Sequence Analysis on the Web

Biochemistry 4010/5010

19 January 2017

Due Date: 26 January 2017


NOTE: The following sections will guide you through your sequence analyses and at several points there are questions. Please answer these questions as you go along -- you don't need. to write in polished English -- just point form answers are fine. Complete all work using the Firefox web browser (or Google Chrome).
If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email Shannon Sibbald (shannon.sibbald at dal.ca) or stop by (Rm. 8H1, Sir Charles Tupper Medical Building).
Please answer all questions on the printed out lab sheet provided (or print them yourself from the web site). Alternatively, you can answer all the questions in a separate file and email the completed lab to Shannon.

Goals:

  1. Basic database searching with text queries

  2. Producing Dot-Plots online

  3. Basic database searching using BLAST to pull out homologues

  4. Pairwise sequence alignment using BLAST, FASTA, Needleman-Wunsch and Smith-Waterman alignments

  5. Randomisation tests for significance of alignments (using the "PRSS" program)


A. Basic database searching with text queries

  1. Go to the National Center for Biotechnology Information (NCBI) web site: http://www.ncbi.nlm.nih.gov/

  2. Change the search menu to "Nucleotide" and search for the Escherichia coli elongation factor tufA gene (EF-Tu) using the accession number AH002539.

    Note: this sequence contains multiple genes including: fusA (elongation factor G), rpsG (ribosomal protein S7) and tufA (elongation factor Tu).


    QUESTION 1. How can you find where the gene begins and ends? (Hint: look under the heading "FEATURES". The numbers tell you where the particular feature listed begins and ends). Where does the tufA gene begin/end?




  3. View just the coding region of the tufA gene. Change the display to "FASTA".

    QUESTION 2. Based on your knowledge of the genetic code (see http://plato.stanford.edu/entries/information-biological/#GenCod), what is strange about this gene?



  4. Go to the NCBI (http://www.ncbi.nlm.nih.gov), click on "Analyze" then go to ORF Finder.

    Copy and paste the coding region of the tufA gene (in FASTA format) into the "Enter Query Sequence" area, then click "Submit". Once it is complete, click "Add six-frame translation track".
    This map tells you the location of canonical start codons (ATG) and stop codons (TAA, TAG, TGA) in the sequence in all 6 reading frames of the strand you have pasted in (an ORF is the region stretching from a start codon to the first stop codon). In this ORF map, a green vertical line indicates a canonical start codon (ATG) and a pink vertical line indicates a canonical stop codon (TAA, TAG, TGA).
    Note: If you have correctly obtained the tufA gene only, there should be 5 ORFs. If there are more, go back to (3) above and redo it.

    QUESTION 3. What do the ORF maps for 6 reading frames represent?




    Copy and paste the coding region of the tufA gene into a Notepad window, and add a ">" symbol plus a name (e.g., ec_tufA) at the top.
    Note: keep the length of the name less than or equal to 8 characters! Make sure the sequence starts on the next line after the name (see below for an example). This format is called Pearson or FASTA format. Save the file in the My Documents folder with a useful filename.

    FASTA format:

      >name
      agcttagcggtaaa...
      

  5. Go back to the web browser with your sequence in it, and change display to "Genbank". Click on the "protein_id" number for tufA. This is the GenBank protein sequence entry corresponding to the nucleotide sequence. At the top and select the FASTA option to show the sequence in FASTA format. Then, in another Notepad window (or some text editor), copy and paste amino acid sequence EF-Tu, once again giving a unique name to the sequence (i.e., remove all of the garbage at the top of the sequence after the carat). Save this file in My Documents.


B. Producing Dot-Plots Online

  1. Go to the DOTMATCHER website: http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher.

    Here we will use an online tool to get a feeling for what kind of information dot-plots can give you about your sequence. In the Additional section of the web interface, you're allowed to modify window size (how big is the window of comparison) and threshold (how many matches or how big the score should be for a window before plotting a dot). The scoring matrix (from among several PAM and BLOSUM family matrices (e.g. EBLOSUM90 or EBLOSUM30), as well as "EDNAMAT", a matrix for nucleotide sequences, available from ftp://ftp.ncbi.nih.gov/blast/matrices/NUC.4.2) can be changed in the input section.

    Generally, when using the BLOSUM or PAM matrices, the program will sum the score of the alignment implied by the "window" using these matrices. If this sum is negative, then the value for the window is set to 0, otherwise the score is compared to the Stringency value. If Score > Stringency, then a dot is plotted in the matrix, otherwise nothing is plotted.

  2. Under the Input section, paste the E. coli EF-Tu (tufA) amino acid sequence (including the name) into the first input area, then paste the amino acid sequence from the GenBank database accession number P16018 (searching in the Protein Database, not Nucleotide) into the second input area area. Again, make sure both of these are amino acid sequences!

    QUESTION 4. What is the function of the protein in entry P16018? What database did this sequence originally come from?




    Look under the Additional section of the web interface and fill in the following information:

    QUESTION 5. What are the default options?

    Window size:

    Threshold:

    (Note: the default matrix file is EBLOSUM62)

    Now click "Run dotmatcher". In a few seconds, a results page should open.


    QUESTION 6. Print theDot-plot and circle the regions of the plot which appear to indicate good alignment. Attach this printout to this lab. (Alternatively,you can use Photoshop or something to "draw" on the image file, and e-mail it to me).

    QUESTION 7. Change the default matrix (e.g. you can change it to EBLOSUM30, or EBLOSUM90 etc.), window size and threshold parameters and remake the dot-plots. Record what you did below and describe what effects these changes have on the resulting Dot-plots? Print one of them out, label it and attach it to this lab when you hand it in.



  3. Go to GenBank and find Human transcript factor IIB by searching the protein database with its accession number: Q00403. Go back to the DOTMATCHER page (or open it again) and paste this sequence into both the top and bottom sequence input fields. Then run dotmatcher using the default options.

    QUESTION 8: In this dot-plot, what do the "off-diagonal" lines suggest about the sequence of this protein?






C. Basic database searching with BLAST and FASTA

  1. Go to the NCBI web site: http://www.ncbi.nlm.nih.gov.

  2. Click on BLAST (under POPULAR at the bottom of the page).

  3. Look under the heading Basic BLAST and click on the link "nucleotide blast".

  4. Go down the set of menus and options and familiarize yourself with the default settings and what you can change. Pay particular attention to the settings under the expandable "Algorithm parameters" section.

  5. Copy and paste your FASTA format E. coli tufA nucleotide sequence into the window.

  6. Select the database with the broadest representation of sequences.

    QUESTION 9. What is this database and what does its abbreviation stand for?


  7. Under Program Selection be sure to select blastn blastn as the BLAST algorithm (note that megablast is the default). Then BLAST away!! While waiting for the job to complete the formatting options link will allow you to specify the output formats. For now leave the defaults.

  8. When your search results appear scan through them and answer the following question.

    QUESTION 10. In general, what do the colours indicate in the top of the graphical output? What do the lengths of these lines relative to the top bar (with the numerical scale) represent?




    Scroll down the page and look at the list of best hits and below that the best local alignments. Look at the percent identity.

  9. Have a look at the "E-values" (or "expect" values). These are the expectation scores of the alignment. That is, given a database of the size you searched, the expectation scores indicate how many matches with this score or greater would you expect at random (from a random alignment). Generally speaking E-values below 0.01-0.001 (1E-2 - 1E-3) are worth paying attention to as "possibly" significant with anything below 1E-4 definitely significant.

  10. Try BLASTing again using the nucleotide sequence, but this time restrict your results to Saccharomyces cerevisiae (taxid:4932) sequences. To restrict to a given Organism, under "Choose Search Set" start typing in the organism name in the "Organism" form field until you can select it from the list.). Once you have the results record the score of the highest hit (below), the number of hits with E < 0.001, and look at the pairwise alignment with the best hit.

    BLASTN

    E-value:

    Number of hits with E < 0.001:



    Go back the the BLAST server and use your saved E. coli tufA amino acid sequence to do a Protein BLAST (Standard protein-protein BLAST [blastp]). To get to blastp go back to the BLAST start page and under Basic BLAST select protein blast. Under Program Selection make sure blastp (default) is selected as the algorithm. Then, as above restrict the BLAST search to Saccharomyces cerevisiae sequences. Go to the Algorithm section and increase the MAX TARGET SEQUENCES to 500. Record the E-score of the highest hit and look at the pairwise alignment.

    BLASTP

    E-value:

    Number of hits with E < 0.001:



    QUESTION 11. Did blastn or blastp give a more significant hit? Therefore, which is more sensitive?(Include the E-value of the most significant hit for each algorithm) How many proteins in yeast are significantly similar (i.e., have E < 0.001 ) according to blastp versus blastn?




  11. Go to the top hitting protein from yeast and copy it into another text file on your desktop and give it a name: yeast1. Also, get a sequence with a hit of E > 0.0001 but < 0.1, copy the amino acid sequence to another Notepad file and call it yeast2. Note both E-values below.

    E. coli vs. yeast 1:

    E. coli vs. yeast 2:

  12. Go to the FASTA server (http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml) at the University of Virginia (Bill Pearson's site).

    Click on "protein-protein FASTA", paste your E.coli EF-Tu protein sequence into the query window (in FASTA format) and choose the Human/Uniprot database (don't try to compare the results with BLAST) and then click "Search Database" (this will run the program against the database with default gap penalties). Look at the results and record the E-value of the best hit. Record the number of hits with E < 0.001 below:

    FASTA

    Score:

    Number of hits with E < 0.001:

  13. Move back a page and run the search again with different scoring matrix and LARGER negative gap opening and extension penalties (note the default gap penalties are -10 (gap opening (open)) and -2 (gap extension (ext) --> choose values that are MORE negative). Record your settings and the E-value of the best hit below:

    Protein matrix:

    Gap:

    Ext:

    E-value:



    QUESTION 12. Briefly, what effect did changing the gap penalty parameters have on the alignments?



    How many hits did you get with E < 0.001 (with your new parameters)?


    Did making the penalties stronger for gaps allow you to pick up more or fewer distant relatives at significant E-values? Why?




D. Pairwise alignment with Needleman-Wunsch and Smith-Waterman algorithms

    When searching the database, the BLAST and FASTA programs do rapid non-optimal local alignments for all sequence in the database against the query sequence and then report the N best scores and alignments from the whole database -- these are the potential homologs of your protein.

    If you want the BEST (optimal) alignment(s) between your two sequences you can use the Needleman-Wunsch algorithm (for global alignment, i.e., spans both full sequences) or the Smith-Waterman algorithm (for local alignment -- does not need to span full sequences). These algorithms are much more time-consuming than BLAST or FASTA, but they are arguably more sensitive. In fact, there are several "supercomputer" type facilities that allow you to search entire databases with Smith-Waterman pairwise alignments (e.g., ssearch on Bill Pearson's site: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml... try it sometime!!!)

  1. In this exercise we are going to do a pairwise alignment between the E. coli EF-Tu and the yeast1 homologue you saved on your computer.

    Go to the NEEDLE web server: NEEDLE

    Use the following parameters for the Needlman-Wunsch alignment:

          Gap open penalty: 5

          Gap extension penalty: 0.5

          Matrix: EBLOSUM62

    Needleman-Wunsch alignment similarity score:

  2. Redo the alignment but change the gap extension penalty to 10.

    QUESTION 13. What happens to the alignment and roughly why?



  3. Now repeat at the WATER web-server: WATER in a new window ( you will need to compare to the NEEDLE alignment. Use the same default settings (Open penalty 5, extension penalty 0.5, EBLOSUM62). ).

    QUESTION 14: Look at the alignment. Record the score below and describe what is different about it compared to the two NEEDLE alignments

    Score:





E. Testing significance of pairwise alignments using "random shuffling" (Monte-Carlo method)

    If a database of sequences is unavailable or irrelevant to your question, another way to evaluate the significance of alignments is to randomly permute (shuffle the order of letters) one of the sequences in the alignment, and then align it to the other sequence using your pairwise alignment algorithm. If you "shuffle" the sequence like this hundreds of times, you can generate a distribution of scores that would be expected under random alignments with sequences of the same amino acid composition as the ones you are looking at. In this exercise we will do this permutation test using the Smith-Waterman local alignment algorithm.

  1. Luckily there is a web-server that does this. Bill Pearson's PRSS program!!!! Go back to the FASTA website: http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.s html. Click on the "Protein vs Protein shuffle (prss)" link found in the Statistical Significance section.

  2. Copy and paste your E.coli EF-Tu sequence into the first window and your yeast1 sequence into the second window. Leave everything as default and click on "Shuffle Sequences". Look at the results.

    QUESTION 15. What is the E-value for your alignment?


  3. Redo the analysis, but change the gap penalty (a lot) and extension penalty (a lot) -- make sure they are negative!!!! (defaults are -10 and -2 -- the more negative the bigger the penalty in this case). Also fiddle around with the different protein matrices to see what happens to the score.

  4. Do the analysis with E. coli EF-tu and yeast2 using default settings.

    QUESTION 16. Is this alignment significant? What is the E-value and how does it compare to the E-value recorded above (Question 15)? Do you think these two genes are homologous (explain your answer)? Note there is no one right answer to this question because you may have chosen different sequences to be yeast2.




Websites you need for this exercise:

NCBI: Entrez (databases), BLAST, Orf Finder
http://www.ncbi.nlm.nih.gov

BillPearson's site: FASTA searching, randomisation test
http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml

Smith-Waterman:
http://www.ebi.ac.uk/Tools/psa/emboss_water/

Needleman-Wunsch:
http://www.ebi.ac.uk/Tools/psa/emboss_needle/

Standard genetic code:
http://molbio.info.nih.gov/molbio/gcode.html