Multiple Sequence Alignments

Protein domains and multiple alignments

Biochemistry 4010/5010

26 January 2017

This assignment will be due 1 week from today (Feb 2nd). Again, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email email Shannon Sibbald (shannon.sibbald at dal.ca) or stop by (Rm. 8H1, Sir Charles Tupper Medical Building).

Goals:

  1. ClustalW multiple sequence alignments online

  2. Protein domains at NCBI

  3. PSI-BLAST


A. ClustalW multiple sequence alignments

This exercise uses ClustalW, a popular multiple sequence alignment program. We'll be using ClustalW via a web interface.

  1. Download amino acid sequences for accession numbers AAA50993, P02992, and P32481 from the NCBI protein database (in FASTA format) (http://www.ncbi.nlm.nih.gov). These are the E. coli tufA, yeast tufA, and other yeast sequence from our previous lab.

  2. Connect to the ClustalW web interface. (http://www.ch.embnet.org/software/ClustalW.html)
    Paste all 3 sequences in FASTA format into the input sequences area so that they look like:

        >ecoli_tu
        MSKEKFERTKPHVNVGT...
        >yeast1
        MSALLPRLLTRTAFKAS...
        >yeast2
        MSDLQDQEPSIIINGNL... 
    
        

    QUESTION 1. What are the default gap opening, extension penalties for the multiple sequence alignment (not pairwise)? What is the default scoring matrix used? (Note: ignore 'End gap penalty' and 'Separation gap penalty')

    Gap opening:

    Gap extension:

    Scoring matrix:


  3. Click "Run ClustalW". In a short while, a results page will appear. It is possible to view a plain text output of your alignment by clicking "clustalw (aln)" under the Multiple alignments column. This will download the multiple sequence alignment, which you can open in any plain text viewing program (e.g. Notepad or TextEdit). Change the name of the file to something meaningful and open the file to have a look at your alignment.

    At sequence positions along the alignment, different symbols represent the degree to which the residues are conserved at that site. An asterisk (*) indicates complete conservation, a colon (:) indicates conservation between groups with highly similar properties and a period (.) indicates conservation between groups with weakly similar properties.

    Keep this alignment, as you'll be comparing this alignment to others.

  4. Now, examine the effects of changing gap opening and extension parameters:

    • Go to your the ClustalW server again, and redo the alignment, but CHANGE the alignment parameters above this time to radically increase the Gap Open value (e.g. to 100).

    • Now do it again but increase the Gap Extension value to 10 (and put the Gap Open back to the default setting) -- Redo the alignment.

    • Now do BOTH changes at once and redo the alignment.

    QUESTION 2. Does the alignment change when you alter the Gap Extension and Gap Opening penalties? If so, how and why does it change? Explain in general terms.










B. Protein motifs/domains using InterPro, NCBI, SMART, and PFAM

Recently the BLAST server at NCBI has added the capacity of identifying "conserved domains" using modification of the PSI-BLAST searching procedure called RPS-BLAST. Here the BLAST algorithm is used with a query sequence to search a database (CDD -- conserved domain database) of position-specific scoring matrices (PSSMs) of well-known protein motifs (also called domains) that tend to occur in many different protein families. There are many such "motif" or "domain" databases -- the two currently used by NCBI in addition to their own "curated domain alignments" (NCBI's are: cd, LOAD and COG) in CDD are the SMART database and the PFAM database. These databases are all curated collections of aligned protein motifs

  1. Go to the NCBI website (http://www.ncbi.nlm.nih.gov) and retrieve a "mystery" amino acid sequence with the accession number CAC38754.

    Copy this sequence and go to the BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)

    Scroll down and look under Specialized Searches. Find and click on CD-Search.

    QUESTION 3. What databases are available to search and how big are they (in number of PSSMs)?










  2. Choose the database with the largest number of motifs and paste your sequence into the query window and click on "Submit".

    Once the result window loads, you should see a gray/black line with numbers over it representing your sequence and below it some coloured boxes, which is where you will find your hits.

    Below the line there are coloured regions with abbreviations in the middle. These are the locations and names of the conserved motifs/domains identified. There are 3 different domains identified in this protein. If you click on the "View" dropdown menu in the top right corner and choose "Full Results" you can see different abbreviations given to the "same" domains by different databases. Below all of this is a List of domain hits and corresponding E-values.

  3. Under List of domain hits click "cd00033" under the Accession column then scroll down to the bottom of the page to see the multiple alignment of the conserved domain.

    QUESTION 4: In general what do you think the coloring of the alignment corresponds to? (If you have problems with this question, try playing with the "Color Bits" pull-down menu - change the value and hit 'Reformat'.) What do you think the grey numbers in brackets indicate? (Hint: Check the help document)








  4. Look in the Links box on the left-hand side near the top of the page, next to "Source".

    QUESTION 5. What database does this come from?


    Take note of the name of the conserved domain.

    Domain name:

  5. Go to the SMART database website, Normal SMART mode tool: http://smart.embl-heidelberg.de/smart/set_mode.cgi?NORMAL=1

    At the bottom of the next page (under "Domains detected by SMART"), type the NAME of your domain into the "Search domain and protein annotation" box. Click "Search". If you have trouble here, please ask for help, the SMART website can be a bit tricky to figure out.

    Note: if multiple results come up, we want to select the result that is a SMART domain .

    QUESTION 6. Give a brief description of the domain in the following categories:

    Name:

    SMART accession number:

    Description:



    QUESTION 7: How many of these domains are found in the SMART non-redundant database (abbreviated nrdb)?

    How many proteins in the "nrdb" have these domains?

    Why are these two numbers (above) different?





    QUESTION 8. List two other kinds of information about this protein that can be retrieved from the SMART database (i.e., look on the page and click on links to see what info you can get)









  6. Go back in your browser to the first window from the results of NCBI conserved domains search that showed the conserved domains as coloured rectangles. Click on the button just below the graphical display that says "Search for similar domain architectures".

    QUESTION 9. What does this new page show? Do all proteins on this page have the same numbers and locations of the conserved domains? What does this tell you about protein evolution (NOTE: We want you to think here... there is no right or wrong answer to this question. This last question is worth the most points)









    There are other tools that are commonly used to search a protein sequence for domains/motifs and to annotate their function. You will use InterPro to help you determine the function of the domains present in your "mystery" protein sequence.

  7. Go to InterPro (https://www.ebi.ac.uk/interpro/) and copy the "mystery" sequence you are working with (accession number CAC38754 ) into the "Analyze your protein sequence" window. Press search.

    The results window will show you information about the domains and repeats that are present in your sequence (near the top). Each of the colored bars is showing you a different domain or repeat. By moving your cursor over these bars, you can display more information about the domain/repeat and the location that it is in your sequence.

  8. Find the domain you noted in Question 5 and click on it's name to find out more information about it.

    QUESTION 10: What sequence features are characteristic of this domain?




  9. Go back to the original Results page. Scroll down to the bottom of the page (below the detailed signatures) to the "GO Term Prediction" section. Here you can find information about what functions are predicted for your protein based the domains that are present, and if it is a part of a cellular component. If you click on the predicted GO Term, you can find out even more!

    QUESTION 11: What are the Molecular functions predicted for your protein?







C. Scouring the database for distant homologues using PSI-BLAST

PSI-BLAST uncovers many protein relationships missed by single-pass database- search methods and has identified relationships that were previously detectable only from information about the three-dimensional structure of the proteins.

Here, you will learn how to operate PSI-BLAST by using a comparison of proteins from thermophilic archaea and bacteria as an example.

  1. Get the uncharacterized protein MJ0414 from Methanococcus jannaschii (accession# Q57857) in FASTA format.

  2. Go back to the BLAST page, click on Protein BLAST and under Algorithm select PSI-BLAST.

  3. Now paste your protein sequence into the text area. Click on Algorithm Parameters and near the bottom change the PSI-BLAST threshold (at the bottom of the page) from 0.005 to 0.01. Expect Threshold should be 10. If running Internet Explorer, Safari, or Chrome this may be different. Please change it back to 10 if this is the case. Also check that under Scoring Parameters the Gap existence and Gap extension penalty is 11 and 1 respectively. Now run the BLAST search.

    Examine the results of the program's initial gapped BLAST search.

    QUESTION 10. How many significant hits did you get? (significant = E < 0.01)


    Select all significant hits with E-values greater than the threshold. At this point all of the "checked" sequences will be multiply aligned by PSI-BLAST to build a position-specific scoring matrix (PSSM) and this will be used in the next iteration of searching:

  4. Now scroll back up to the top of the Descriptions section and "Run PSI-Blast iteration 2".

    QUESTION 11. The sequences that were picked up in this iteration are indicated in yellow shading. By looking at all the hits, what is the most common name for the proteins that were identified as hits?


    QUESTION 12 Based on these annotation can you putatively assign a function to the "unknown protein" you originally used to do the search (describe the FUNCTION, not just the name)?





    The following two questions will require you to read the handouts I've given you in class or lecture 5 material and/or read NCBI's PSI-Blast tutorial

    QUESTION 13. Why do new database hits in the second iteration have E-values of << 0.01, but yet did not appear at all in the first iteration?





    QUESTION 14. If you kept running more PSI-BLAST iterations, they may converge. What does this mean?