This assignment will be due 1 week from today (Feb 2nd). Again, point-form answers are fine. If we don't get through all of the exercises today, you can continue to work on them from any computer with web access. If you have any trouble, feel free to email email Shannon Sibbald (shannon.sibbald at dal.ca) or stop by (Rm. 8H1, Sir Charles Tupper Medical Building).
ClustalW multiple sequence alignments online
Protein domains at NCBI
This exercise uses ClustalW, a popular multiple sequence alignment program. We'll be using ClustalW via a web interface.
Download amino acid sequences for accession numbers AAA50993, P02992, and P32481 from the NCBI protein database (in FASTA format) (http://www.ncbi.nlm.nih.gov). These are the E. coli tufA, yeast tufA, and other yeast sequence from our previous lab.
Connect to the ClustalW web interface. (http://www.ch.embnet.org/software/ClustalW.html)
Paste all 3 sequences in FASTA format into the input sequences area so that they look like:
>ecoli_tu MSKEKFERTKPHVNVGT... >yeast1 MSALLPRLLTRTAFKAS... >yeast2 MSDLQDQEPSIIINGNL...
QUESTION 1. What are the default gap opening, extension
penalties for the multiple sequence alignment (not pairwise)? What is the default scoring matrix used? (Note: ignore 'End gap penalty' and 'Separation gap penalty')
Click "Run ClustalW". In a short while, a results page will appear. It is possible to view a plain text output of your alignment by clicking "clustalw (aln)" under the Multiple alignments column. This will download the multiple sequence alignment, which you can open in any plain text viewing program (e.g. Notepad or TextEdit). Change the name of the file to something meaningful and open the file to have a look at your alignment.
At sequence positions along the alignment, different symbols represent the degree to which the residues are conserved at that site. An asterisk (*) indicates complete conservation, a colon (:) indicates conservation between groups with highly similar properties and a period (.) indicates conservation between groups with weakly similar properties.
Keep this alignment, as you'll be comparing this alignment to others.
Now, examine the effects of changing gap opening and extension parameters:
Go to your the ClustalW server again, and redo the alignment, but CHANGE the alignment parameters above this time to radically increase the Gap Open value (e.g. to 100).
Now do it again but increase the Gap Extension value to 10 (and put the Gap Open back to the default setting) -- Redo the alignment.
Now do BOTH changes at once and redo the alignment.
QUESTION 2. Does the alignment change when you alter the Gap
Extension and Gap Opening penalties? If so, how and why does it
change? Explain in general terms.
Recently the BLAST server at NCBI has added the capacity of identifying "conserved domains" using modification of the PSI-BLAST searching procedure called RPS-BLAST. Here the BLAST algorithm is used with a query sequence to search a database (CDD -- conserved domain database) of position-specific scoring matrices (PSSMs) of well-known protein motifs (also called domains) that tend to occur in many different protein families. There are many such "motif" or "domain" databases -- the two currently used by NCBI in addition to their own "curated domain alignments" (NCBI's are: cd, LOAD and COG) in CDD are the SMART database and the PFAM database. These databases are all curated collections of aligned protein motifs
Go to the NCBI website (http://www.ncbi.nlm.nih.gov)
and retrieve a "mystery" amino acid sequence with the
accession number CAC38754.
Copy this sequence and go to the BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
Scroll down and look under Specialized Searches. Find and click on CD-Search.
QUESTION 3. What databases are available to search and how big are they (in number of PSSMs)?
Choose the database with the largest number of motifs and
paste your sequence into the query window and click on "Submit".
Once the result window loads, you should see a gray/black line with numbers over it representing your sequence and below it some coloured boxes, which is where you will find your hits.
Below the line there are coloured regions with abbreviations in the middle. These are the locations and names of the conserved motifs/domains identified. There are 3 different domains identified in this protein. If you click on the "View" dropdown menu in the top right corner and choose "Full Results" you can see different abbreviations given to the "same" domains by different databases. Below all of this is a List of domain hits and corresponding E-values.
Under List of domain hits click "cd00033" under the Accession column
then scroll down to the bottom of the page to see the multiple alignment of the conserved domain.
QUESTION 4: In general what do you think the coloring of the alignment corresponds to? (If you have problems with this question, try playing with the "Color Bits" pull-down menu - change the value and hit 'Reformat'.) What do you think the grey numbers in brackets indicate? (Hint: Check the help document)
Look in the Links box on the left-hand side near the top of the page, next to
QUESTION 5. What database does this come from?
Take note of the name of the conserved domain.
Go to the SMART database website, Normal SMART mode tool:
At the bottom of the next page (under "Domains detected by SMART"), type the NAME of your domain into the "Search domain and protein annotation" box. Click "Search". If you have trouble here, please ask for help, the SMART website can be a bit tricky to figure out.
Note: if multiple results come up, we want to select the result that is a SMART domain .
QUESTION 6. Give a brief description of the domain in the following categories:
SMART accession number:
QUESTION 7: How many of these domains are found in the SMART non-redundant database (abbreviated nrdb)?
How many proteins in the "nrdb" have these domains?
Why are these two numbers (above) different?
QUESTION 8. List two other kinds of information about this protein that can be retrieved from the SMART database (i.e., look on the page and click on links to see what info you can get)
Go back in your browser to the first window from the results of NCBI conserved domains search that showed the
conserved domains as coloured rectangles. Click on the button just
below the graphical display that says "Search for similar
QUESTION 9. What does this new page show? Do all proteins on this page have the same numbers and locations of the conserved domains? What does this tell you about protein evolution (NOTE: We want you to think here... there is no right or wrong answer to this question. This last question is worth the most points)
PSI-BLAST uncovers many protein relationships missed by
single-pass database- search methods and has identified relationships
that were previously detectable only from information about the
three-dimensional structure of the proteins.
Here, you will learn how to operate PSI-BLAST by using a comparison of proteins from thermophilic archaea and bacteria as an example.
Get the uncharacterized protein MJ0414 from Methanococcus jannaschii (accession# Q57857) in FASTA format.
Go back to the BLAST page, click on Protein BLAST and under Algorithm select PSI-BLAST.
Now paste your protein sequence into the text area. Click on
Algorithm Parameters and near the bottom change the PSI-BLAST
threshold (at the bottom of the page) from 0.005 to 0.01. Expect Threshold should be 10.
If running Internet Explorer, Safari, or Chrome this may be different.
Please change it back to 10 if
this is the case. Also check that under Scoring Parameters the Gap
existence and Gap extension penalty is 11 and 1 respectively. Now
run the BLAST search.
Examine the results of the program's initial gapped BLAST search.
QUESTION 10. How many significant hits did you get? (significant = E < 0.01)
Select all significant hits with E-values greater than the threshold. At this point all of the "checked" sequences will be multiply aligned by PSI-BLAST to build a position-specific scoring matrix (PSSM) and this will be used in the next iteration of searching:
Now scroll back up to the top of the Descriptions section and "Run PSI-Blast iteration 2".
QUESTION 11. The sequences that were picked up in this iteration are indicated in yellow shading. By looking at all the hits, what is the most common name for the proteins that were identified as hits?
QUESTION 12 Based on these annotation can you putatively assign a function to the "unknown protein" you originally used to do the search (describe the FUNCTION, not just the name)?
The following two questions will require you to read the handouts I've given you in class or lecture 5 material and/or read NCBI's PSI-Blast tutorial
QUESTION 13. Why do new database hits in the second iteration have E-values of << 0.01, but yet did not appear at all in the first iteration?
QUESTION 14. If you kept running more PSI-BLAST iterations, they may converge. What does this mean?