LEMURS OF MADAGASCAR

PART III: MOLECULAR DATA

This is the third part of the Lemur practical. In this part we'll work on some molecular datasets. Open the file 'molecular' in MacClade. All the sequences in this matrix were downloaded from Genbank. The matrix consists of sequences from two different genes: cytochrome B (the first 1140 base pairs) and cytochrome C (the other 684 base pairs). We will first analyse and compare the phylogenetic signals contained in each of these genes seperately and then we will try to combine these data sets.

In MacClade, enter what type of data we’re dealing with ('Characters > data Format'). What did you enter?

type your answer here

Let's see if we can calculate the codon positions for each base in the matrix. Given that both the cytochrome B and C sequences in the matrix start at codon position 1, can we calculate the codon positions for both genes 'in one go'? Why (not)?

Let MacClade calculate the codon positions ('Characters > codon positions > calculate positions'), save the file and open and execute it in PAUP.

Although the matrix contains 1140+684 characters, it’s still relatively small (taxonwise). What would be a suitable search method in this case?

What’s more important when deciding what search method to use: the number of species or the number of characters? Why is that?

Exclude the cytochrome C characters ('Data > Include-Exclude characters > CharSets > cytochrome C > Exclude'). Run a bootstrap analysis and root the resulting tree on Daubentonia. Print the bootstrap consensus. We'll call this tree "B".

Are there any differences between this tree ("B") and the last one (the one we printed: "A") you’ve made using the communication signals data? What are they?

Save the data and the tree. Open the file in MacClade. Go to the tree window using the tree you’ve just made in PAUP. In MacClade you need to reroot the tree on Daubentonia (an explanation on how to manipulate trees in MacClade is in the manual).

Calculate the total number of character state changes per codon position (MacClade tree window> Chart> Character steps/etc). Explain the differing rates among codon positions. Be specific: why would positions 1 and 2 be so different from position 3 (you might want to take a look at the 'Genetic code' table under the 'Characters' menu in the data window). Refer to Natural Selection in your answer.

What do you think the phrase ‘multiple hits’ means and how might it affect the observed differing rates among codon positions?

The ratio between the number of state changes of the three codon positions is information we might be able to use to modify the data PAUP uses when searching for phylogenies.

If we’d do this, would that be a case of weighting characters differently, or weighting types of character changes differently?

Chart the state changes (Tree window > Chart > State changes and Stasis). State changes to the A base and from the G base are the least common.

Why is that? (Charting the states (number of occurences for each state) might give you a clue.)

How might this information be relevant for molecular phylogeneticists?

What's the transition/transversion ratio? (Go to the MacClade tree window > Chart > State changes. You'll find the absolute number of transitions and transversions. Calculate the ratio from these values).

The A and G bases are both purines, C and T are both pyrimidines. When a purine replaces another purine or a pyrimidine another pyrimidine this is called a transition, when a purine replaces a pyrimidine (or vice versa) this is called a transversion. Now, take a look at this picture, ( This is where I found the picture of the DNA bases) and explain why the transition/transversion ratio is significantly different from 1:1? (Refer to the DNA picture in your answer).

Calculate the transition/transversion ratio for the Eulemurs excluding all others. (To do this, you have to go back to PAUP, delete all species except the Eulemurs and perform a branch and bound search. Save the resulting tree(s) and open them in MacClade. Go to the tree window and chart the state changes. Calculate the transition/transversion ratio from the absolute number of transitions and transversions). The transition/transversion ratio within the Eulemurs is significantly different from the transition/transversion ratio within all Lemurs from the matrix. We're going to examine what causes this by looking at an example:

Consider two groups of species. The species within group A more recently shared a common ancestor than the species in group B. Choose the correct words (the choices are between brackets): Between the species in group B [more/fewer] mutations have occurred than between the species in group A. Therefore [more/fewer] 'multiple hits' have occurred in group B.

The chance that we overlook mutations because of these 'multiple hits' under maximum parsimony is higher for group [A/B]. We are especially prone to overlook [transitions/transversions] because they occur more often and because of saturation of 'available' loci. The apparent transition/transversion ratio is therefore [higher/lower] in group B.

Go to PAUP and open the file 'molecular'. Run a branch and bound search on the cytochrome C part of the matrix (so you'll want to exclude the cytochrome B part). Save the resulting tree(s). Run a bootstrap analysis (rooted on Daubentonia) and print the bootstrap consensus. We'll call this one "C".

Reopen the molecular file in MacClade. Go to the tree window using the cytochrome C B&B tree you've just saved. Take a look at the total number of transitions and transversions (MacClade > Tree window > Chart > State changes).

Why are the totals for cytochrome C much lower than for cytochrome B? Are the ratios more or less the same for the two genes? How might you test this statistically?

What are the differences (if any) between the trees constructed using cytochrome B data and cytochrome C data (i.e. between trees "B" and "C")?

Perhaps we can combine both the cytochrome B and cytochrome C data in the matrix. We're going to examine whether or not we can do this.

To do this, we'll perform a partition homogeneity test. This test will randomly select a number of partitions (the default is 100) of the same size as the original cytochrome B and C partitions (so that's 100 partitions of 1140 base pairs, and 100 partitions of 684 base pairs), build trees using these randomly selected partitions (you can choose wich tree building algorithm is used - let's use b&b, it's fast enough with our data) and then checks whether a tree based on either one of the original (= cytochrome B and C) partitions falls within the distribution of the trees based on the random partitions. If it does, we'll get a p>0.05 and we can safely combine the two data sets.

Go to Analysis>Partition homogeneity test..., use 'branch-and-bound' and click 'continue'. The resulting p value is the chance that the data sets are compatible. We reject the possibility that they are compatible when p<0.05 (you probably know this is not unusual in statistics, although it's a rather arbitrary value).

So, are the datasets compatible? What's the p value?

Run a branch and bound search using the combined dataset.

Describe the differences (if any) between this tree and the trees you've made using the cytochrome B and cytochrome C data sets separately.

Run a bootstrap analysis using the combined dataset. Print the bootstrap consensus (rooted on Daubentonia) and call it "D".

Does the combined dataset lead to more robust trees (e.g. higher bootstrap values)?

What problems would arise if we'd try to combine the molecular and the communication signals matrices even if the datasets would produce the same trees (and are therefore in a phylogenetic sense compatible)?

Construct a genetic distance data tree using the 'combined' data set with distance options 'Jukes-Cantor', and 'equal rates across variable sites'.

Is the tree similar to that found with MP?

Notice the relative branch lengths. Where are the branches the shortest, and where the longest?

Does the distance tree indicate the ‘problem areas’ you have found in the earlier trees?

When we calculated the transition/transversion ratio for the Eulemurs, we assumed that the difference between this ratio and the ratio for all Lemurs could be ascribed to the closer relatedness within the Eulemur clade. Was this a correct assumption?

So far, we have constructed and printed 4 bootstrap trees (A through D). We're going to take a closer look at these trees.

Which grouping is by far the most reliable in these trees?

Are the branches within this group as reliable as between this group and the rest of the species?

Name one grouping that was more reliable and one grouping that was less reliable in the combined data set (tree D) than in the separate sets (trees B and C)?

If we'd include skull shapes in the communication signals matrix (remember how I referred to the Hapalemur, Eulemur and Lemur catta skulls earlier in the practical?), would that have a positive or a negative effect on the confidence we can have in tree A?

Is the most reliable grouping of trees B, C and D also present in tree A? And is it just as reliable?

Name:

Student #:

E-Mail address: