Back to the Phylogenetics Page

PHYLIP, a few suggestions


Check out PHYLIP on the web! The Institut Pasteur, Paris accepts several alignment formats.

Example: Getting a distance (Neighbor joining) tree: (First you must get a distance matrix, then from that you get a n-j tree.

1. Run your DNA data (infile) in DNAdist (change to sequential if necessary). Save your infile under a different name (better yet- already have your infile backed up under a different name).

2. Rename the newly created 'outfile' infile. Run neighbor. (This is a very fast program!)


A statistical test of your tree using bootstrapping in PHYLIP (neighbor-joining)
(Make sure none of the taxon names have parentheses or any punctuation):

1) Use Seqboot to create (100 or 1000) datasets. (not showing indications of run will speed the analysis)
2) Rename the 'outfile' to 'infile'. [In UNIX: % rm infile % mv outfile infile]
3) Run Dnadist (change to sequential and multiple datasets)
4) Rename the 'outfile' to 'infile'.
5) Run Neighbor (change to multiple datasets)
6) Rename the 'treefile' to 'infile'.
7) Run Consense (commas in the names of taxa will cause a problem at this point!)
8) Find the bootstrap tree in 'outfile'.


A statistical test of your tree using bootstrapping in PHYLIP (dnapars) (make sure none of the taxon names have parentheses or any punctuation):
1) Use Seqboot to create (100) datasets. (change to sequential, not interleaved; not showing indications of run will speed the analysis)
2) Rename the 'outfile' to 'infile'.
3) Run Dnapars or some other discrete character program (change to sequential and multiple datasets, change 'I' and 'M')
4) Rename the 'treefile' to 'infile'.
5) Run Consense
6) Find the bootstrap tree in 'outfile'.


Common error messages

error allocating memory" Data file saved other than as a text document. (So in word, use "File, Save As", then choose text rather than word under type of document.)

sequences out of alignment, or base ratios wrong Possible problem is you have an sequential format file rather than a interleaved file, the solution is to choose the option "i" in phylip.


Running on unix in backgound

(please note that your unix system may be different than this- ask people locally how to run your job in the background):

nice dnaml (follow menu to start program)
^Z (this will stop program and return you to command line)
jobs (this will list any jobs running and give their number)
bg %1 (this will start job # 1 to run in the backgound)
ps -u your_user_name -l (this will give a long list of all your jobs)


Dave Carmean, Simon Fraser University ver Nov'97 (Please send comments to me at carmean@sfu.ca)
http://mendel.mbb.sfu.ca/staff/dc/resources.html or http://carex.ekbot.umu.se/~dave/resources.html

Excerpt's from Felsenstein's documentation (copyrighted by Joe Felsenstein and the University of Washington- Please get his programs and complete documentation by anonymous FTP at 128.95.12.41):

...DNAPARS uses the parsimony method allowing changes between all bases and counting the number of those. DNAMOVE is an interactive parsimony program allowing the user to rearrange trees by hand and see where characters states change. DNAPENNY uses the branch-and-bound method to search for all most parsimonious trees in the nucleic acid sequence case. ... DNAML does a maximum likelihood estimate of the phylogeny (Felsenstein, 1981a). DNAMLK is similar to DNAML but assumes a molecular clock. DNADIST computes distance measures between pairs of species from nucleotide sequences, distances that can then be used by the distance matrix programs FITCH and KITSCH. RESTML does a maximum likelihood estimate from restriction sites data. SEQBOOT allows you to read in a data set and then produce multiple data sets from it by bootstrapping, delete-half jackknifing, or by permuting within sites. This then allows most of these methods to be bootstrapped or jackknifed, and for the Permutation Tail Probability Test of Archie (1989) and Faith and Cranston (1991) to be carried out.


INTERLEAVED AND SEQUENTIAL FORMATS

[ I strongly suggest you use the sequential format!-DC.]
The sequences can continue over multiple lines; when this is done the sequences must be either in "interleaved" format, similar to the output of alignment programs, or "sequential" format. These are described in the main document file. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names.
Here are hypothetical examples of

 
interleaved format:                                    and      sequential format (same sequences):   
5 42 5 42 Turkey AAGCTNGGGC ATTTCAGGGT Turkey AAGCTNGGGC ATTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCCGGGC AATACAGGGT AT H. SapiensACCGGTTGGC CGTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT Chimp AAACCCTTGC CGTTACGCTT GAGCCGTGGC CGGGCACGGT AT Gorilla AAACCCTTGC CGGTACGCTT H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA GAGCCCGGGC AATACAGGGT AT Chimp AAACCCTTGC CGTTACGCTT GAGCCGTGGC CGGGCACGGT AT AAACCGAGGC CGGGACACTC AT ACAGGTTGGC CGTTCAGGGT AA Gorilla AAACCCTTGC CGGTACGCTT AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA AAACCATTGC CGGTACGCTT AA

In interleaved format the present versions of the programs may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage- return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of this problem are that the programs complain that the sequences are not properly aligned, and you can find no other cause for this complaint.

INPUT FOR THE DNA SEQUENCE PROGRAMS
The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.

These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

 
    Symbol   Meaning                   Symbol    Meaning
   ------    -------                   ------    -------
     A       Adenine                      Y       pYrimidine  (C or T)      
     G       Guanine                      R       puRine      (A or G)      
     C       Cytosine                     W       "Weak"      (A or T)      
     T       Thymine                      S       "Strong"    (C or G)      
     U       Uracil                       K       "Keto"      (T or G)     
     M       "aMino" (C or A)             V       not T       (A or C or G)    
     B       not A   (C or G or T)      X,N,?     unknown     (A or C or G or T)
     D       not C   (A or G or T)        O       deletion (letter)
     H       not G   (A or C or T)        -       deletion

The programs allow options chosen from their menus. Many of these are as described in the main documentation file, particularly the options J, O, U, T, W, and Y. (Although T has a different meaning in the programs DNAML and DNADIST than in the others).

The U option indicates that user-defined trees are provided at the end of the input file. This happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and DNAMLK, the trees must be strictly bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For DNAML and RESTML it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The root of the tree may in those cases be placed arbitrarily, since the trees needed are actually unrooted, though they look different when printed out. The program RETREE should enable you to reroot the trees without having to hand edit or retype them. For DNAMOVE the U option is not available (although there is an equivalent feature which uses rooted user trees).


This page is maintained by Dave Carmean with an eye towards speed and clarity, and last modified 8 April 1997. Comments or suggestions are welcomed!

Back to the BioComputing Homepage