Back to the Phylogenetics Homepage

PHYLIP, a few suggestions to start with

Dave Carmean, Simon Fraser University ver 0.3 (Please send comments to me at carmean@sfu.ca)

PHYLIP is a suite of programs that can run on most computers (Unix, Macintosh, Windows and others). The most important (or commonly used?) in DNA analysis are DNApars, DNAdist, and DNAml. For protein sequences, use PROTDIST and PROTPARS. To generate trees from either protein or DNA distance matrices, you can choose FITCH or NEIGHBOR. Programs can be obtained from J. Felsenstein.

PHYLIP requires aligned sequences, all with the same number of bases. If you are starting in SeqApp, choose a block of sequences to be analyzed, lock all indels, in the top right corner of the align window change the format to PHYLIP sequential (this allows deleting or adding taxa). SeqApp will not create a file from sequences that differ in length (and if you do not give it a new filename may destroy your file while attempting). Sometimes, SeqApp does not recognize that all sequences are the same length. If you get this type of message, highlight the block of sequence to be to be analyzed, and under 'file' choose 'Save selection,' giving the file a new name.

In Word, search and replace all "." with "N". DNAPARS considers '-' as a 5th character state, and a gap 3 bases long is potentially three added steps- so this should be taken in account when preparing the file and comparing the results with PAUP. DNAML treats '-' as 'N'. The two numbers at the top of the file are the number of taxa and number of characters (bases) and the taxa number should be changed if you are deleting any taxa. Commas in taxon names will cause problems with treefiles. Save as a text file with a simple filename (PHYLIP searches for a file named 'infile', if it does not find one, it asks for the name of the file. (PHYLIP may give the message "error allocating memory" if you try to run a file saved other than as a text document.) Place this file in the same directory (folder) as the PHYLIP program. Ascertain that there is no outfile or treefile that needs to be saved (PHYLIP will overwrite any files with the names outfile or treefile, creating new outfiles each time). PHYLIP only reads the number of taxa and characters indicated in the first line, and the number of user trees indicated in the first line after the data of the last taxon. Thus you may place comments (or data from taxa not included in that analysis run) at the end of the file.

In PHYLIP, enter 'I' (assuming your sequences are sequential), change any other options, and then 'y' to begin the analysis. With luck, the taxon names will scroll across the screen (with user trees you will see no progress until the program is done computing). Otherwise, you may get a message: sequences out of alignment, or base ratios wrong. Maximum likelihood is much faster with user trees but the progress of the run is not indicated.

A PHYLIP sequential file can be used in PAUP by simply adding a NEXUS header to it (deleting 'interleaved') and adding "; end" at the end of the file.

User trees can be quickly made in MacClade and exported in PHYLIP format. To be consistent, make all trees right-leaning. To use in DNAml trees must be trifurcated at the base (unrooted): erase the last closing parenthesis ')' of each tree and an earlier opening parenthesis '('. Do not change any commas. Thus:

      (CICADA    ,((ELAPHROTH ,(ONCOTHRIP ,HOPLOTHRI )),(FRANKLINI ,(HETEROTHR ,AEOLOTHRI ))));
becomes:
      (CICADA    ,(ELAPHROTH ,(ONCOTHRIP ,HOPLOTHRI )),(FRANKLINI ,(HETEROTHR ,AEOLOTHRI )));
and
      (CICADA    ,(ELAPHROTH ),(FRANKLINI ,(HETEROTHR ,AEOLOTHRI ))));
becomes:
      (CICADA    ,ELAPHROTH ),(FRANKLINI ,(HETEROTHR ,AEOLOTHRI )));

Alternatively, use PHYLIP's retree program- just open the treefile (call it intree), quickly view it and quit it, at which point it will ask if you wish to save it- answer yes, then it will ask if rooted or unrooted, answer unrooted.

Comments on formatting.
PHYLIP requires aligned sequences, each with the same total number of characters (bases, gaps, and N's). DNAPARS considers the gap '-' as a 5th character state, and a gap 3 bases long is potentially three added steps- so this should be taken in account when preparing the file and comparing the results with PAUP or other programs. DNAML treats '-' as missing or 'N'. Two numbers representing the number of taxa and number of characters (bases) and the taxa number are required at the top of the file. The data file should be a text (ASCII) file with no formatting and a simple filename. PHYLIP searches for a file named 'infile', if it does not find one, it asks for the name of the file. Place this file in the same directory (folder) as the PHYLIP program. Ascertain that there is no outfile or treefile that needs to be saved (PHYLIP will overwrite any files with the names outfile or treefile, creating new outfiles each time). PHYLIP only reads the number of taxa and characters indicated in the first line, and the number of user trees indicated in the first line after the data of the last taxon. Thus you may place comments (or data from taxa not included in that analysis run) at the end of the file.

PHYLIP accepts two main formats, sequential and interleaved. Sequential means that all the characters (bases) of each taxon follows the name of that taxon in a single block. Sequential is the simplest format to deal with, as it allows you to quickly add or delete taxa- however unless you have only one line of characters it is not possible to see that the characters are reasonably aligned.

To start a PHYLIP program in DOS or UNIX enter the name program you wish to run. In Windows or Macintosh (double) click on the program. You must quit before starting a new analysis.

In PHYLIP, enter 'I' (to change to sequential), change any other options, and then 'y' to begin the analysis. With luck, the taxon names will scroll across the screen (with user trees you will see no progress until the program is done computing). Otherwise, you may get a message: sequences out of alignment, or base ratios wrong. Maximum likelihood, which is computationally intensive and one of the slowest programs on earth, is much faster with user trees but again the progress of the run is not indicated.

A PHYLIP sequential file can be used in PAUP by simply adding a NEXUS header to it (deleting 'interleaved') and adding "; end" at the end of the file.
To view PHYLIP output use a non-proportional font such as Courier- otherwise the trees will be difficult to interpret. A smaller size or a landscape (sideways) view may also help.

From Joe Felsenstein's documentation (copyrighted by Joe Felsenstein and the University of Washington- Please get his programs and complete documentation from Felsenstein):

The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.

These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

 
    Symbol   Meaning                   Symbol    Meaning
   ------    -------                   ------    -------
     A       Adenine                      Y       pYrimidine  (C or T)      
     G       Guanine                      R       puRine      (A or G)      
     C       Cytosine                     W       "Weak"      (A or T)      
     T       Thymine                      S       "Strong"    (C or G)      
     U       Uracil                       K       "Keto"      (T or G)     
     M       "aMino" (C or A)             V       not T       (A or C or G)    
     B       not A   (C or G or T)      X,N,?     unknown     (A or C or G or T)
     D       not C   (A or G or T)        O       deletion (letter)
     H       not G   (A or C or T)        -       deletion

The programs allow options chosen from their menus. Many of these are as described in the main documentation file, particularly the options J, O, U, T, W, and Y. (Although T has a different meaning in the programs DNAML and DNADIST than in the others).

The U option indicates that user-defined trees are provided at the end of the input file. This happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and DNAMLK, the trees must be strictly bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For DNAML and RESTML it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The root of the tree may in those cases be placed arbitrarily, since the trees needed are actually unrooted, though they look different when printed out. The program RETREE should enable you to reroot the trees without having to hand edit or retype them. For DNAMOVE the U option is not available (although there is an equivalent feature which uses rooted user trees).

This page is maintained by Dave Carmean with an eye towards speed and clarity, and last modified 13 May 1996. Comments or suggestions are welcomed!

Back to the BioComputing Homepage