From SeqApp to NEXUS format (PAUP and MacClade)

Version Oct-95, Dave Carmean. Comments to carmean@sfu.ca

In SeqApp, choose a block of sequences to be analyzed (preferably with the outgroup sequences first), lock all indels (alternatively, in Word search and replace all '~' with '-'), in the top right corner of the align window change the format to PAUP/NEXUS.

PAUP requires aligned sequences, all with the same number of bases. SeqApp will not create a file from sequences that differ in length. Sometimes, SeqApp does not recognize that all sequences are the same length. If you get this type of message, highlight the block of sequence to be to be analyzed, and in the menu under 'file' choose 'Save selection.' Give new names to files being saved in the PAUP format before saving- rarely people have lost all the data in their files at this point.

SeqApp provides a NEXUS header and ending but does not provide the header with the characters used for gaps and missing data. If in PAUP you execute a SeqApp generated file it will give the error message: "Keyword 'Matrix' not recognized." Immediately before the word "matrix" change "missing=;" to "missing=. gap=- ;".

Check that the correct number of bases have been saved: sometimes SeqApp only saves a portion of the highlighted sequences. If this occurs, make a PAUP file by using the pearson/fasta format for sequences all the same length, remove the information after the taxon name (so the only space is between the taxon name and the data), and add a PAUP header deleting the word 'interleave.'

As shown in the second file below, one may also change the header to "respectcase": this allows the ambiguous/unalignable data to be included in the file (as lowercase) but ignored in the analysis (currently MacClade does not support the 'respectcase' option, but one can rapidly produce a data matrix for MacClade from PAUP with all the ignored characters as '?'). By using equate="A=G" and equate="T=C" one may do transversion parsimony.

Following are two PAUP files. The first is directly from Seqapp except after the word 'missing' required information has been put in and comment added. Note PAUP ignores all comments between brackets. The second file has character and taxon sets added to it, as well as many other embellishments explained in brackets.

#NEXUS
[   Sample file from SeqApp-139638 -- data title]

[Name: DROMTTGNC         Len:    36  Check:  1ADB7C5]
[Name: Dros              Len:    36  Check: C8051EB0]
[Name: YMU09206          Len:    36  Check: E84CA691]
[Name: LUCMTPIEA         Len:    36  Check: 80147A4E]
[Name: MSQNCATR          Len:    36  Check:   31F0A8]
[Name: Apis              Len:    36  Check: 736F31AC]

begin data;
 dimensions ntax=6 nchar=36;
 format datatype=dna interleave missing=. gap=-; ["missing=;" changed to "missing=. gap=- ;" ]
  matrix
DROMTTGNC  TACTACCCTGCTCTTTCTTT ATTATTAGTAAGAAGA
     Dros  ...TATCCTGCTCTTTCTTT ATTATTAGTAAGAAGA
 YMU09206  TATTATCCATCCTTAACACT ATTAATTTCTAGAAGA
LUCMTPIEA  TTTTATCCTGCAT--ACTTT ACTATTAGTAAGTAGT
 MSQNCATR  TATTACCCCTCTTTAACTCT TCTAA-TTCTAGAAGT
     Apis  TACTTTCCCTCATTATTTAT ACTTTTATTAAGAAAT
;
  end;







#NEXUS
[! CO1 Data][Comments in brackets are ignored by PAUP, the '!' next to the left bracket 
	makes the comment visible when the file is executed by PAUP ]

 begin data;
 dimensions  ntax=6 nchar=36 [If you do not know the number of characters, use a very large number here, place a 
	'@' as the last character of the last taxon, execute the file, and PAUP will generate an error message of 
	one more than the actual number of characters];
FORMAT     
    MISSING=N    respectcase  
  [Enclose the "equate...  =N" in brackets and re-execute file to produce data matrix in PAUP]
   equate="a=N"     equate="c=N" equate=".=N"     equate="n=N"         
   equate="g=N"     equate="t=N"   equate="I=N" 
   [equate="A=R" equate="G=R"   equate="T=Y" equate="C=Y" ][Allows Transversion Parsimony]
   SYMBOLS="ACGTacgtI" INTERLEAVE  [Don't interleave if using a PIR file or a PHYLIP sequential file] 
   GAP=-;  OPTIONS IGNOR=INVAR;  
  matrix
DROMTTGNC  TACTACCCTGCTCTTTCT TTATTATTAGTAAGAAGA
     Dros  TACTATCCTGCTCTTTCT TTATTATTAGTAAGAAGA
 YMU09206  TATTATCCATCCTTAACa cTATTAATTTCTAGAAGA
LUCMTPIEA  TTTTATCCTGCATTAACT TTACTATtagtaagtagt [lower case ignored if using the respectcase format]
 MSQNCATR  TATTACCCCTCTTTAACT CTTCTAATTTCTAGAAGT
     Apis  TACTTTCCCTCATTATTT ATACTTTTATTAAGAAAT   ;
  end;
begin assumptions;
charset begin = 1-5;  [Characters sets (for excluding/including etc). This one is named 'begin']
charset various = 8 10 13-16;
charset first = 1-36\3;   [For amino-acid coding, every third base]
charset second = 2-36\3;
charset third = 3-36\3;
taxset  one = 1 3 5;    [Taxa sets, note taxa may be referred to by number or name]
taxset Diptera = DROMTTGNC LUCMTPIEA MSQNCATR Dros;    end;
begin PAUP;
outgroup Apis; [Automatically roots trees at Apis instead of the first taxon when file is executed]
delete DROMTTGNC  ;  [Automatically excludes taxa or taxa sets]
exclude various  third ; [Automatically excludes characters or character sets]

  constraints   both_genera =  ((Dros,DROMTTGNC),(LUCMTPIEA, MSQNCATR)); [Constraint tree, only 
	enforced when ticked in the search box]   end;
begin trees;    uTREE both_genera =  (((2,3),6),(4,5)) [Places tree in memory upon execution];     end;




This page is maintained by Dave Carmean with an eye towards speed and clarity, and last modified 13 May 1996.  Comments or suggestions are welcomed!    


Back to the BioComputing Homepage