Just Annotate My Genome (JAMg) tutorial

A tutorial for Just_Annotate_My_genome using Drosophila melanogaster, a small amount (50 million pairs) of RNA-Seq and public Sanger sequences.

In modern projects you will have substantially more RNA-Seq (at least 150 million, i.e. one HiSeq lane) and probably little to no Sanger/454 sequences. However, this makes the tutorial shorter and shows how even established genomes can be annotated (it is also a shameless plug for InsectaCentral.

This tutorial assumes you know how to use a (ba)sh Linux shell, can create and navigate through directories. In the tutorial we don’t create any special directories, but in real-world scenarios you would. You’d also keep a log of what you’ve done (and why), keeping track if something didn’t work. If it was a bug from our side, please let us know (but first ask your system administrator if you don’t know how to use Linux too well). I also assume you know not to overload your server(s) in terms of CPUs and I/O (hard disk) or network bandwidth. Use top to see if you caused such a bottleneck.

Further, in this tutorial we are skipping a step: aligning known proteins to the genome. Because our tutorial uses Drosophila melanogaster, if we did that then we would be essentially using the official annotations so we wouldn’t be able to validate the procedure. In a real-world scenario, you would follow the procedure to align known proteins of your choice. More tips on using JAMg with real-world scenario will be scattered through out this tutorial.

At times, entries like this will appear. They offer commentary, tips or warnings. For example, many of the commands that you will run below may take hours. Considering that you will be using a server with SSH, it is advisable that you use the unix command screen to ensure that an intermittent connection error doesn’t cramp your style.

Steps that share a number can be run in parallel:

Step 0: Configure and download example data

I’m assuming you’ve downloaded the software from somewhere and unpacked it. The next step is to store the path in an environmental variable so you can copy-paste for the rest of the tutorial. Also we will export any other parameters that are species specific and we will use for the rest of the tutorial. You will also need to run make which make take some time (unzipping files etc) but it ought to be un-attended (any errors will cause it to crash).

Basic installation instructions:

cd $JAMG_PATH
make # will install the various software, including RepeatMasker
# Go to link:http://www.giriinst.org[giriinst.org] and install the repeat library in $JAMG_PATH/3rd_party/RepeatMasker
# Run configure. Use the RMBlast NCBI option when asked about an engine (the engine is in $JAMG_PATH/3rd_party/bin/)
$JAMG_PATH/3rd_party/RepeatMasker/configure

Then download the refseq_insecta_march13_just_useful.tar.bz2 HHblits database from here and uncompress it in $JAMG_PATH/databases/hhblits

Finally, we also assume you have correctly installed PASA (and mySQL) to work with your computing environment. Prepare the configuration file

cp $JAMG_PATH/3rd_party/PASA/pasa_conf/pasa.CONFIG.template $JAMG_PATH/3rd_party/PASA/pasa_conf/conf.txt
# Edit conf.txt and set the values for these MySQL database settings:
# MYSQLSERVER=(your mysql server name)
# MYSQL_RO_USER=(mysql read-only username)
# MYSQL_RO_PASSWORD=(mysql read-only password) - leave empty if none used
# MYSQL_RW_USER=(mysql all privileges username)
# MYSQL_RW_PASSWORD=(mysql all privileges password) - leave empty if none used
# PASA_ADMIN_DB=pasa_admin # we need something here but it is not used
# you do not need to worry about the others
vim $JAMG_PATH/3rd_party/PASA/pasa_conf/conf.txt

This is all you need to do to setup PASA these days.

Next let us prepare for the Tutorial (i.e. tutorial-specific instructions follow). First, setup some variables:

export JAMG_PATH=$HOME/software/jamg # or wherever you installed it.
export MAX_INTRON_LENGTH=70000 # maximum size of intron.
export GENOME_PATH=[full path to]/dmel-all-r5.53.fasta
export GENOME_NAME=dmel-all-r5.53 # just a base name for our genome
# number of CPUs to use for the local machine. Always at least one less than the number of available CPUs:
export LOCAL_CPUS=5
export MAX_MEMORY_G=100 # maximum amount of memory to use in Gb
export PATH=$JAMG_PATH/bin:$JAMG_PATH/3rd_party/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAMG_PATH/3rd_party/lib:$JAMG_PATH/3rd_party/lib64/

These ought to be project specific so if you store the above in a file you can use this command to import them in your environment. That way you don’t have to type them every time and you will also remember what you used.

vim env_vash.sh # add the above data
 . ./env_vash.sh # that is a dot in the beginning

Next preparing some data. First of all, since this is a tutorial and we are analyzing on of the better assembled/annotated genomes out there (Drosophila melanogaster, the fruitfly), we want to be able to evaluate if what we do is correct. For that we will need the official FlyBase annotations. Sadly the GFF3 provided by FlyBase is often corrupt (e.g. CDS lines appear before a gene etc) so we have post-processed their GFF, cleaned it and provide a GTF that will be used for our evaluation. It sits in the test_suite directory so:

tar -xf $JAMG_PATH/test_suite/Drosophila_official_annotations_cleaned.tar melanogaster/dmel-all-no-analysis-r5.53.gff3.gff3.clean.gtf.bz2
bunzip dmel-all-no-analysis-r5.53.gff3.gff3.clean.gtf.bz2 # we will use this later during evaluation.

Now we need some input data. JAMg is specifically built for RNA-Seq type of data (for example Illumina, but it doesn’t need to be NGS data, could be Sanger too - just not cost-effective to produce them anymore). How we can get some RNA-Seq data? Well we can download them from NCBI and run the following script to pre-process them:

mkdir RNAseq; cd RNAseq
# we have installed Aspera and the sra_sdk from NCBI
ascp \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023199 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023502 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023504 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023538 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023539 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR023/SRR023540 \
 .
find . -name "*sra" -exec fastq-dump --split-spot --split-files --skip-technical -F -Q 33 -W -T -R pass '{}' \;
find . -name "*sra" -delete # delete primary data if you don't need the backup
# quality pre-process:
find . -name "pass" -type d -exec $JAMG_PATH/3rd_party/justpreprocessmyreads/preprocess_illumina.pl \
 -cdna -no_screen -paired '{}/1/fastq' '{}/2/fastq' \;
# let's see what we produced:
ls -l SRR*/pass/1/fastq.trimmomatic.bz2 # All 'left' files of a pair
ls -l SRR*/pass/2/fastq.trimmomatic.bz2 # 'right' files of a pair
ls -l SRR*/pass/?/fastq.unpaired.bz2 # remnant unpaired files when the pair has been discarded
# let's rename them to something easier to use.
$JAMG_PATH/3rd_party/justpreprocessmyreads/rename_SRA_sra_fastq.pl
ls -l *.trimmomatic.bz2
bunzip2 -k *.trimmomatic.bz2 # highly recommend pbzip2 -d

if you want to use the traditional Trinity RNASeq way of creating input files, you can do this (but we will not use it for this tutorial):

pbzip2 -dck SRR*/pass/1/fastq.trimmomatic.bz2 > drosoph_ALL.Left.fastq
pbzip2 -dck SRR*/pass/2/fastq.trimmomatic.bz2 > drosoph_ALL.Right.fastq
pbzip2 -dck SRR*/pass/?/fastq.unpaired.bz2 >> drosoph_ALL.Left.fastq # unpaired remnants can go to any file

We can also use 454 and Sanger data! Then go to InsectaCentral and download a FASTA of all Drosophila melanogaster (Diptera:Drosophilidae) contigs. This is a MIRA assembly of all Sanger and 454 data that was available for D. melanogaster at the time. For the tutorial we’ll assume you saved them as D_melanogaster_ICcontigs.fsa.

If you want to increase the amount of data for this tutorial (and i.e are willing to wait much longer, then you can download these data which were used for testing JAMg). This will increase TGG’s run time by 4 days and TDN by about half a day.

ascp \
anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR767/SRR767611 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR767/SRR767619 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR767/SRR767621 \
 anonftp@ftp-trace.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR767/SRR767623 \
.

Next we would like to get the Drosophila genome and official annotations in order to evaluate our performance. In a real life scenario this is not possible as you are the ones building the annotation from scratch! However, you can use this procedure to compare different types of annotations (as well as see how JAMg performs). This tutorial was built with FlyBase version 5.53. One can download the GFF from FlyBase (the file with no-analysis) but we found that the melanogaster file had a number of errors (e.g. CDS before the parent genes or CDSs without any exon and in one instance a weakly supported gene model that we couldn’t translate with GTF etc). We’ve cleaned it up and made the GTF we will use for evaluation.

ls $JAMG_PATH/test_suite/Drosophila_official_annotations_cleaned.tar tar -xf $JAMG_PATH/test_suite/Drosophila_official_annotations_cleaned.tar bunzip2 -v $JAMG_PATH/test_suite/*bz2 ls -l dmel-X-r5.53.fasta dmel-X-r5.53.repeat.hard dmel-X-r5.53.repeat.soft melanogaster/

Step 1a: RepeatMask and identify untranscribed coding exons

In this step you will check if there are any domains that are coding. This will provide evidence even in the absence of RNA-Seq data. The script will also run RepeatMasker the output of which we will use later on.

mkdir exon_search; cd exon_search
# MPI with many hosts; localhost (morgan) uses 5 threads for repeatmasker
$JAMG_PATH/bin/prepare_domain_exon_annotation.pl -verbose -genome $GENOME_PATH \
 -repthreads $LOCAL_CPUS -engine mpi -hosts morgan:5-haldane3:12-haldane2:10-haldane1:5-haldane4:12 -mpi 44 \
 -uniprot_db $JAMG_PATH/databases/hhblits/refseq_insecta_march13_just_useful \
 -scratch /dev/shm/$USER
# OR  MPI with a single local host and 5 CPUs
$JAMG_PATH/bin/prepare_domain_exon_annotation.pl -verbose -genome $GENOME_PATH \
 -repthreads $LOCAL_CPUS -engine localmpi -mpi $LOCAL_CPUS \
 -uniprot_db $JAMG_PATH/databases/hhblits/refseq_insecta_march13_just_useful \
 -scratch /dev/shm/$USER
ls ./*hints # should be two files
# RepeatMasker is going to be run above. Once finished, run this as later we will need a "soft-masked" genome:
cd ../ # go back to original directory
ln exon_search/*masked exon_search/*.out.gff . #symlink the .masked genome and RepeatMasker output to the main directory
$JAMG_PATH/3rd_party/bin/maskFastaFromBed -soft -fi $GENOME_PATH -fo $GENOME_PATH.softmasked \
 -bed *.out.gff # this last file is the output from RepeatMasker

Purely FYI: in my local cluster environment using MPI with 44 CPUs and the databases copied to /dev/shm (first command above), this step took 36h (36h and 17 minutes to be exact). The network speed used for MPI was a bottleneck (we just have 10gb Ethernet not infiniband).

Now apply the repeatmasking output file $GENOME_PATH.out.gff to create a 'soft’masked file that we will use later on:

maskFastaFromBed -soft -fi $GENOME_PATH -fo $GENOME_PATH.softmasked -bed $GENOME_PATH.out.gff

Step 1b: Assembly transcriptome to create high-quality gene models

While the previous step is running, prepare TDN and TGG assemblies (Trinity de-novo and Trinity genome-guided) for the RNA-seq data.

mkdir Trinity_assemblies; cd Trinity_assemblies
$JAMG_PATH/3rd_party/trinityrnaseq/Trinity --seqType fq --min_kmer_cov 2 \
 --left ../*1_fastq.trimmomatic ../*_unpaired_fastq.trimmomatic --right ../*_2_fastq.trimmomatic \
 --output TDN --JM "$MAX_MEMORY_G"G --CPU $LOCAL_CPUS --full_cleanup |& tee tdn.log # expected output is a Trinity.fasta
# the above will take some time. First step is unzipping all the input files before JellyFish start running
# In parallel prepare for TGG. First align the reads to the genome:
mkdir TGG; cd TGG
$JAMG_PATH/bin/align_rnaseq_gsnap.pl -fasta $GENOME_PATH -dbname $GENOME_NAME -cpus $LOCAL_CPUS \
-gmap_dir . -nofail -suffix -input_dir ../../ |& tee tgg.log

In the above we have prepared the input for TDN and TGG (Trinity de-novo and Trinity Genome-guided). For this tutorial, for TDN we use --min_kmer_cov 2 because it saves time and resources but in a real world scenario of annotating a genome, don’t use it if you don’t have to. For TGG we use -suffix because we don’t expect any substantial polymorphism for Drosophila melanogaster. If you have a species with polymorphism then don’t use -suffix (actually in my tests using -suffix made the search slower rather than faster, the opposite of what is expected). Once the alignment of the RNASeq is complete we can continue with the TGG process. Some files will be mapped multiple times. We know that Drosophila is well assembled so these RNASeq are almost certainly repeats, for this tutorial we will not use them. The overall process above will take about 16-24h if you’re doing both in parallel and using 5 CPUs for TDN and 10 CPUs for TGG.

In NGS-derived assemblies, it is not uncommon to have haplotype scaffolds (see the Heliconius genome paper), for that reason we would keep them but decrease the -path_number option of align_rnaseq_gsnap from the default of 50 to something that is expected for your assembly (e.g. 4). See the files TGG/*.concordant_mult_xs for read pairs that map to higher than -path_number paths.

For parallelization with computing clusters, you can use the -commands_only option and create a text file that has one line worth of commands for each input. You can then use the unix command split or ParaFly to run it on a cluster. I find that this GSNAP step is the slowest in the entire procedure (1 day per 50 million pairs). By splitting the sequence files, we can leverage HPC architectures to get through the data in a day or so.

# grab all the outputs for Drosophila
cd TGG
# prepare files for TGN, splitting them to those that will take a very long time/resources, medium and very short
$JAMG_PATH/bin/prepare_trinity_genome_assembly_pbs.pl -files ./*.concordant_uniq.bam -intron $MAX_INTRON_LENGTH
ls ./*.cmds # what needs to be run.
# We can start assemblying the TGG data:
# In our our tutorial, only small_trinity_GG.cmds will be created. Run it.
ParaFly -CPU $LOCAL_CPUS -v -c small_trinity_GG.cmds -failed_cmds small_trinity_GG.cmds.failed
# Instead, you can use the small_trinity_GG.cmds.000 and small_trinity_GG.cmds.001
# and load them on two separate machines (or cluster). These were produced using the unix command split

Even though TGG is very fast, Trinity itself is rather I/O (hard disk/network read/write) demanding, especially when you are running multiple Trinity runs in parallel. Decreasing the number of CPUs / parallel runs, may complete faster (use the unix command top to see if many of your commands get stuck in D (delay) mode instead of R (run).

If you had used -files ./_uniq_mult.bam rather than ./.concordant_uniq.bam, then a medium_trinity_GG.cmds would have been created too. Because Drosophila is well assembled and the reads are high-quality, these are likely to be repeats. Also, generally and very rarely, large_trinity_GG.cmds may exist, especially from very large RNASeq projects. They are probably repeats or very highly expressed genes. They can take days to complete and their value is debatable. I recommend you use Trinity’s kmer data reduction algorithm. Currently this has to be done manually.

While this ParaFly procedure is running, we can post-process the alignments to create RNA-Seq coverage data for Augustus (it will take considerable time):

# RNASeq_TGG_input.bam is from prepare_trinity_genome_assembly_pbs.pl above
$JAMG_PATH/bin/augustus_RNAseq_hints.pl -bam RNASeq_TGG_input.bam -genome $GENOME_PATH

Once TGG and TDN are complete, we can integrate TGG and TDN using PASA2.

find TGG/Dir_* -name "*inity.fasta" | $JAMG_PATH/3rd_party/trinityrnaseq/util/support_scripts/GG_trinity_accession_incrementer.pl > Trinity_GG.fasta
#p.s With Trinity_GG.fasta in your base directory, you can safely delete the TGG directory now
cat TDN/Trinity.fasta Trinity_GG.fasta > transcripts.fasta
cat TDN/Trinity.fasta | $JAMG_PATH/3rd_party/PASA/misc_utilities/accession_extractor.pl > tdn.accs
# prepare a PASA assembly configuration (separate from the PASA-wide configuration you did in the beginning)
cp $JAMG_PATH/3rd_party/PASA/pasa_conf/pasa.alignAssembly.Template.txt alignAssembly.config
# Edit the alignAssembly.config and give the database a unique name, set the following:
# MYSQLDB=jamg_drosie_tutorial
$JAMG_PATH/3rd_party/bin/seqclean transcripts.fasta -c $LOCAL_CPUS -n 10000
# first use -x to check everything is OK create a list of commands that will be run with PASA:
$JAMG_PATH/3rd_party/PASA/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R \
 -g $GENOME_PATH --MAX_INTRON_LENGTH $MAX_INTRON_LENGTH \
 --ALIGNERS blat,gmap --TRANSDECODER --CPU $LOCAL_CPUS \
 -T -t transcripts.fasta.clean -u transcripts.fasta \
 --TDN tdn.accs -x > pasa.alignAssembly.commands.to.run
# Now run it.
$JAMG_PATH/3rd_party/PASA/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R \
 -g $GENOME_PATH --MAX_INTRON_LENGTH $MAX_INTRON_LENGTH \
 --ALIGNERS blat,gmap --TRANSDECODER --CPU $LOCAL_CPUS \
 -T -t transcripts.fasta.clean -u transcripts.fasta \
 --TDN tdn.accs |& tee pasa.log
ls jamg_drosie_tutorial.assemblies.fasta jamg_drosie_tutorial.pasa_assemblies*
# Find transcripts that did not make it to the genome
$JAMG_PATH/3rd_party/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta.clean

This will take some time, about one day with the 2 alignment steps (blat and gmap) taking about 5-7 hours. In very large data or mission critical scenarios, we can run the alignment steps separately on a cluster and use the -s and -e options to determine which steps shown in pasa.alignAssembly.commands.to.run will be run on which computer or cluster. The tee command will print the screen output to a file and |& will copy any errors to the same file too.

If you’ve never ran PASA on this system before, it makes sense to try the first command produced in pasa.alignAssembly.commands.to.run. It attempt to connect and create the database. If it doesn’t work then your mySQL settings are wrong. Common errors are Access denied for user user_demo@localhost when the user already exists. Giving the relevant priviliges can solve it:

mysql -u root
CREATE USER 'user_demo'@'localhost' IDENTIFIED BY 'pass13';
# OR for without a password, skip the IDENTIFIED BY part.
GRANT SELECT,INSERT,UPDATE,DELETE ON *.* TO 'user_demo'@'localhost';

Step 1c: run de-novo predictors that require no training

There are some predictors that use no training at all.

GeneMarkES is one such example. Note that you have to install it yourself due to licensing restrictions (GeneMark website):

$JAMG_PATH/3rd_party/genemark/gm_es_bp_linux64_v2.3e/gmes/gm_es.pl $GENOME_PATH.masked |tee genemark.log

GeneMark will take some time, about overnight. Note that we used the masked version of our genome. Always use a masked version unless you’re using Augustus (for which we will specify the repeat co-ordinates separately).

Another tool (under development) is Gavin Huttley’s projection approach. This approach takes a well annotated genome and projects its gene models to your un-annotated genome. We will not use it for this tutorial but see the procedure on how to use it.

Step 2a: Acquire a golden sub-set of gene models

For phase 2, we assume you have completed the PASA step

We require to identify some gene models that are complete and of very high quality. These can be use downstream to train our de-novo predictors. Traditionally, fewer than 100 genes have been used but this was a limitation of the availability of data. In this part we can identify 1000s of such golden models but we will only use a subset: some we will keep for validation of the output.

Now we need to create a golden gene set for training. First, PASA has a script that uses TransDecoder to convert the PASA output to CDS-aware genome co-ordinates. As we will show later, there is a significant number of not-so-golden genes in that subset (sensitivity is very low). So with JAMG we can use the prepare_golden_genes_for_predictors.pl to acquire a subset of good gene models informed by a splice aware aligner such as exonerate.

$JAMG_PATH/3rd_party/PASA/scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta ./*.assemblies.fasta \
 --pasa_transcripts_gff3 ./*.pasa_assemblies.gff3
$JAMG_PATH/bin/prepare_golden_genes_for_predictors.pl -genome $GENOME_PATH.masked -softmasked $GENOME_PATH.softmasked \
 -same_species -intron $MAX_INTRON_LENGTH -cpu $LOCAL_CPUS -norefine -complete -no_single \
 -pasa_gff ./*.assemblies.fasta.transdecoder.gff3 \
 -pasa_peptides ./*.assemblies.fasta.transdecoder.pep \
 -pasa_cds ./*.assemblies.fasta.transdecoder.cds \
 -pasa_genome ./*.assemblies.fasta.transdecoder.genome.gff3 \
 -pasa_assembly ./*.assemblies.fasta
ls -l *golden.gff3

Evaluate [todo]

$JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_gtf_format.pl jamg_drosie_tutorial.assemblies.fasta.transdecoder.genome.gff3 $GENOME_PATH > jamg_drosie_tutorial.assemblies.fasta.transdecoder.genome.gtf
$JAMG_PATH/3rd_party/eval-2.2.8/evaluate_gtf.pl -g dmel-all-no-analysis-r5.53.gff3.gff3.clean.gtf jamg_drosie_tutorial.assemblies.fasta.transdecoder.genome.gtf

Step 2b: Train and run de-novo predictors that need no evidence

Some predictors like SNAP and GlimmerHMM can use evidence as an option but they (GlimmerHMM at least) takes longer and the results in a small test I did were not as good as without adding additional weights. Regardless, we first need to train them using our golden gene sets from above.

First, training and running SNAP is very easy now that we have the ZFF files:

# SNAP - specific instructions
mkdir snap; cd snap
#train
mkdir train ; cd train
# also copy/link the relevant .fasta .zff data used below
ln -s ../../*zff* ../../*golden.train.gff3.fasta .
$JAMG_PATH/3rd_party/bin/fathom ./*golden.train.zff ./*golden.train.gff3.fasta -gene-stats 2>snap_warnings1.log | tee gene.statistics.log
$JAMG_PATH/3rd_party/bin/fathom ./*golden.train.zff ./*golden.train.gff3.fasta -categorize 1000 2> snap_warnings2.log
$JAMG_PATH/3rd_party/bin/fathom -export 1000 -plus uni.ann uni.dna
$JAMG_PATH/3rd_party/snap/forge export.ann export.dna
$JAMG_PATH/3rd_party/snap/hmm-assembler.pl $GENOME_NAME . > $GENOME_NAME.hmm # model to use to predict
#predict
cd ../ ; mkdir predict; cd predict
# create a directory where each genome sequence is in a single file. Use the softmasked repeats
ln -s $GENOME_PATH.softmasked $GENOME_NAME.softmasked
$JAMG_PATH/bin/splitfasta.pl -i $GENOME_NAME.softmasked
# prepare execution for each genome sequence
find $GENOME_NAME.softmasked_dir1 -maxdepth 1 -type f -exec sh -c \
 'echo "$JAMG_PATH/3rd_party/snap/snap ../train/$GENOME_NAME.hmm $1 -lcmask -quiet > $1.snap 2>/dev/null ; \
  $JAMG_PATH/3rd_party/evidencemodeler/OtherGeneFinderTrainingGuide/SNAP/SNAP_output_to_gff3.pl $1.snap $1 > $1.snap.gff3 2>/dev/null ; \
  $JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_gtf_format.pl $1.snap.gff3 $1 > $1.snap.gtf 2>/dev/null "' \
  find-copy '{}' \; > snap.commands
ParaFly -c snap.commands -CPU $LOCAL_CPUS -v -shuffle
cat $GENOME_NAME.softmasked_dir1/*snap.gtf > snap.gtf
cd ../../

Snap is very fast. For Drosophila, this will take some time (5-10 minutes) because our genome is split in whole chromosomes, i.e. only 15 commands each of which will take considerable time. In most NGS projects you will have thousands (or dozens) of scaffolds so each one will be rather quick. By the way, if you want to use external evidence, please check out the procedure for SNAP external evidence documentation.

Next we process GlimmerHMM.

mkdir -p glimmer/train; cd glimmer/train
ln -s ../../*train.good.gb.fasta ../../*train.good.gb.glimmer .
$JAMG_PATH/3rd_party/GlimmerHMM/train/trainGlimmerHMM \
 ./*train.good.gb.fasta ./*train.good.gb.glimmer \
 -d attempt1 >/dev/null
cd ../

Step 2c: Prepare evidence for Augustus and train

Note to readers: due to time-constraints, the rest of the tutorial below has not been edited with the correct filenames to run with the above tutorial. I’ve written this preliminary and will move into the procedure documentation once I’ve run the tutorial. So be careful about directories and file paths, copy-pasting is unlikely to work.

# collate data from the output of predict_golden
mkdir augustus; cd augustus;
ln -s [path from trinity genome guided]/gsnap.coordSorted.bam.rnaseq.hints . # from when you ran $JAMG_PATH/bin/augustus_RNAseq_hints.pl at the Trinity Genome Guided above.
ln -s [path from prepare_golden_genes] final_golden_genes.gff3.nr.golden.*.good.gb* .

In this section we have to do the slightly complicated task of preparing a configuration file for how to weight the external evidence (e.g. RNASeq). Even though eventually we will run Augustus with our genome, to create the configuration we must use a test dataset for which we know the truth. This is our golden genes, which we have split in training, test and optimization (a small subset of training). In order to get the configuration file we must create evidence tracks for our RNAseq (etc) against these datasets. So we need to rerun our alignments:

# now we need to prepare some files for optimizing the extrinsic evidence scores. Specifically we need to recreate the hints for use with our 'test data' derived from the genbank files.
mkdir optimization; cd optimization
cat ../final_golden_genes.gff3.nr.golden.optimization.good.gb.fasta ../final_golden_genes.gff3.nr.golden.train.good.gb.fasta > complete_reference_training_set.fasta
# prepare hints for training set reference: RNA-Seq and repeatmasking/exons
mkdir rnaseqalign; cd rnaseqalign
ln ../complete_reference_training_set.fasta .
# symlink the RNA-Seq raw data here.
$JAMG_PATH/bin/align_rnaseq_gsnap.pl -fasta complete_reference_training_set.fasta -gmap_dir . -suffix -dbname reference_training -pattern1 _1_ -pattern2 _2_ -nofail
$JAMG_PATH/bin/augustus_RNAseq_hints.pl -bam *uniq_mult.bam -genome complete_reference_training_set.fasta
cd ..
# domains and repeats
mkdir repeats ; cd repeats
ln ../complete_reference_training_set.fasta .
$JAMG_PATH/bin/prepare_domain_exon_annotation.pl -fasta complete_reference_training_set.fasta -engine localmpi -repthreads $LOCAL_CPUS -mpi_cpus $LOCAL_CPUS
$JAMG_PATH/3rd_party/bin/maskFastaFromBed -soft -fi complete_reference_training_set.fasta -fo complete_reference_training_set.fasta.softmasked  -bed complete_reference_training_set.fasta.out.gff
cd ..
# Foreign proteins
mkdir foreign; cd foreign
ln ../repeats/complete_reference_training_set.fasta* .
$JAMG_PATH/bin/prepare_golden_genes_for_predictors.pl -identical 40 -similar 70 -mismatch_cutoff 100 -genome complete_reference_training_set.fasta.masked -softmasked complete_reference_training_set.fasta.softmasked -peptides foreign_proteins.nr90  -intron $MAX_INTRON_LENGTH -cpu $LOCAL_CPUS -norefine
cd ..

Except preparing the above evidence, we also want to create our HMM.

# the following will train the coding sequence models. The --metapars must point to a valid configuration which you're welcome to change if you want - here I used a rather broad parameter search.
$JAMG_PATH/bin/optimize_augustus.pl --species $SPECIES --optimize ../final_golden_genes.gff3.nr.golden.optimization.good.gb --onlytrain ../final_golden_genes.gff3.nr.golden.train.good.gb \
 --cpus $LOCAL_CPUS --kfold $LOCAL_CPUS --metapars $JAMG_PATH/configs/metaparameters_broad.cfg | tee opt_cds.log
# Once the above finishes, now also train the UTR
$JAMG_PATH/bin/optimize_augustus.pl --species $SPECIES --optimize ../final_golden_genes.gff3.nr.golden.optimization.good.gb --onlytrain ../final_golden_genes.gff3.nr.golden.train.good.gb \
 --cpus $LOCAL_CPUS --kfold $LOCAL_CPUS --onlyutr --metapars $JAMG_PATH/configs/metaparameters_utr_broad.cfg | tee opt_utr.log

Each of the above commands can be run in parallel (with the exception of the last optimize_augustus.pl -onlyutr command above which must be run after the other optimize_augustus.pl command).

We have now prepared all our evidence for creating the configuration, our the HMM parameters have been set and our HMM has been trained. Prepare our evidence configuration file will determine how much each type of evidence (e.g. RNASeq, RepeatMasker etc) affects the prediction of each type of feature (exon, intron etc).

cat complete_reference_training_set.fasta.repeatmasked.hints foreign/final_golden_genes.gff3.nr.hints hhr.complete_reference_training_set.fasta.masked.exons.aa.trim.db.transposon.db.results.hints hhr.complete_reference_training_set.fasta.masked.exons.aa.trim.norep.db.uniprot.db.results.hints  rnaseqalign/master_bamfile.bam.rnaseq.hints | sort -n -k 4,4 | sort -s -n -k 5,5 | sort -s -n -k 3,3 | sort -s -k 1,1 > complete_reference_training_set.fasta.all.hints
cp $JAMG_PATH/configs/extrinsic_meta.cfg .
# We need to edit the above file to remove any missing src that are missing from our .hints:
grep -oP 'src=[^;]+' complete_reference_training_set.fasta.all.hints |sort -u
# edit the extrinsic_meta.cfg and remove anything not present in the above grep command
vim extrinsic_meta.cfg
# now run the optimisation, note the -utr option requires the above --onlyutr to have completed successfully (otherwise remove -utr).
$JAMG_PATH/bin/optimize_extrinsic_augustus.pl --species $SPECIES -opt ../final_golden_genes.gff3.nr.golden.optimization.good.gb --train ../final_golden_genes.gff3.nr.golden.train.good.gb \
 -cpu $LOCAL_CPUS -notrain -hint complete_reference_training_set.fasta.all.hints -utr -extr ./extrinsic_meta.cfg | tee opt_extr.log

Check the opt_extr.log file for instruction on which configuration file to pick and copy it as your new configuration, e.g. extrinsic_chosen.cfg. You’re almost done!

Finally run Augustus. We have a helper script that allows you to run each scaffold independenly:

mkdir genome_run; cd genome_run
# gather all the hints produced using the $GENOME_PATH reference
# ALEXIE NOTE: document this: util/hhblits/parse_hhsearch2hints.pl -in $GENOME_PATH.masked.exons.aa.trim.trim.db.uniprot.all.db -is_from_getorf
cat [your hint files]  | sort -n -k 4,4 | sort -s -n -k 5,5 | sort -s -n -k 3,3 | sort -s -k 1,1 > genome_augustus.all.hints
# ALEXIE NOTE: check and document this note: "final configuration: we are not optimizing GLD, give it the same as XNT"
# check which sources we have
grep -oP 'src=[^;]+' genome_augustus.all.hints |sort -u
# sed -i -e 'GLD/XNT/' genome_augustus.all.hints
# Now we can split our genome for each scaffold and run augustus
cp $JAMG_PATH/bin/run_split_augustus.cmd  .
# edit ./run_split_augustus.cmd and make sure the top two variables are correct
#   MAX_CONTIGS=9999 #NB if there are more than 9999 scaffold, then change the number
#   PREFIX_GENOME="scaffold_" # after prefix a number is experted, 0 to $MAX_CONTIGS
# Follow the instructions, making sure you export the AUGUSTUS_CONFIG variable in your compute nodes.
./run_split_augustus.cmd $GENOME_PATH genome_augustus.all.hints extrinsic_chosen.cfg

Some notes:

Feel free to create your own metaparameters_broad.cfg, metaparameters_broad.cfg and extrinsic_meta.cfg. With UTR training, is a chance that Augustus will crash. See notes within $JAMG_PATH/configs/metaparameters_utr_broad.cfg.

If you have spare time or computing power you can also try to run optimize_extrinsic_augustus.pl without the -utr option. Then you can compare between the two types of prediction but make sure that you don’t use the Accuracy option as we estimate these values differently when there is a UTR and not: Look at the individual Sensitivity and Specificity values.

Step 3: Run Augustus

For phase 3, we assume you have completed all of the previous steps (except perhaps running the other de-novo predictors). First let’s collate all the data and then run it. It’s actually pretty straightforward if you want to use a single machine or know how to distribute a file with commands to your HPC. Simply:

cat exon_search/*repeatmasked.hints gsnap.coordSorted.bam.rnaseq.hints foreign/*golden*hints > all_hints.augustus # or any other hints you may have

Step 4: Integrate with EvidenceModeller and add RNA-seq supported UTR

Phase 4 requires all of the previous phases to have been completed.

mkdir evidence_modeller; cd evidence_modeller
export PREDGFF3=abinitio_gene_predictions.gff3
export PROTGFF3=protein_alignments.gff3
export TRANSGFF3=transcript_alignments.gff3
export WEIGHT=${PWD}/evm_weights.txt
# prepare weights file and optionally edit it using the weights you think are appropriate. For example, if processing fungi, increase GeneMark's weights since it's rather better at it
cp $JAMG/configs/evm_weights.txt .
vim evm_weights.txt
# split genome and prepare commands
$JAMG_PATH/3rd_party/evidencemodeler/EvmUtils/partition_EVM_inputs.pl --genome $GENOME --gene_predictions $PREDGFF3 --protein_alignments $PROTGFF3 --transcript_alignments $TRANSGFF3 --pasaTerminalExons $PASATERM --segmentSize 50000000 --overlapSize 10000 --partition_listing partitions_list.out
$JAMG_PATH/3rd_party/evidencemodeler/EvmUtils/write_EVM_commands.pl --genome $GENOME --weights $WEIGHT --gene_predictions $PREDGFF3 --protein_alignments $PROTGFF3 --transcript_alignments $TRANSGFF3 --output_file_name evm.out --partitions partitions_list.out > commands.list
# run commands
$JAMG_PATH/3rd_party/bin/ParaFly -shuffle -v -CPU $LOCAL_CPUS -c commands.list -failed_cmds commands.list.failed
# assuming that nothing failed from ParaFly
$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out --genome $GENOME
#delete anything that exist before and combine all outputs.
rm -f $GENOME.evm.gff3
find . -name evm.out.gff3 -exec cat '{}' \; >> $GENOME.evm.gff3

Step 5a: Funcational annotation with JAMp

Phase 5 is required if you are happy with your annotation and now you’d like to manually curate it.

todo

Step 5b: Deploy WebApollo

blah