Workflow
Overview
JAMg works by generating and then combining different lines of evidence to construct one comprehensive gene set. You manually curate this gene set and over time you can incrementally improve it by running some of the aspects of the workflow. The core components are:
-
Augustus for de-novo gene prediction (guided by evidence); also finds some alternative transcripts
-
PASA2 for high quality gene prediction using transcriptomes (Sanger, 454, Illumina etc). Used to train Augustus above
-
RepeatMasker to identify some repeats.
-
We recommend you install RMBlast as it is the easiest to use (and free). You will also need the TRF program as they recommend on the above website.
-
A good repeat library is important so make sure you have a registration to RepBase.
-
-
HHblits to comprehensively identify transposable elements and other nasties that wreck havoc in annotation
-
also used to identify protein domains and inform Augustus that a region is coding
-
-
If you have a closely related species, then you can use a new projection algoritm from Huttley et al (in prep)
-
Any other gene predictor or other lines of evidence; we support:
-
SNAP
-
geneid
-
GlimmerHMM
-
GENEMARK-ES
-
MAKER (perceived as our competitor but well, why not)
-
Something else we are not aware of (email us but basically if you can produce a GFF/GTF then you’re good)
-
Then:
-
EvidenceModeller that brings the above together before another round of PASA2 post-processing
-
WebApollo on your webserver to view all of the above lines of evidence and manually curate the genes
-
JAMp to functionally annotate your predicted proteins and (in near future) integrate any downstream experiments
Preparing for the workflow
Installing pre-requisites
For the rest of the instructions, I’ll assume your genome FASTA file is genome.fasta and the scripts are in an environmental variable called JAMG_PATH.
export JAMG_PATH=$HOME/software/jamg export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAMG_PATH/3rd_party/lib:$JAMG_PATH/3rd_party/lib64/ ls -l $JAMG # try it
First you need to install the required software. PASA might be slightly challenging if you have never used mySQL before. We distribute most of the software you will need under 3rd_party (when licensed to do so). To automatically install things, use this make command.
cd $JAMG_PATH make all # will install all utilities
You will then still need to install blat, Apache/Tomcat and WebApollo (see 3rd_party/WebApollo/README for instructions on the later)
Finally, read $JAMG_PATH/databases/hhblits/README and download a database for your taxon (or UniProt which may be an overkill but your only choice if your taxon is not represented in the public databases).
For the rest of this document, I use some environmental variables as shortcuts as above for $JAMG_PATH. For example, LOCAL_CPUS is the number of CPUs to use, MAX_INTRON_LENGTH is the maximum length an intron is expected to be, GENOME_PATH is the path to the genome assembly and GENOME_NAME is just a friendly name for the genome (without any spaces). As shown in the tutorial, you can set your own and it is a handy trick to store them in a file so you remember what were those values.
Input files
-
Reference: The main input to JAMg is, simply put, a well-assembled genome, in upper-case FASTA format, without any (soft)masking. Then you will benefit from some RNA-Seq. Finally, you will need a high performance computing (HPC) environment, preferably with MPI enabled (a standard in all modern HPCs). Your main server (at least) should also have sufficient disk space for processing the RNA-Seq data. Trinity RNA-Seq doesn’t require vast amounts of RAM, especially if you choose to use the kmer-reduction algorithm.
-
Evidence: If you have not generated any transcriptomes, read Sequencing data below. For this document, I will assume you have not undertaken any transcriptome assembly (with Trinity or other) yet. We will run it here. If you have already performed a de-novo one, then you will still to perform a genome-guided one and keep the alignment .bam files. Otherwise, you will still require the raw RNA-Seq. If you have all that, then feel free to skip those steps. You may also have non-RNA-seq transcriptome data, such as 454, Sanger or other full length sequences. Use MIRA to make a de-novo assembly of them if you haven’t already assembled them. Don’t use raw 454 data with PASA. If you also have RNA-Seq, you could provide the Illumina data to MIRA if you are brave but providing any Trinity assemblies would be better (it will allow the 454 to be better assembled). If you don’t that’s fine, PASA will integrate them using the genome as a reference. Also, there is a script download_my_sequences.pl that you can use to download any public data found in GenBank for your species of interest but it is not very well tested.
$JAMG_PATH/bin/download_my_sequences.pl -t Insecta -t fasta -m protein #=> downloads Insecta proteins $JAMG_PATH/bin/download_my_sequences.pl -s "Drosophila melanogaster" -m cDNA -f fasta #=> Downloads all D. melanogaster cDNA sequences
So to recap you need:
-
one genome sequence in FASTA format
-
some raw transcriptome data (RNA-Seq preferred, see Sequencing data below). Diversity is more important than quantity
-
any public sequences for your species or other Sanger/454 data you may have
-
If it is 454 then assemble them with MIRA first.
-
If you have Illumina and 454, run a de-novo Trinity assembly with the Illumina data and a MIRA assembly with the Trinity contigs and the 454 raw data.
-
-
a high-performance computing (HPC) environment and knowing how to use it.
-
this software installed on a server that can access the HPC
If your genome has a lot of small scaffolds, you can use the following command to keep scaffolds above a certain size:
$JAMG_PATH/bin/trim_fasta_all.pl -i $GENOME_FASTA -out $GENOME_FASTA.10000 -length 9999
Familiarize with the software, read some literature and being aware of limitations of de-novo prediction
First, I suggest you try out the software that we didn’t produce but you will be using. Learn what RepeatMasker (and RepBase) or HHblits is for… If you’re having issues installing or running them, then please let the respective authors know, it’s unlikely we can help.
Also, I recommend you read a couple of papers. You may want to learn a bit of the theory if you thus inclined but for the average bioinformatician, these two papers will be more than sufficient (if you read them you may know more than the average person running these software).
-
Yandell, M. & Ence, D., 2012. A beginner’s guide to eukaryotic genome annotation. Nature reviews. Genetics, 13(5), pp.329–42. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22510764
-
Haas, B., Zeng, Q. & Pearson, M., 2011. Approaches to fungal genome annotation. Mycology, 2(3), pp.118–141. Available at: http://www.tandfonline.com/doi/abs/10.1080/21501203.2011.606851
Also you may want to practice with our tutorial before proceeding. That way you will know if you’re doing something wrong, if the software is not behaving as it should (i.e. a bug) or there is something peculiar about your data. It would not be unlikely if your HPC environment and our software are not compatible, in that case ask you system administrator to let us know.
Annotation, step by step
You may follow any of the following steps in any order, at times you can even accomplish them in parallel. See the tutorial for inspiration. Leave Augustus for the end, just before EvidenceModeller.
-
Exon identification: Using your genome FASTA, run the script prepare_domain_exon_annotation.pl. This script will run RepeatMasker on your genome, and explore if any ORF is coding for a protein. It does this by first extracting all putative ORFs that have enough amino acids (stretches of Ns, as in gaps, will be translated to X. We don’t like those…). Then for each putative ORF it will search against a transposon database and then against a database of known proteins.
If you have already run RepeatMasker that is ok, make sure that a file that is called $GENOME_PATH.masked is in the same directory as $GENOME_PATH. It will continue with the ORF exploration. You can choose which known protein database to use after the transposons. It can be the entire Uniprot distributed with HHblits or one of the taxon-specific databases we provide from RefSeq. These databases are in the folder databases/hhblits/. This script can make use of MPI so that if you have a computing PC-Farm (i.e. no batch system) you can do this:
$JAMG_PATH/bin/prepare_domain_exon_annotation.pl -genome $GENOME_PATH -verbose \ -uniprot_db $JAMG_PATH/databases/hhblits/refseq_insecta_march13_just_useful \ -trans $JAMG_PATH/databases/hhblits/transposons \ -engine mpi -hosts morgan:5-haldane3:12-haldane2:10-haldane1:5-haldane4:12 -mpi 44 \ -repthreads $LOCAL_CPUS -mpi_cpus $LOCAL_CPUS \ -scratch /dev/shm/$USER # RepeatMasker is going to be run above. Once finished, run this as later we will need a "soft-masked" genome: $JAMG_PATH/3rd_party/bin/maskFastaFromBed -soft -fi $GENOME_PATH -fo $GENOME_PATH.softmasked \ -bed $GENOME_PATH.out.gff # this last file is the output from RepeatMasker
What you specify as a database for -uniprot_db or -trans is the full path and basename of the database (i.e. there is no file $JAMG_PATH/databases/hhblits/transposons but there are files such as transposons.a3m_db, transposons.cs219 etc). The last option -scratch, tells the program to copy all the database files to every node’s local memory. You can use any local directory (/tmp/$USER or a scratch) but be careful you have enough space (and memory). Remember that /dev/shm and some /tmp use the computer’s local memory (not hard disk). That’s very fast but it will use RAM. Our computers have 48Gb of RAM each and that is far more than needed (depending on database size, estimate 1-5Gb per MPI process). Not including this option means that the databases will be read over the network. That’s fine if your network connection is fast, unsaturated and the databases are small. Otherwise, decrease the number of processes, find another computing environment or use a smaller database.
-engine option has a number of possible options. We’ve tested mpi and localmpi and routinely use PBS. The cluster option splits the input into segments and produces command files for you to run (we haven’t tested it). See MPI help. Once prepare_domain_exon_annotation.pl is complete, you can provide the .hint files to Augustus (eventually).
-
RNA-Seq processing: essentially you will be following the process outline here. Briefly:
-
Choose the maximum intron expected in your species (in base pairs). For the rest of these instructions, we will store in the env. variable $MAX_INTRON_LENGTH:
export MAX_INTRON_LENGTH=70000 export LOCAL_CPUS=4 # example number of CPUs to use
-
Do some mild trimming of your sequences, see 3rd_party/preprocess_reads (you can use the -noscreen option to improve speed).
-
Prepare Trinity RNA-Seq de-novo assemblies (a.k.a. TDN) with all the data concatanated (separately for -left and -right for paired end; any additional single end reads can be concatanated to the -left).
-
Prepare Trinity RNA-Seq genome-guided assemblies (a.k.a. TGG) with the same input data.
-
Make sure you keep the aligment .bam files. We will use them down the line.
-
If you are assembling transcripts from microbial genomes, make sure you use the --jaccard_clip option.
-
If you annotating a large eukaryotic genome (e.g. mouse), feel free to use Cufflinks as well but use gsnap as an aligner, not Tophat. If your genome is compact (e.g. Drosophila, microbes), just don’t…
-
We have two scripts if you have a lot of data (e.g. a dozen lanes of HiSeq) but there is no benefit learning them if you only have a few Gb of data or are not in a hurry:
-
bin/prepare_trinity_genome_assembly_pbs.pl prepares everything you need for a TGG assembly. It splits the data into small, medium and large jobs so that all the small run together. Otherwise a single large job will delay the entire processing, only to find out that you’re assembling a highly expressed retrotransposon.
-
bin/align_rnaseq_gsnap.pl automatically run against all files that match a pattern for left and right so that you don’t have to do it manually. In other words, it is for advanced users with lots of data.
There is little benefit using more than 6 threads for each GSNAP (it’s actually making it slower). Also it can take > 2 days to align 50 M reads but depends on the quality of the data. For high-throughput I recommend you split your input data into chunks and align in parallel. For parallelization with computing clusters, you can use the -commands_only option and create a text file that has one line worth of commands for each input. You can then use the unix command split or ParaFly to run it on a cluster. -
also $JAMG_PATH/bin/JAMG_TGG_cmds.pl can be used to (re)run batches of Trinity-guided assemblies on a PBS cluster.
-
-
This is the last command you will need for the genome-guided part:
# store what is TDN output $JAMG_PATH/3rd_party/PASA/misc_utilities/accession_extractor.pl < Trinity_denovo.fasta > tdn.accs # prepare TGG output $JAMG_PATH/bin/prepare_trinity_genome_assembly_pbs.pl -files ./*.concordant_uniq.bam -intron $MAX_INTRON_LENGTH ls *cmds # Run each one using your method of choice, e.g. ParaFly find Dir_* -name "*inity.fasta" | $JAMG_PATH/3rd_party/trinityrnaseq/util/support_scripts/GG_trinity_accession_incrementer.pl > Trinity_GG.fasta # compile TGG and TDN outputs into one file. cat Trinity_denovo.fasta Trinity_GG.fasta > transcripts.fasta
-
-
Before we continue with the assembly, we ought to prepare the RNA-Seq files for use with Augustus later on. In particular we want to know the coverage, which exons are joined together, where are the introns etc
-
First, converting the BAM alignment files of RNA-Seq to something that Augustus can appreciate and also identify the intron/exon junction reads
$JAMG_PATH/bin/augustus_RNAseq_hints.pl -bam RNASeq_TGG_input.bam -genome $GENOME_PATH # RNASeq_TGG_input.bam is from prepare_trinity_genome_assembly_pbs.pl
-
-
Now follow the PASA guidelines to assemble them as transcripts. If you are having issues installing PASA, look at the tutorial for advice.
# identify poly-a tails using SeqClean $JAMG_PATH/3rd_party/bin/seqclean transcripts.fasta -c $LOCAL_CPUS -n 10000 $JAMG_PATH/3rd_party/PASA/scripts/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g $GENOME_PATH \ --ALIGNERS blat,gmap --TRANSDECODER --CPU $LOCAL_CPUS \ -T -t transcripts.fasta.clean -u transcripts.fasta \ --TDN tdn.accs # Find transcripts that did not make it to the genome $JAMG_PATH/3rd_party/PASA/scripts/build_comprehensive_transcriptome.dbi -c alignAssembly.config -t transcripts.fasta.clean # Identify alternative splicing. This will take a very long time so I don't currently recommend it (you could launch it and leave it running but we are not currently using the output) $JAMG_PATH/3rd_party/PASA/scripts/Launch_PASA_pipeline.pl -c annotCompare.config -g $GENOME_PATH -t transcripts.fasta.clean --ALT_SPLICE
If your gene density is high and you expect transcripts from neighboring genes to often overlap in their UTR regions (e.g. fungi), you can perform more stringent clustering of alignments by adding --stringent_alignment_overlap 30.0. -
If your RNA-seq was single-stranded (used the --sslib option) then add the PASA option --transcribed_is_aligned_orient.
-
I’m not patient person, so I run the blat and gmap separately on a cluster with dozens of CPUs. You can use the -x, -s and -e options to control which steps of the pipeline to perform. We recommend this only to people who are/want to be expert as it can take sometime to get used to.
-
If you have 50 million read pairs, the entire process should be done in a day. If you have > 1 billion read pairs then the PASA step will not be that much slower (a few days) but your Trinity assembly will take a considerable time. Consider assembling by library or using the kmer normalization technique.
-
The output file of interest is the one matching *.assemblies.fasta, let us assume it is called my.assemblies.fasta from now on.
-
-
-
Gene models for training and evaluation: Identify a subset of you gene data that is of high quality (this process diverges from the standard PASA approach):
-
The standard PASA approach is to use the genome and a perl script to convert my.assemblies.fasta into gene models. This will create a lot of gene models, most of which will be not be correct. So we also need to identify a subset of really good ones (golden) that can be used for training de-novo gene prediction (including generating the different file formats these predictors expect):
# the PASA approach which will also run TransDecoder $JAMG_PATH/3rd_party/PASA/scripts/pasa_asmbls_to_training_set.dbi --pasa_transcripts_fasta ./*.assemblies.fasta \ --pasa_transcripts_gff3 ./*.pasa_assemblies.gff3
-
Once that is complete use the prepare_golden_genes_for_predictors.pl script to prepare the various files.
-
This script uses exonerate with an initial step to find the approximate regions. In this scenario the approximate regions have been identified by PASA (and pasa_asmbls_to_training_set.dbi) but the script could be used in other scenarios that have no transdecoder data (see the -peptide option - in that case BLAST or the AATPACKAGE will be used to find the approximate regions). Exonerate works better if it knows that certain regions are repetitive so we will create a soft-(repeat)masked version of your genome using bedtools.
$JAMG_PATH/bin/prepare_golden_genes_for_predictors.pl -genome $GENOME_PATH.masked -softmasked $GENOME_PATH.softmasked \ -same_species -augustus $JAMG_PATH/3rd_party/augustus/bin \ -intron $MAX_INTRON_LENGTH -cpu $LOCAL_CPUS -norefine -complete -no_single \ -pasa_gff ./*.assemblies.fasta.gff3 \ -pasa_assembly ./*.assemblies.fasta.transdecoder.gff3 \ -pasa_peptides ./*.assemblies.fasta.transdecoder.pep \ -pasa_cds ./*.assemblies.fasta.transdecoder.cds \ -pasa_genome ./*.assemblies.fasta.transdecoder.genome.gff3 \ -pasa_assembly ./*.assemblies.fasta
-
I find that -norefine is quicker and makes little difference (but as always I could be wrong). The -complete flag can also be used to ensure only full-length genes are printed out.
Generally, the prepare_golden_genes_for_predictors.pl can also be used for non-PASA derived data (see -mrna, -peptides), or even from another species (remove -same_species). To be honest, support for PASA was the last option added :-) -
Let’s explore the output and find some statistics:
$JAMG_PATH/3rd_party/PASA/misc_utilities/index_gff3_files.pl final_golden_genes.gff3.nr.golden.gff3 >/dev/null $JAMG_PATH/3rd_party/PASA/misc_utilities/exon_and_intron_stats.pl final_golden_genes.gff3.nr.golden.gff3.inx $JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_feature_types.pl final_golden_genes.gff3.nr.golden.gff3 2> /dev/null
You can then use R to get for example the (median) size of introns, exons etc.
data<-read.csv("*.gff3.intergenes",sep="\t",header=F) summary(abs(data$V2-data$V3)) // Min. 1st Qu. Median Mean 3rd Qu. Max. // 0 1469 7025 43780 34770 1113000 mean(abs(data$V2-data$V3), trim=0.20) // [1] 12175.85 Trimmed mean
Currently the AATPACKAGE and exonerate work rather well. They also very well for foreign proteins (i.e. from another species), just make sure you remove the -same_species parameter from above. In a future version, I’m thinking of integrating a GMAP step (on top or instead of aatpackage or even exonerate) for this step of mapping within the same species (GMAP will not perform between species).
-
-
-
Optional: Run de-novo gene predictors that don’t require external evidence (all but Augustus).
-
For almost all gene predictors, use the RepeatMasked genome (.masked, above).
-
GeneMark-ES does not require any training but you should still use the repeatmasked genome. You also need to install (and accept the license) of GeneMark. Genemark will take a couple of days to complete.
$JAMG_PATH/3rd_party/genemark/gm_es_bp_linux64_v2.3e/gmes/gm_es.pl $GENOME_PATH.masked # use --BP ON if you're working on fungi
-
For geneid, glimmerhmm and snap, you can train them using the output of prepare_golden_genes_for_predictors.pl (see below for each software)
-
Glimmerhmm and snap can use external evidence but when we run a validation we saw that they performed less well than without any evidence. We’re not experts of the software and there is no documentation so some optimization might be necessary. The input to these software is the output of prepare_golden_genes_for_predictors.pl, in particular .zff and *.xdef (for snap) and *geneid or *glimmer for the others. Generally, we will train with the .train. file, then we make predictions and finally we test (evaluate) them against the *.golden.test.gtf file. For more details, see [evaluation].
-
SNAP:
# SNAP - specific instructions mkdir snap; cd snap #train mkdir train ; cd train # also copy/link the relevant .fasta .zff data used below ln -s ../../*zff* ../../*gff3.fasta . $JAMG_PATH/3rd_party/bin/fathom golden.train.zff golden.train.gff3.fasta -gene-stats | tee gene.statistics.log $JAMG_PATH/3rd_party/bin/fathom golden.train.zff golden.train.gff3.fasta -categorize 1000 $JAMG_PATH/3rd_party/bin/fathom -export 1000 -plus uni.ann uni.dna $JAMG_PATH/3rd_party/snap/forge export.ann export.dna $JAMG_PATH/3rd_party/snap/hmm-assembler.pl Pult . > $GENOME_NAME.hmm # model to use to predict #predict cd .. ; mkdir predict; cd predict # create a directory where each genome sequence is in a single file. Use the softmasked repeats ln -s $GENOME_PATH.softmasked $GENOME_NAME.softmasked $JAMG_PATH/bin/splitfasta.pl -i $GENOME_NAME.softmasked # prepare execution for each genome sequence find $GENOME_NAME.softmasked_dir1 -maxdepth 1 -type f -exec sh -c \ 'echo "$JAMG_PATH/3rd_party/snap/snap ../train/$GENOME_NAME.hmm $1 -lcmask -quiet > $1.snap 2>/dev/null ; \ $JAMG_PATH/3rd_party/evidencemodeler/OtherGeneFinderTrainingGuide/SNAP/SNAP_output_to_gff3.pl $1.snap $1 > $1.snap.gff3 ; \ $JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_gtf_format.pl $1.snap.gff3 $1 > $1.snap.gtf"' \ find-copy '{}' \; > snap.commands ParaFly -c snap.commands -CPU $LOCAL_CPUS -v -shuffle cat $GENOME_NAME.softmasked_dir1/*snap.gtf > snap.gtf # evaluate $JAMG_PATH/3rd_party/eval-2.2.8/evaluate_gtf.pl -g ./*golden.test.gtf snap.gtf > snap.eval cd ../../
-
If you don’t like the results, you can create a new training directory and try training with fewer, or more genes. Alternatively you can use external evidence (see snap -help) in order to improve specificity and prevent wrong overlapping gene models from being predicted. First create an external evidence file for snap using the training genes and then run snap with an extra option:
cd snap/predict # reset input genome directory rm -f $GENOME_NAME.softmasked_dir1/*snap* $GENOME_NAME.softmasked_dir1/*cidx # prepare evidence $JAMG_PATH/bin/zff2hintzff.pl golden.train.gff3.zff # then prepare snap with the -xdef command find $GENOME_NAME.softmasked_dir1 -maxdepth 1 -type f -exec sh -c \ 'echo "$JAMG_PATH/3rd_party/snap/bin/snap $GENOME_NAME.hmm $1 -lcmask -quiet -xdef $1.snap.evidence > $1.snap 2>/dev/null ; \ $JAMG_PATH/3rd_party/evidencemodeler/OtherGeneFinderTrainingGuide/SNAP/SNAP_output_to_gff3.pl $1.snap $1 > $1.snap.gff3 2>/dev/null ; \ $JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_gtf_format.pl $1.snap.gff3 $1 > $1.snap.gtf 2>/dev/null"' \ find-copy '{}' \; > snap.commands2 cp evidence/* $GENOME_NAME.softmasked_dir1/ # bit of a convenience hack here ParaFly -c snap.commands2 -CPU $LOCAL_CPUS -v -shuffle cat $GENOME_NAME.softmasked_dir1/*snap.gtf > snap2.gtf # evaluate both snap runs $JAMG_PATH/3rd_party/eval-2.2.8/evaluate_gtf.pl -g ./*golden.test.gtf snap.gtf snap2.gtf > snap.eval # Repeat with any evidence/training data as you wish. Once you're happy, you can delete the $GENOME_NAME.softmasked_dir1 directory rm -rf $GENOME_NAME.softmasked_dir1 cd ../../
-
First, I should say that the zff2hintzff.pl script has room for improvement (specifically, only the coding regions are currently used). Second, if you don’t want to evaluate the output and you think that the external evidence has improved matters, there is no reason why you cannot provide all the golden genes (golden.gff3.zff) as external evidence. Clearly you will not be able to have an independent test (since your test genes have been included as evidence) but your aim here is to improve the annotation of a genome, not prove that a particular gene model prediction algorithm is better or worse. We will follow this procedure at the very end with Augustus et al after integrating everything with the evidencemodeler.
-
-
Glimmer:
# GlimmerHMM - specific instructions # train mkdir -p glimmer/train; cd glimmer/train ln -s ../../*glimmer* ../../*golden*.fasta . $JAMG_PATH/3rd_party/GlimmerHMM/train/trainGlimmerHMM \ ./*.train.good.fasta ./*.train.good.gb.glimmer \ -d attempt1 >/dev/null cd ../
-
You can provide the options -f, -l and -n which are the average values of respectively: up- and down-stream UTR and intergenic regions. You can use scripts within PASA to get an estimate for these.
-
Glimmer asks you set certain false positive/negative thresholds: they are found in the false. files: false.acc for acceptor sites, false.don for donor sites, false.atg for start sites. These are thresholds and are defined in the .cfg file and some defaults are already stored, they are often not very good. You can opt to change these defaults in order to balance over- with under- predicting. I tend to pick something that prevents over-prediction (i.e. when false positives jumps down to something acceptable). NB: make sure you only use two and only two decimal places in the cfg file. Now you can start predicting:
# predict # create a directory where each genome sequence is in a single file. Use the hardmasked repeats ln -s $GENOME_PATH.masked $GENOME_NAME.masked $JAMG_PATH/bin/splitfasta.pl -i $GENOME_NAME.masked # prepare execution for each genome sequence find $GENOME_NAME.masked_dir1 -maxdepth 1 -type f -exec sh -c \ 'echo "$JAMG_PATH/3rd_party/GlimmerHMM/bin/glimmerhmm $1 train/attempt1 > $1.glimmer 2>/dev/null ; \ $JAMG_PATH/3rd_party/evidencemodeler/OtherGeneFinderTrainingGuide/GlimmerHMM/glimmerHMM_out_to_gff3.pl $1.glimmer $1 > $1.glimmer.gff3 2>/dev/null; \ $JAMG_PATH/3rd_party/PASA/misc_utilities/gff3_to_gtf_format.pl $1.glimmer.gff3 $1 > $1.glimmer.gtf 2>/dev/null "' \ find-copy '{}' \; > glimmer.commands ParaFly -c glimmer.commands -CPU $LOCAL_CPUS -v -shuffle cat $GENOME_NAME.masked_dir1/*glimmer.gtf > glimmer.gtf # evaluate $JAMG_PATH/3rd_party/eval-2.2.8/evaluate_gtf.pl -g ./*golden.test.gtf glimmer.gtf > glimmer.eval
Glimmer may crash with the very not informative segmentation fault. It is usually data & system (i.e. C library) specific. As Glimmer is unsupported (i.e. they don’t reply to emails) and rather old, there is little we could do. If that happens for you, try another server. If it still happens, simply don’t use it.
-
-
Geneid
$JAMG_PATH/3rd_party/cegma/bin/geneid-train -v ./*geneid.golden.train.gff3 $GENOME_PATH train cd train make_paramfile $JAMG_PATH/3rd_party/cegma/data/self.param.template coding.initial.5.logs coding.transition.5.logs start.logs acc.logs don.logs intron.max > $GENOME_NAME.param cd .. #TODO CONVERT TO PERL $JAMG_PATH/bin/optimize_geneid.sh train/$GENOME_NAME.param ./*.golden.train.good.gb.geneid.fasta ./*.golden.train.good.gb.geneid run1 > $GENOME_NAME.optimization.cmds ParaFly -c $GENOME_NAME.optimization.cmds -CPU $LOCAL_CPUS
-
-
If we have a closely species that is well annotated, a good approach is to project that genome’s gene models to our unannotated one. This script, by Gavin Huttley () is still under construction but once ready we can use it like so:
create_projections.py -reference annotated_genome.fasta -genes [annotated_genome.gff3|annotated_genome.genbank] -genome new_genome.fasta -out new_genome.gff3
-
-
-
Pick a species or taxon, download data, align
-
First get some foreign proteins from a taxon that makes sense. Then go to UniProt or RefSeq and download something appropriate. Alternatively, you could bulk download some taxa you’d like to sue with the download_my_sequences script. I highly recommend you use the cd-hit program to reduce redundancy. Regardless, when you have your data, you can align them to your genome using some kind of cutoff (your guess will depend on the data and will be as good as my guess…).
# download data e.g. insects/invertebrate: wget -c --mirror --accept=protein.faa.gz ftp://ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate \ && gunzip -dc ftp.ncbi.nlm.nih.gov/refseq/release/invertebrate/*gz > foreign_proteins.fsa # OR: $JAMG_PATH/bin/download_my_sequences.pl -t Insecta -f fasta -m protein -out foreign_proteins.fsa # more data than RefSeq, slower cd-hit -c 0.90 -i foreign_proteins.fsa -o foreign_proteins.nr90 -d 0 -M 0 -T 10 # remove reduncancy to 90% # align data using some rather arbitrary cut-offs: $JAMG_PATH/bin/prepare_golden_genes_for_predictors.pl -identical 40 -similar 70 -mismatch_cutoff 100 -stop_golden \ -genome $GENOME_PATH.masked -softmasked $GENOME_PATH.softmasked -peptides foreign_proteins.nr90 \ -intron $MAX_INTRON_LENGTH -cpu $LOCAL_CPUS -norefine
-
-
-
Running Augustus
Augustus is our more powerful gene predictor for the simple fact that it accepts a number of tracks with external evidence (and also it is still actively maintained). In practice, in my projects I barely provide the other gene predictors with external evidence and just rely on Augustus. JAMg was written before Augustus 3 was released and I have had time to test it so JAMg uses the latest Augustus 2 (also Augustus 3 is under the rather viral GPL so there is the headache of changing the license). In any case, moving on: the input to Augustus is your genome FASTA, a subset of your golden gene predictions and your evidence tracks and two sets of configuration. The first set is to parameterize how Augustus builds the HMM. It is very important to do that once per species. Then the second configuration file is about your external evidence: how much weight to give to each type of evidence (on a per feature type basis, exon, intron etc). JAMg uses a modified script from Augustus to conduct the former and a new script to do the latter. Both are quite easy to use but may take a day to run. The major complication is that for the second configuration we must create a set of evidence tracks using our test sequence as a reference, not our genome. This is happening because our golden genes are used to create a train/test set and therefore we must create our evidence tracks using these as a reference:
mkdir augustus; cd augustus cat ../final_golden_genes.gff3.nr.golden.optimization.good.gb.fasta ../final_golden_genes.gff3.nr.golden.train.good.gb.fasta > complete_reference_training_set.fasta
Therefore, we must rerun our scripts: for the RNASeq: align_rnaseq_gsnap.pl and augustus_RNAseq_hints.pl, for repeats and domains: prepare_domain_exon_annotation.pl and maskFastaFromBed, and for our Foreign proteins prepare_golden_genes_for_predictors.pl. See the tutorial section 2C on how we are doing this exactly.
-
Creating a consensus gene set
-
Adding UTR and alternative splicing
-
Functional annotation with JAMPs
-
Setting up WebApollo
-
Adding more data to WebApollo
genome_gaps_to_bed.pl
-
-
Where do I go from here?
General info and help
Every perl script in JAMg has a PerlDoc so that you can do this to read the manual. |
perldoc prepare_domain_exon_annotation.pl # the complete manual prepare_domain_exon_annotation.pl # or short info Usage: Mandatory -fasta|genome|in :s => FASTA file of genome -engine :s => How to run hhblits: none, local, localmpi, PBS or cluster (def. localmpi) -transposon_db :s => HHblits transposon database (provided) -uniprot_db :s => HHblits Uniprot database (see ftp://toolkit.genzentrum.lmu.de/pub/HH-suite/databases/hhsuite_dbs) -hosts :s => Only for -engine mpi: a definition for which hosts to use in the format hostname1:number_of_cpus-hostname2:number_of_cpus, e.g. localhost:5-remote:5
You don’t have to type the entire argument, the first few unique letters will be enough. The pipe character (|) tells you that -fasta or -genome (or -in) can be used interchangebly. The :s or :i above means that we expect a string or integer to be the argument. Remember to quote (") strings that have spaces in them. When in doubt use the defaults. |
-
MPI is free. We recommend openMPI but we also support MPICH2. You can install it from repositories, e.g. on Debian/Ubuntu:
apt-get install openmpi-bin
-
Tell FFINDEX where the shared libraries are. FFINDEX is installed as part of transdecoder You ought to include it in your $HOME/.bashrc or your sys-admin can copy the libraries in a system-wide path.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAMG_PATH/3rd_party/transdecoder/util/lib64/
To make the most of the annotation platform you will need some RNA-Seq. Generally speaking, the more is better but beyond a certain point, any added value decreases. If you have the ability to design your genome/transcriptome sequencing before you’ve reached this stage then, first of all, "Well Done"™! Too often sequencing is undertaken with little understanding of the needs of the downstream processes… Second, your genome assembly will greatly benefit from long-range mate pair libraries, long reads (such as Pac-Bio) or optical mapping (if working on bacteria or have lots of cash). The reason for this is that your ability to fully ascertain gene families that have paralogues will only be as good as your feature annotation, and you feature annotation cannot be better than the underlying genome assembly sequence for those regions. Third, for your transcriptome diversity of tissue/life stages is key to acquiring sequencing of us many diverse tissues as possible. Be particularly careful if you wish to identify lowly expressed genes: even with tissue specific libraries, you may need to sequence deeply (the literature is your friend). The most difficult class of genes to annotate are rapidly evolving and lowly expressed genes since you will then have to rely solely on the de-novo prediction but the protein domain search above will be of great help.
TODO Because for evaluation we are using the entire genome, specificity will be somewhat low if your test set has few genes. Basically a low specificity means that the prediction created gene models not found in your test set (as expected).
- Reference sequence
-
A contiguous, relatively long, sequence that is used to anchor other sequences or features
- Feature
-
In this context, an annotation of a reference sequence that has start and stop co-ordinates (e.g. a gene). It can have sub-features (e.g. exons). Usually we just use the term feature' for sub-features too.
Email us one!