2. Bioinformatics services
a. Emboss and wEmboss:
Access to Emboss software is available on emboss.uit.tufts.edu , which provides both shell and web access. In both cases you will need an account. You may request an account at http://research.uit.tufts.edu . The server hardware is a single quad core 64 bit host with 4 gig of ram.
For shell access to command line tools:
> ssh -Y emboss.uit.tufts.edu
For access to the web interface wEmboss.
For access to emboss web documentation.
Emboss tutorial
If you have any questions about Emboss related usage, applications, or assistance with software, please contact bio-support@tufts.edu.
b. Tufts Center for Neuroscience Research Genomics Core
The Tufts CNR Genomics Core supplies links to bioinformatics resources related to their operation. See Tufts CNR Genomics Core Resources for more information.
c. Genome Indexes on Cluster
Several mammalian and model system genomes, indexes, and annotations are located on the Tufts HPC cluster. Currently the genomes are listed below in the indicated directory tree are UCSC genome builds, except for canFam3 which is a NCBI build.
/cluster/tufts/genomes /HomoSapiens /hg18 /hg19 Within each build subdirectory, there are two subdirectories. /Annotation /Sequence |
---|
In the Annotation directory there are subdirectories for gene annotations ( Gene), and depending upon the degree of annotation, directories for smallRNA and Variation.
Under the Sequence directory, there are subdirectories containing indexes for popular short read sequence mapping programs.
/AbundantSequences -- data files with over-represented sequences /BlastDB -- blast formatted genomic indexes: use genome.fa as reference name /Bowtie2Index -- Bowtie2 formatted indexes: use genome as reference name /BowtieIndex -- Bowtie formatted indexes: use genome as reference name /BWAIndex -- BWA formatted indexes: use genome.fa as reference name /Chromosomes -- individual chromosomes as fasta files /Transcriptome -- Bowtie2 formatted index of transcriptome sites: use transcript as ref name /WholeGenomeFasta -- Genome as one file with accessory files |
---|
Please read the documentation for a mapping program to understand the way in which the reference indexes are referred.
Examples
The examples listed below should be included in a script and then submitted
Example: BWA
It helps to set up environmental variables to avoid having to type long paths. Here a set of short reads ( myreads.fq) are mapped to the mouse genome (mm10) with a SAM formatted file as output. Note that bwa uses genome.fa as a reference index name and the bwa mem analysis is used. See the BWA documentation for other ways to invoke bwa.
module load bwa/0.7.9a export MM10=/cluster/tufts/genomes/MusMusculus/mm10/Sequence/BWAIndex export MYDATADIR=/cluster/shared/myutln/mmdata bwa mem $MM10/genome.fa $MYDATADIR/myreads.fq >$MYDATADIR/myreads.sam |
---|
Example: Bowtie2
Similarly, environmental variables can be set up, and in the case of bowtie2 a BOWTIE2_INDEXES variable must be set also. Here we have an example of a paired end analysis, with minimal options. See the bowtie2 documentation for a complete set of command options. Note Bowtie2 uses genome as reference index name (-x genome ).
module load bowtie2 export BOWTIE2_INDEXES=/cluster/tufts/genomes/MusMusculus/mm10/Sequence/Bowtie2Index export MYDATADIR=/cluster/shared/myutln/mmdata bowtie2 -q -x genome -1 $MYDATADIR/myreads_1.fq -2 $MYDATADIR/myreads_2.fq -S $MYDATADIR/myreads.sam |
---|
d. HPC Modules for Bioinformatics
Note: some bioinformatic software modules, such as R modules like bioconductor or python modules, are not listed here because they are part of a larger module, for example R/3.1.0 or python/2.7.6. Load those modules to get to bioconductor or python modules such as numpy or matplotlib.
To list the entirety of the module collection use this command
module avail
To load a module use this command
module load modulename/version
as listed below. Default settings are annotated by '*'
module list
shows currently loaded modules.
To unload a module use this command
module unload modulename/version
|
Performance Considerations using threads
In general, there are useful performance gains using threads, but it can also be abused by using too many. Applications supporting thread parallelism may have varying degree of internal support.
Application performance is not always well documented and it may be beneficial to you to do some benchmarking. By doing so you will be in a position to better utilize the cluster resources. For example here is a benchmark examination of blastp and other tools.