Running the Pipeline¶
If you have not already, make sure to check out the Install page first.
You will have to re-source the virtualenv when you want to use the pipeline, using
. bactpipeline/bin/activate
Running a single sample¶
Running a single sample is fairly straight forward assuming you have your miseq reads parsed out into individual folders for each sample.
Usage Example¶
For this example, assume you have a directory /home/username/reads/sample1 that contains MiSeq reads for sample1:
- sample1_S1_L001_R1_001.fastq
- sample1_S1_L001_R2_001.fastq
Simply run the following command to run the pipeline on sample1’s data
runsample -o sample1 /home/username/reads/sample1
Running multiple samples¶
Multiple samples can be run with the --sample-sheet
parameter.
./runsample --sample-sheet samplesheet.csv -o outdir
./runsample -s samplesheet.csv -o outdir
Sample Sheet Syntax¶
sample_directory,sample_id,primer_file
test/fixtures/fix_fastq,SampleA,
PBS/Torque Example¶
Here is a quick example on how to run all your samples using the qsub command
Here we are using a bash while loop to loop through all lines in the samplesheet.csv file to spawn a new job for each line.
We use qsub’s -j option to ensure standard output and standard error are joined into one stream.
We then use qsub’s -V option to ensure that your current environment gets forwarded on to each job. This is important as your current environment should already include your virtualenv’s variables as well as having the newbler executables in your PATH.
First we will define where all project directories will be created:
OUTDIR="outdir"
Now we can run our loop over the samplesheet.csv file
mkdir -p $OUTDIR
while IFS=',' read path sn primer; do \
[ "$sn" == "sample_id" ] && continue; \
echo "cd \$PBS_O_WORKDIR; runsample -o $OUTDIR/${sn} $path" | qsub -j oe -N $sn -V; \
done < samplesheet.csv
Once all jobs are completed you can build your full report and aggregate all contig files into a single directory named after each sample.
grep -h sample $OUTDIR/*/summary.tsv | head -1 > $OUTDIR/full_summary.tsv
grep -h -v sample $OUTDIR/*/summary.tsv >> $OUTDIR/full_summary.tsv
mkdir -p $OUTDIR/contigs
for c in $OUTDIR/*/top_contigs.fasta; do sn=$(basename $(dirname $c)); ln -s ../${sn}/top_contigs.fasta $OUTDIR/contigs/${sn}.contigs.fasta; done;
runsample¶
This script takes care of putting all the pieces of the pipeline together
Pipeline Flow¶
- fix_fastq
- flash
- btrim
- Newbler(runAssembly)
Usage¶
- Output Directory
- -o or –output
- Specifies where to put all the resulting output directories for the various stages
- Default: output
- primer
- -p or –primer
- Specify a primer trimming file to use for the newbler assembly. It is passed using -vt to runProject
- sample sheet
- -s or –sample-sheet
- Specify a samplesheet to parse containing your input files and primer files
- truseq
- -t or –truseq
- Specify the path to a truseq.txt file for adapter trimming
- readdir
- Specifies a directory that contains the paired MiSeq reads
runsample --help
Output Files¶
- fix_fastq
- fastq files with same name as were in the readdir argument but with sequence id modified for Newbler
- flash/
- out.extendedFrags.fastq
- paired reads combined together
- out.notCombined_1.fastq
- R1 reads that did not combine
- out.notCombined_2.fastq
- R2 reads that did not combine
- out.hist
- Combined read lengths
- out.histogram
- Combined read lengths visual
- btrim
- fastq files with same name as out.*.fastq from flash, but with .btrim.fastq at end
- newbler_assembly
- gsAssembler project directory
- See Newbler documentation about contents of this directory.
- top_contigs.fasta
Contains the top 100 contigs from newbler_assembly/assembly/454AllContigs.fna sorted by sequence length
- summary.tsv
Summary file that contains quick easy summary to view about all the contigs including their length, number of reads used to compose them, N50, % of total reads from after btrim ran that compose each contig
fix_fastq¶
This script handles renaming sequence identifiers in Illumina reads such that Newbler will use them as paired end correctly.
It addresses this
Usage¶
fix_fastq [-o outdir] fastq [fastq ...]
Example usage¶
You essentially supply the script with the location of any fastq files you want and it will replace the sequence id in each and copy the modified version into an output directory.
If you have a bunch of fastq files in a directory, lets say /home/username/reads, then you could run it as follows:
fix_fastq -o newbler_reads /home/username/reads/*.fastq
All modified reads would then be placed in a directory called newbler_reads in the current directory.