Running the Pipeline

If you have not already, make sure to check out the Install page first.

You will have to re-source the virtualenv when you want to use the pipeline, using

. bactpipeline/bin/activate

Running a single sample

Running a single sample is fairly straight forward assuming you have your miseq reads parsed out into individual folders for each sample.

Usage Example

For this example, assume you have a directory /home/username/reads/sample1 that contains MiSeq reads for sample1:

  • sample1_S1_L001_R1_001.fastq
  • sample1_S1_L001_R2_001.fastq

Simply run the following command to run the pipeline on sample1’s data

runsample -o sample1 /home/username/reads/sample1

Running multiple samples

Multiple samples can be run with the --sample-sheet parameter.

./runsample --sample-sheet samplesheet.csv -o outdir
./runsample -s samplesheet.csv -o outdir

Sample Sheet Syntax

sample_directory,sample_id,primer_file
test/fixtures/fix_fastq,SampleA,

PBS/Torque Example

Here is a quick example on how to run all your samples using the qsub command

Here we are using a bash while loop to loop through all lines in the samplesheet.csv file to spawn a new job for each line.

We use qsub’s -j option to ensure standard output and standard error are joined into one stream.

We then use qsub’s -V option to ensure that your current environment gets forwarded on to each job. This is important as your current environment should already include your virtualenv’s variables as well as having the newbler executables in your PATH.

First we will define where all project directories will be created:

OUTDIR="outdir"

Now we can run our loop over the samplesheet.csv file

mkdir -p $OUTDIR
while IFS=',' read path sn primer; do \
    [ "$sn" == "sample_id" ] && continue; \
    echo "cd \$PBS_O_WORKDIR; runsample -o $OUTDIR/${sn} $path" | qsub -j oe -N $sn -V; \
done < samplesheet.csv

Once all jobs are completed you can build your full report and aggregate all contig files into a single directory named after each sample.

grep -h sample $OUTDIR/*/summary.tsv | head -1 > $OUTDIR/full_summary.tsv
grep -h -v sample $OUTDIR/*/summary.tsv >> $OUTDIR/full_summary.tsv
mkdir -p $OUTDIR/contigs
for c in $OUTDIR/*/top_contigs.fasta; do sn=$(basename $(dirname $c)); ln -s ../${sn}/top_contigs.fasta $OUTDIR/contigs/${sn}.contigs.fasta; done;

runsample

This script takes care of putting all the pieces of the pipeline together

Pipeline Flow

  • fix_fastq
  • flash
  • btrim
  • Newbler(runAssembly)

Usage

  • Output Directory
    • -o or –output
    • Specifies where to put all the resulting output directories for the various stages
    • Default: output
  • primer
    • -p or –primer
    • Specify a primer trimming file to use for the newbler assembly. It is passed using -vt to runProject
  • sample sheet
    • -s or –sample-sheet
    • Specify a samplesheet to parse containing your input files and primer files
  • truseq
    • -t or –truseq
    • Specify the path to a truseq.txt file for adapter trimming
  • readdir
    • Specifies a directory that contains the paired MiSeq reads
runsample --help

Output Files

  • fix_fastq
    • fastq files with same name as were in the readdir argument but with sequence id modified for Newbler
  • flash/
    • out.extendedFrags.fastq
      • paired reads combined together
    • out.notCombined_1.fastq
      • R1 reads that did not combine
    • out.notCombined_2.fastq
      • R2 reads that did not combine
    • out.hist
      • Combined read lengths
    • out.histogram
      • Combined read lengths visual
  • btrim
    • fastq files with same name as out.*.fastq from flash, but with .btrim.fastq at end
  • newbler_assembly
    • gsAssembler project directory
      • See Newbler documentation about contents of this directory.
  • top_contigs.fasta

    Contains the top 100 contigs from newbler_assembly/assembly/454AllContigs.fna sorted by sequence length

  • summary.tsv

    Summary file that contains quick easy summary to view about all the contigs including their length, number of reads used to compose them, N50, % of total reads from after btrim ran that compose each contig

fix_fastq

This script handles renaming sequence identifiers in Illumina reads such that Newbler will use them as paired end correctly.

It addresses this

Usage

fix_fastq [-o outdir] fastq [fastq ...]

Example usage

You essentially supply the script with the location of any fastq files you want and it will replace the sequence id in each and copy the modified version into an output directory.

If you have a bunch of fastq files in a directory, lets say /home/username/reads, then you could run it as follows:

fix_fastq -o newbler_reads /home/username/reads/*.fastq

All modified reads would then be placed in a directory called newbler_reads in the current directory.