Day 5: Whole Genome Assembly | STEM: DNA Biology and Bioinformatics

ASSEMBLY OF BACTERIAL GENOME SEQUENCES

Today we will be assembling the sequencing reads from our Agrobacterium tumefaciens genomic DNA extractions from Week 1. The samples were sequenced on an Illumina MiSeq over the weekend, and today we will be assembling the short reads into longer genome sequences using the program Velvet.

First, make sure you are in your home directory, the "~" is a shortcut for your home folder:

cd ~

We will make a new folder called sequencing_reads. This folder is where we will put the the forward and reverse sequencing reads from the MiSeq run of your Agrobacterium genome samples. These reads have been trimmed so that only high quality reads remain. Change into that directory with the command:

mkdir sequencing_reads

cd sequencing_reads

and type

We will now copy the sequencing read files for one of the 17 sequenced genomes. These files are stored in the folder "/nfs1/Teaching/CGRB/dbbc_s16/data/reads"

Let's look at the files in that folder:

ls /nfs1/Teaching/CGRB/dbbc_s16/data/reads/

Each pair of files is named isolate#-R1.fastq and isolate#-R2.fastq, where # is a number from 1 to 17. Copy the isolate files with the number matching the number on your desk. For example, if you are at the computer marked "lab5", your number is 5.

(replace 5 with your number)

cp /nfs1/Teaching/CGRB/dbbc_s16/data/reads/isolate5-* ./

Let’s look at what some of these reads look like by typing the following command:

less isolate#-R1.fastq

Each sequence has 4 lines, the first starts with “@” and is the identifier for that sequence. The second line contains the actual DNA sequence. The fourth line contains the quality score for each base in the sequence. Each character corresponds to a number that represents the quality of that DNA base being correct.

Lets make a new directory for our genome assembly. First, go up one directory to your home directory

cd ..

and make a new folder called genome_assembly:

mkdir genome_assembly

and change directory into that folder

cd genome_assembly

To assemble these reads into longer genome sequences, we will use the program Velvet. This program can take a long time to run since it has to piece together millions of reads. Velvet has two parts, velveth and velvetg.

Here are the commands to run Velvet:

velveth ./kmer75assembly 75 -shortPaired -fastq -separate ../sequencing_reads/isolate#-R1.fastq ../sequencing_reads/isolate#-R2.fastq

where # is your isolate number. Once velveth is done running, then run:

velvetg ./kmer75assembly

If there is an error message, please ask one of the TAs for help and they will help you get it running again.

Once the velvet commands have finished running, let’s check and see what output it produced.

Type the ls command again to see what files are in your current directory.

You should see a folder called kmer75assembly. This folder contains your assembled genome and other files from the assembly process. Change directories to that folder and list the files inside.

cd kmer75assembly

You should see a file named "contigs.fa". This file contains your assembled genome sequences.

Let’s look at the very beginning of this file to see what they look like. Type the following command:

less contigs.fa

These are your assembled contigs. Unfortunately we cannot piece together all of the contigs together because the sequencing reads are so short. To do that we would need to do several more sequencing runs with longer reads. However this gives us a majority of the genome of our organism.

Let’s take a look now at some of the statistics from our genome assembly. Run the following command:

assemblathon_stats.pl contigs.fa

This shows various statistics of the results of your Velvet assembly. Look for the line that says “Number of scaffolds”. This number is how many pieces (or contigs) your final genome assembly was assembled into. The lower this number is the better.

Look for the line that says “Total number of bases in assembly”. This is the total length of your assembled genome sequence in nucleotide bases. This number should be around 5 million, the estimated size of the Agrobacterium tumefaciens genome.

Also look at the line that says “Longest scaffold”. This is the length of your longest contig.

If you finish early, try assembling your genome again using a different “kmer” size with the velveth command and then the velvetg command. That is, try replacing the number 75 with an odd number between 19 and 149 and see if the number of scaffolds you get changes for better or for worse. A kmer is the length of DNA sequence that velvet will split your reads into before aligning them. For example, a kmer size of 75 (or “75-mer”) will align pieces of your reads together that are 75 bases long.

Be sure to go back up a folder (cd ..) before starting again, and pick a new name for your assembly folder in the velveth and velvetg commands (ie. “kmer149assembly”) so that your old assembly does not get overwritten.

Which kmer size gave you the best assembly?