ANNOTATE BACTERIAL GENOME SEQUENCES (PREDICT GENES AND THEIR FUNCTIONS) USING PROKKA
Today we will be learning about genome annotation. We will use the program Prokka, which is designed for annotating Bacterial genomes, to annotate our Agrobacterium tumefaciens genome assemblies from earlier.
First, open PuTTY and log in to crick if you are not already.
Change directory to the folder containing your genome assembly:
cd ~/genome_assembly/kmer75assembly/
You should be in a folder that contains your genome assembly contigs (“contigs.fa”). Run the ls command to see all of the files.
First, we need to rename the sequence IDs in our assemblies so that they will work with prokka. Run the following commands:
awk '/^>/{print ">contig_" ++i; next}{print}' < contigs.fa > contigs_number.fa
This will make a new file called contigs_number.fa that contains all of your assembled sequences with IDs renamed as numbers (contig_1, contig_2, etc).
To run Prokka, run the following commands:
prokka contigs_number.fa --outdir annotated_genome --prefix Atumefaciens_assembly --force
Next, change directories to the new folder prokka created:
cd annotated_genome
and list the files in the directory:
ls
You should see two files, Atumefaciens_assembly.faa and Atumefaciens_assembly.fna. The “.faa” file contains all of the predicted and annotated proteins that Prokka identified in your assembly. The “.fna” file contains all of the annotated DNA sequences from your contigs.
Let’s take a look at a few of the (probable) proteins Prokka found in your genome:
nano Atumefaciens_assembly.faa
What is the predicted function of the first protein found?
How many predicted proteins did Prokka find? (Hint: use grep with --count, and fasta format files start with ">" for each sequence)