Log into your cloud instance machine with SSH (described in Getting started with Amazon EC2) and we can start to
You should be starting at the prompt that looks something like 'ubuntu@ip-10-98-166-81:~$’, inside your Terminal or gitbash window.
This is now not bash on your local machine, but bash on amazon’s cloud instance.
The instance we’ve started for you is a blank slate. It has bash, but little else. No blast, no genomes, no bwa...
Now that we are in the cloud, download and install BLAST:
curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar xzf blast-2.2.26-x64-linux.tar.gz
sudo cp blast-2.2.26/bin/* /usr/local/bin
sudo cp -r blast-2.2.26/data /usr/local/blast-data
Download and install some useful scripts::
sudo apt-get -y install git python-pip
sudo git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts
Create a working directory on a large disk, and change to that working directory:
cd /mnt
sudo chown $USER /mnt
mkdir blast
cd blast
Download the E. coli MG1655 protein data set:
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.
Grab a Prokka-generated set of proteins
curl -O http://athyra.idyll.org/~t/ecoli0104.faa
Let’s take a quick look at these files:
head ecoli0104.faa
head NC_000913.faa
Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:
formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC
Look at the output file:
head 0104.x.NC
Let’s convert ‘em to a CSV file:
sudo pip install screed
python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv
This creates a file ‘0104.x.NC.csv’, which you an open in Excel.
The notes from the Berkeley workshop show you how to synchronize high-value, output files on your instance using Dropbox. This tutorial does not show off this neat feature.
But we will copy the spreadsheet onto your computer using scp.
In a new Terminal or gitbash window (one that is NOT running SSH into your instance), type:
scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/0104.x.NC.csv .
(Note: there is a period at the end of this line. It is essential. scp USER@HOST:DIRECTORY/FILENAME DESTINATION )
(Why does it need to be a new terminal window? Your first terminal window is busy running SSH into the cloud instance, and your cloud instance is not authorized to write files to your laptop.)
This will copy the output file 0104.x.NC.csv to your current directory where you can open it, say, with excel.
Do the reciprocal BLAST, too:
formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104
Extract reciprocal best hit:
python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv
This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations.:
scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/ortho.csv .
Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.