Basic EC2, command line, and BLAST¶

Log into your cloud instance machine with SSH (described in Getting started with Amazon EC2) and we can start to

get the tools
get the data
get the comparison data
run the tools
get the data off

Install BLAST and some other software¶

You should be starting at the prompt that looks something like 'ubuntu@ip-10-98-166-81:~$’, inside your Terminal or gitbash window.

This is now not bash on your local machine, but bash on amazon’s cloud instance.

The instance we’ve started for you is a blank slate. It has bash, but little else. No blast, no genomes, no bwa...

Now that we are in the cloud, download and install BLAST:

curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar xzf blast-2.2.26-x64-linux.tar.gz
sudo cp blast-2.2.26/bin/* /usr/local/bin
sudo cp -r blast-2.2.26/data /usr/local/blast-data

Download and install some useful scripts::

sudo apt-get -y install git python-pip
sudo git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts

Create a working directory on a large disk, and change to that working directory:

cd /mnt
sudo chown $USER /mnt
mkdir blast
cd blast

Download the E. coli MG1655 protein data set:

curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa

This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.

Grab a Prokka-generated set of proteins

curl -O http://athyra.idyll.org/~t/ecoli0104.faa

Let’s take a quick look at these files:

head ecoli0104.faa
head NC_000913.faa

Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:

formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC

Look at the output file:

head 0104.x.NC

Let’s convert ‘em to a CSV file:

sudo pip install screed
python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv

This creates a file ‘0104.x.NC.csv’, which you an open in Excel.

The notes from the Berkeley workshop show you how to synchronize high-value, output files on your instance using Dropbox. This tutorial does not show off this neat feature.

But we will copy the spreadsheet onto your computer using scp.

In a new Terminal or gitbash window (one that is NOT running SSH into your instance), type:

scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/0104.x.NC.csv .

(Note: there is a period at the end of this line. It is essential. scp USER@HOST:DIRECTORY/FILENAME DESTINATION )

(Why does it need to be a new terminal window? Your first terminal window is busy running SSH into the cloud instance, and your cloud instance is not authorized to write files to your laptop.)

This will copy the output file 0104.x.NC.csv to your current directory where you can open it, say, with excel.

Reciprocal BLAST calculation¶

Do the reciprocal BLAST, too:

formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104

Extract reciprocal best hit:

python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv

This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations.:

scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/ortho.csv .

A few post-tutorial links¶

Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

‘.faa’ files are protein data sets;
‘.fna’ files are genomic DNA;
the rest are annotation files of various kinds.

Basic EC2, command line, and BLAST¶

Install BLAST and some other software¶

Reciprocal BLAST calculation¶

A few post-tutorial links¶

Table Of Contents

Previous topic

Next topic

This Page

Edit this document!

Navigation

Basic EC2, command line, and BLAST¶

Install BLAST and some other software¶

Reciprocal BLAST calculation¶

A few post-tutorial links¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

Edit this document!