Basic EC2, command line, and BLAST

Log into your cloud instance machine with SSH (described in Getting started with Amazon EC2) and we can start to

  • get the tools
  • get the data
  • get the comparison data
  • run the tools
  • get the data off

Install BLAST and some other software

You should be starting at the prompt that looks something like 'ubuntu@ip-10-98-166-81:~$’, inside your Terminal or gitbash window.

This is now not bash on your local machine, but bash on amazon’s cloud instance.

The instance we’ve started for you is a blank slate. It has bash, but little else. No blast, no genomes, no bwa...

Now that we are in the cloud, download and install BLAST:

curl -O ftp://ftp.ncbi.nih.gov/blast/executables/release/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar xzf blast-2.2.26-x64-linux.tar.gz
sudo cp blast-2.2.26/bin/* /usr/local/bin
sudo cp -r blast-2.2.26/data /usr/local/blast-data

Download and install some useful scripts::

sudo apt-get -y install git python-pip
sudo git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts

Create a working directory on a large disk, and change to that working directory:

cd /mnt
sudo chown $USER /mnt
mkdir blast
cd blast

Download the E. coli MG1655 protein data set:

curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa

This grabs that URL and saves the contents of ‘NC_000913.faa’ to the local disk.

Grab a Prokka-generated set of proteins

curl -O http://athyra.idyll.org/~t/ecoli0104.faa

Let’s take a quick look at these files:

head ecoli0104.faa
head NC_000913.faa

Format it for BLAST and run BLAST of the O104 protein set against the MG1655 protein set:

formatdb -i NC_000913.faa -o T -p T
blastall -i ecoli0104.faa -d NC_000913.faa -p blastp -e 1e-12 -o 0104.x.NC

Look at the output file:

head 0104.x.NC

Let’s convert ‘em to a CSV file:

sudo pip install screed
python /usr/local/share/ngs-scripts/blast/blast-to-csv-with-names.py ecoli0104.faa NC_000913.faa 0104.x.NC > 0104.x.NC.csv

This creates a file ‘0104.x.NC.csv’, which you an open in Excel.

The notes from the Berkeley workshop show you how to synchronize high-value, output files on your instance using Dropbox. This tutorial does not show off this neat feature.

But we will copy the spreadsheet onto your computer using scp.

In a new Terminal or gitbash window (one that is NOT running SSH into your instance), type:

scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/0104.x.NC.csv .

(Note: there is a period at the end of this line. It is essential. scp USER@HOST:DIRECTORY/FILENAME DESTINATION )

(Why does it need to be a new terminal window? Your first terminal window is busy running SSH into the cloud instance, and your cloud instance is not authorized to write files to your laptop.)

This will copy the output file 0104.x.NC.csv to your current directory where you can open it, say, with excel.

Reciprocal BLAST calculation

Do the reciprocal BLAST, too:

formatdb -i ecoli0104.faa -o T -p T
blastall -i NC_000913.faa -d ecoli0104.faa -p blastp -e 1e-12 -o NC.x.0104

Extract reciprocal best hit:

python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py ecoli0104.faa NC_000913.faa 0104.x.NC NC.x.0104 > ortho.csv

This generates a file ‘ortho.csv’, containing the ortholog assignments and their annotations.:

scp -i ~/Desktop/soils.pem ubuntu@ec2-???-???-???-???.compute-1.amazonaws.com:/mnt/blast/ortho.csv .

A few post-tutorial links

Explore the NCBI bacterial genome site here: http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria

  • ‘.faa’ files are protein data sets;
  • ‘.fna’ files are genomic DNA;
  • the rest are annotation files of various kinds.

Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to Basic EC2, command line, and BLAST on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.