Bioinformatics for Cancer Genomics

Bioinformatics for Cancer Genomics 12 Lab

This lab was created by Solomon Shorser

Introduction

Description of the lab

Welcome to the lab for Big Data Analysis! This lab will consolidate what you have learned about Cloud Computing by aligning reads from a cell line as an example.

After this lab, you will be able to:

  • Install the dockstore CLI.
  • Run CWL tools and workflows using the dockstore CLI.

Things to know before you start:

The lab may take between 1-2 hours, depending on your familiarity with Cloud Computing and alignment tasks.

Requirments

Set up a fresh VM by following the instructions in [Module 10 lab] (https://github.com/bioinformaticsdotca/BiCG_2018/blob/master/module10/lab.md), but with the following changes:

  • choose flavor c1.large
  • don’t assign a floating IP

Without a floating IP, this VM is only accessible from Collaboratory. Note that there are often not enough floating IPs for all VMs when you’re running a fleet. So you’ll have to set up a “jump server” as a getway to ssh from outside into Collaboratory. Then from the jump server, you can ssh into any of the VMs in your fleet. We’ll use the VM (c1.micro) you’ve set up for Modules 3 and 4 as a jump server. If you haven’t already done so, add your prviate key to the jump server. From the console, find the IP address of the new c1.large VM and ssh into it.

ssh -i path_to_private_key ubuntu@10.0.0.XXX

Setting up your VM

Install Java

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get install -y oracle-java8-set-default

Get the dockstore tool

Latest instructions for installing Dockstore at https://dockstore.org/quick-start

mkdir -p ~/bin
curl -L -o ~/bin/dockstore https://github.com/ga4gh/dockstore/releases/download/1.5.1/dockstore
chmod +x ~/bin/dockstore
echo 'export PATH=~/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Add the location of the dockstore script to $PATH.

Using your favourite text editor (try pico if you don’t have one), add this line to the end of ~/.bashrc:

PATH=$PATH:~/sbin

Now, set up the dockstore configuration file:

cd ~
mkdir -p ~/.dockstore
touch ~/.dockstore/config

Add to ~/.dockstore/config these lines:

The URL for dockstore
server-url: https://dockstore.org:8443
A token

You only need a valid token if you want to push data TO dockstore. To pull data, “DUMMY” is fine.

token: DUMMY
Caching

Turn on caching to prevent the same input files from being downloaded again and again and again…

use-cache=true

Install docker

The full instructions are on Docker’s website: https://docs.docker.com/install/linux/docker-ce/ubuntu/

Fast Method
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
Manual Method

If the Fast Method does not work, you can try the manual process described below.

First, install prerequisite software:

sudo apt-get update
sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

Add Docker’s official GPG key:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Add the Docker repository (for x86_64/amd64 architecture):

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

Set arch in the command above to armhf for armhf platforms, ppc64el for IBM PPC, or s390x for IBM Z s390x.

Now that you’ve added a new repository, run apt update:

sudo apt-get update

Install docker:

sudo apt-get install docker-ce

Testing Docker

You can test your docker installation by running:

sudo docker run hello-world

This command will tell you if docker was successfully installed.

Add your user to the docker user group

This is so you can run docker without having to sudo every time.

sudo usermod -aG docker $USER

After you execute the above below, you will need to log out and log back in.

Get cwltool

Install Python’s package manager pip (if it is not already installed) and then download the appropriate requirements file and have pip install cwltool. If you are using Python 2:

sudo apt install python-pip
curl -o requirements.txt "https://dockstore.org:443/api/metadata/runner_dependencies?client_version=1.5.1&python_version=2"
pip install -r requirements.txt

If you are using Python 3:

sudo apt install python-pip
curl -o requirements.txt curl -o requirements.txt "https://dockstore.org:443/api/metadata/runner_dependencies?client_version=1.5.1&python_version=3"
pip install -r requirements.txt

Note: You may need to run the pip install commands as sudo:

Use the dockstore CLI to fetch the CWL

The dockstore CLI will download the CWL file for the tool specified by --entry.

dockstore tool cwl --entry quay.io/pancancer/pcawg-bwa-mem-workflow:2.6.8-cwl1 > Dockstore.cwl

Note: If you get an error “dockstore: command not found”, that’s because you haven’t logged out and logged back in after adding yourself to the docker group.

Prepare your JSON input file

Generate the JSON file

JSON files can be automatically generated from the CWL file. You will have to fill in the default values in this file.

dockstore tool convert cwl2json --cwl Dockstore.cwl > Dockstore.json

Download an existing file

An existing input JSON file can be found here. Edit it if you wish, but note that ‘~’ if used in the JSON is not interpreted as home directory.

wget https://raw.githubusercontent.com/bioinformaticsdotca/BiCG_2018/master/module12_lab/sample_input.json

Create a directory for the output data. We use ‘~/tmp’ in the example JSON.

mkdir ~/tmp

You are ready to run BWA-Mem using the Dockstore CLI (see below). However, if you have time, try downloading the input data (unaligned BAM) to your VM using the icgc-storage-client. In Module 10, you ran icgc-storage-client as a Docker. We’ll now run it as a command line tool.

Download unaligned BAMs using the icgc-storage-client

To install the tool,

wget -O icgc-storage-client.tar.gz https://dcc.icgc.org/api/v1/ui/software/icgc-storage-client/latest
tar -zxvf icgc-storage-client.tar.gz

Here are the 2 unaligned BAMs with their object ids. They are are open-access in Collaboratory and don’t require a token.

hg19.chr22.5x.normal.bam	26ed125c-bc28-552c-b82d-1de2561b3911
hg19.chr22.5x.normal2.bam	1039a928-a767-5fe4-a50a-4e7af8ced828

Method 1: Genereate a pre-signed URL, and download using curl or wget. Remember to put quotes for the URL in the wget command. Otherwise, you’ll get a 403 error.

mkdir input
icgc-storage-client-1.0.23/bin/icgc-storage-client --profile collab url --object-id 1039a928-a767-5fe4-a50a-4e7af8ced828
wget -O input/hg19.chr22.5x.normal2.bam "<pre-signed URL>"

Method 2: Download using the icgc-storage client which will handle multi-part download and resume after interruption.

icgc-storage-client-1.0.23/bin/icgc-storage-client --profile collab download --object-id 26ed125c-bc28-552c-b82d-1de2561b3911 --output-layout bundle --output-dir input

Organize the files in your input directory. Then edit the JSON sample_input.json to update under “reads” the paths to 2 unaligned BAMs.

Run it locally with the Dockstore CLI

dockstore tool launch --entry quay.io/pancancer/pcawg-bwa-mem-workflow:2.6.8_1.2 --json sample_input.json