Bioinformatics for Cancer Genomics 2019
CBW BiCG Unix and R Review Session
Introduction
Welcome to the review for Unix and R! This lab will introduce you to the the command line and using R.
After this lab, you will be able to:
- use common commands in Unix
- read data into R
- produce basic plots in R
Before We Begin
Read these directions for information on how to log in to your assigned Amazon node.
Unix Review
The unix terminal/shell/bash console is a text interface for interacting with a computer. It lets you see the folders and files saved on a computer, read and write new files, and other activities.
A lot of biological data files are very large, and running them on your laptop is not feasible. Researchers use very large computer clusters or cloud computing in order to generate, manipulate, store, and view data files. These clusters are accessed remotely from another computer, such as your laptop. Because these clusters generally require use of the unix terminal in order to perform these actions, a lot of bioinformatics software is written specifically for use in a unix terminal. Therefore it is helpful to learn the basics of navigating and using a terminal.
You can’t use a mouse and click to move the cursor–everything is done by typing. If you make a typo while typing a command, you need to use the arrow keys to navigate to the position in order to edit. A short cut to jump to the beginning of the line is ctrl + a
, and ctrl + e
will jump to the end of the line.
You can use the mouse to select and highlight text to copy/paste. To copy/paste within the terminal, you can do the following:
- Mac:
cmd + c
andcmd + v
- Windows: highlight text and click the right mouse button
- Linux:
ctrl + shift + c
andctrl + shift + v
Basic commands:
- ls: list the files in the directory:
ls
- cd: change directories:
cd workspace
- mkdir: make a new directory:
mkdir Review_Session
-
cd ..
will take you up one directory. Give it a try. - pwd: print working directory (where you are currently located):
pwd
You should be in
/home/ubuntu
. - Let’s move into the new directory we created:
cd workspace/Review_Session
- echo: print what was typed:
echo 'Hello World!'
>
redirects the output from a command into a new file:echo 'Hello World!' > test.txt
- cat: print the contents of a file:
cat test.txt
- curl: used to get contents from a URL:
curl https://raw.githubusercontent.com/bioinformaticsdotca/Genomic_Med_2017/master/test.fasta > test.fasta curl https://raw.githubusercontent.com/bioinformaticsdotca/BiCG_2017/master/Review_session/Gene_R_example.csv > Gene_R_Example.csv
- head and tail: get the beginning or end of a file:
head test.fasta tail test.fasta
- less: look at the contents of a file in a new window (handy for looking at large files when you don’t want to loose sight of your previous commands in the terminal):
less test.fasta
To exit
less
, pressq
. - cp: copy a file
cp test.fasta test2.fasta
- mv: move a file
mkdir test-files mv test2.fasta test-files ls ls test-files
You should only see test2.fa listed in the test-files directory.
- Let’s move it back:
mv test-files/test2.fasta .
The single
.
refers to the current directory. mv
can also be used to rename filesmv test2.fasta test3.fasta ls
You should see test.fasta and test3.fasta, but not test2.fasta
- rm: remove files
rm test3.fasta
Note: Removing files with
rm
permanently deletes them, so be careful! You can usingrm -i
to avoid accidentally deleting a wanted file. This forces you to confirm by typingy
. - grep: search for a matching pattern in a file:
grep TTT test.fasta
- You can use the pipe character
|
to string commands together:head test.fasta | grep TTT
- Flags are additional options for commands. They are usually indicated with a dash and a single letter:
head -n 5 test.fasta
-n
is the flag to specify the number of lines to be shown for thehead
command, and5
is the option we chose. - man: open a help page for a command:
man head
Scroll through the usage and options available for the
head
command. Pressq
to exit when done. - history: prints out the ~1000 most recently commands that you entered:
history
Useful tips:
- You can get commands that you previously used to reappear in the terminal by pressing the up arrow on your keyboard.
- If you want to quickly delete everything that you typed in the terminal, or if you want to cancel a command that is taking a long time to run, hold
ctrl
and typec
.
R Review
Connecting To RStudio
- Open an internet browser.
- In the URL bar, enter http://##.oicrcbw.ca:8080 replacing ## with your provided student number.
- Enter the supplied username and password.
Navigating RStudio
RStudio is an interactive program for reading and writing R code. It is divided into four panes that can be resized and rearranged to your preference:
- Source
- Environment, History, and Connections
- Console
- Files, Plots, Help, and Viewer
The console pane is similar to the bash/unix/terminal console. You can enter code here and see output, but you can’t save the code you’ve written. It’s useful for checking quick things, like the size of a list of genes.
The source pane is where you write and edit scripts or RStudio notebooks (where you write code that you want to save). You can select code to run and the results will show up in the console (as well as the source, if you are using a notebook).
The environment, history, and connections pane gives an overview of what items you have entered in your R session (environment) and the history of commands you’ve typed (history). (We don’t need to worry about the connections tab)
The files, plots, packages, help, and viewer pane lets you navigate folder and load files (files), view images you’ve created (plots), see what packages have been installed and loaded in your R session (packages), and look at help files for specific packages and functions (help). (We don’t need to worry about the viewer tab)
RStudio Notebooks
Information on RStudio Notebooks can be found here.
RStudio notebooks are written in R Markdown and contain text that can be executed independently.
To run code chunks, place your cursor within the code chunk and press Cmd+Shift+Enter on Mac or Crtl+Shift+Enter on Windows or Linux, or click the green triangle run button at the top right of the grey code chunk. The output of the chunk appears below the code chunk.
To start a new notebook, “File” > “New File” > “R Notebook”. It will be populated with some information about notebooks–you can delete this info if you want.
Copy the code here into the notebook you have created.
The code below is the same as the link above. If you’d prefer to work in an R script instead of an RStudio notebook, you can copy/paste what’s in the code boxes. Note that if you don’t want text notes to be interpreted as code, they have to start with #
Intro to R
R is used to manipulate various kinds of data. Data objects can be given names, called variables:
5 + 6
number1 <- 5
number2 <- 6
number1 + number2
number.sum <- number1 + number2
# print the content of a data object or variable to screen
print(number.sum)
# typing just the name of the object also prints it to screen
number.sum
Data in R are manipulated primarily with functions. Functions are called by typing the name of the function followed by round brackets. Most functions require arguments, or options, to be specified in the brackets. For print(number)
used above, print
is the name of the function, and number.sum
is the variable option that we want to print.
If you are not sure what a specific function does, or if you need a reminder on what the arguments for a funtion are, you can view the help page by typing ?function.name
without the round brackets:
?print
Note: The help page will open in the “Help” tab in the Files/Plots/Packages/Help/Viewer pane.
Getting Around
The Hard Way
# get the current working directory:
getwd()
# set a new working directory:
setwd("~/workspace/Review_Session")
# to set a working directory on your own computer, you can do one of the following:
# (these commands are commented out so that these code chunks will run on AWS RStudio)
# setwd("C:/myPATH") # on Windows
# setwd("~/myPATH") # on Mac
# setwd("/Users/david/myPATH") # on Mac
# list the files in the current working directory:
list.files()
# list objects in the current R session:
ls()
The Easy Way
In RStudio we can use “Session” > “Set Working Directory” > “Choose Directory”.
Data Types
Vectors
Vectors contain multiple pieces of data. The elements of a vector must all be of the same type (numeric, logical, character).
The c()
function combines the items inside the round brackets and can be used to create a new vector:
numeric.vector <- c(1,2,3,4,5,6,2,1)
numeric.vector
character.vector <- c("Fred", "Barney", "Wilma", "Betty")
character.vector
logical.vector <- c(TRUE, TRUE, FALSE, TRUE)
logical.vector
To refer to specific elements in the vector, use square brackets.
Because vectors are one dimensional (unlike a two-dimensional matrix), only a single number can be specified in the brackets.
If you want more than one element, you can specify a range (with :
) or a vector (with c()
):
character.vector
character.vector[2]
character.vector[2:3]
character.vector[c(2,4)]
Matrices
You can create a 3x4 numeric matrix with:
matrix.example <- matrix(1:12, nrow = 3, ncol=4, byrow = FALSE)
matrix.example
matrix.example <- matrix(1:12, nrow = 3, ncol=4, byrow = TRUE)
matrix.example
Alternatively, you can create a matrix by combining vectors:
dataset.a <- c(1,22,3,4,5)
dataset.b <- c(10,11,13,14,15)
dataset.a
dataset.b
rbind.together <- rbind(dataset.a, dataset.b)
rbind.together
cbind.together <- cbind(dataset.a, dataset.b)
cbind.together
To get elements of the matrix, square brackets are used again, but two dimensions must be specified. To do this, a comma is used, where the number before the comma is the row number and the number after the comma is the column number. If you want to specify all of the rows, leave the space before the comma blank. If you want to specify all of the columns, leave the space after the comma blank:
matrix.example[2,4]
matrix.example[2,]
matrix.example[,4]
You can add column and row names to the matrix and use the new names to get the elements of the matrix:
colnames(matrix.example) <- c("Sample1","Sample2","Sample3","Sample4")
rownames(matrix.example) <- paste("gene",1:3,sep="_")
matrix.example
matrix.example[,"Sample2"]
matrix.example[1,"Sample2"]
matrix.example["gene_1","Sample2"]
Note that all columns in a matrix must have the same type(numeric, character, etc.) and the same length.
Dataframes
Dataframes are similar to matrices, but different columns can have different types (numeric, character, factor, etc.):
people.summary <- data.frame(
age = c(30,29,25,25),
names = c("Fred", "Barney", "Wilma", "Betty"),
gender = c("m", "m", "f", "f")
)
people.summary
Getting elements of a dataframe is similar to getting elements of a matrix:
people.summary[2,1]
people.summary[2,]
people.summary[,1]
An easier way to specify a dataframe column by name is by using $
:
people.summary$age
Lists
Lists gather together a collection of objects under one name:
together.list <- list(
vector.example = dataset.a,
matrix.example = matrix.example,
data.frame.example = people.summary
)
together.list
There are several ways to get elements of a list:
together.list$matrix.example
together.list$matrix.example[,3]
together.list["matrix.example"]
together.list[["matrix.example"]]
together.list[["matrix.example"]][,2]
Reading Data In
We use read.data
or read.csv
to read in data:
gene_example <- read.csv("Gene_R_Example.csv")
In RStudio, we can use the “File” navigation window instead. Navigate to the directory containing the Gene_R_example.csv that we downloaded previously. Click on the file name then click “Import Dataset.” A new window appears allowing you to modify attributes of your file. Rename the file to the object “gene_example”.
Commands like head
and tail
also work in R. View
will open the dataframe in a new window:
head(gene_example)
View(gene_example)
Basic Plotting
A very basic plot:
plot(x=gene_example$Control, y=gene_example$Treated)
You can specify how you want your plot to look, in terms of color, shape, labels, etc. A nicer plot:
plot(x=gene_example$Control, y=gene_example$Treated,
xlab = "Control",
ylab = "Treated",
cex.lab = 1.5,
main = "A nice scatter plot",
pch = 16,
bty = "n",
col = "dark blue",
las = 1
)
# las
# To change the axes label style, use the graphics option las (label style). This changes the orientation angle of the labels:
# 0: The default, parallel to the axis
# 1: Always horizontal
# 2: Perpendicular to the axis
# 3: Always vertical
# bty
# To change the type of box round the plot area, use the option bty (box type):
# "o": The default value draws a complete rectangle around the plot.
# "n": Draws nothing around the plot.
Connecting the dots:
plot(x=gene_example$Control, y=gene_example$Treated,
xlab = "Control",
ylab = "Treated",
cex.lab = 1.5,
main = "A nice scatter plot",
pch = 16,
bty = "n",
col = "dark blue",
type = "b",
las = 1
)
# type
# The 'type' option changes what the data points are plotted with.
# "p": points
# "l": lines
# "b": both (points and lines)
# for more options, see the plot() help page
Histograms
hist(gene_example$Control)
hist(gene_example$Control,
xlab = "Expression",
ylab = "Number of Genes",
cex.lab = 1.5,
main = "A nice histogram",
col = "cyan",
breaks = 10,
las = 1
)
Boxplots
boxplot(gene_example[,2:3])
boxplot(gene_example[,2:3],
width = c(3,1),
col = "red",
border = "dark blue",
names = c("Control", "Treatment"),
main = "My boxplot",
notch = TRUE,
horizontal = TRUE
)
Saving your plots as PDFs
pdf("myfigure.pdf", height=10, width=6)
par(mfrow=c(2,1))
plot(x=gene_example$Control, y=gene_example$Treated,
xlab = "Control",
ylab = "Treated",
cex.lab = 1.5,
main = "A nice scatter plot",
pch = 16,
bty = "n",
col = "dark blue",
type = "b",
las = 1
)
boxplot(gene_example[,2:3],
width = c(3,1),
col = "red",
border = "dark blue",
names = c("Control", "Treatment"),
main = "My boxplot",
notch = TRUE,
horizontal = TRUE
)
dev.off()
If you don’t specify a path in the pdf() function like we did above, the pdf will be saved in your working directory. In a new internet tab, go to http://##.oicrcbw.ca/Review_Session/myfigure.pdf to view the saved image.
To save your RStudio notebook, go to “File” > “Save As…” and enter a name.