AWS and Unix Intro - Module 4

Module 4: Searching and sorting files!

Preamble

The exercies here are modified versions of the Software Carpentry Unix shell lesson. Licensed under CC-BY 4.0 2018–2021 by The Carpentries.

Setup

The first step will be to download some example data. First move into your home directory, download the example data as a zip file and then unzip the file:

cd ~
wget http://bioinformaticsdotca.github.io/AWS_2021/data/data.zip
unzip data.zip
cd data

Data Exploration

We’ll begin by looking at files are in Genbank gbff. This is a text file that describes the nucleotide sequence and annotation features on those nucleotide sequences. First of all, we run the ls command to view the names of the files in the ecoli_genomes directory:

$ cd data
$ ls genomes
atlanta.gbff      braunschweig.gbff lab-strain.gbff   london.gbff       muenster.gbff     nevada.gbff       texas.gbff

Let’s go into that directory with cd and run an example command wc london.gbff:

$ cd genomes
$ wc london.gbff
212462  899913 13993868 london.gbff

wc is the ‘word count’ command: it counts the number of lines, words, and characters in files (from left to right, in that order).

If we run the command wc *.gbff, the * wildcard matches zero or more occurences of any character, so the shell turns *.gbff into a list of all gbff files in the current directory:

$ wc *.gbff
  217040  915629 14251456 atlanta.gbff
  204528  837084 13332241 braunschweig.gbff
  191859  685480 11851718 lab-strain.gbff
  212462  899913 13993868 london.gbff
  202036  823021 13167730 muenster.gbff
  216788  883045 14082598 nevada.gbff
  208514  849571 13555200 texas.gbff
 1453227 5893743 94234811 total

Note that wc *.gbff also shows the total number of all lines in the last line of the output.

If we run wc -l instead of just wc, the output shows only the number of lines per file:

$ wc -l *.gbff
  217040 atlanta.gbff
  204528 braunschweig.gbff
  191859 lab-strain.gbff
  212462 london.gbff
  202036 muenster.gbff
  216788 nevada.gbff
  208514 texas.gbff
 1453227 total

Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:

$ wc -l *.gbff > lengths.txt

The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. (This is why there is no screen output: everything that wc would have printed has gone into the file lengths.txt instead.) The shell will create the file if it doesn’t exist. If the file exists, it will be silently overwritten, which may lead to data loss and thus requires some caution. ls lengths.txt confirms that the file exists:

$ ls lengths.txt
lengths.txt

We can now send the content of lengths.txt to the screen using cat lengths.txt. The cat command gets its name from ‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:

$ cat lengths.txt
  217040 atlanta.gbff
  204528 braunschweig.gbff
  191859 lab-strain.gbff
  212462 london.gbff
  202036 muenster.gbff
  216788 nevada.gbff
  208514 texas.gbff
 1453227 total

Sorting

The sort command rearranges the lines in a file in order. There are different methods of sorting - lexigraphically (a-z1-9) or numerically. The default sort type is lexigraphically, where numbers are treated one character at a time. Given a hypothetical file “numbers.txt” that looks like:

cd ../sorting
cat numbers.txt
10
2
19
22
6

If we run sort on this file:

sort numbers.txt
10
19
2
22
6

If we run sort -n on the same input - specifying that we want to sort numerically, we get this instead:

sort -n numbers.txt
2
6
10
19
22

Explain why -n has this effect.

**Solution** (click here) The -n option specifies a numerical rather than an alphanumerical sort.

We will sort our lengths.txt file using the -n option to specify that the sort is numerical instead of alphanumerical. Note that running sort does not modify the file; instead, it sends the sorted result to the screen:

cd ../genomes
$ sort -n lengths.txt
  191859 lab-strain.gbff
  202036 muenster.gbff
  204528 braunschweig.gbff
  208514 texas.gbff
  212462 london.gbff
  216788 nevada.gbff
  217040 atlanta.gbff
 1453227 total

We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:

$ sort -n lengths.txt > sorted-lengths.txt
$ head -n 1 sorted-lengths.txt
  191859 lab-strain.gbff

Using -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines.

What Does » Mean?

We have seen the use of >, but there is a similar operator >> which works slightly differently. We’ll learn about the differences between these two operators by printing some strings. We can use the echo command to print strings e.g.

$ echo The echo command prints text
The echo command prints text

Now test the commands below to reveal the difference between the two operators:

$ echo hello > testfile01.txt

and:

$ echo hello >> testfile01.txt

Running Commands Together

If you think this is confusing, you’re in good company: even once you understand what wc, sort, and head do, all those intermediate files make it hard to follow what’s going on. We can make it easier to understand by running sort and head together:

$ sort -n lengths.txt | head -n 1
  191859 lab-strain.gbff

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

Nothing prevents us from chaining pipes consecutively. That is, we can for example send the output of wc directly to sort, and then the resulting output to head. Thus we first use a pipe to send the output of wc to sort:

$ wc -l *.gbff | sort -n
  191859 lab-strain.gbff
  202036 muenster.gbff
  204528 braunschweig.gbff
  208514 texas.gbff
  212462 london.gbff
  216788 nevada.gbff
  217040 atlanta.gbff
 1453227 total

And now we send the output of this pipe, through another pipe, to head, so that the full pipeline becomes:

$ wc -l *.gbff | sort -n | head -n 1
  191859 lab-strain.gbff

The redirection and pipes used in the last few commands are illustrated below:

redirects_and_pipes

Piping Commands Together

In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?

$ wc -l * > sort -n > head -n 3
$ wc -l * | sort -n | head -n 1-3
$ wc -l * | head -n 3 | sort -n
$ wc -l * | sort -n | head -n 3

This idea of linking programs together is why Unix has been so successful. Instead of creating enormous programs that try to do many different things, Unix programmers focus on creating lots of simple tools that each do one job well, and that work well with each other. This programming model is called ‘pipes and filters’. We’ve already seen pipes; a filter is a program like wc or sort that transforms a stream of input into a stream of output. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read from standard input, do something with what they’ve read, and write to standard output.

The key is that any program that reads lines of text from standard input and writes lines of text to standard output can be combined with every other program that behaves this way as well. You can and should write your programs this way so that you and other people can put those programs into pipes to multiply their power.

Pipe Reading Comprehension

A file called annotation-dates.txt (in the data/collection folder) contains the annotation dates for our strains in CSV format. Note the file contains some duplicate lines:

2021-05-23,atlanta
2021-05-19,branschweig
2021-05-23,london
2021-05-23,london
2021-05-26,muenster
2021-05-27,nevada
2021-05-30,texas
2021-05-30,texas
2004-06-10,lab-strain

What text passes through each of the pipes and the final redirect in the pipeline below?

$ cd data/collection
$ cat annotation-dates.txt | head -n 5 | tail -n 3 | sort -r > final.txt
**Solution** (click here)

The head command extracts the first 5 lines from annotation-dates.txt. Then, the last 3 lines are extracted from the previous 5 by using the tail command. With the sort -r command those 3 lines are sorted in reverse order and finally, the output is redirected to a file final.txt. The content of this file can be checked by executing cat final.txt. The file should contain the following lines:

  2021-05-26,muenster
2021-05-23,london
2021-05-23,london
  

Pipe Construction

For the file annotation-dates.txt from the previous exercise, consider the following command:

$ cut -d , -f 2 annotation-dates.txt

The cut command is used to remove or ‘cut out’ certain sections of each line in the file, and cut expects the lines to be separated into columns by a Tab character. A character used in this way is a called a delimiter. In the example above we use the -d option to specify the comma as our delimiter character. We have also used the -f option to specify that we want to extract the second field (column). This gives the following output:

atlanta
branschweig
london
london
muenster
nevada
texas
texas
lab-strain

The uniq command filters out adjacent matching lines in a file. How could you extend this pipeline (using uniq and another command) to find out what animals the file contains (without any duplicates in their names)?

**Solution** (click here)
$ cut -d , -f 2 annotation-dates.txt | sort | uniq

Which Pipe?

The uniq command has a -c option which gives a count of the number of times a line occurs in its input. Assuming your current directory is data/collection, what command would you use to produce a table that shows the total number of times each E. coli strain appears in the file?

  1. sort annotation-dates.txt | uniq -c
  2. sort -t, -k2,2 annotation-dates.txt | uniq -c
  3. cut -d, -f 2 annotation-dates.txt | uniq -c
  4. cut -d, -f 2 annotation-dates.txt | sort | uniq -c
  5. cut -d, -f 2 annotation-dates.txt | sort | uniq -c | wc -l
**Solution** (click here) Option 4. is the correct answer. If you have difficulty understanding why, try running the commands, or sub-sections of the pipelines (make sure you are in the `data/collections` directory).

Checking Files

Let’s say our collaborator has created 17 files in the north-pacific-gyre/2012-07-03 directory. As a quick check, starting from the data directory, if we type:

$ cd north-pacific-gyre/2012-07-03
$ wc -l *.txt

The output is 18 lines that look like this:

300 NENE01729A.txt
300 NENE01729B.txt
300 NENE01736A.txt
300 NENE01751A.txt
300 NENE01751B.txt
300 NENE01812A.txt
...

Now if we run

$ wc -l *.txt | sort -n | head -n 5
  240 NENE02018B.txt
  300 NENE01729A.txt
  300 NENE01729B.txt
  300 NENE01736A.txt
  300 NENE01751A.txt

Whoops: one of the files is 60 lines shorter than the others. When we goes back and checks it, we sees that assay at 8:00 on a Monday morning — someone was probably in using the machine on the weekend, and forgot to reset it. Before re-running that sample, lets checks to see if any files have too much data:

$ wc -l *.txt | sort -n | tail -n 5
 300 NENE02040B.txt
 300 NENE02040Z.txt
 300 NENE02043A.txt
 300 NENE02043B.txt
5040 total

Those numbers look good — but what’s that ‘Z’ doing there in the third-to-last line? All of her samples should be marked ‘A’ or ‘B’; by convention, her lab uses ‘Z’ to indicate samples with missing information. To find others like it, we can:

$ ls *Z.txt
NENE01971Z.txt    NENE02040Z.txt

It turns outt hat there’s no depth recorded for either of those samples. Since it’s too late to get the information any other way, we must exclude those two files from our analysis. We could delete them using rm, but there are actually some analyses we might do later where depth doesn’t matter, so instead, we’ll have to be careful later on to select files using the wildcard expression *[AB].txt. As always, the * matches any number of characters; the expression [AB] matches either an ‘A’ or a ‘B’, so this matches all the valid data files she has.

Wildcard Expressions

Wildcard expressions can be very complex, but you can sometimes write them in ways that only use simple syntax, at the expense of being a bit more verbose. Consider the directory data/north-pacific-gyre/2012-07-03 : the wildcard expression *[AB].txt matches all files ending in A.txt or B.txt. Imagine you forgot about this.

Can you match the same set of files with basic wildcard expressions that do not use the [] syntax? Hint: You may need more than one command, or two arguments to the ls command.

If you used two commands, the files in your output will match the same set of files in this example. What is the small difference between the outputs?

If you used two commands, under what circumstances would your new expression produce an error message where the original one would not?

**Solution** (click here) 1: A solution using two wildcard commands: ls *A.txt and then ls *B.txt A solution using one command but with two arguments: ls *A.txt *B.txt 2: The output from the two new commands is separated because there are two commands. 3: When there are no files ending in `A.txt`, or there are no files ending in `B.txt`, then one of the two commands will fail.

Key Points

  • cat displays the contents of its inputs.
  • head displays the first 10 lines of its input.
  • tail displays the last 10 lines of its input.
  • sort sorts its inputs.
  • wc counts lines, words, and characters in its inputs.
  • command > [file] redirects a command’s output to a file (overwriting any existing content).
  • command >> [file] appends a command’s output to a file.
  • [first] | [second] is a pipeline: the output of the first command is used as the input to the second.
  • The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).