Professional Documents
Culture Documents
Note: The lesson is based on Chapter 12 of the Bioinformatics Data Skills (http://shop.oreilly.com/product
/0636920030157.do) by Vince Buffalo (https://github.com/vsbuffalo)
1 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
✏ Challenge 1
As you may remember, Unix provides different flavors of shells:
Solution to Challenge 1
$SHELL gives you the default shell. $0 gives you the current shell.
echo $0
bash
#!/bin/bash
set -e
set -u
set -o pipefail
• The first line is called the shebang, and indicates the path to the interpreter used to execute this script. Although shebang
line is only required when running the script as a program, it’s best to include it.
• set -e tells the script to terminate if any command exited with a nonzero exit status. Note, however, that set -e has
complex rules to accommodate cases when a nonzero exit status indicates “false” rather than failure.
• set -u tells Bash scripts not to run any command containing a reference to an unset variable name (check echo "rm
$NOTSET/*.*" .
• set -o pipefail is needed to cover one of the exceptions of set -e : if the last program terminates with a nonzero
status, the pipe will not be terminated.
These three options are the first layer of protection against Bash scripts with silent errors and unsafe behavior.
Unlike other programming languages, Bash’s variables don’t have data types. It’s helpful to think of Bash’s variables as strings
(but that may behave differently depending on context). We can create a variable and assign it a value with (note that spaces
matter when setting Bash variables: do not use spaces around the equals sign!):
2 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
results_dir="results/"
To access a variable’s value, we use a dollar sign in front of the variable’s name (e.g., $results_dir). You can experiment with
this in a Bash script, or directly on the command line:
results_dir="results/"
echo $results_dir
results/
Even though accessing a variable’s value using the dollar sign syntax works, in some cases it’s not clear where a variable
name ends and where an adjacent string begins. To fix this, wrap the variable name in braces:
sample="CNTRL01A"
mkdir ${sample}_aln/
In addition, quoting variables makes your code more robust by preventing commands from interpreting any spaces or other
special characters that the variable may contain:
sample="CNTRL01A"
mkdir "${sample}_aln/"
echo '
#!/bin/bash
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh
bash args.sh arg1 arg2 arg3
Bash assigns the number of command-line arguments to the variable $# (this does not count the script name, $0, as an
argument). This is useful for user-friendly messages (this uses a Bash if conditional, which we’ll cover in more depth in the
next section):
echo '
#!/bin/bash
if [ "$#" -lt 3 ] # are there less than 3 arguments?
then
echo "error: too few arguments, you provided $#, 3 required"
echo "usage: script.sh arg1 arg2 arg3"
exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh
bash args.sh arg1 arg2
3 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
Note, that variables created in your Bash script will only be available for the duration of the Bash process running that script.
For example, running a script that creates a variable with some_var=3 will not create some_var in your current shell, as the
script runs in an entirely separate shell process.
if [commands] then
[if-statements]
else
[else-statements]
fi
were [commands] is a placeholder for any command, set of commands, pipeline, or test condition [if-statements] is a
placeholder for all statements executed if [commands] evaluates to true (0). [else-statements] is a placeholder for all
statements executed if [commands] evaluates to false (1). The else block is optional.
This is an advantage Bash has over Python when writing pipelines: Bash allows your scripts to directly work with command-
line programs without requiring any overhead to call programs.
For example, suppose we wanted to run a set of commands only if a file contains a certain string. Because grep returns 0 only
if it matches a pattern in a file and 1 otherwise, we could use:
echo `#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
# commands to run if "pattern" is found
echo "found 'pattern' in 'some_file.txt"
fi`
The set of commands in an if condition can use all features of Unix we’ve mastered so far. For example, chaining commands
with logical operators like && (logical AND) and || (logical OR):
#!/bin/bash
if grep "pattern" file_1.txt > /dev/null && grep "pattern" file_2.txt > /dev/null
then echo "found 'pattern' in 'file_1.txt' and in 'file_2.txt'"
fi
# We can also negate our program’s exit status with !:
# if ! grep "pattern" some_file.txt > /dev/null
# then echo "did not find 'pattern' in 'some_file.txt"
# fi
Finally, it’s possible to use pipelines in if condition statements. Note, however, that the behavior depends on set -o
pipefail . If pipefail is set, any nonzero exit status in a pipe in your condition statement will cause skipping the if-statements
section (and going on to the else block if it exists).
test command.
The final component necessary to understand Bash’s if statements is the test command. Like other programs, test exits
with either 0 or 1. However test’s exit status indicates the return value of the test specified through its arguments, rather than
exit success or error. test supports numerous standard comparison operators (whether two strings are equal, whether two
integers are equal, whether one integer is greater than or equal to another, etc.), which is needed because ash can’t rely on
familiar syntax such as > for “greater than,” as this is used for redirection: instead, test has its own syntax (see Table 12-1 for
a full list). You can get a sense of how test works by playing with it directly on the command line (using ; echo “$?” to print the
exit status):
4 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
0
1
1
0
In practice, the most common tests are for whether files or directories exist and whether you can write to them. test supports
numerous file- and directory- related test operations (the few that are most useful in bioinformatics are in Table 12-2). For
examples:
if test -f some_file.txt
then [...]
fi
However, Bash provides a simpler syntactic alternative to the test statements: [ -f some_file.txt ] . Note that spaces
around and within the brackets are required. This makes for much simpler if statements involving comparisons:
if [ -f some_file.txt ]
then [...]
fi
When using this syntax, we can chain test expressions with -a as logical AND, -o as logical OR, ! as negation, and
parentheses to group statements. Our familiar && and || operators won’t work in test, because these are shell operators. As
an example, suppose we want to ensure our script has enough arguments and that the input file is readable:
5 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
#!/bin/bash
set -e
set -u
set -o pipefail
As discussed earlier, we quote variables (especially those from human input); this is a good practice and prevents issues with
special characters.
There are three essential parts to creating a pipeline to process a set of files:
There are two common ways to select files: either start with a text file containing information about samples or directly select
files in directories using some criteria.
In the first approach, you may have a file called samples.txt that tells you basic information about your raw data: sample
name, read pair, and where the file is. Here’s an example (which is also in book’s chapter 12 directory on GitHub):
cat ../data/samples.txt
zmaysA R1 seq/zmaysA_R1.fastq
zmaysA R2 seq/zmaysA_R2.fastq
zmaysB R1 seq/zmaysB_R1.fastq
zmaysB R2 seq/zmaysB_R2.fastq
zmaysC R1 seq/zmaysC_R1.fastq
zmaysC R2 seq/zmaysC_R2.fastq
The first two columns are called metadata (data about data), which is vital to relating sample information to their physical files.
Note that the metadata is also in the filename itself, which is useful because it allows us to extract it from the filename if we
need to. We have to loop over these data, and do so in a way that keeps the samples straight. Suppose that we want to loop
over every file, gather quality statistics on each and every file (using the imaginary program fastq_stat), and save this
information to an output file. Each output file should have a name based on the input file.
First, we load our filenames into a Bash array, which we can then loop over. Bash arrays can be created manually; specific
elements can be extracted using:
zmaysA
zmaysC
All elements can be extracted with the cryptic-looking echo ${sample_files[@]} and the number of elements can be
printed with echo ${#sample_names[@]} .
Better yet, we can use a command substitution to construct Bash arrays. Because we want to loop over each file, we need to
extract the third column using cut -f 3 from samples.txt. Demonstrating this in the shell:
6 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
sample_files=($(cut -f 3 ../data/samples.txt))
echo ${sample_files[@]}
IMPORTANT:
This approach only works if our filenames only contain alphanumeric characters, (_), and (-)! If spaces, tabs, newlines, or
special characters like * end up in filenames, it will break this approach.
With our filenames in a Bash array, we’re almost ready to loop over them. The last component is to strip the path and
extension from each filename, leaving us with the most basic filename we can use to create an output filename. The Unix
program basename strips paths from filenames:
basename seqs/zmaysA_R1.fastq
zmaysA_R1.fastq
basename can also strip a suffix (e.g., extension) provided as the second argument from a filename (or alternatively using
the argument -s):
zmaysA_R1
Now, all the pieces are ready to construct our processing script:
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt
A more refined script might add a few extra features, such as using an if statement to provide a friendly error if a FASTQ file
does not exist or a call to echo to report which sample is currently being processed.
7 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt
# our reference
reference=zmays_AGPv3.20.fa
Here we use cut to grab the first column (corresponding to sample names), and pipe these sample names to uniq to remove
diplicates (first column repeats each sample name twice, once for each paired-end file). As before, we create an output
filename for the current sample being iterated over. In this case, all that’s needed is the sample name stored in $sample. Our
call to bwa provides the reference, and the two paired-end FASTQ files for this sample as input. Finally, the output of bwa is
redirected to the $results_file .
Globbing
Finally, in some cases it might be easier to directly loop over files, rather than working a file containing sample information like
samples.txt. The easiest (and safest) way to do this is to use Bash’s wildcards to glob files to loop over. The syntax of this is
quite easy:
#!/bin/bash set -e
set -u
set -o pipefail
for file in *.fastq
do
echo "$file: " $(bioawk -c fastx 'END {print NR}' $file)
done
Bash’s loops are a handy way of applying commands to numerous files, but have a few downsides. First, globbing is not the
most powerful way to select certain files (see next section). Second, Bash’s loop syntax is lengthy for simple operations, and a
bit archaic. Finally, there’s no easy way to parallelize Bash loops in a way that constrains the number of subprocesses used.
We’ll see a powerful file-processing Unix idiom in the next section that works better for some tasks where Bash scripts may
not be optimal.
ls *.fq | process_fq
Your shell expands this wildcard to all matching files in the current directory, and ls prints these filenames. Unfortunately,
this leads to a common complication that makes ls and wildcards a fragile solution. Suppose your directory contains a
filename called treatment 02.fq . In this case, ls returns reatment 02.fq along with other files. However, because
files are separated by spaces, and this file contains a space, process_fq will interpret treatment 02.fq as two separate
files, named reatment and 02.fq . This problem crops up periodically in different ways, and it’s necessary to be aware of
when writing file-processing pipelines. Note that this does not occur with file globbing in arguments—if process_fq takes
multiple files as arguments, your shell handles this properly:
process_fq *.fq
Here, your shell automatically escapes the space in the filename treatment 02.fq , so process_fq will correctly receive the
arguments treatment-01.fq, treatment 02.fq, treatment-03.fq. The potentential problem here is that there’s a limit to the number
of files that can be specified as arguments. The limit is high, but you can reach it with NGS data. In this case you may get a
meassage: : cannot execute [Argument list too long] The solution to both of these problems is through `find` and `xargs`, as
we will see in the following sections.
Unlike ls , find is recursive (it will search through the directory structure). In fact, running find on a directory (without
other arguments) is a quick way to see it’s structure, e.g.,
/
/research
/sw
/WebApps
/WebApps/hello.wsgi
/.HFS+ Private Directory Data
/home
/libpeerconnection.log.0.gz
/usr
/usr/bin
Argument -maxdepth limits the depth of the search: to search only within the current directory, use find -maxdepth 1 .
For the following exercises, I’ll create a small directory system in my current directory:
mkdir -p zmays-snps/{data/seqs,scripts,analysis}
touch zmays-snps/data/seqs/zmays{A,B,C}_R{1,2}.fastq
find zmays-snps
zmays-snps
zmays-snps/analysis
zmays-snps/data
zmays-snps/data/seqs
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysB_R1.fastq
zmays-snps/data/seqs/zmaysB_R2.fastq
zmays-snps/data/seqs/zmaysC_R1.fastq
zmays-snps/data/seqs/zmaysC_R2.fastq
zmays-snps/scripts
Find expressions
9 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
Find’s expressions are built from predicates, which are chained together by logical AND and OR operators. Through
expressions, find can match files based on conditions such as creation time or the permissions of the file, as well as advanced
combinations of these conditions.
For example, we can use find to print the names of all files matching the pattern “zmaysB*fastq” (e.g., FASTQ files from
sample “B”, both read pairs):
zmays-snps/data/seqs/zmaysB_R1.fastq
zmays-snps/data/seqs/zmaysB_R2.fastq
This gives similar results to ls zmaysB*fastq , as we’d expect. The primary difference is that find reports results separated
by newlines and, by default, find is recursive.
Because we only want to return FASTQ files (and not directories with that matching name), we might want to limit our results
using the -type option:
There are numerous different types you can search for; the most commonly used are f for files, d for directories, and l for links.
By default, find connects different parts of an expression with logical AND. The find command in this case returns results
where the name matches “zmaysB*fastq” and is a file (type “f ”). find also allows explicitly connecting different parts of an
expression with different operators.
If we want to get the names of all FASTQ files from samples A or C, we’ll use the operator -or to chain expressions:
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysC_R1.fastq
zmays-snps/data/seqs/zmaysC_R2.fastq
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysC_R1.fastq
zmays-snps/data/seqs/zmaysC_R2.fastq
Let’s see another example. Suppose a file named zmaysB_R1-temp.fastq was created by your colleague in zmays-
snps/data/seqs but you want to ignore it in your file querying:
touch zmays-snps/data/seqs/zmaysB_R1-temp.fastq
find zmays-snps/data/seqs -type f "!" -name "zmaysC*fastq" -and "!" -name "*-temp*"
zmays-snps/data/seqs/zmaysA_R1.fastq
zmays-snps/data/seqs/zmaysA_R2.fastq
zmays-snps/data/seqs/zmaysB_R1.fastq
zmays-snps/data/seqs/zmaysB_R2.fastq
Note that find’s operators like !, (, and ) should be quoted so as to avoid your shell from interpreting these.
touch zmays-snps/data/seqs/zmays{A,C}_R{1,2}-temp.fastq
ls zmays-snps/data/seqs
zmaysA_R1-temp.fastq
zmaysA_R1.fastq
zmaysA_R2-temp.fastq
zmaysA_R2.fastq
zmaysB_R1-temp.fastq
zmaysB_R1.fastq
zmaysB_R2.fastq
zmaysC_R1-temp.fastq
zmaysC_R1.fastq
zmaysC_R2-temp.fastq
zmaysC_R2.fastq
Although we can delete these files with rm *-temp.fastq , using rm with a wildcard in a directory filled with important data
files is too risky. Using find’s -exec is a much safer way to delete these files.
For example, let’s use find - exec and rm to delete these temporary files:
Notice the (required!) semicolumn and curly brackets at the end of the command!
In one line, we’re able to pragmatically identify and execute a command on files that match a certain pattern. With find and
-exec, a daunting task like processing a directory of 100,000 text files with a program is simple.
In general, find - exec is most appropriate for quick, simple tasks (like deleting files, changing permissions, etc.). For
larger tasks, xargs (which we’ll see next) is a better choice.
xargs allows us to take input from standard in, and use this input as arguments to another program, which allows us to build
commands programmatically. Using find with xargs is much like find -exec , but with some added advantages that
make xargs a better choice for larger tasks.
touch zmays-snps/data/seqs/zmays{A,C}_R{1,2}-temp.fastq
ls zmays-snps/data/seqs
zmaysA_R1-temp.fastq
zmaysA_R1.fastq
zmaysA_R2-temp.fastq
zmaysA_R2.fastq
zmaysB_R1.fastq
zmaysB_R2.fastq
zmaysC_R1-temp.fastq
zmaysC_R1.fastq
zmaysC_R2-temp.fastq
zmaysC_R2.fastq
xargs works by taking input from standard in and splitting it by spaces, tabs, and newlines into arguments. Then, these
arguments are passed to the command supplied. For example, to emulate the behavior of find -exec with rm , we use
xargs with rm :
11 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
zmays-snps/data/seqs/zmaysA_R1-temp.fastq
zmays-snps/data/seqs/zmaysA_R2-temp.fastq
zmays-snps/data/seqs/zmaysC_R1-temp.fastq
zmays-snps/data/seqs/zmaysC_R2-temp.fastq
xargs passes all arguments received through standard in to the supplied program (rm in this example). This works well for
programs like rm , touch , mkdir , and others that take multiple arguments. However, other programs only take a single
argument at a time. We can set how many arguments are passed to each command call with xargs’s -n argument. For
example, we could call rm four separate times (each on one file) with:
touch zmays-snps/data/seqs/zmays{A,C}_R{1,2}-temp.fastq
find zmays-snps/data/seqs -name "*-temp.fastq" | xargs -n 1 rm
One big benefit of xargs is that it separates the process that specifies the files to operate on ( find ) from applying a
command to these files (through xargs). If we wanted to inspect a long list of files find returns before running rm on all files in
this list, we could use:
touch zmays-snps/data/seqs/zmays{A,C}_R{1,2}-temp.fastq
find zmays-snps/data/seqs -name "*-temp.fastq" > files-to-delete.txt
cat files-to-delete.txt
cat files-to-delete.txt | xargs rm
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -I{} fastq_stat --in {}.fastq --out
../summaries/{}.txt
## Notice that the BSD xargs will only replace up to five instances of the string specified
## by -I by default, unless more are set with -R.
Combining xargs with basename is a powerful idiom used to apply commands to many files in a way that keeps track of
which output file was created by a particular input file. While we could accomplish this other ways, xargs allows for very
quick and incremental command building. xargs has another very large advantage over for loops: it allows parallelization
over a prespecified number of processes.
xargs allows us to define the number of processes to run simultaneously with the -P option. _E.g._, for the previous
example, we could use
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -P 6 -I{} fastq_stat --in {}.fastq --
out ../summaries/{}.txt
## Notice that the BSD xargs will only replace up to five instances of the string specified
## by -I by default, unless more are set with -R.
Admittedly, the price of some powerful xargs workflows is complexity. If you find yourself using xargs mostly to parallelize
tasks or you’re writing complicated xargs commands that use basename , it may be worthwhile to learn GNU Parallel .
GNU Parallel extends and enhances xargs ’s functionality, and fixes several limitations of xargs . For example, GNU
12 of 13 11/30/21, 11:42
Intro to UNIX: Shell Scripting, Writing Pipelines, a... https://data-skills.github.io/unix-and-bash/03-bash-s...
parallel can handle redirects in commands, and has a shortcut ({/.}) to extract the base filename without basename. This
allows for very short, powerful commands:
GNU Parallel has numerous other options and features. If you find yourself using xargs frequently for complicated workflows,
I’d recommend learning more about GNU Parallel. The GNU Parallel website has numerous examples and a detailed tutorial.
Key Points
• Don’t forget the proper header in your scripts
(../02- (../04-
unix- git/index.h
data-
tools/index.html)
Copyright © 2016–2019 Software Carpentry Foundation (https://software-carpentry.org)
Edit on GitHub (https://github.com/Data-Skills/unix-and-bash/edit/gh-pages/_episodes/03-bash-
scripts.md) / Contributing (https://github.com/Data-Skills/unix-and-bash/blob/gh-
pages/CONTRIBUTING.md) / Source (https://github.com/Data-Skills/unix-and-bash/) / Cite
(https://github.com/Data-Skills/unix-and-bash/blob/gh-pages/CITATION) / Contact
(mailto:dlavrov@iastate.edu)
13 of 13 11/30/21, 11:42