You are on page 1of 20

Unix: Beyond the Basics

1
Login
Requesting a tak account
http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php
Windows
PuTTY
Need to setup X-windows for graphical display
Macs
Access the Terminal: Go Utilities Terminal

2
Login using Secure Shell
ssh Y user@tak
PuTTY on Windows

Terminal on Macs

Command
Prompt
3
user@tak ~$
Hot Topics website:
http://jura.wi.mit.edu/bio/education/hot_topics/

Please login to our tak server, and create a directory for the
exercises in your lab share, and use it as your working directory
$ mkdir unix2
$ cd unix2
Download all files into your working directory
You should have the files below in your working directory:
foo.txt, sample1.txt, exercise.txt , join2filesByFirstColumn.pl , datasets folder
you can check with ls command

4
Unix Commands Review
(Unix Essentials)
$ sort k2,3 foo.tab -n or -g: sorting numbers
-n is recommended than g, except for
start end scientific notation or a leading '+'
-r: reverse order

$ cut f1,5 foo.tab


$ cut f1-5 foo.tab
-f: select only these fields
-f1,5: select 1st and 5th fields
-f1-5: select 1st, 2nd, 3rd, 4th, and 5th fields

$ wc l foo.txt
How many lines are in this file?

5
What you will learn

sed
awk
groupBy (bedtools)
loops
join files together
scripting

6
sed:
stream editor for filtering and transforming text

Print lines 10 - 15:


$ sed -n '10,15p' bigFile > selectedLines

Delete 5 header lines at the beginning of a file:


$ sed '1,5d' file > fileNoHeader

Remove all version numbers (ex: '.1') from the end of


a list of sequence accessions:
$ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly

s: substitute g: global modifier


d: delete line p: print the current pattern space
-n: only print those lines matching our pattern

7
awk
Name comes from the original authors:
Alfred V. Aho, Peter J. Weinberger, Brian W.
Kernighan
A simple programing language
Good for short programs, including parsing
files

8
awk
Print the 2nd and 1st fields of the file:
$ awk ' { print $2"\t"$1 } ' foo.tab

Convert sequences from tab delimited format to fasta format:


$ awk ' { print ">"$1"\n"$2 } ' foo.tab > foo.fa
$ head -1 foo.tab
Seq1 ACTGCATCAC
$ head -2 foo.fa
>Seq1
ACGCATCAC

Character Description
\n newline
\r carriage return
\t horizontal tab
#: comment, ignored by awk
By default, awk splits each line by spaces

9
awk: field separator
Issues with default separator: white space
one field is gene description with multiple words
consecutive empty cells
To use tab as the separator:
$ awk F "\t" '{ print NF }' foo.txt
or
$ awk 'BEGIN {FS="\t"} { print NF }' foo.txt

BEGIN: action before read input


NF: number of fields in the current record
FS: input field separator
OFS: output field separator
END: action after read input
10
awk: arithmetic operations
Add average values of 4th and 5th fields to the file:
$ awk '{ print $0"\t"($4+$5)/2 }' foo.tab

$0: all fields

Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulo
^ Exponentiation
** Exponentiation

11
awk: making comparisons
Print out records if values in 4th or 5th field are above 4:
$ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab

Sequence Description
> Greater than
< Less than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
~ Matches
!~ Does not match
|| Logical OR
&& Logical AND

12
awk
Conditional statements:
Display expression levels for the gene NANOG: if (condition1) action1
$ awk '{ if(/NANOG/) print $0 }' foo.txt else if (condition2) action2
or else action3
$ awk '/NANOG/ { print $0 } ' foo.txt
or sub function for substitution:
$ awk '/NANOG/' foo.txt sub( regexp, replacement, target)

Add line number to the above output: for ( i = 1; i <= NF; i++ )
$ awk '/NANOG/ { print NR\t$0 }' foo.txt print $i
NR: number of the current record
Looping:
Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript
$ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt
or
$ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\tavg }' foo.txt

13
Summarize by Columns:
groupBy (from bedtools)
-g grpCols column(s) for grouping
-c -opCols column(s) to be summarized
-o Operation(s) applied to opCol:

sum, count, min, max,


mean, median, stdev,
collapse
distinct
freqdesc (i.e., print desc. list of values:freq)
freqasc (i.e., print asc. list of values:freq)

Print the maximum value (3rd column) for each gene (2nd field) in the file
$ groupBy -g 2 -c 3 -o max foo.tab

14
Join files together
Unix join
$ join -1 2 -2 3 FILE1 FILE2
Join files on the 2nd field of FILE1 with the 3rd field of FILE2,
only showing the common lines.
FILE1 and FILE2 must be sorted on the join fields before running join

BaRC scripts: /nfs/BaRC_Public/BaRC_code/Perl/


sorting not required
$ join2filesByFirstColumn.pl file1 file2
$ submatrix_selector.pl matrixFile rowIDfile or 0 [columnIDfile]

15
Shell Flavors
Syntax (for scripting) depends on which shell
echo $SHELL
Several shells available, bash is the default on
tak and commonly used.
Some of the shells (incomplete listing):
Shell Name
sh Bourne
bash Bourne-Again
ksh Korn shell
csh C shell

16
Shell Script
Automation: avoid having to retype the same
commands many times
Ease of use and more efficient
Outline of a script:
#!/bin/bash shebang: interprets how to run the script
commands set of commands used in the script
#comments write comments using #
Commonly used extension for script is .sh (eg.
foo.sh), file must have executable permission
17
Bash Shell: for loop
Process multiple files with one command
Reduce computational time with many cluster
nodes
for myfile in `ls dataDir`
do
bsub wc l $myfile
done

When referring to a variable, $ is needed


before the variable name ($myfile), but $ is
not needed when defining it (myfile).
Note: this syntax is different from awk
18
#!/bin/sh
Shell Script Example
# 1. Take two arguments: the first one is a directory with all the datasets, the second one is an output
directory
# 2. For each data, calculate average gene expression, and save the results in a file in the output directory

inDir=$1 # 1st argument


outDir=$2 # 2nd argument; outDir must already exist
# Define variables: no spaces on either side of the equal sign

for i in `/bin/ls $inDir ` # refer to variable with $


do
# output file name
name="${i}_gene.txt" # {}: $i_gene not valid; prevent misinterpretation of variable as
#characters
# calculate average gene expression
# NM_001039201 Hdhd2 5.0306 5.3309 5.4998
sort -k2,2 ${inDir}/${i} | groupBy -g 2 -c 3,4,5 -o mean,mean,mean >| ${outDir}/${name}
done
# You can use graphical editors such as nedit, gedit, xemacs to create shell scripts

19
Further Reading
BaRC one liners:
http://iona.wi.mit.edu/bio/bioinfo/scripts/#unix
Unix Info for Bioinfo:
http://iona.wi.mit.edu/bio/barc_site_map.php
Online books via MIT:
http://proquest.safaribooksonline.com/
MIT: Turning Biologists into Bioinformaticists
http://rous.mit.edu/index.php/Teaching
Bash Guide for Beginners:
http://tldp.org/LDP/Bash-Beginners-Guide/html/

20

You might also like