Professional Documents
Culture Documents
1
Login
Requesting a tak account
http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php
Windows
PuTTY
Need to setup X-windows for graphical display
Macs
Access the Terminal: Go Utilities Terminal
2
Login using Secure Shell
ssh Y user@tak
PuTTY on Windows
Terminal on Macs
Command
Prompt
3
user@tak ~$
Hot Topics website:
http://jura.wi.mit.edu/bio/education/hot_topics/
Please login to our tak server, and create a directory for the
exercises in your lab share, and use it as your working directory
$ mkdir unix2
$ cd unix2
Download all files into your working directory
You should have the files below in your working directory:
foo.txt, sample1.txt, exercise.txt , join2filesByFirstColumn.pl , datasets folder
you can check with ls command
4
Unix Commands Review
(Unix Essentials)
$ sort k2,3 foo.tab -n or -g: sorting numbers
-n is recommended than g, except for
start end scientific notation or a leading '+'
-r: reverse order
$ wc l foo.txt
How many lines are in this file?
5
What you will learn
sed
awk
groupBy (bedtools)
loops
join files together
scripting
6
sed:
stream editor for filtering and transforming text
7
awk
Name comes from the original authors:
Alfred V. Aho, Peter J. Weinberger, Brian W.
Kernighan
A simple programing language
Good for short programs, including parsing
files
8
awk
Print the 2nd and 1st fields of the file:
$ awk ' { print $2"\t"$1 } ' foo.tab
Character Description
\n newline
\r carriage return
\t horizontal tab
#: comment, ignored by awk
By default, awk splits each line by spaces
9
awk: field separator
Issues with default separator: white space
one field is gene description with multiple words
consecutive empty cells
To use tab as the separator:
$ awk F "\t" '{ print NF }' foo.txt
or
$ awk 'BEGIN {FS="\t"} { print NF }' foo.txt
Operator Description
+ Addition
- Subtraction
* Multiplication
/ Division
% Modulo
^ Exponentiation
** Exponentiation
11
awk: making comparisons
Print out records if values in 4th or 5th field are above 4:
$ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab
Sequence Description
> Greater than
< Less than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to
~ Matches
!~ Does not match
|| Logical OR
&& Logical AND
12
awk
Conditional statements:
Display expression levels for the gene NANOG: if (condition1) action1
$ awk '{ if(/NANOG/) print $0 }' foo.txt else if (condition2) action2
or else action3
$ awk '/NANOG/ { print $0 } ' foo.txt
or sub function for substitution:
$ awk '/NANOG/' foo.txt sub( regexp, replacement, target)
Add line number to the above output: for ( i = 1; i <= NF; i++ )
$ awk '/NANOG/ { print NR\t$0 }' foo.txt print $i
NR: number of the current record
Looping:
Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript
$ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt
or
$ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\tavg }' foo.txt
13
Summarize by Columns:
groupBy (from bedtools)
-g grpCols column(s) for grouping
-c -opCols column(s) to be summarized
-o Operation(s) applied to opCol:
Print the maximum value (3rd column) for each gene (2nd field) in the file
$ groupBy -g 2 -c 3 -o max foo.tab
14
Join files together
Unix join
$ join -1 2 -2 3 FILE1 FILE2
Join files on the 2nd field of FILE1 with the 3rd field of FILE2,
only showing the common lines.
FILE1 and FILE2 must be sorted on the join fields before running join
15
Shell Flavors
Syntax (for scripting) depends on which shell
echo $SHELL
Several shells available, bash is the default on
tak and commonly used.
Some of the shells (incomplete listing):
Shell Name
sh Bourne
bash Bourne-Again
ksh Korn shell
csh C shell
16
Shell Script
Automation: avoid having to retype the same
commands many times
Ease of use and more efficient
Outline of a script:
#!/bin/bash shebang: interprets how to run the script
commands set of commands used in the script
#comments write comments using #
Commonly used extension for script is .sh (eg.
foo.sh), file must have executable permission
17
Bash Shell: for loop
Process multiple files with one command
Reduce computational time with many cluster
nodes
for myfile in `ls dataDir`
do
bsub wc l $myfile
done
19
Further Reading
BaRC one liners:
http://iona.wi.mit.edu/bio/bioinfo/scripts/#unix
Unix Info for Bioinfo:
http://iona.wi.mit.edu/bio/barc_site_map.php
Online books via MIT:
http://proquest.safaribooksonline.com/
MIT: Turning Biologists into Bioinformaticists
http://rous.mit.edu/index.php/Teaching
Bash Guide for Beginners:
http://tldp.org/LDP/Bash-Beginners-Guide/html/
20