Professional Documents
Culture Documents
Chapter 3
Inspection of Sequence Quality
Kwong Qi Bin, Ong Ai Ling and Martti Tammi
Glossary of Terms
FASTQ: Text-based nucleotide sequence with its quality score.
Line 1 is the FASTA identifier, line 2 is the nucleotide sequence,
line 3 starts with a ‘+’ followed by the optional FASTA identifier,
and line 4 represents the quality score.
FASTA: Text-based nucleotide sequence without its quality score,
which is just line 1 and 2 of FASTQ. It has a header that starts
with “>” and the next line is the sequence.
Kmer: Nucleotide sub-sequence that is made up of a fixed
number of K bases.
PCR: Polymerase chain reaction is a molecular biology method
that is used to amplify a DNA fragment to multiple copies.
GC: Guanine-cytosine content within a sequence.
Introduction
The adage ‘garbage in, garbage out’ serves as an important
reminder to users of NGS technologies to be careful about the
quality of sequence data that are used for analyses. Although NGS
is a powerful technology that allows us to acquire important bio-
logical information of a species such as its genome, its accuracy
depends on the raw sequenced data. Similar to any high
49
FastQC
FastQC1 is a quality control tool for NGS data. It is useful in summa-
rizing the quality of sequencing and detecting potential problems.
The program can be downloaded at http://www.bioinformatics.
babraham.ac.uk/projects/fastqc/.
$ wget http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/fastqc_v0.11.4.zip
$ unzip fastqc_v0.11.4.zip
# Note that a folder named FastQC is generated.
$ cd FastQC/
# It is useful to look inside the file INSTALL.
txt to see some useful instructions on how to
use the software.
by UNIVERSITY OF BIRMINGHAM on 08/12/17. For personal use only.
Once you have done that you can run it directly by typing:
$ ./fastqc
An error might show if you cannot view the graphical interface
of FastQC as shown in Figure 1. To fix this error, the user needs to
ensure they have X11 display.
Figure 2. Screenshot for Putty’s configuration to enable Xming X for any visuali-
zation purposes.
Download Datasets
The datasets can be downloaded at http://bioinfo.
perdanauniversity.edu.my/infohub/display/NPB/Index
by UNIVERSITY OF BIRMINGHAM on 08/12/17. For personal use only.
The second line is the bases, third line is mainly just a separator,
and the fourth line is the quality scores. To run FastQC for all the
example files, use the following command:
$ ./fastqc good_seq.fastq bad_seq.fastq
contaminated.fq
Three output files will be generated for each of the input files:
good_seq_fastqc.zip, bad_seq_fastqc.zip and
contaminated_fastqc.zip.
by UNIVERSITY OF BIRMINGHAM on 08/12/17. For personal use only.
In the unzipped folder of the output files, there are also two text
files, which are ‘summary.txt’ and ‘fastqc_data.txt’. These files con-
tain the raw statistics that are used to generate the HTML reports.
Take a look at it if you are interested. For the purpose of this tutorial,
we will only focus on the more user-friendly HTML report.
Overall, it is easy to locate potential problems in the FASTQ files
by looking at the summary column in the HTML file. The summary
has 11 categories that show various aspects relevant for sequence
quality inspection (Table 1).
by UNIVERSITY OF BIRMINGHAM on 08/12/17. For personal use only.
Figure 5. Example results for average quality score for all sequences.
Figure 8. Example result for distribution of sequence length over all sequences.
$ wget http://hannonlab.cshl.edu/fastx_tool
-
kit/fastx_toolkit_0.0.13_binaries_Linux_2.6_
amd64.tar.bz2
$ fastq_quality_filter -i bad_seq.fastq -q 25
Bioinformatics Downloaded from www.worldscientific.com
-p 80 -o bad_seq.fastq.filtered –Q33
Flag:
As we can see in Figure 11, the quality of the input file has
improved drastically. Although some of the bases still fall into the
yellow zone, most of the low quality ones have been removed.
Another option is to remove the entire low-quality short read
by setting a more stringent cutoff for the quality scores. However,
we strongly do not recommend doing this.
$ fastq_quality_filter -q 20 -p 80 -i bad_seq.
fastq -o bad_sequence.txt.filtered –Q33
$ fastx_clipper -a CAAGCAGAAGACGGCATACGAGAT
CGTGATGTGACTGGAG-i contaminated.fq -o
contaminated_adapter_remove.fq –v –Q33
Or
$ fastx_clipper -a CAAGCAGAAGA -i contaminated.
fq -o contaminated_adapter_remove.fq –v –Q33
Flag:
CCAGAGGGCGACCGCCCAGAGCGAGAGCAACACGAAAACA
>HWQB2:1:10:72:192:#0/1
CGGCGAGCACAGAGAACAAGCAACGACCGAAAAGGCGGGA
distribution
FASTA formatter Change the width of sequences line in a FASTA file
Bioinformatics Downloaded from www.worldscientific.com
Conclusion
It is important to do sequence quality inspection to ensure a good
and clean data for downstream analyses. Options available are to
either omit the entire sequence, low quality bases or to treat low
quality bases as Ns.
References
1. Andrews, S. FastQC: a quality control tool for high throughput sequence data,
<http://www.bioinformatics.babraham.ac.uk/projects/fastqc> (2010).
2. Gordon, A. & Hannon, G. J. Fastx-toolkit. (2010).
3. Aronesty, E. ea-utils: Command-line tools for processing biological sequencing
data, <https://github.com/ExpressionAnalysis/ea-utils> (2011).