You are on page 1of 161

VetBooks.

ir

Haja N. Kadarmideen Editor

Systems Biology
in Animal
Production and
Health, Vol. 1
VetBooks.ir

Systems Biology in Animal Production


and Health, Vol. 1
VetBooks.ir

Haja N. Kadarmideen
Editor

Systems Biology in Animal


Production and Health,
Vol. 1
Editor
VetBooks.ir

Haja N. Kadarmideen
Faculty of Health and Medical Sciences
University of Copenhagen
Frederiksberg C, Denmark

ISBN 978-3-319-43333-2 ISBN 978-3-319-43335-6 (eBook)


DOI 10.1007/978-3-319-43335-6

Library of Congress Control Number: 2016956674

© Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material contained herein or for any errors
or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
The registered company address is Gewerbestrasse 11, 6330 Cham, Switzerland
VetBooks.ir

Foreword

The increased prominence of “systems biology” in biological research over the past
two decades is arguably a reaction to the reductionist approach exemplified by the
genome sequencing phase of the Human Genome Project. A simplistic view of the
genome projects was that the genome sequence of a species, whether humans,
model organisms, plants or farmed animals, represents a blueprint for the organism
of interest, and thus characterising the sequence would reveal the relevant instruc-
tions. Subsequent targets for the reductionist or cataloguing approach were com-
plete lists of transcripts (transcriptomes) and proteins (proteomes) for the organism
of interest. The ‘omics approach to the comprehensive characterisation of an organ-
ism, tissue or cell has also been extended to metabolites and hence metabolomes.
A catalogue of parts, however, is insufficient to understand how an organism func-
tions. Thus, a holistic approach that recognises the interactions between compo-
nents of the system was required. Given the size and complexity of the data and the
possible interactions, it was necessary to use advanced mathematical and computa-
tional methods to attempt to make sense of the data. Thus, “systems biology” in the
‘omics era is widely considered to concern the use of mathematical modelling and
analysis together with ‘omics data (genome sequence, transcriptomes, proteomes,
metabolomes) to understand complex biological systems. The predictive aspect of
these models is viewed as particularly important. Moreover, it is desirable that the
models’ predictions can be tested experimentally. Systems biology, therefore, con-
tributes in part to converting large ‘omics data sets from data-driven biology experi-
ments into testable hypotheses.
Systems approaches and the use of predictive mathematical models in biological
systems long pre-date the post genome project (re-)emergence of systems biology.
Population biologists/geneticists, epidemiologists, agricultural scientists, quantita-
tive geneticists and plant and animal breeders have been developing and success-
fully exploiting predictive mathematical models and systems approaches for
decades.
Quantitative geneticists and animal breeders, for example, have been remarkably
successful at developing statistical animal models that are effective predictors of
future performance. For decades, these successes were achieved without any knowl-
edge of the underlying molecular components. The accuracy of these models has
been increased by using high-density molecular (single nucleotide polymorphism,
SNP) genotypes in so-called genomic selection. However, whilst the sequences and

v
vi Foreword

genome locations of the SNP markers are known little is known about the functional
VetBooks.ir

impact or relevance of the individual SNP loci. Further improvements could be


achieved through the use of genome sequence data and by adding knowledge of the
likely effects of the sequence variants whether coding or regulatory. Thus, there is a
growing commonality between the systems approaches of quantitative geneticists
and animal breeders and the ‘omics version of systems biology.
Animals are not only complex biological systems but also function within wider
complex systems. The recognition that an animal’s phenotype is determined by a
combination of its genotype and environmental factors simply restates the latter.
The environmental factors include, amongst others, feed, pathogens and the micro-
biomes present in the gastrointestinal tract and other locations. The ‘omics tech-
nologies allow not only the characterisation of the components of the animal of
interest, but also those of its commensal microbes and the microbes, including
pathogens present in its environment.
As noted earlier, it is desirable that the mathematical models developed in sys-
tems biology are predictive and that the associated hypotheses are testable. Genome
editing technologies which have been demonstrated in farmed animal species facili-
tate hypothesis testing at the level of modifying the genome sequence that deter-
mines components of the system of interest.
This volume of Systems Biology in Animal Production and Health, edited by
Professor Haja Kadarmideen, explores some aspects of both quantitative genetics
and ‘omics led approaches to applying systems approaches to tackling the chal-
lenges of improving animal productivity and reducing the burden of disease. This
book contains some chapters with R codes and other computer programs, workflow/
pipelines for processing and analysing multi-omic datasets from laboratory all the
way to interpretation of results. Hence, this book would be particularly useful for
students, teachers and practitioners of integrative genomics, bioinformatics and sys-
tems biology in animal and veterinary sciences.
Adhil et al. (chapter “Advanced Computational Methods, NGS Tools, and
Software for Mammalian Systems Biology”) review the computational methods
and tools required to analyse and integrate multi-omics data from different levels
including genome sequence, transcriptomics, proteomics and metabolomics. The
analysis of transcriptomic data and specifically RNA-Seq data are described in
greater detail by Heras-Saldana et al. (chapter “RNA Sequencing Applied to
Livestock Production”).
Whilst it is generally challenging to identify the causal genetic variants for com-
plex phenotypes, identifying loci with effects on primary traits such as the level of
gene expression or levels of a metabolite is easier as effects are often delivered close
to the gene. For example, many expression quantitative trait loci (eQTL) are detected
as cis-effects with the causal genetic variation located in the regulatory sequences of
the gene of interest. Of course, most phenotypes of importance to animal production
or health are controlled by the effects of many genes. Wang and Michoel (chapter
“Detection of Regulator Genes and eQTL Gene Networks”) address the challenge
of identifying the gene networks that capture the interaction between genes from
eQTL data. Systems genetics and systems biology using gene network methods
Foreword vii

with application for obesity using pig models is reviewed by Kogelman and
VetBooks.ir

Kadarmideen (chapter “Applications of Systems Genetics and Biology for Obesity


Using Pig Models”). Fontanesi (chapter “Merging Metabolomics, Genetics, and
Genomics in Livestock to Dissect Complex Production Traits”) reviews metabolite
QTL (mQTL), which have similar advantages to eQTL in respect of ease of identi-
fication, in pigs and cattle.
Rosa et al. (chapter “Applications of Graphical Models in Quantitative Genetics
and Genomics”) discuss the use of stochastic graphical models with an emphasis on
Bayesian networks to predict phenotypes, including primary traits such as gene
expression levels and end traits from sequence variants and thus arguably traversing
the path from sequence to consequence.

Professor Alan L. Archibald FRSE


Deputy Director, Head of Genetics and Genomics
The Roslin Institute and Royal (Dick) School of Veterinary Studies
University of Edinburgh
Easter Bush, Midlothian EH25 9RG, UK
VetBooks.ir

Preface

Systems biology is a research discipline at the crossroad of statistical, computa-


tional, quantitative, and molecular biology methods. It involves joint modeling,
combined analysis and interpretation of high-throughput omics (HTO) data col-
lected at many “levels or layers” of the biological systems within and across indi-
viduals in the population. The systems biology approach is often aimed at studying
associations and interactions between different “layers or levels”, but not necessar-
ily one layer or level in isolation. For instance, it involves study of multidimensional
associations or interaction among DNA polymorphisms, gene expression levels,
proteins or metabolite abundances. With modern HTO biotechnologies and their
decreasing costs, hugely comprehensive multi-omic data at all “levels or layers” of
the biological system are now available. This “big data” at lower costs, along with
development of genome scale models, network approaches and computational
power, have spearheaded the progress of the systems biology era, including applica-
tions in human biology and medicine. Systems biology is an established indepen-
dent discipline in humans and increasingly so in animals, plants and microbial
research. However, joint modeling and analyses of multilayer HTO data, in large
volumes on a scale that has never been seen before, has enormous challenges from
both computational and statistical points of view. Systems biology tackles such joint
modeling and analyses of multiple HTO datasets using a combination of statistical,
computational, quantitative and molecular biology methods and bioinformatics
tools. As I wrote in my review article (Livestock Science 2014, 166:232–248), sys-
tems biology is not only about multilayer HTO data collection from populations of
individuals and subsequent analyses and interpretations; it is also about a philoso-
phy and a hypothesis-driven predictive modeling approach that feeds into new
experimental designs, analyses and interpretations. In fact, systems biology revolves
and iterates between these “wet” and “dry” approaches to converge on coherent
understanding of the whole biological system behind a disease or phenotype and
provide a complete blueprint of functions that leads to a phenotype or a complex
disease.
It is equally important to introduce, alongside systems biology, the sub-disci-
pline of systems genetics as a branch of systems biology. It is akin to considering
“genetics” as a sub-discipline of “biology”. It is well known that quantitative genet-
ics/genomics links genome-wide genetic variation with variation in disease risks or
a performance (phenotype or trait) that we can easily measure or observe in a

ix
x Preface

population of individuals. However, systems genetics or systems genomics not only


VetBooks.ir

performs such genome-wide association studies (GWAS), but also performs linking
genetic variations (e.g. SNPs, CNVs, QTLs etc.) at the DNA sequence level with
variation in molecular profiles or traits (e.g. gene expression or metabolomic or
proteomic levels etc. in tissues and biological fluids) that we can measure using
high-throughput next- and third-generation biotechnologies. The systems genetics
approach is still “genetics”, because we are looking at those genetic variants that
exert their effects from DNA to phenotypic expression or disease manifestations
through a number of intermediate molecular profiles. Hence, systems genetics
derives its name, as originally proposed in my earlier article (Mammalian Genome,
2006, 17:548–564), by being able to integrate analyses of all underlying genetic
factors acting at different biological levels, namely, QTL, eQTL, mQTL, pQTL and
so on. I have provided a complete up-to-date review and illustration of systems
genetics or systems genomics and multi-omic data integration and analyses in our
review paper published in Genetics Selection Evolution (2016), 48:38. Overall, sys-
tems genetics/genomics leads us to provide a holistic view on complex trait heredity
at different biological layers or levels.
Whether it is systems biology or systems genetics, the gene ontology annotation
is one of the most important and valuable means of assigning functional information
using standardized vocabulary. This would include annotation of genetic variants
falling into functional groups such as trait QTL, eQTL, mQTL, pQTL. Molecular
pathway profiling, signal transduction and gene set enrichment analyses along with
various types of annotations form the “icing on cake”. For this purpose, several
bioinformatics tools are frequently used. Most chapters in this book and its associ-
ated volume cover these aspects.
I would like to point out that systems biology approaches have been proven to be
very powerful and shown to produce accurate and replicable discoveries of genes,
proteins and metabolites and their networks that are involved in complex diseases or
traits. In very practical terms, it delivers biomarkers, drug targets, vaccine targets,
target transcripts or metabolites, genetic markers, pathway targets etc. to diagnose
and treat diseases better or improve traits or characteristics in animals, plants and
humans. In the world of genomic prediction and genomic selection, there have been
an increasing number of studies that have shown high accuracy and predictive
power when models include functional QTLs such as eQTL, mQTL, pQTL which,
in fact, are results from systems genetics methods.
This book and its associated volume cover the above-mentioned principles, the-
ory and application of systems biology and systems genetics in livestock and animal
models and provides a comprehensive overview of open source and commercially
available software tools, computer programing codes and other reading materials to
learn, use and successfully apply systems biology and systems genetics in animals.
Overall, I believe this book is an extremely valuable source for students inter-
ested in learning the basics and could form as a textbook in higher educational
institutes and universities around the world. Equally, the book chapters are very
relevant and useful for scientists interested in learning and applying advanced HTO
studies, integrative HTO data analyses (e.g. eQTLs and mQTLs) and computational
Preface xi

systems biology techniques to animal production, health and welfare. One of the
VetBooks.ir

chapters focuses on systems genomics models and computational methods applied


to animal models for elucidating systems biology of human obesity and diabetes.
The two volumes of this book is a result of contributions from highly reputed scien-
tists and practitioners who originate from renowned universities and multinational
companies in the UK, Denmark, France, Italy, Australia, USA, Brazil and India.
I would like to thank the publisher Springer for inviting me to edit two volumes on
this subject, publishing in an excellent form and promoting the book across the
globe. I am grateful to all contributing authors and co-authors of this book. I also
wish to thank Ms. Gilda Kischinovsky from my research group for proofreading
and the staff at Springer involved in production of this book. Last but not least,
I wish to thank my wife and children who have given me moral support and strength
while I reviewed and edited this book.

Copenhagen, Denmark Haja N. Kadarmideen


September, 2016
VetBooks.ir

Contents

Detection of Regulator Genes and eQTLs in Gene Networks. . . . . . . . . . . . . 1


Lingfei Wang and Tom Michoel
Applications of Systems Genetics and Biology
for Obesity Using Pig Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Lisette J.A. Kogelman and Haja N. Kadarmideen
Merging Metabolomics, Genetics, and Genomics in Livestock
to Dissect Complex Production Traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Luca Fontanesi
RNA Sequencing Applied to Livestock Production . . . . . . . . . . . . . . . . . . . . 63
Sara de las Heras-Saldana, Hawlader A. Al-Mamun,
Mohammad H. Ferdosi, Majid Khansefid, and Cedric Gondro
Applications of Graphical Models in Quantitative
Genetics and Genomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Guilherme J.M. Rosa, Vivian P.S. Felipe, and Francisco Peñagaricano
Advanced Computational Methods, NGS Tools, and Software
for Mammalian Systems Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Mohamood Adhil, Mahima Agarwal, Prahalad Achutharao, and
Asoke K. Talukder

xiii
VetBooks.ir

Detection of Regulator Genes and eQTLs


in Gene Networks

Lingfei Wang and Tom Michoel

Abstract
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in noncoding genomic regions.
These genetic variants are often also associated to differences in expression lev-
els of nearby genes (they are “expression quantitative trait loci” or eQTLs, for
short) and presumably play a gene regulatory role, affecting the status of molecu-
lar networks of interacting genes, proteins, and metabolites. Computational sys-
tems biology approaches to reconstruct causal gene networks from large-scale
omics data have therefore become essential to understand the structure of net-
works controlled by eQTLs together with other regulatory genes, as well as to
generate detailed hypotheses about the molecular mechanisms that lead from
genotype to phenotype. Here we review the main analytical methods and soft-
ware to identify eQTLs and their associated genes, to reconstruct coexpression
networks and modules, to reconstruct causal Bayesian gene and module net-
works, and to validate predicted networks in silico.

1 I ntroduction

Genetic differences between individuals are responsible for variation in the observ-
able phenotypes. This principle underpins genomewide association studies (GWAS),
which map the genetic architecture of complex traits by measuring genetic variation
at single-nucleotide polymorphisms (SNPs) on a genomewide scale across many

L. Wang • T. Michoel (*)


Division of Genetics and Genomics, The Roslin Institute, The University of Edinburgh,
Midlothian EH25 9RG, UK
e-mail: tom.michoel@roslin.ed.ac.uk

© Springer International Publishing Switzerland 2016 1


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_1
2 L. Wang and T. Michoel

individuals (Mackay et al. 2009). GWAS have resulted in major improvements in


VetBooks.ir

plant and animal breeding (Goddard and Hayes 2009) and in numerous insights into
the genetic basis of complex diseases in human (Manolio 2013). However, quantita-
tive trait loci (QTLs) with large effects are uncommon and a molecular explanation
for their trait association rarely exists (Mackay et al. 2009). The vast majority of
QTLs indeed lie in noncoding genomic regions and presumably play a gene regula-
tory role (Hindorff et al. 2009; Schaub et al. 2012). Consequently, numerous studies
have identified cis- and trans-acting DNA variants that influence gene expression
levels (i.e., “expression QTLs”; eQTLs) in model organisms, plants, farm animals,
and humans (reviewed in Rockman and Kruglyak 2006; Georges 2007; Cookson
et al. 2009; Cheung and Spielman 2009; Cubillos et al. 2012). Gene expression
programs are of course highly tissue- and cell-type specific, and the properties and
complex relations of eQTL associations across multiple tissues are only beginning
to be mapped (Dimas et al. 2009; Foroughi Asl et al. 2015; Greenawalt et al. 2011;
Ardlie et al. 2015). At the molecular level, a mounting body of evidence shows that
cis-eQTLs primarily cause variation in transcription factor (TF) binding to gene
regulatory DNA elements, which then causes changes in histone modifications,
DNA methylation, and mRNA expression of nearby genes; trans-eQTLs in turn can
usually be attributed to coding variants in regulatory genes or cis-eQTLs of such
genes (Albert and Kruglyak 2015).
Taken together, these results motivate and justify a systems biological view of
quantitative genetics (“systems genetics”), where it is hypothesized that genetic
variation, together with environmental perturbations, affects the status of molecular
networks of interacting genes, proteins, and metabolites; these networks act within
and across different tissues and collectively control physiological phenotypes
(Williams 2006; Kadarmideen et al. 2006; Rockman 2008; Schadt 2009; Schadt and
Björkegren 2012; Civelek and Lusis 2014; Björkegren et al. 2015). Studying the
impact of genetic variation on gene regulation networks is of crucial importance in
understanding the fundamental biological mechanisms by which genetic variation
causes variation in phenotypes (Chen et al. 2008), and it is expected to lead to the
discovery of novel disease biomarkers and drug targets in human and veterinary
medicine (Schadt et al. 2009). Because the direct experimental mapping of genetic,
protein–protein, or protein–DNA interactions is an immensely challenging task,
further exacerbated by the cell-type-specific and dynamic nature of these interac-
tions (Walhout 2006), comprehensive, experimentally verified molecular networks
will not become available for multi-cellular organisms in the foreseeable future.
Statistical and computational methods are therefore essential to reconstruct trait-­
associated causal networks by integrating diverse omics data (Rockman 2008;
Schadt 2009; Ritchie et al. 2015).
A typical systems genetics study collects genotype and gene, protein, and/or
metabolite expression data from a large number of individuals segregating for one
or more traits of interest. After raw data processing and normalization, eQTLs are
identified for each of the expression data types, and a coexpression matrix is con-
structed. Causal Bayesian gene networks, coexpression modules (i.e., clusters), and/
Detection of Regulator Genes and eQTLs in Gene Networks 3

Adequate experimental design and data collection


VetBooks.ir

Appropriate data preprocessing and quality control

Expression quantitative trait loci analysis


matrix-eQTL,kruX

Choice of correlation function and calculation of gene co-expression


Covered in this chapter

Co-expression module detection Genotype data assisted edge directing


Fast Modularity,MCL,WGCNA NEO,Trigger

Model-based clustering Module network reconstruction


Lemon-Tree Lemon-Tree

In silico validation of reconstructed gene regulation network


ENCODE,Roadmap Epigenomics,modENCODE,BioGRID,
Gene Expression Omnibus,ArrayExpress

Experimental verification of regulatory pathways

Fig. 1 A flow chart for a typical systems genetics study and the corresponding software. Steps in
light yellow are covered in this chapter

or causal Bayesian module networks are then reconstructed. The in silico validation
of predicted networks and modules using independent data confirms their overall
validity, ideally followed by the experimental validation of the most promising find-
ings in a relevant cell line or model organism (Fig. 1). Here we review the main
analytic principles behind each of the steps from eQTL identification to in silico
network validation and present a selection of most commonly used methods and
software for each step. Throughout this chapter, we tacitly assume that all data have
been quality controlled, preprocessed, and normalized to suit the assumptions of the
analytic methods presented here. For expression data, this usually means working
with log-transformed data where each gene expression profile is centered around
zero with standard deviation one. We also assume that the data have been corrected
for any confounding factors, either by regressing out known covariates or by esti-
mating hidden factors (Stegle et al. 2012).
4 L. Wang and T. Michoel

2 Genetics of Gene Expression


VetBooks.ir

A first step toward identifying molecular networks affected by DNA variants is to


identify variants that underpin variations in eQTLs of transcripts (Cookson et al.
2009), proteins (Foss et al. 2007), or metabolites (Nicholson et al. 2011) across
individuals. When studying a single trait, as in GWAS, it is possible to consider
multiple statistical models to explicitly account for additive and/or dominant genetic
effects (Laird and Lange 2011). However, when the possible effects of a million or
more SNPs on tens of thousands of molecular abundance traits need to be tested, as
is common in modern genetics of gene expression studies, the computational cost of
testing SNP–trait associations one by one becomes prohibitive. To address this
problem, new methods have been developed to calculate the test statistics for the
parametric linear regression and analysis of variance (ANOVA) models (Shabalin
2012) and the nonparametric ANOVA model (or Kruskal–Wallis test) (Qi et al.
2014) using fast matrix multiplication algorithms, implemented in the software
matrix eQTL (http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/)
(Shabalin 2012) and kruX (https://github.com/tmichoel/krux) (Qi et al. 2014).
In both software, genotype values of s genetic markers and expression levels of
k transcripts, proteins, or metabolites in n individuals are organized in an s ´ n
genotype matrix G and k ´ n expression data matrix X. Genetic markers take values
0, 1, …, ℓ, where ℓ is the maximum number of alleles (  = 2 for biallelic markers),
whereas molecular traits take continuous values. In the linear model, a linear rela-
tion is tested between the expression level of gene i and the genotype value (i.e., the
number of reference alleles) of SNP j. The corresponding test statistic is the Pearson
correlation between the ith row of X and the jth row of G, for all values of i and j.
Standardizing the data matrices to zero mean and unit variance, such that for all i
and j,
n n n n

åX
l =1
il = åG jl = 0 and
l =1
åX
l =1
2
il = åG 2jl = n,
l =1

it follows that the correlation values can be computed as


n
Rij = åX il G jl = ( XG T ) ,
ij
l =1

where GT denotes the transpose of G. Hence, a single matrix multiplication suffices


to compute the test statistics for the linear model for all pairs of traits and SNPs.
The ANOVA models test if expression levels in different genotype groups origi-
nate from the same distribution. Therefore, ANOVA models can account for both
additive and dominant effects of a genetic variant on expression levels. In the para-
metric ANOVA model, suppose the test samples are divided into  +1 groups by the
SNP j. The mean expression level for gene i in each group m can be written as

1
X i( å
m ,j )
= ( m ,j )
X il ,
n {l :G jl = m}
Detection of Regulator Genes and eQTLs in Gene Networks 5

where n(m,j) is the number of samples in genotype group m for SNP j.


VetBooks.ir

Again assuming that the expression data are standardized, the F-test statistic for
testing gene i against SNP j can be written as

n -  - 1 SSi( )
j

Fi ( ) =
j
,
 n - SSi( )
j

where SSi(j) is the sum of squares between groups,


 2
SSi( ) = ån(
m ,j )
X i(
j m ,j )
.
m=0

Let us define the n ´ s indicator matrix I(m) for genotype group m, i.e., I (lj ) = 1
m

if G jl = m and 0 otherwise. Then

å X il = XI ( ( m)
) ij
.
{l :G jl = m }
Hence, for each pair of expression level Xi and SNP Gj, the sum of squares matrix
SSi( j) can be computed via  -1 matrix multiplications1.
In the nonparametric ANOVA model, the expression data matrix is converted to
a matrix T of data ranks, independently over each row. In the absence of ties, the
Kruskal–Wallis test statistic is given by

12 2

å n( ) Ti ( ) - 3 ( n + 1) ,
m ,j m ,j
Sij =
n ( n + 1) m = 0
where Ti (
m ,j )
is the average expression rank of gene i in genotype group m of SNP j,
defined as

1
Ti ( å
m ,j )
= ( m ,j )
Til ,
n {l :G jl = m}
which can be similarly obtained from the  -1 matrix multiplications.
There is as yet no consensus about which statistical model is most appropriate
for eQTL detection. Nonparametric methods were introduced in the earliest eQTL
studies (Brem et al. 2002; Schadt et al. 2008) and have remained popular, as they are
robust against variations in the underlying genetic model and trait distribution.
More recently, the linear model implemented in matrix eQTL has been used in a
number of large-scale studies (Ardlie et al. 2015; Lappalainen et al. 2013). A com-
parison on a data set of 102 human whole blood samples showed that the parametric
ANOVA method was highly sensitive to the presence of outlying gene expression

1
There are only  -1 matrix multiplications, because the data standardization implies that
 -1
XI ( 0) = 1 - åXI (
m)

m =1

.
6 L. Wang and T. Michoel

values and SNPs with singleton genotype group. Linear models reported the highest
VetBooks.ir

number of eQTL associations after empirical False Discovery Rate (FDR) correc-
tion, with an expected bias toward additive linear associations. The Kruskal–Wallis
test was most robust against data outliers and heterogeneous genotype group sizes
and detected a higher proportion of nonlinear associations but was more conserva-
tive for calling additive linear associations than linear models (Qi et al. 2014).
In summary, when large numbers of traits and markers have to be tested for asso-
ciation, efficient matrix multiplication methods can be used to calculate all test sta-
tistics at once, leading to a dramatic reduction in computation time compared with
calculating these statistics one by one for every pair using traditional methods.
Matrix multiplication is a basic mathematical operation, which has been purposely
studied and optimized for tens of years (Golub and Van Loan 1996). Highly effi-
cient packages, such as BLAS (http://www.netlib.org/blas/) and LAPACK (http://
www.netlib.org/lapack/), are available for use on generic CPUs and are indeed used
in most mainstream scientific computing software and programming languages,
such as Matlab and R. In recent years, graphics processor unit (GPU)-accelerated
computing, such as CUDA, has revolutionized scientific calculations that involve
repetitive operations in parallel on bulky data, offering even more speedup than the
existing CPU-based packages. The first applications of GPU computing in eQTL
analysis have already appeared (e.g., Hemani et al. 2014), and more can be expected
in the future.
Lastly, for pairs exceeding a predefined threshold on the test statistic, a p-value
can be computed from the corresponding test distribution, and these p-values can
then be further corrected for multiple testing by common procedures (Shabalin
2012; Qi et al. 2014).

3 Coexpression Networks and Modules

3.1 Coexpression Gene Networks

The Pearson correlation is the simplest and computationally most efficient similar-
ity measure for gene expression profiles. For genes i and j, their Pearson correlation
can be written as
n
Cij = åX il X jl . (1)
l =1

In matrix notation, this can be combined as the matrix multiplication

C = XXT .
Gene pairs with large positive or negative correlation values tend to be up- or down-­
regulated together due to either a direct regulatory link between them or being
jointly coregulated by a third, often hidden, factor. By filtering for correlation values
exceeding a significance threshold determined by comparison with randomly
Detection of Regulator Genes and eQTLs in Gene Networks 7

permuted data, a discrete coexpression network is obtained. Assuming that a high


VetBooks.ir

degree of coexpression signifies that genes are involved in the same biological pro-
cesses, graph theoretical methods can be used, for instance, to predict gene function
(Sharan et al. 2007).
One drawback of the Pearson correlation is that by definition, it is biased toward
linear associations. To overcome this limitation, other measures are available. The
Spearman correlation uses expression data ranks (cf. Section 2) in Eq. (1) and will
give high score to monotonic relations. Mutual information is the most general mea-
sure and detects both linear and nonlinear associations. For a pair of discrete ran-
dom variables A and B (representing the expression levels of two genes) taking
values al and bm, respectively, the mutual information is defined as

MI ( A,B ) = H ( A ) + H ( B ) - H ( A,B ) ,
where

H ( A ) = -åP ( al ) log P ( al ) ,
l

H ( B ) = -åP ( bm ) log P ( bm ) ,
m

H ( A,B ) = åP ( al , bm ) log P ( al , bm ) ,
lm

are the individual and joint Shannon entropies of A and B, and P ( al ) = P ( A = al ) ,


and likewise for the other terms. Because gene expression data are continuous,
mutual information estimation is nontrivial and usually involves some form of dis-
cretization (Daub et al. 2004). Mutual information has been successfully used as a
coexpression measure in a variety of contexts (Butte and Kohane 2000; Basso et al.
2005; Faith et al. 2007).

3.2 Clustering and Coexpression Module Detection

It is generally understood that cellular functions are carried out by “modules,”


groups of molecules that operate together and whose function is separable from that
of other modules (Hartwell et al. 1999). Clustering gene expression data (i.e., divid-
ing genes into discrete groups on the basis of similarities in their expression pro-
files) is a standard approach to detect such functionally coherent gene modules. The
literature on gene expression clustering is vast and cannot possibly be reviewed
comprehensively here. It includes “standard” methods such as hierarchical cluster-
ing (Eisen et al. 1998), k-means (Tavazoie et al. 1999), graph-based methods that
operate directly on coexpression networks (Sharan and Shamir 2000), and model-­
based clustering algorithms which assume that the data are generated by a mixture
of probability distributions, one for each cluster (Medvedovic and Sivaganesan
2002). Here we briefly describe a few recently developed methods with readily
available software.
8 L. Wang and T. Michoel

3.2.1 Modularity Maximization


VetBooks.ir

Modularity maximization is a network-clustering method that is particularly popu-


lar in the physical and social sciences, based on the assumption that intramodule
connectivity should be much denser than intermodule connectivity (Newman and
Girvan 2004; Newman 2006). In the context of coexpression networks, this method
can be used to identify gene modules directly from the correlation matrix C (Ayroles
et al. 2009). Suppose the genes are grouped into N modules M l , l = 1, ¼, N . Each
module Ml is a nonempty set that can contain any combination of the genes
i = 1, ¼, k , but each gene is contained by exactly one module. Also define M0 as the
set containing all genes. The modularity score function is defined as

æ W ( M , M ) æ W ( M , M ) ö2 ö
N
S (M ) = åç l l
- çç l 0
÷ ÷,
l =1 W ( M 0 , M 0 )
ç è W ( M 0 , M 0 ) ÷ø ÷
è ø
where W ( A, B ) = å w ( Cij ) is a weight function, summing over all the edges
iÎ A, jÎB , i ¹ j

that connect one vertex in A with another vertex in B, and w(x) is a monotonic
f­unction to map correlation values to edge strengths. Common functions are
w ( x ) = x , x (power law) (Langfelder and Horvath 2008), e
b bx
(exponential)
(Ayroles et al. 2009), or 1 / (1 + e ) (sigmoid) (Lee et al. 2009).
bx

A modularity maximization software particularly suited for large networks is fast


modularity (http://www.cs.unm.edu/aaron/research/fastmodularity.htm) (Clauset
et al. 2004).

Markov Cluster Algorithm


The Markov cluster (MCL) algorithm is a graph-based clustering algorithm, which
emulates random walks among gene vertices to detect clusters in a graph obtained
directly from the coexpression matrix C. It is implemented in the MCL software
(http://micans.org/mcl/) (Van Dongen 2001; Enright et al. 2002). The MCL algo-
rithm starts with the correlation matrix C as the probability flow matrix of a random
walk and then iteratively suppresses weak structures of the network and performs a
multistep random walk. In the end, only backbones of the network structure remain,
essentially capturing the modules of coexpression network. To be precise, the MCL
algorithm performs the following two operations on C alternatingly:

• Inflation: The algorithm first contrasts stronger direct connections against weaker
ones, using an element-wise power law transformation, and normalizes each col-
umn separately to sum to one, such that the element Cij corresponds to the dis-
sipation rate from vertex Xi to Xj in a single step. The inflation operation hence
updates C as C ® Gµ C , where the contrast rate µ> 1 is a predefined parameter
of the algorithm. After operation Γα, each element of C becomes
k
µ µ
Cij ® Gµ Cij = Cij / å C pj .
p =1

• Expansion: The probability flow matrix C controls the random walks performed
in the expansion phase. After some integer b ³ 2 steps of random walk, gene
Detection of Regulator Genes and eQTLs in Gene Networks 9

pairs with strong direct connections and/or strong indirect connections through
VetBooks.ir

other genes tend to see more probability flow exchanges, suggesting higher prob-
abilities of belonging to the same gene modules. The expansion operation for the
β-step random walk corresponds to the matrix power operation

C ® Cb .
The MCL algorithm performs the above two operations iteratively until conver-
gence. Nonzero entries in the convergent matrix C connect gene pairs belonging to
the same cluster, whereas all inter-cluster edges attain the value zero, so that cluster
structure can be obtained directly from this matrix (Van Dongen 2001; Enright et al.
2002).

Weighted Gene Coexpression Network Analysis


With higher than average correlation or edge densities within clusters, genes from
the same cluster typically share more neighboring (i.e., correlated) genes. The
weighted number of shared neighboring genes hence can be another measure of
gene function similarity. This information is captured in the so-called topological
overlap matrix Ω, first defined by Ravasz et al. (2002) for binary networks as

Aij + åAiu Auj


wij = u
,
min ( ki , k j ) + 1 - Aij
where A is the (binary) adjacency matrix of the network and ki = åAiu is the con-
u
nectivity of vertex Xi. The åA
u
iu Auj term represents vertex similarity through neigh-

boring genes, and the rest of terms normalize the output as 0 £ wij £ 1 . This concept
was later extended onto networks with weighted edges by applying a “soft thresh-
old” preprocess on the correlation matrix, for example, as
µ
1 + Cij
Aij = ,
2
or
µ
Aij = Cij ,
such that 0 £ Aij £ 1 (Zhang and Horvath 2005). Note that in the first case, only
positive correlations have high edge weight, whereas in the second case, positive
and negative correlations are treated equally. The parameter µ> 1 is determined
such that the weighted network with adjacency matrix A has approximately a scale-­
free degree distribution (Zhang and Horvath 2005).
In principle, any clustering algorithm (including the aforementioned ones) can
be applied to the topological overlap matrix W . In the popular WGCNA software
(http://labs.genetics.ucla.edu/horvath/htdocs/CoexpressionNetwork/Rpackages/
WGCNA/) (Langfelder and Horvath 2008), which is a multipurpose toolbox for
10 L. Wang and T. Michoel

network analysis, hierarchical clustering with a dynamic tree-cut algorithm


VetBooks.ir

(Langfelder et al. 2008) is used.

Model-Based Clustering
Model-based clustering approaches assume that the observed data are generated by
a mixture of probability distributions, one for each cluster, and takes explicitly into
account the noise of gene expression data. To infer model parameters and cluster
assignments, techniques such as expectation maximization (EM) or Gibbs sampling
are used (Liu 2002). A recently developed method assumes that the expression lev-
els of genes in a cluster are random samples drawn from a mixture of normal distri-
butions, where each mixture component corresponds to a clustering of samples for
that module, i.e., it performs a two-way co-clustering operation (Joshi et al. 2008).
The method is available as part of the Lemon-Tree package (https://github.com/
eb00/lemon-tree) and has been successfully used in a variety of applications (Bonnet
et al. 2015).

The co-clustering is carried out by a Gibbs sampler, which iteratively updates the
assignment of each gene and, within each gene cluster, the assignment of each
experimental condition. The co-clustering operation results the full posterior distri-
bution, which can be written as
N Ll
p ( C | X ) µ ÕÕ òò p ( m ,t ) Õ Õ p (X im | m ,t ) d m dt ,
l =1 u =1 iÎMl mÎEl ,u

where C = {M l , El ,u : l = 1, ¼, N ; u = 1, ¼, Ll } is a coclustering consisting of N gene


modules Ml, each of which has a set of Lm sample clusters as εl,u; p ( X im | m ,t ) is
a normal distribution function with mean μ and precision τ, and p(μ, τ) is a nonin-
formative normal-gamma prior. Detailed investigations of the convergence proper-
ties of the Gibbs sampler showed that the best results are obtained by deriving
consensus clusters from multiple independent runs of the sampler. In the Lemon-­
Tree package, consensus clustering is performed by a novel spectral graph cluster-
ing algorithm (Michoel and Nachtergaele 2012) applied to the weighted graph of
pairwise frequencies with which two genes are assigned to the same gene module
(Bonnet et al. 2015).

4 C
 ausal Gene Networks

4.1 Using Genotype Data to Prioritize Edge Directions


in Coexpression Networks

Pairwise correlations between gene expression traits define undirected coexpression


networks. Several studies have shown that pairs of gene expression traits can be
causally ordered using genotype data (Zhu et al. 2004; Chen et al. 2007; Aten et al.
Detection of Regulator Genes and eQTLs in Gene Networks 11

2008; Schadt et al. 2005; Neto et al. 2008, 2013; Millstein et al. 2009). Although
VetBooks.ir

varying in their statistical details, these methods conclude that gene A is causal for
gene B, if the expression of B associates significantly with A’s eQTLs, and this
association is abolished by conditioning on the expression of A and on any other
known confounding factors. In essence, this is the principle of “Mendelian random-
ization,” first introduced in epidemiology as an experimental design to detect causal
effects of environmental exposures on human health (Smith and Ebrahim 2003),
applied to gene expression traits.
To illustrate how these methods work, let A and B be two random variables rep-
resenting two gene expression traits, and let E be a random variable representing a
SNP, which is an eQTL for gene A and B. Because genotype cannot be altered by
gene expression (i.e., E cannot have any incoming edges), there are three possible
regulatory models to explain the joint association of E to A and B:

1. E ® A ® B : the association of E to B is indirect and due to a causal interaction


from A to B.
2. E ® B ® A : idem with the roles of A and B reversed.
3. A ¬ E ® B : A and B are independently associated to E.

To determine if gene A mediates the effect of SNP E on gene B (model 1), one
can test whether conditioning on A abolishes the correlation between E and B, using
the partial correlation coefficient

cor ( E , B ) - cor ( E , A ) cor ( B, A )


cor ( E , B | A ) =
(1 - cor ( E, A) )(1 - cor ( B, A) ).
2 2

If model 1 is correct, then cor ( E , B | A ) is expected to be zero, and this can be


tested, for example, using Fisher’s Z transform to assess the significance of a sample
correlation coefficient. The same approach can be used to test model 2, and if nei-
ther is significant, it is concluded that no inference on the causal direction between
A and B can be made (using SNP E), i.e., that model 3 is correct. For more details,
see (Aten et al. 2008), who have implemented this approach in the NEO software
(http://labs.genetics.ucla.edu/horvath/htdocs/aten/NEO/).
Other approaches are based on the same principle but use statistical model selec-
tion to identify the most likely causal model, with the probability density functions
(PDF) for the models as follows:

• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | A ) ,
• p ( E , A, B ) = p ( E ) p ( B | E ) p ( A | B ) ,
• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | E , A ) ,

where the dependence on A in the last term of the last model indicates that there
may be a residual correlation between B and A not explained by E. The minimal
additive model assumes the distributions are (Schadt et al. 2005)
12 L. Wang and T. Michoel

E ~ Bernoulli ( q ) ,
VetBooks.ir

A | E ~ N ( m A| E ,s A2 ) ,
æ s ö
B | A ~ N ç m B + r B ( A - m A ) , (1 - r 2 ) s B2 ÷ ,
è sA ø
æ s ö
B | E , A ~ N ç m B| E + r B ( A - m A| E ) , (1 - r 2 ) s B2 ÷ ,
è s A ø
so that E fulfils a Bernoulli distribution, A | E undergoes a normal distribution
whose mean depends on E, and that B | A has a conditional normal distribution
whose mean and variance are contributed in part by A. For ( B | E , A ) , the mean
of B also depends on E. The parameters of all distributions can be estimated by
maximum likelihood, and the model with the highest likelihood is selected as
the most likely causal model. The number of free parameters can be accounted
using penalties such as the Akaike information criterion (AIC) (Schadt et al.
2005).
The approach has been extended in various ways. In the study of Chen et al.
(2007), likelihood ratio tests, comparison to randomly permuted data, and false dis-
covery rate estimation techniques are used to convert the three model scores in a
single probability value P ( A ® B ) for a causal interaction from gene A to B. This
method is available in the Trigger software (https://www.bioconductor.org/pack-
ages/release/bioc/html/trigger.html). In the study of Millstein et al. (2009) and
(Neto et al. (2013), the model selection task is recast into a single hypothesis test,
using F-tests and Vuong’s model selection test respectively, resulting in a signifi-
cance p-value for each gene–gene causal interaction.
It should be noted that all of these approaches suffer from limitations due to their
inherent model assumptions. In particular, the presence of unequal levels of mea-
surement noise among genes, or of hidden regulatory factors causing additional
correlation among genes, can confuse causal inference. For example, excessive
error level in the expression data of gene A, may mistake the true structure
E ® A ® B as E ® B ® A . These limitations are discussed by Rockman (2008)
and Li et al. (2010).

4.2  sing Bayesian Networks to Identify Causal Regulatory


U
Mechanisms

Bayesian networks are probabilistic graphical models that encode conditional


dependencies between random variables in a directed acyclic graph (DAG).
Although Bayesian network cannot fully reflect certain pathways in gene regula-
tion, such as self-regulation or feedback loops, they still serve as a popular method
for modeling gene regulation networks, as they provide a clear methodology for
learning statistical dependency structures from possibly noisy data (Friedman et al.
1999a, 2000; Koller and Friedman 2009).
Detection of Regulator Genes and eQTLs in Gene Networks 13

We adopt our previous convention in Section 2, where we have the gene expres-
VetBooks.ir

sion data X and genetic markers G. The model contains a total of k vertices (i.e.,
random variables), Xi with i = 1, ¼, k , corresponding to the expression level of gene
i. Given a DAG  , and denoting the parental vertex set of Xi by Pa( ) ( X i ) , the
acyclic property of  allows to define the joint probability distribution function as

( )
k
p ( X 1 , ¼, X k |  ) = Õ p X i | Pa( ) ( X i ) . (2)
i =1

In its simplest form, we model the conditional distributions as

æ ö
( )
p X i | Pa( ) ( X i ) = N ç a i + å b ji ( X j - a j ) , s i2 ÷ ,
ç ÷
è X j ÎPa( ) ( X i ) ø

where (αi, σi) and βji are parameters for vertex Xi and edge X j ® X i respectively, as
part of the DAG structure  . Under such modeling, the Bayesian network is called
a linear Gaussian network.
The likelihood of data X given the graph  is

( { })
k n
p ( X |  ) = ÕÕ p X il | X jl , X j Î Pa( ) ( X i ) .
i =1 l =1

Using Bayes’ rule, the log-likelihood of the DAG  based on the gene expression
data X becomes

log p (  | X ) = log p ( X |  ) + log p (  ) - log p ( X ) ,


where p (  ) is the prior probability for  , and p(X) is a constant when the expres-
sion data are provided, so the follow-up calculations do not rely on it.
Typically, a locally optimal DAG is found by starting from a random graph and
randomly ascending the likelihood by adding, modifying, or removing one directed
edge at a time (Friedman et al. 1999a, 2000; Koller and Friedman 2009).
Alternatively, the posterior distribution p (  | X ) can be estimated with Bayesian
inference using Markov chain Monte Carlo simulation, allowing us to estimate the
significance levels at an extra computational cost. The parameter values of α, β, and
σ, as part of  , can be estimated with maximum likelihood.
When Bayesian network is modified by a single edge, only the vertices that
receive a change would require a recalculation, whereas all others remain intact.
This significantly reduces the amount of computation needed for each random step.
A further speedup is achievable if we constrain the maximum number of parents
each vertex can have, either by using the same fixed number for all nodes or by
preselecting a variable number of potential parents for each node using, for instance,
a preliminary L1-regularization step (Schmidt et al. 2007).
Two DAGs are called Markov equivalent if they result in the same PDF (Koller
and Friedman 2009). Clearly, using gene expression data alone, Bayesian networks
can only be resolved up to Markov equivalence. To break this equivalence and
uncover a more specific causal gene regulation network, genotype data are
14 L. Wang and T. Michoel

incorporated in the model inference process. The most straightforward approach is


VetBooks.ir

to use any of the methods in the previous section to calculate the probability
P ( X i ® X j ) of a causal interaction from Xi to Xj (Zhu et al. 2004, 2008, 2012;
Zhang et al. 2013), for example, by defining the prior as
æ ö
p ( ) = Õ ç Õ P ( X j ® X i ) Õ
X i ç X ÎPa( ) ( X )
( )
1 - P ( X j ® X i ) ÷ . A more ambi-
÷
è j X j ÎPa( ) ( X i ) ø

i

tious approach is to jointly learn the eQTL associations and causal trait (i.e., gene or
phenotype) networks. In the study of Neto et al. (2010), EM is used to alternatingly
map eQTLs given the current DAG structure and update the DAG structure and
model parameters given the current eQTL mapping. In the study of Scutari et al.
(2014), Bayesian networks are learned where SNPs and traits both enter as variables
in the model, with the constraint that traits can depend on SNPs, but not vice versa.
However, the additional complexity of both methods means that they are computa-
tionally expensive and have only been applied to problems with a handful of traits
(Neto et al. 2010; Scutari et al. 2014).
A few additional “tips and tricks” are worth mentioning:

• First, when the number of vertices is much larger than the sample count, we may
break the problem into independent subproblems by learning a separate Bayesian
network for each coexpression module (Section 3.1 and Zhang et al. 2013).
Dependencies between modules could then be learned as a Bayesian network
among the module eigengenes (Langfelder and Horvath 2007), although this
does not seem to have been explored.
• Second, Bayesian network learning algorithms inevitably result in locally opti-
mal models, which may contain a high number of false positives. To address this
problem, we can run the algorithm multiple times and report an averaged net-
work, only consisting of edges that appear sufficiently frequent.
• Finally, another technique that helps in distinguishing genuine dependencies
from false positives is bootstrapping, where resampling with replacement is exe-
cuted on the existing sample pool. A fixed number of samples are randomly
selected and then processed to predict a Bayesian network. This process is
repeated many times, essentially regarding the distribution of sample pool as the
true PDF, and allowing to estimate the robustness of each predicted edge, so that
only those with high significance are retained (Friedman et al. 1999b). In theory,
even the whole pipeline of Fig. 1 up to the in silico validation could be simulated
in this way. Although bootstrapping is computationally expensive and mostly
suited for small data sets, it could be used in conjunction with the separation into
modules on larger data sets.

4.3  sing Module Networks to Identify Causal Regulatory


U
Mechanisms

Module network inference is a statistically well-grounded method that uses probabi-


listic graphical models to reconstruct modules of coregulated genes and their upstream
regulatory programs and that has been proven useful in many biological case studies
Detection of Regulator Genes and eQTLs in Gene Networks 15

(Bonnet et al. 2015; Segal et al. 2003; Friedman 2004; Qu et al. 2016). The module
VetBooks.ir

network model was originally introduced as a method to infer regulatory networks


from large-scale gene expression compendia, as implemented in the Genomica soft-
ware (http://genomica.weizmann.ac.il) (Segal et al. 2003). Subsequently, the method
has been extended to integrate eQTL and gene expression data (Lee et al. 2006, 2009;
Zhang et al. 2010). The module network model starts from the same formula as Eq.
(2). It is then assumed that genes belonging to the same module share the same parents
and conditional distributions; these conditional distributions are parameterized as
decision trees, with the parental genes on the internal (decision) nodes and normal
distributions on the leaf nodes (Segal et al. 2003). Recent algorithmic innovations
decouple the module assignment and tree structure learning from the parental gene
assignment and use Gibbs sampling and ensemble methods for improved module net-
work inference (Joshi et al. 2008, 2009). These algorithms are implemented in the
Lemon-Tree software (https://github.com/eb00/lemon-tree), a command line software
suite for module network inference (Bonnet et al. 2015).

4.4 Illustrative Example

We have recently identified genomewide significant eQTLs for 6500 genes in seven
tissues from the Stockholm Atherosclerosis Gene Expression (STAGE) study
(Foroughi Asl et al. 2015) and performed coexpression clustering and causal net-
works reconstruction (Talukdar et al. 2016). To illustrate the above concepts, we
show some results for a coexpression cluster in visceral fat (88 samples, 324 genes),
which was highly enriched for tissue development genes ( P = 5 ´ 10-10 ) and con-
tained 10 genomewide significant eQTL genes and 25 transcription factors, includ-
ing eight members of the homeobox family (Fig. 2a).
A representative example of an inferred causal interaction is given by the coex-
pression interaction between huntingtin-associated protein 1 (HAP1, chr17 q21.2-
­21.3) and forkhead box G1 (FOXG1, chr14 q11-q13). The expression of both genes
is highly correlated ( r = 0.85 , P = 4.4 ´ 10-24 , Fig. 2b). HAP1 expression shows a
significant, nonlinear association with its eQTL rs1558285 ( P = 1.2 ´ 10-4 ); this
SNP also associates significantly with FOXG1 expression in the cross-association
test ( P = 0.0024), but not anymore after conditioning FOXG1 on HAP1 and its
own eQTL rs7160881 ( P = 0.67) (Fig. 2c). By contrast, although FOXG1 expres-
sion is significantly associated with its eQTL rs7160881 (P = 0.0028 ), there is no
association between this SNP and HAP1 expression ( P = 0.037), and conditioning
on FOXG1 and HAP1’s eQTL has only a limited effect ( P = 0.19) (Fig. 2d). Using
conditional independence tests (Section 4.1), this results in a high-confidence pre-
diction that HAP1 ® FOXG1 is causal.
A standard greedy Bayesian network search algorithm (Schmidt et al. 2007) was
run on the aforementioned cluster of 324 genes. Figure 2e shows the predicted con-
sensus subnetwork of causal interactions between the 10 eQTLs and the 25 TFs.
This illustrates how a sparse Bayesian network can accurately represent the fully
connected coexpression network (all 35 genes have high-mutual coexpression, cf.
Fig. 2a).
16 L. Wang and T. Michoel

a −2 0 2 e
VetBooks.ir

HOXB3 VASN

TP63 HOXB7 TTC39B

ISL1
CDH1 OBSCN HOXC9 GSC ZBTB16 ZBTB25

BCL11A
HAP1 HOXD8 KLF5 MESP2 FMO3 HOXA4 TEF
FOXG1
TRIM29 DLK1 THNSL2

IRF6
FOXE1 IRF6

KLF5
HLA−DQB1 SALL2 FOXG1
DLK1
HAP1 FOXE1 TBX5

OBSCN
ASCL1 ISL1 HOXC6

SALL2
TP63
HLA−DQB1
TTC39B BCL11A TRIM29 ASCL1 HOXA7

ZBTB25
TBX5 PLCD4 HOXA5 CDH1 PITX2

HOXA7
PITX2

f
HOXB7
HOXC6
HOXC9
HOXA5 SNP_A-8471683
HOXD8
ACVR1C
HOXB3
VASN
PLCD4 ADIPOQ
CIDEC
FMO3 PLIN4
HOXA4 PLIN1
GSC THRSP
SLC19A3
MESP2 GPD1
THNSL2 DGAT2
ZBTB16 TNMD
MRAP
TEF CIDEA

b 0.25 c 0.3 FOXG1


* d 0.3 FOXG1
HAP1 HAP1
0.2
0.25 HAP1 adj 0.25 HAP1 adj
0.15
FOXG1 standardized expression

0.2 0.2
Standardized expression
Standardized expression

0.1 0.15 0.15

0.05 0.1 0.1

0.05 0.05
0
0 0
−0.05
−0.05 −0.05
−0.1
−0.1 −0.1

−0.15 −0.15 −0.15

−0.2 −0.2 * −0.2


−0.2 −0.1 0 0.1 0.2 0.3 0 1 2 0 1 2
HAP1 standardized expression rs1558285 genotype rs7160881 genotype

Fig. 2 (a) Heat map of standardized expression profiles across 88 visceral fat samples for 10
eQTL genes and 25 TFs belonging to a coexpression cluster inferred from the STAGE data. (b)
Coexpression of HAP1 and FOXG1 across 88 visceral fat samples. (c) Association between
HAP1’s eQTL (rs1558285) and expression of HAP1 (red), FOXG1 (blue), and FOXG1 adjusted
for HAP1 and FOXG1’s eQTL (green). (d) Association between FOXG1’s eQTL (rs7160881) and
expression of FOXG1 (blue), HAP1 (red), and HAP1 adjusted for FOXG1 and HAP1’s eQTL
(green). (e) Causal interactions inferred between the same genes as in (a) using Bayesian network
inference. (f) Example of a regulatory module inferred by Lemon-Tree from the STAGE data. See
Section 4.4 for further details

Figure 2f shows a typical regulatory module inferred by the Lemon-Tree soft-


ware, also from the STAGE data. Here, a heat map is shown of the genotypes of an
eQTL (top), the expression levels of a regulatory gene (middle), predicted to regu-
late a coexpression module of 11 genes (bottom). The red lines indicate sample
clusters representing separate normal distributions inferred by the model-based co-­
clustering algorithm (Section 3.2).

5 I n Silico Validation of Predicted Gene Regulation


Networks

Gene regulation networks reconstructed from omics data represent hypotheses


about the downstream molecular implications of genetic variations in a particular
cell or tissue type. An essential first step toward using these networks in concrete
Detection of Regulator Genes and eQTLs in Gene Networks 17

applications (e.g., discovering novel candidate drug target genes and pathways)
VetBooks.ir

consists of validating them using independent data. The following is a nonexhaus-


tive list of typical in silico validation experiments.

Model Likelihood Comparison and Cross Validation


When different algorithms are used to infer gene network models, their log-­
likelihoods can be compared to select the best one. (With the caveat that the same
data that was used to learn the models is used to compare them, this comparison is
meaningful only when the algorithms optimize exactly the same (penalized) log-­
likelihood functions.) In a K-fold cross-validation experiment, the available samples
are divided into K subsets of approximately equal size. For each subset, models are
learned from a data set consisting of the K -1 other subsets, and the model likeli-
hood is calculated using only the unseen data subset. Thus, cross validation is used
to test the generalizability of the inferred network models to unseen data. For an
example where model likelihood comparison and cross validation were used to
compare two module network inference strategies, see Joshi et al. (2009).

Functional Enrichment
Organism-specific gene ontology databases contain structured functional gene
annotations (Ashburner et al. 2000). These databases can be used to construct gene
signature sets composed of genes annotated to the same biological process, molecu-
lar function or cellular component. Reconstructed gene networks can then be vali-
dated by testing for enriched connectivity of gene signature sets using a method
proposed by (Zhu et al. 2008). For a given gene set, this method considers all net-
work nodes belonging to the set and their nearest neighbors, and from this set of
nodes and edges, the largest connected subnetwork is identified. Then the enrich-
ment of the gene set in this subnetwork is tested using the Fisher exact test and
compared with the enrichment of randomly selected gene sets of the same size.

Comparison with Physical Interaction Networks


Networks of transcription factor–target interactions based on ChIP-sequencing data
(Furey 2012) from diverse cell and tissue types are available from the ENCODE
(The ENCODE 2012), Roadmap Epigenomics (Kundaje et al. 2015), and modEN-
CODE (Gerstein et al. 2010; Roy et al. 2010; Yue et al. 2014) projects, whereas
physical protein–protein interaction networks are available for many organisms
through databases such as the BioGRID (Chatr-Aryamontri et al. 2015). Because of
indirect effects, networks predicted from gene expression data rarely show a signifi-
cant overlap with networks of direct physical interactions. A more appropriate vali-
dation is therefore to test for enrichment for short connection paths in the physical
networks between pairs predicted to interact in the reconstructed networks (Bonnet
et al. 2015).

Gene Perturbation Experiments


Gene knockout experiments provide the ultimate gold standard of a causal network
intervention, and genes differentially expressed between knockout and control
experiments can be considered as true positive direct or indirect targets of the
18 L. Wang and T. Michoel

knockout gene. Predicted gene networks can be validated by compiling relevant


VetBooks.ir

(i.e., performed in a relevant cell or tissue type) gene knockout experiments from
the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) or ArrayExpress
(https://www.ebi.ac.uk/arrayexpress/), and comparing the overlap between gene
sets responding to a gene knockout and network genes predicted to be downstream
of the knockout gene. Overlap significance can be estimated by using randomized
networks with the same degree distribution as the predicted network.

6 Future Perspective: Integration of Multi-Omics Data

Although combining genotype and transcriptome data to reconstruct causal gene


networks has led to important discoveries in a variety of applications (Civelek and
Lusis 2014), important details are not incorporated in the resulting network models,
particularly regarding the causal molecular mechanisms linking eQTLs to their tar-
get genes, and the relation between variation in transcript levels and protein levels,
with the latter ultimately determining phenotypic responses. Several recent studies
have shown that at the molecular level, cis-eQTLs primarily cause variation in tran-
scription factor binding to gene regulatory DNA elements, which then causes
changes in histone modifications, DNA methylation, and mRNA expression of
nearby genes (reviewed in Albert and Kruglyak 2015). Although mRNA expression
can be used as a surrogate for protein expression, due to diverse posttranscriptional
regulation mechanisms, the correlation between mRNA and protein levels is known
to be modest (Lu et al. 2007; Schwanhausser et al. 2011), and genetic loci that affect
mRNA and protein expression levels do not always overlap (Foss et al. 2007; Wu
et al. 2013). Thus, an ideal systems genetics study would integrate genotype data
and molecular measurements at all levels of gene regulation from a large number of
individuals.
Human lymphoblastoid cell lines (LCLs) are emerging as the primary model
system to test such an approach. Whole-genome mRNA and micro-RNA sequenc-
ing data are available for 462 LCL samples from five populations genotyped by the
1000 Genomes Project (Lappalainen et al. 2013); protein levels from quantitative
mass spectrometry for 95 samples (Wu et al. 2013); ribosome occupancy levels
from the sequencing of ribosome-protected mRNA for 50 samples (Cenik et al.
2015); DNA-occupancy levels of the regulatory TF PU.1, the RNA polymerase II
subunit RBP2, and three histone modifications from the ChIP sequencing of 47
samples (Waszak et al. 2015); and the same three histone modifications from the
ChIP sequencing of 75 samples (Grubert et al. 2015). These population-level data
sets can be combined further with three-dimensional chromatin contact data from
Hi-C (Rao et al. 2014) and ChIA-PET (Grubert et al. 2015), knockdown experi-
ments followed by microarray measurements for 59 transcription-associated factors
and chromatin modifiers (Cusanovich et al. 2014), and more than 260 ENCODE
assays (including the ChIP sequencing of 130 TFs) (The ENCODE 2012) in a refer-
ence LCL cell line (GM12878). Although the number of samples where all mea-
sures are simultaneously available is currently small, this number is sure to rise in
Detection of Regulator Genes and eQTLs in Gene Networks 19

the coming years, along with the availability of similar measurements in other cell
VetBooks.ir

types. Despite the challenging heterogeneity of data and analyses in the integration
of multi-omics data, web-based toolboxes, such as GenomeSpace (http://www.
genomespace.org) (Qu et al. 2016), can prove helpful to nonprogrammer
researchers.

Conclusions
In this chapter, we have reviewed the main methods and software to carry out a
systems genetics analysis, which combines genotype and various omics data to
identify eQTLs and their associated genes, to reconstruct coexpression networks
and modules, to reconstruct causal Bayesian gene and module networks, and to
validate predicted networks in silico. Several method and software options are
available for each of these steps, and by necessity, a subjective choice about
which ones to include had to be made, based largely on their ability to handle
large data sets, their popularity in the field, and our personal experience of using
them. Where methods have been compared in the literature, they have usually
been performed on a small number of data sets for a specific subset of tasks, and
results have rarely been conclusive. That is, although each of the presented meth-
ods will give somewhat different results, no objective measurements will consis-
tently select one of them as the “best” one. Given this lack of objective criterion,
the reader may well prefer to use a single software that allows to perform all of
the presented analyses, but such an integrated software does not currently exist.
Nearly all of the examples discussed referred to the integration of genotype
and transcriptome data, reflecting the current dominant availability of these two
data types. However, omics technologies are evolving at a fast pace, and it is
clear that data on the variation of TF binding, histone modifications, and post-
transcriptional and protein expression levels will soon become more widely
available. Developing appropriate statistical models and computational methods
to infer causal gene regulation networks from these multi-omics data sets is
surely the most important challenge for the field.

Acknowledgments The authors’ work is supported by the BBSRC (BB/M020053/1) and Roslin
Institute Strategic Grant funding from the BBSRC (BB/J004235/1).

References
Albert FW, Kruglyak L (2015) The role of regulatory variation in complex traits and disease. Nat
Rev Genet 16:197–212
Ardlie KG et al (2015) The genotype-tissue expression (GTEx) pilot analysis: multitissue gene
regulation in humans. Science 348:648–660
Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25:25–29
Aten JE et al (2008) Using genetic markers to orient the edges in quantitative trait networks: the
NEO software. BMC Syst Biol 2:34
Ayroles JF et al (2009) Systems genetics of complex traits in drosophila melanogaster. Nat Genet
41:299–307
20 L. Wang and T. Michoel

Basso K et al (2005) Reverse engineering of regulatory networks in human b cells. Nat Genet
VetBooks.ir

37:382–390
Björkegren JL et al (2015) Genome-wide significant loci: how important are they?: systems genet-
ics to understand heritability of coronary artery disease and other common complex disorders.
J Am Coll Cardiol 65:830–845
Bonnet E, Calzone L, Michoel T (2015) Integrative multi-omics module network inference with
Lemon-Tree. PLoS Comput Biol 11, e1003983
Brem RB et al (2002) Genetic dissection of transcriptional regulation in budding yeast. Science
296:752–755
Butte A, Kohane I (2000) Mutual information relevance networks: functional genomic clustering
using pairwise entropy measurements. Pac Symp Biocompu 5:415–426
Cenik C et al (2015) Integrative analysis of rna, translation and protein levels reveals distinct regu-
latory variation across humans. Genome Res. doi:10.1101/gr.193342.115
Chatr-Aryamontri A et al (2015) The BioGRID interaction database: 2015 update. Nucleic Acids
Res 43(Database issue):D470–D478. doi:10.1093/nar/gku1204
Chen LS, Emmert-Streib F, Storey JD (2007) Harnessing naturally randomized transcription to
infer regulatory relationships among genes. Genome Biol 8:R219
Chen Y et al (2008) Variations in DNA elucidate molecular networks that cause disease. Nature
452:429–435
Cheung VG, Spielman RS (2009) Genetics of human gene expression: mapping dna variants that
influence gene expression. Nat Rev Genet 10:595–604
Civelek M, Lusis AJ (2014) Systems genetics approaches to understand complex traits. Nat Rev
Genet 15:34–48
Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks.
Phys Rev E 70:066111
Cookson W et al (2009) Mapping complex disease traits with global gene expression. Nat Rev
Genet 10:184–194
Cubillos FA, Coustham V, Loudet O (2012) Lessons from eQTL mapping studies: non-coding
regions and their role behind natural phenotypic variation in plants. Curr Opin Plant Biol
15:192–198
Cusanovich DA et al (2014) The functional consequences of variation in transcription factor bind-
ing. PLoS Genet 10, e1004226
Daub CO et al (2004) Estimating mutual information using B-spline functions – an improved simi-
larity measure for analysing gene expression data. BMC Bioinf 5:118
Dimas AS et al (2009) Common regulatory variation impacts gene expression in a cell type–depen-
dent manner. Science 325:1246–1250
Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95:14863–14868
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of
protein families. Nucleic Acids Res 30:1575–1584
Faith JJ et al (2007) Large-scale mapping and validation of Escherichia coli transcriptional regula-
tion from a compendium of expression profiles. PLoS Biol 5, e8
Foroughi Asl H et al (2015) Expression quantitative trait loci acting across multiple tissues are
enriched in inherited risk of coronary artery disease. Circulation Cardiovasc Genet 8:305–315
Foss EJ et al (2007) Genetic basis of proteome variation in yeast. Nat Genet 39:1369–1375
Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science
308:799–805
Friedman N, Nachman I, Peér D (1999) Learning bayesian network structure from massive datas-
ets: the “sparse candidate” algorithm. In Proceedings of the fifteenth conference on uncertainty
in artificial intelligence, UAI’99. Morgan Kaufmann Publishers Inc., San Francisco,
pp 206–215
Friedman N, Goldszmidt M, Wyner A (1999b) Data analysis with Bayesian networks: a bootstrap
approach. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence.
Morgan Kaufmann Publishers Inc, San Francisco, pp 196–205
Detection of Regulator Genes and eQTLs in Gene Networks 21

Friedman N et al (2000) Using Bayesian networks to analyze expression data. J Comput Biol
VetBooks.ir

7:601–620
Furey TS (2012) ChIP–seq and beyond: new and improved methodologies to detect and character-
ize protein–DNA interactions. Nat Rev Genet 13:840–852
Georges M (2007) Mapping, fine mapping, and molecular dissection of quantitative trait loci in
domestic animals. Annu Rev Genomics Hum Genet 8:131–162
Gerstein M et al (2010) Integrative analysis of the Caenorhabditis elegans genome by the modEN-
CODE project. Science 330:1775–1787
Goddard ME, Hayes BJ (2009) Mapping genes for complex traits in domestic animals and their
use in breeding programmes. Nat Rev Genet 10:381–391
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University
Press, Baltimore
Greenawalt DM et al (2011) A survey of the genetics of stomach, liver, and adipose gene expres-
sion from a morbidly obese cohort. Genome Res 21:1008–1016
Grubert F et al (2015) Genetic control of chromatin states in humans involves local and distal
chromosomal interactions. Cell 162:1051–1065
Hartwell LH et al (1999) From molecular to modular cell biology. Nature 402:C47–C52
Hemani G et al (2014) Detection and replication of epistasis influencing transcription in humans.
Nature 508:249–253
Hindorff LA et al (2009) Potential etiologic and functional implications of genome-wide associa-
tion loci for human diseases and traits. Proc Natl Acad Sci 106:9362–9367
Joshi A, Van de Peer Y, Michoel T (2008) Analysis of a Gibbs sampler for model based clustering
of gene expression data. Bioinformatics 24:176–183
Joshi A et al (2009) Module networks revisited: computational assessment and prioritization of
model predictions. Bioinformatics 25:490–496
Kadarmideen HN, von Rohr P, Janss LL (2006) From genetical genomics to systems genetics: poten-
tial applications in quantitative genomics and animal breeding. Mamm Genome 17:548–564
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT
Press, Cambridge, MA
Kundaje A et al (2015) Integrative analysis of 111 reference human epigenomes. Nature
518:317–330
Laird N, Lange C (2011) The fundamentals of modern statistical genetics. Springer, New York
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-­
expression modules. BMC Syst Biol 1:54
Langfelder P, Horvath S (2008) Wgcna: an r package for weighted correlation network analysis.
BMC Bioinf 9:559
Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the
dynamic tree cut package for r. Bioinformatics 24:719–720
Lappalainen T et al (2013) Transcriptome and genome sequencing uncovers functional variation in
humans. Nature 501:506–511
Lee S et al (2006) Identifying regulatory mechanisms using individual variation reveals key role
for chromatin modification. Proc Natl Acad Sci U S A 103:14062–14067
Lee SI et al (2009) Learning a prior on regulatory potential from eqtl data. PLoS Genet 5, e1000358
Li Y et al (2010) Critical reasoning on causal inference in genome-wide linkage and association
studies. Trends Genet 26:493–498
Liu JS (2002) Monte Carlo strategies in scientific computing. Springer, New York
Lu P et al (2007) Absolute protein expression profiling estimates the relative contributions of tran-
scriptional and translational regulation. Nat Biotech 25:117–124
Mackay TF, Stone EA, Ayroles JF (2009) The genetics of quantitative traits: challenges and pros-
pects. Nat Rev Genet 10:565–577
Manolio TA (2013) Bringing genome-wide association findings into clinical use. Nat Rev Genet
14:549–558
Medvedovic M, Sivaganesan S (2002) Bayesian infinite mixture model based clustering of gene
expression profiles. Bioinformatics 18:1194–1206
22 L. Wang and T. Michoel

Michoel T, Nachtergaele B (2012) Alignment and integration of complex networks by hypergraph-­


VetBooks.ir

based spectral clustering. Phys Rev E 86:056111


Millstein J et al (2009) Disentangling molecular relationships with a causal inference test. BMC
Genet 10:23
Neto EC et al (2008) Inferring causal phenotype networks from segregating populations. Genetics
179:1089–1100
Neto EC et al (2010) Causal graphical models in systems genetics: a unified framework for joint
inference of causal network and genetic architecture for correlated phenotypes. Ann Appl Stat
4:320
Neto EC et al (2013) Modeling causality for pairs of phenotypes in system genetics. Genetics
193:1003–1013
Newman MEJ (2006) Modularity and community structure in networks. PNAS 103:8577–8582
Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys
Rev E 69:026113
Nicholson G et al (2011) A genome-wide metabolic QTL analysis in Europeans implicates two
loci shaped by recent positive selection. PLoS Genet 7, e1002270
Qi J et al (2014) kruX: Matrix-based non-parametric eQTL discovery. BMC Bioinf 15:11
Qu K et al (2016) Integrative genomic analysis by interoperation of bioinformatics tools in
GenomeSpace. Nat Methods 13:245–247
Rao SS et al (2014) A 3D map of the human genome at kilobase resolution reveals principles of
chromatin looping. Cell 159:1665–1680
Ravasz E et al (2002) Hierarchical organization of modularity in metabolic networks. Science
297:1551–1555
Ritchie MD et al (2015) Methods of integrating data to uncover genotype-phenotype interactions.
Nat Rev Genet 16:85–97
Rockman MV (2008) Reverse engineering the genotype–phenotype map with natural genetic vari-
ation. Nature 456:738–744
Rockman MV, Kruglyak L (2006) Genetics of global gene expression. Nat Rev Genet 7:862–872
Roy S et al (2010) Identification of functional elements and regulatory circuits by Drosophila
modENCODE. Science 330:1787–1797
Schadt EE (2009) Molecular networks as sensors and drivers of common human diseases. Nature
461:218–223
Schadt EE, Björkegren JL (2012) New: network-enabled wisdom in biology, medicine, and health
care. Sci Transl Med 4:115rv1
Schadt EE et al (2005) An integrative genomics approach to infer causal associations between gene
expression and disease. Nat Genet 37:710–717
Schadt EE et al (2008) Mapping the genetic architecture of gene expression in human liver. PLoS
Biol 6, e107
Schadt EE, Friend SH, Shaywitz DA (2009) A network view of disease and compound screening.
Nat Rev Drug Disc 8:286–295
Schaub MA et al (2012) Linking disease associations with regulatory information in the human
genome. Genome Res 22:1748–1759
Schmidt M, Niculescu-Mizil A, Murphy K (2007) Learning graphical model structure using
L1-regularization paths. AAAI 7:1278–1283
Schwanhausser B et al (2011) Global quantification of mammalian gene expression control. Nature
473:337–342
Scutari M et al (2014) Multiple quantitative trait analysis using Bayesian networks. Genetics
198:129–137
Segal E et al (2003) Module networks: identifying regulatory modules and their condition-specific
regulators from gene expression data. Nat Genet 34:166–167
Shabalin AA (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Bioinformatics 28:1353–1358
Detection of Regulator Genes and eQTLs in Gene Networks 23

Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression
VetBooks.ir

analysis. In Proc Int Conf Intell Syst Mol Biol 8:16


Sharan R, Ulitsky I, Shamir R (2007) Network-based prediction of protein function. Mol Syst Biol
3:88
Smith GD, Ebrahim S (2003) ‘mendelian randomization’: can genetic epidemiology contribute to
understanding environmental determinants of disease? Int J Epidemiol 32:1–22
Stegle O et al (2012) Using probabilistic estimation of expression residuals (peer) to obtain
increased power and interpretability of gene expression analyses. Nat Protoc 7:500–507
Talukdar H et al (2016) Cross-tissue regulatory gene networks in coronary artery disease. Cell Syst
2:196–208
Tavazoie S et al (1999) Systematic determination of genetic network architecture. Nat Genet
22:281–285
The ENCODE (2012) Project Consortium. An integrated encyclopedia of DNA elements in the
human genome. Nature 489:57–74
Van Dongen SM (2001) Graph clustering by flow simulation. Dissertation, Utrecht University
Repository
Walhout AJ (2006) Unraveling transcription regulatory networks by protein–DNA and protein–
protein interaction mapping. Genome Res 16:1445–1454
Waszak SM et al (2015) Population variation and genetic control of modular chromatin architec-
ture in humans. Cell 162:1039–1050
Williams RW (2006) Expression genetics and the phenotype revolution. Mamm Genome
17:496–502
Wu L et al (2013) Variation and genetic control of protein abundance in humans. Nature
499:79–82
Yue F et al (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature
515:355–364
Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analy-
sis. Stat Appl Genet Mol Biol 4:17
Zhang W et al (2010) A Bayesian partition method for detecting pleiotropic and epistatic eQTL
modules. PLoS Comput Biol 6, e1000642
Zhang B et al (2013) Integrated systems approach identifies genetic nodes and networks in late-­
onset Alzheimer’s disease. Cell 153:707–720
Zhu J et al (2004) An integrative genomics approach to the reconstruction of gene networks in
segregating populations. Cytogenet Genome Res 105:363–374
Zhu J et al (2008) Integrating large-scale functional genomic data to dissect the complexity of
yeast regulatory networks. Nat Genet 40:854–861
Zhu J et al (2012) Stitching together multiple data dimensions reveals interacting metabolomic and
transcriptomic networks that modulate cell regulation. PLoS Biol 10, e1001301
VetBooks.ir

Applications of Systems Genetics


and Biology for Obesity Using Pig
Models

Lisette J.A. Kogelman and Haja N. Kadarmideen

Abstract
In many biomedical research areas, animals have been used as a model to increase
the understanding of molecular mechanisms involved in human diseases. One of
those areas is human obesity, where porcine models are increasingly used. The
pig shows genetic and physiological features that are very similar to humans and
have shown to be an excellent model for human obesity. Using pig populations,
many genetic studies have been performed to unravel the genetic architecture of
human obesity. Most of them are pinpointing toward single genes, but more and
more studies focus on a systems genetics approach, a branch of systems biology.
In this chapter, we will describe the state of the art of genetic studies on human
obesity, using pig populations. We will describe the features of using the pig as a
model for human obesity and briefly discuss the genetics of obesity, and we will
focus on systems genetic research performed using pigs with their contribution
to human obesity research.

1 The Pig as a Model for Human Obesity

Throughout the history of biomedical research, animals have been extensively used
as a model for human diseases. Animal models have several advantages with respect
to costs, ethical potential, and measurement of phenotypic characteristics. The use
of animal models in biomedical research has been previously described in depth
(Hau 2008), showing that the choice of animal model in biomedical research is

L.J.A. Kogelman (*) • H.N. Kadarmideen


Department of Large Animal Sciences, Faculty of Health and Medical Sciences, University of
Copenhagen, Grønnegårdsvej 7, 1870 Frederiksberg C, Denmark
e-mail: lisette.kogelman@regionh.dk

© Springer International Publishing Switzerland 2016 25


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_2
26 L.J.A. Kogelman and H.N. Kadarmideen

highly dependent on the genetic, physiological, and/or psychological features of


VetBooks.ir

both animal and disease under study. For human obesity, rodents are a commonly
used animal model, but because of major anatomical and physiological differences,
the translational efforts to human medical science have been limited (Houpt et al.
1979; Spurlock and Gabler 2008). To overcome those major differences, the pig
(Sus scrofa) has successfully been used as a model for human obesity.
The digestive tract is one of the key organs in obesity research (Halsted 1999)
and, therefore, needs to be considered when choosing an animal model for human
obesity. Zooming in on the anatomy of the digestive tract of pigs, it can be shown
that it is very similar to that of humans. Both species are omnivorous, and their
digestive tract consists of the esophagus, stomach, small intestine (consisting of
duodenum, jejunum, and ileum), and large intestine (consisting of cecum, colon,
rectum, and anus). Furthermore, the digestive tract has proportionally the same size:
the stomach has a capacity of 6–8 l in the pig compared to 2–4 l in humans (Curtus
and Barnes 1994), the small intestine is approximately 18 m long in the pig
(Razmaite et al. 2009) compared to 7 m in humans (Gray 1918), and the large intes-
tine is approximately 6 m in the pig (Razmaite et al. 2009) compared to 1.5 m in
humans.
Also genetically, the pig is very similar to humans. Recently (2013), the pig
genome was assembled and analyzed (Groenen et al. 2012), resulting in the annota-
tion of protein-coding genes and gene transcripts, with numbers comparable to the
human genome (Table 1).
Based on the pig's genetic background, Groenen et al. (2012) also showed the
potential of the pig as a biomedical model. For example, they detected that at 112
positions, the amino acid sequences were equal to the human orthologs that were
implicated in human disease. Moreover, several studies have used the pig as a bio-
medical model, regarding, for example, heart physiology, brain, gut physiology and
nutrition, biomechanical models, respiratory function, and infectious disease mod-
els (Lunney 2007; Michael Swindle and Smith 2008). Here, we will focus on the
use of the pig as a model for human obesity and its application in systems biology
research.
Already in 1979, the use of the pig as a biomedical model to study human obesity
was reviewed (Houpt et al. 1979). Several similarities between the pig and the
humans were discussed with respect to obesity-related phenotypes. For example,
both in humans and in pigs, fat is mainly stored in subcutaneous adipose tissue, and
fat cell size and number are similar (Gurr et al. 1977). High-density lipid protein

Table 1 Overview genome Pig Human


of the pig and human
No. of chromosomes 19 pairs 23 pairs
No. of base pairs 2,596,639,456 3,272,480,989
Protein coding genes 21,640 20,345
Pseudogenes 380 14,206
ncRNAs 2,965 22,883
Gene transcripts 26,487 196,520
Applications of Systems Genetics and Biology for Obesity Using Pig Models 27

(HDL) is also structurally and compositionally similar (Davis et al. 1974).


VetBooks.ir

Importantly, as in humans, there is no indication that obesity in pigs is caused by a


single locus or gene, but it is most likely caused by several loci or homozygous
recessive genotypes for obesity. Genetically, there is, however, a difference in vari-
ous pig breeds: a number of breeds show a strong propensity for obesity (e.g.,
Ossabaw pig and Göttingen minipig), whereas others have been bred for centuries
for their lean meat content (e.g., Yorkshire and Duroc).
As mentioned, the pig as a biomedical model has major advantages in com-
parison with human studies with respect to the measurement of phenotypic
characteristics. Because of costs and ethical reasons, it is easier to measure a
wide range of phenotypes under controlled experimental conditions, and after
the study period, the animal was slaughtered and samples from different tissues
and cells were collected. Also during the study period, deep phenotyping can
be obtained using dual-energy X-ray absorptiometry (DXA) scanning. DXA
has shown its potential in human obesity studies because of the precise mea-
surement of the body composition with respect to fat mass. Several studies
have shown the potential of estimating the fat mass percentage of pigs using
DXA scanning. For example, it was shown that the percentage of body fat mea-
sured by DXA was not significantly different from estimation by chemical
analysis, but more extensive calibration may be needed for total body analysis
(Mitchell et al. 1996), which was similarly shown in small pigs (Mitchell et al.
1998). Likewise, a study using production pigs (Large White × Landrace pigs)
showed a high accuracy of determining body composition using DXA scanning
(Suster et al. 2003).
One of the well-known pig breeds in relation to obesity is the Göttingen Minipig.
This minipig is bred for its small size and ease in handling, which is one of its main
advantages in an experimental setting (Johansen et al. 2001). It has been shown that
this breed becomes severely obese when fed a high-fat, high-energy diet, and their
glycemic control is similar to what has been observed in humans (Johansen et al.
2001). For example, pigs fed the high-fat, high-energy diet had a fat content of
15.2% ± 0.7% vs. 10.0% ± 1.2% in the low-fat, low-energy diet. However, dissimi-
larities were also observed with the high-fat, high-energy diet: the obese pig showed
an increase in triglycerides and HDL cholesterol concentration, whereas in the
obese human, triglyceride levels were increased but HDL cholesterol values were
decreased. Surprisingly, the Göttingen Minipig also develops obesity on a normal,
ad libitum diet and, therefore, is classified to be prone to obesity (Bollen et al. 2005).
Another breed with high potential in obesity research is the Ossabaw pig, another
miniature pig breed. They possess a thrifty genotype, similar to humans, which
results in the ability to store large amounts of fat during feasting and consequently
survive periods of famine (Dyson et al. 2006). Studies have shown their excellent
potential as a model for obesity and the progression to type 2 diabetes and other
implications of obesity (e.g., coronary artery disease). Extensive discussions on
type and use of many breeds of minipigs in biomedical research and sources to pro-
cure them for research can be found in the book The Minipig in Biomedical Research
(McAnulty et al. 2011).
28 L.J.A. Kogelman and H.N. Kadarmideen

In contrast to the obese pigs, production pigs (e.g., Duroc and Yorkshire) have
VetBooks.ir

been bred for centuries for less fat or for their lean meat content to live up to the
standards for human consumption, leading to a pig breed that is genetically predis-
posed for leanness. Although those animals are less valuable in an experimental
setting because of their size, the normal production setting has great potential. In
research, e.g., performed by animal breeding industries, a large amount of data are
collected to breed animals that are growing fast and lean, with a high feed effi-
ciency. These data have vast opportunities to be related to human obesity studies,
gaining knowledge about the genetic architecture of, for example, eating behavior
and development of lean/fat content.

2 The Complexity of Human Obesity in a Nutshell

Obesity is the excessive accumulation of body adipose tissue, commonly the


result of a chronic imbalance between energy intake and expenditure (Galgani
and Ravussin 2010). Obesity is mostly the result of both environmental and
genetic factors and interactions among and within them (multi-factorial)
(Bougnères 2002). Worldwide, the prevalence of obesity has been growing expo-
nentially over the last decades (World Health Organization 2012), which may be
largely due to the increased availability of energy-rich foods and reduced need
for physical activity (O'Rahilly and Farooqi 2006). However, it is also known
that there is a large genetic component: quantitative genetic studies have esti-
mated the heritability of obesity to be between 40% and 70% (Speliotes et al.
2010). The regulation of energy balance (homeostasis) is a very important aspect
of human obesity. Many different tissues, biological processes, and hormones are
involved, whereby genes also play a major role. For example, energy/food intake
is strongly associated with appetite and satiety. Those states are mainly regulated
by the central nervous system with several involved organs and hormones. One
of these hormones is ghrelin, an appetite-stimulatory signal secreted by the stom-
ach (Wren et al. 2001). On the other side, leptin is called the satiety hormone,
released by adipose tissue and functioning through its receptor in the hypothala-
mus, leading to a reduction of food intake and increase of energy expenditure
(Friedman 2002).
Obesity is also very closely related to the human immune system because of the
variety of functions of adipose tissue, i.e., endocrine, inflammatory and metabolic
functions (Heber 2010). White adipose tissue mainly consists of adipocytes that
store energy as fat. Adipocytes secrete a large number of adipokines (also called
adipocytokines) with important roles in energy homeostasis (Fantuzzi 2005). The
most abundant ones are leptin (function mentioned previously) and adiponectin
(Tilg and Moschen 2006). Adiponectin is involved in insulin sensitivity and has
anti-inflammatory and anti-atherogenic properties (Diez and Iglesias 2003). Besides
adipocytes, several immune cells are present in adipose tissue (Ferrante 2013), of
which the most abundant are macrophages with an important role in phagocytosis.
Among others, neutrophils (critical for the first immune response) and mast cells
Applications of Systems Genetics and Biology for Obesity Using Pig Models 29

(several immune functions, e.g., role in allergy) are strongly increased in individu-
VetBooks.ir

als with obesity (Elgazar-Carmon et al. 2008; Liu et al. 2009).


The complexity of obesity as well as, for example, the relationship of adipose
tissue with many different hormones and cells results in its association with several
other (complex) diseases, like type 2 diabetes, cardiovascular problems, and several
types of cancer. For instance, decreased insulin sensitivity (leading to type 2 diabe-
tes) is a consequence of the chronic, low-grade inflammation state caused by the
increased level of immune cells because of the high degree of adipose cells (Xu
et al. 2003; Shoelson et al. 2007). Insulin has an important function in regulating the
uptake of glucose, originating from carbohydrates in nutrition. Because of food
intake, the blood glucose levels rise, and insulin makes sure that the glucose is trans-
ported to the cells and that cells take up glucose so it can be used as an energy
source. In the case of insulin sensitivity, there is not enough insulin, resulting in a
limited uptake of glucose (type 2 diabetes). As the human body is dependent on an
adequate glucose supply, the disturbed glucose uptake has major consequences for
the human body, such as heart and vascular problems, diabetic retinopathy (affected
eyesight), and kidney failure. Furthermore, insulin also has an anti-inflammatory
effect, again relating obesity to the immune system.
The complexity of obesity becomes clear by looking into all the associated tis-
sues, cells, hormones, and biological mechanisms and thereby affects longevity and
quality of human life. Its consequences for human health and subsequent financial
burden on the society increase the need for a better understanding of the biological
and genetic background of obesity.

3 Single Gene Studies in Obesity: What Do We Know


So Far?

For several years, genomewide association studies (GWAS) have been very impor-
tant in the detection of genes associated with diseases. Since 2007, GWAS have
been published regarding obesity. Most commonly, the body mass index (BMI) was
used as a measure of obesity. The first GWAS performed on BMI, using 4741 indi-
viduals and 362,129 SNPs, led to the detection of the fat mass and obesity-associated
(FTO) gene (Scuteri et al. 2007). Approximately 8 years later, the latest GWAS
performed on BMI, composed of 339,224 individuals and approximately 2.5 mil-
lion SNPs (GIANT consortium), detected 97 obesity-related loci (Locke et al.
2015). Although this is a huge increase in detected loci/genes associated with obe-
sity, results thus far are disappointing, as they only explain approximately 2.7% of
the BMI variation. Over the years and with new findings, the understanding of bio-
logical mechanisms behind obesity has also changed. Where FTO was discovered
as an actual fat mass gene (related to feed intake), pathway analysis of the 97
obesity-related loci discovered by Locke et al. (2015) is pointing toward a major
role for the central nervous system.
Furthermore, several other phenotypes have been used to detect obesity-related
genes. For example, it has been shown that central obesity has more negative health
30 L.J.A. Kogelman and H.N. Kadarmideen

consequences than general obesity. As a consequence, the waist-to-hip ratio (WHR)


VetBooks.ir

might be more informative than the BMI (Vazquez et al. 2007; Molarius and Seidell
1998). Recently, 49 loci related to WHR were detected using the GIANT consor-
tium database, consisting of genes that were highly enriched in adipocyte-related
tissues (Shungin et al. 2015). Likewise, as with GWAS on BMI, the variance
explained by those loci is very low: only approximately 1.4% of the variance was
explained.
Many more studies have been performed to try elucidating the genetic back-
ground of obesity, to gain understanding of the biological mechanisms of this com-
plex phenotype. As mentioned, the use of animal models in biomedical research has
some outstanding advantages, such as costs and ethical potential. Several research
groups have made use of those potentials and investigated obesity using the pig as a
biomedical model, either using data sets coming from the pig industry or by using
an experimental animal model, both having their own (dis)advantages (Fig. 1).

4 Human Obesity Genes Present in Pigs

The first gene that has been directly associated with obesity in humans is the FTO
gene. This gene has also been related to obesity-related traits in pigs by several
studies. In 2007, a study focused on the alleles of the FTO gene and studied the
relationship of this gene in seven pig breeds with several measured traits (Fontanesi
et al. 2009). They showed, and reconfirmed, that FTO was significantly associated
with obesity-related phenotypes in Duroc pigs, for example, intramuscular fat
content (Fontanesi et al. 2010). Also in an ISU Berkshire × Yorkshire population,
this gene showed significant association with average daily gain and total lipid
percentage in muscle (Fan et al. 2009). Furthermore, an expression study showed
elevated levels of the FTO gene brain tissues, with significantly higher levels in
the cerebellum compared with the cortex of pigs fed a high-cholesterol diet
(Madsen et al. 2009).
Another well-known human obesity gene is MC4R. The gene is active in the
hypothalamic leptin–melanocortin signaling pathway and has been associated with
suppression of food intake (Santini et al. 2009). Also in pigs, the gene has been
associated with several obesity-related traits. In Italian Duroc and Italian Large
White pigs, it has been associated with daily gain, feed conversion ratio, and ham
weight (Davoli et al. 2012). Another study using Large White showed the associa-
tion of MC4R with backfat depth, average daily gain, and daily feed intake (Houston
et al. 2004).
Another study that showed the presence of human obesity genes in pigs was
performed at Poznan University of Life Sciences (Poland). They localized seven
previously mapped (INSIG2, LIPIN1, PLIN, NAMPT, ADIPOQ, UCP2, and UCP3)
and six novel (NR3C1, GNB3, ADRB1, ADRB2, ADRB3, and UCP1) candidate
genes for human obesity in the pig genome (Nowacka-Woszuk et al. 2008). All
genes could be localized on one of the pig chromosomes, and several of them were
located within a known quantitative trait loci (QTL) for fatness traits in pigs.
VetBooks.ir

Industry resources vs. Experimental animal model


- Bred for lean meat content - Breeding strategy and/or selection of
- Large sample sizes specific animals
- Controlled environment - Smaller sample sizes
- Phenotyping/sampling limited to - Very controlled environment
commercial production line - Deep phenotyping/tissue sampling
- Genotyping available (for - Different ‘omic data levels possible
genomic selection)
Fig. 1 Overview of potential use of pigs for human obesity research; industry versus experimental animal model
Applications of Systems Genetics and Biology for Obesity Using Pig Models
31
32 L.J.A. Kogelman and H.N. Kadarmideen

Another expression-based study also showed the presence of LEP, LEPR, NEGR1,
VetBooks.ir

and ADIPOQ together with FTO and MC4R in production pigs and in Göttingen
Minipigs (Cirera et al. 2014).

5 Studying the Genetics of Fatness Traits in Pigs: Input


from the Industry

Fat-related traits in pigs have been studied intensively for the pig industry because
of their commercial effect. Production pigs have been bred for their lean meat con-
tent, which means that it was commercially interesting to detect biomarkers for
fatness. Although many of these studies have never been related to human obesity,
it can be proposed that those genes might be important in a human context. For
example, a QTL study for backfat thickness and intramuscular fat content in an
experimental cross between Meishan and Dutch Large White and Landrace lines
detected several QTLs on several chromosomes. Comparative mapping of the
results showed human homologues for those QTLs, for example, tumor necrosis
factor α (de Koning et al. 1999). Furthermore, in a QTL study using a three-
generation experimental cross between Meishan and Large White pig breeds, seven
chromosomal regions were detected as genomewide significant for fatness traits.
One of the detected QTLs was close to the insulin-like growth factor 2 locus
(Bidanel et al. 2001). A GWAS performed on 669 Duroc pigs across generations
resulted in the detection of a region associated with backfat thickness. In this region,
six genes were identified (PDE4B, LEPR, LEPROT, DNAJC6, AK3L1, and JAK1)
of which both LEPR and PDE4B have previously been associated with backfat-
related traits and obesity.
A GWAS study of 820 commercial sows (Large White and Large White × Landrace
cross) identified several regions associated with backfat, including the genes MC4R,
ATP6V1H, OPRK1, LDHD, CHCHD3, and ATP2B3 (Fan et al. 2011). Of those,
MC4R, ATP6V1H, and CHCHD3 have been previously associated with obesity in
human or other animal studies (Hwang et al. 2010; Lubrano-Berthelier et al. 2003;
Walewski et al. 2010). Another GWAS using 651 Duroc samples (three generations)
detected a region associated with backfat thickness containing six genes: PDE4B,
LEPR, LEPROT, DNAJC6, AK3L1, and JAK1 (Okumura et al. 2013). Several of
those have been previously associated with human obesity, such as the previously
discussed LEPR.
In addition to GWAS performed on backfat and other fat-related traits, another
study focused on feeding behavior in production pigs (Duroc boars), which might
be an important trait with respect to feed efficiency (Do et al. 2013). However, find-
ings in this study could also be valuable for human research because of its impact
on eating behavior and subsequent development of obesity. Among the genes that
were discovered were ENPP1, HCRT, and MTTP.
Another study focused on the sequencing of the porcine genome using RNA
sequencing and found a gene associated with cholesterol and triglyceride levels:
CES1 (Chen et al. 2011). Interestingly, a human study has also shown that the
Applications of Systems Genetics and Biology for Obesity Using Pig Models 33

expression of CES1 is upregulated in obese subjects (Marrades et al. 2010),


VetBooks.ir

whereas in mice, it was shown that CES1 has an important role in lipid homeostasis
(Xu et al. 2014).

6 Porcine Models for Human Obesity

As described, the pig is an excellent model for human obesity with huge potential.
Many researchers have taken this opportunity to study human obesity, using differ-
ent porcine models. In this section, we will describe several of those studies in order
to gain insight in to the knowledge gained about human obesity using porcine
models.
One of the ways in which pigs are used as a model for human obesity is by induc-
ing obesity with a high-fat diet. Using this approach, Li et al. (2011) created a por-
cine model of 20 crossbred boars, which were fed a corn-soy basal diet or a diet with
lard for 180 days. There were significant differences between the high-fat diet pigs
and control pigs with respect to, for example, body weight, backfat thickness, and
abdominal fat. They detected 852 genes (387 upregulated and 465 downregulated)
that showed differential expression between the obese pigs and control pigs.
Upregulated genes were mainly associated with metabolic process, immune
response, translation, and cell cycle, whereas downregulated genes were mainly
involved in regulation of transcription, RNA splicing, and transcription.
Porcine models can also be obtained by intercrossing pig breeds, creating a pop-
ulation consisting of different generations that is advantageous for genetic studies.
An excellent example is the UNIK pig resource population (University of
Copenhagen, Denmark), which is an F2 pig population created by intercrossing
Göttingen Minipig boars with Duroc and Yorkshire sows (563 pigs). As described
earlier, the Göttingen Minipig and production pigs (i.e., Duroc and Yorkshire) are
very distinct in some obesity-related features, e.g., body size and body fat content.
By creating an F2 intercross, the F2 animals will show a wide range of values for
each distinct phenotype. This large variation in obesity-related phenotypes was also
obvious in the UNIK pig resource population, shown by genetic parameter calcula-
tions. The high coefficients of variation (15–42%) and moderate to high heritabili-
ties (0.22–0.81) for obesity and obesity-related traits showed that the population
was highly divergent for obesity traits (Kogelman et al. 2013). Furthermore, genetic
correlations between obesity-related traits revealed more of the genetic architecture
in the population. For example, weight and lean mass estimated by DXA scanning
were highly correlated (0.56–0.97), and fat-related traits were strongly correlated
with glucose levels (0.35–0.74). The UNIK resource population thereby proved its
potential for further genetic investigations using, for example, genomics and
transcriptomics.
To detect genomewide associations for obesity-related phenotypes in the UNIK
pig resource population, a combined linkage disequilibrium-linkage analysis was
performed (Pant et al. 2015). This resulted in the detection of 229 QTLs that were
subsequently used for comparative analysis with the human genome. Many different
34 L.J.A. Kogelman and H.N. Kadarmideen

genes were identified for obesity-related phenotypes, such as BMI (e.g., SMAD6 and
VetBooks.ir

PAX5), fasting glucose levels (STIM1), and cholesterol levels (e.g., STRADA).
Another study within the UNIK pig resource population focused on obesity-specific
microRNA expression in subcutaneous adipose tissue (Mentzel et al. 2015). In total,
two differentially expressed microRNAs were discovered and validated using qPCR:
mir-9 and mir-124a. Both microRNAs have been previously associated with obesity-
related phenotypes in human and/or mouse studies.

7 Systems Genetics Analyses of Obesity Using a Porcine


Model

As previously described, obesity is a complex disease involving many different


genes, pathways, tissues, and organs. With respect to complex diseases, systems
genetics approaches can be very useful to unravel the involved molecular mecha-
nisms. One of those systems genetics approaches is expression quantitative trait
loci (eQTL) mapping, whereby genomic and transcriptomic data are integrated.
Generally, this leads to the detection of both cis- and trans-eQTLs, where SNPs
of a cis-eQTLs are located near the affected transcript and SNPs of a trans-eQTL
are mapped further away from the affected transcript or even on a different
chromosome.
In 2011, a group at the Leibniz Institute for Farm Animal Biology (Germany)
performed the first eQTL mapping in a porcine model to analyze obesity-related
traits (Ponsuksili et al. 2011). A total of 150 pigs (Pietrain × (German Large White
× German Landrace)) were genotyped and expression-profiled using microarray
platform. First, they detected genes that were correlated with fat traits and showed
that positively correlated genes were mainly related to metabolism of various mac-
romolecules and nutrients. On the other hand, negatively correlated genes were
related to dynamic cellular processes. Second, the eQTL mapping was performed,
leading to the detection of 448 cis-eQTLs and 3297 trans-eQTLs. Next, pathway
analysis and network generation were performed on the detected genes within the
eQTLs. Constructed networks within cis- and trans-eQTLs resulted both in detec-
tion of pathways related to lipid metabolism. Overall, they detected several genes
that were previously detected in human and/or mouse studies (e.g., PRDX6, PLIN4,
and FOXO1), but also detected novel candidate genes (e.g., SLC12A1).
In the same year, another eQTL mapping study on an F2 pig population was
performed (Steibel et al. 2011). Although this F2 population was not created as a
model for human obesity, they did find several eQTLs that were part of a network
associated with lipid metabolism. In the networks associated with lipid metabolism,
many genes were present, such as AKR7A2, CASP7, and CYP4F2. Results of this
study were proposed as novel candidate genes for pig production traits, but a subset
of them could potentially be translated to human obesity.
Another way of data integration was performed by integration of heritability
estimates with comparative genomics strategy to identify causal genetic factors
(Kim et al. 2012). First, they showed that the human chromosome 2 is mostly
Applications of Systems Genetics and Biology for Obesity Using Pig Models 35

syntenic to chromosome 3 and chromosome 15 of the pig. Second, heritabilities for


VetBooks.ir

backfat thickness in the pig and subscapular skinfold thickness in humans were
estimated, and a correlation of 0.479 was detected. In both the pig and human popu-
lations, the human chromosome 2 was detected as the most important chromosome
based on the percentage of heritability explained and was therefore further investi-
gated. Several genes were suggested as being candidate genes for human obesity,
namely, MRPL33, PARD3B, ERBB4, STK39, and ZNF385B.
In the previously mentioned UNIK pig resource population, several systems
genetics approaches were also conducted. Using the 60K genotype data, genetic
networks were constructed using the weighted interaction SNP hub (WISH) method
(Kogelman et al. 2014b). The WISH method detects interactions between SNPs
based on their genotype correlations and based on the epistatic interaction effect
between SNP pairs (Kogelman and Kadarmideen 2014). First, a GWAS was per-
formed on the obesity index (OI), which is an aggregate genotypic value for the
level of obesity of each individual pig. Several genes were detected (e.g., NPC2,
OR4D10, and CACNA1E), and pathway analysis of all genomewide significant
SNPs (404 SNPs) showed association with, e.g., insulin and immune system path-
ways. Second, 2500 SNPs were selected based on GWAS results (P-value < 0.05)
and their connectivity to construct the WISH network. Pathway analysis of detected
modules resulted in pathways related to, for example, metabolic processes and puri-
nergic receptor activity. Lastly, a differentially wired network was constructed to
detect potential obesity genes based on a differential connectivity. Here, genes such
as UBR1, PNPLA8, and CTNAP2 were detected, all of which are previously associ-
ated with obesity or obesity-related diseases.
Besides the genotyping in the UNIK pig resource population, a subset of the
population was selected for RNA sequencing. The selection of pigs was obtained
using selective expression profiling, based on the OI, selecting 12 extremely lean,
12 extremely obese, and 12 intermediate pigs. One of the small studies on this data
set was the investigation of interaction between highly differentiating genes (lean
vs. obese animals). They were visualized in Cytoscape (Shannon et al. 2003) and
were further analyzed using the inbuild network analyzer (Fig. 2). We detected sev-
eral genes that were so-called hub genes, and one clear example was TNMD. TNMD
encodes a protein related to chondromodulin I, a cartilage-specific glycoprotein.
Genetic variation in TNMD has been associated with central obesity and type 2
diabetes (Tolppanen et al. 2007).
The complete RNA-sequencing data were then used to construct coexpression
networks using the weighted gene coexpression network analysis (WGCNA)
approach, and regulator genes were detected using Lemon-Tree algorithms
(Kogelman et al. 2014a). This resulted in the detection of several obesity-related
pathways, but more interestingly in the detection of three potential causal genes
linking obesity with osteoporosis: CCR1, MSR1, and SPI1. Continuing the sys-
tems genetics approaches using this UNIK pig resource population, one study
focused on lncRNAs present among the genes in the previously detected WGCNA
modules (Suravajhala et al. 2015). This study investigates those genes using the
RNA–protein interaction predictor and detects network properties using Cytoscape
36 L.J.A. Kogelman and H.N. Kadarmideen
VetBooks.ir

Fig. 2 Visualization of coexpression of genes selected based on their association with obesity,
resulting from RNA-sequencing data of subcutaneous adipose tissue in the UNIK pig resource
population. Edges represent Pearson's correlation coefficients between selected genes, colored on
a red–green scale based on negative–positive correlation. The thicknesses of the edges are repre-
senting the strength of the correlations. Nodes represent selected genes, with size representing the
fold change detected in differential expression analysis and darkness (gray scale) represents sig-
nificance level of differential expression analysis. Only genes with a coexpression higher than 0.8
are visualized. TNMD, one of the selected hub genes, is colored yellow

(Shannon et al. 2003). They found that cyp2c91 has strong interactions with sev-
eral regulator genes, and they showed its importance in transcriptional regulation
localized to cytoplasm. Furthermore, another study in progress within the UNIK
population focuses on transcriptional regulation networks. This study extracted
known transcription factors and used them as input of WGCNA to detect interac-
tion patterns between transcription factors (Skinkyte-Juskiene et al. 2015).
Most recent research in the UNIK research population has been a study that inte-
grates the genomic data with the transcriptomics data, using an eQTL mapping
Applications of Systems Genetics and Biology for Obesity Using Pig Models 37

approach. First, analysis of RNA-Seq data revealed 458 differentially expressed


VetBooks.ir

genes for the degree of obesity. Second, eQTL analysis revealed 987 cis-eQTLs and
73 trans-eQTLs. Data were further integrated by confining the eQTL mapping input
to genomewide associated SNPs and differentially expressed genes. Furthermore,
eQTL results were used for coexpression network construction. Many different
obesity-related pathways and genes were identified, and several obesity candidate
genes were proposed: ENPP1, CTSL, and ABHD12B (Kogelman et al. 2015).

8 Future Perspectives

Here we have described a selection of (ongoing) genetic research into obesity, using
the pig as a biomedical model. The complexity of obesity, resulting from interac-
tions between and within genes and environment, requires complex analyses of the
different -omic levels to elucidate underlying genetic and biological mechanisms.
Systems genetics approaches, a genetic branch of systems biology, offer those pos-
sibilities and are therefore an excellent choice for obesity research. Here we have
focused on genomic and transcriptomic research, and besides the huge opportuni-
ties that still can be reached within those fields, there are also prospects for other
-omics levels like epigenomics, metabolomics, and proteomics.
In human, epigenetic studies have shown its importance of detecting epigenetic
changes underlying the development of obesity (van Dijk et al. 2015). A recently pub-
lished study using a subset of pigs from the previously mentioned UNIK pig resource
population focused on the epigenetic mechanisms in leukocytes (Jacobsen et al. 2016).
They detected several genes that were differentially expressed (lean vs. obese pigs) in
retroperitoneal adipose tissue and peripheral blood mononuclear cells. Several of those
genes had a strong association with inflammatory pathways and fatty acid metabolism,
as, for example, SPPI1, LEP, and INSIG1. Also, metabolomics has shown its potential
in human obesity research (Rauschert et al. 2014). A few metabolomic studies have
been performed using pigs, with the aim to unravel metabolic mechanisms. He et al.
(2012) found many differences in metabolite levels between lean and obese pigs, show-
ing a distinct metabolism (e.g., difference in lipid oxidation and fermentation of gastro-
intestinal microbes) for obese pigs. Obesity research could highly benefit from the
integration of metabolomics and genomic/transcriptomic data, to narrow down the
unknown interactions between genetics and environmental factors. Proteomic studies
have been used already more frequently in obesity research, using the pig as a model,
and its potential is reviewed by Bendixen et al. (2010). It has been shown that several
plasma proteins correlated well with obesity-related parameters (e.g., cholesterol and
glucose) among pigs that were fed different diets (te Pas et al. 2013). Also, the protein
expression of muscle in obese and lean pigs has been studied to gain knowledge about
growth rate and meat quality in pigs, which can be used in human research (Li et al.
2013). They found metabolism-related proteins that were highly expressed in the obese
pigs, but not in the lean pigs. Both proteins (COX5A and ATP5B) participate in oxida-
tive phosphorylation. The protein that was highly expressed in the lean animals but not
in obese pigs (ENO3) is involved in glycolysis.
38 L.J.A. Kogelman and H.N. Kadarmideen

Omic data generation (both microarray and sequence based) is getting easier and
VetBooks.ir

more affordable, and in combination with the ease of collecting samples from dif-
ferent cells/tissues in a porcine model, there is a huge potential to elucidate genetic
and biological mechanisms of human obesity. Inherent in all these approaches are
the dramatic increase in the use of omic-scale modeling and joint analyses of dispa-
rate multi-omic data sets using highly advanced bioinformatic and computational
systems biology methods. This will become even more important and widespread in
biomedical research in general and in complex diseases, such as obesity, in particu-
lar. As shown throughout this chapter, in the field of systems biology/systems genet-
ics, there are many untouched areas in obesity research using a porcine model,
which promises further growth and good prospective in the future.

References
Bendixen E, Danielsen M, Larsen K, Bendixen C (2010) Advances in porcine genomics and pro-
teomics—a toolbox for developing the pig as a model organism for molecular biomedical
research. Brief Funct Genomics 9(3):208–219. doi:10.1093/bfgp/elq004
Bidanel J, Milan D, Iannuccelli N et al (2001) Detection of quantitative trait loci for growth and
fatness in pigs. Genet Sel Evol 33:289–309. doi:10.1186/1297-9686-33-3-289
Bollen PJ, Madsen LW, Meyer O, Ritskes-Hoitinga J (2005) Growth differences of male and
female Gottingen minipigs during ad libitum feeding: a pilot study. Lab Anim 39(1):80–93.
doi:10.1258/0023677052886565
Bougnères P (2002) Genetics of obesity and type 2 diabetes. Diabetes 51(suppl 3):S295–S303.
doi:10.2337/diabetes.51.2007.S295
Chen C, Ai H, Ren J et al (2011) A global view of porcine transcriptome in three tissues from a
full-sib pair with extreme phenotypes in growth and fat deposition by paired-end RNA sequenc-
ing. BMC Genomics 12:448. doi:10.1186/1471-2164-12-448
Cirera S, Jensen MS, Elbrønd VS et al (2014) Expression studies of six human obesity-related
genes in seven tissues from divergent pig breeds. Anim Genet 45(1):59–66. doi:10.1111/
age.12082
Curtus H, Barnes NS (1994) Invitation to biology, vol 529, 5th edn. Worth, New York
Davis MA, Henry R, Leslie RB (1974) Comparative studies on porcine and human high density
lipoproteins. Comp Biochem Physiol B 47(4):831–849
Davoli R, Braglia S, Valastro V et al (2012) Analysis of MC4R polymorphism in Italian Large
White and Italian Duroc pigs: association with carcass traits. Meat Sci 90(4):887–892.
doi:10.1016/j.meatsci.2011.11.025
de Koning D, Janss L, Rattink A et al (1999) Detection of quantitative trait loci for backfat thick-
ness and intramuscular fat content in pigs (Sus scrofa). Genetics 152:1679–1690
Diez J, Iglesias P (2003) The role of the novel adipocyte-derived hormone adiponectin in human
disease. Eur J Endocrinol 148(3):293–300. doi:10.1530/eje.0.1480293
Do DN, Strathe AB, Ostersen T, Jensen J, Mark T, Kadarmideen HN (2013) Genome-wide asso-
ciation study reveals genetic architecture of eating behavior in pigs and its implications for
humans obesity by comparative mapping. PLoS One 8(8), e71509. doi:10.1371/journal.
pone.0071509
Dyson MC, Alloosh M, Vuchetich JP, Mokelke EA, Sturek M (2006) Components of metabolic
syndrome and coronary artery disease in female Ossabaw swine fed excess atherogenic diet.
Comp Med 56(1):35–45
Elgazar-Carmon V, Rudich A, Hadad N, Levy R (2008) Neutrophils transiently infiltrate intra-
abdominal fat early in the course of high-fat feeding. J Lipid Res 49(9):1894–1903. doi:10.1194/
jlr.M800132-JLR200
Applications of Systems Genetics and Biology for Obesity Using Pig Models 39

Fan B, Du ZQ, Rothschild MF (2009) The fat mass and obesity-associated (FTO) gene is associ-
VetBooks.ir

ated with intramuscular fat content and growth rate in the pig. Anim Biotechnol 20(2):58–70.
doi:10.1080/10495390902800792
Fan B, Onteru SK, Du Z-Q, Garrick DJ, Stalder KJ, Rothschild MF (2011) Genome-wide associa-
tion study identifies loci for body composition and structural soundness traits in pigs. PLoS
One 6(2), e14726. doi:10.1371/journal.pone.0014726
Fantuzzi G (2005) Adipose tissue, adipokines, and inflammation. J Allergy Clin Immunol
115(5):911–919. doi:10.1016/j.jaci.2005.02.023
Ferrante AW (2013) The immune cells in adipose tissue. Diabetes Obes Metab 15(s3):34–38.
doi:10.1111/dom.12154
Fontanesi L, Scotti E, Buttazzoni L, Davoli R, Russo V (2009) The porcine fat mass and obesity
associated (FTO) gene is associated with fat deposition in Italian Duroc pigs. Anim Genet
40(1):90–93. doi:10.1111/j.1365-2052.2008.01777.x
Fontanesi L, Scotti E, Buttazzoni L et al (2010) Confirmed association between a single nucleotide
polymorphism in the FTO gene and obesity-related traits in heavy pigs. Mol Biol Rep
37(1):461–466. doi:10.1007/s11033-009-9638-8
Friedman JM (2002) The function of leptin in nutrition, weight, and physiology. Nutr Rev 60(suppl
10):S1–S14. doi:10.1301/002966402320634878
Galgani J, Ravussin E (2010) Energy metabolism, fuel selection and body weight regulation. Int
J Obes (Lond) 32(Suppl 7):S109–S119. doi:10.1038/ijo.2008.246
Gray H (1918) Anatomy of the human body. Lea & Febiger
Groenen MAM, Archibald AL, Uenishi H et al (2012) Analyses of pig genomes provide insight
into porcine demography and evolution. Nature 491(7424):393–398. doi:10.1038/nature11622
Gurr MI, Kirtland J, Phillip M, Robinson MP (1977) The consequences of early overnutrition for
fat cell size and number: the pig as an experimental model for human obesity. Int J Obes (Lond)
1(2):151–170
Halsted CH (1999) Obesity: effects on the liver and gastrointestinal system. Curr Opin Clin Nutr
Metab Care 2(5):425–429
Hau J (2008) Animal models for human diseases. In: Conn PM (ed) Sourcebook of models for
biomedical research. Humana Press, Totowa, pp 3–8. doi:10.1007/978-1-59745-285-4_1
He Q, Ren P, Kong X et al (2012) Comparison of serum metabolite compositions between obese
and lean growing pigs using an NMR-based metabonomic approach. J Nutr Biochem
23(2):133–139. doi:10.1016/j.jnutbio.2010.11.007
Heber D (2010) An integrative view of obesity. Am J Clin Nutr 91(1):280S–283S. doi:10.3945/
ajcn.2009.28473B
Houpt KA, Houpt TR, Pond WG (1979) The pig as a model for the study of obesity and of control
of food intake: a review. Yale J Biol Med 52(3):307–329
Houston RD, Cameron ND, Rance KA (2004) A melanocortin-4 receptor (MC4R) polymorphism
is associated with performance traits in divergently selected Large White pig populations.
Anim Genet 35(5):386–390. doi:10.1111/j.1365-2052.2004.01182.x
Hwang H, Bowen BP, Lefort N et al (2010) Proteomics analysis of human skeletal muscle reveals novel
abnormalities in obesity and type 2 diabetes. Diabetes 59(1):33–42. doi:10.2337/db09-0214
Jacobsen MJ, Mentzel CMJ, Olesen AS et al (2016) Altered methylation profile of lymphocytes is
concordant with perturbation of lipids metabolism and inflammatory response in obesity.
J Diabet Res 2016:11. doi:10.1155/2016/8539057
Johansen T, Hansen HS, Richelsen B, Malmlöf K (2001) The obese Gottingen minipig as a model
of the metabolic syndrome: dietary effects on obesity, insulin sensitivity, and growth hormone
profile. Comp Med 51(2):150–155
Kim J, Lee T, Kim T-H, Lee K-T, Kim H (2012) An integrated approach of comparative genomics
and heritability analysis of pig and human on obesity trait: evidence for candidate genes on
human chromosome 2. BMC Genomics 13:711. doi:10.1186/1471-2164-13-711
Kogelman LJA, Kadarmideen H (2014) Weighted Interaction SNP Hub (WISH) network method
for building genetic networks for complex diseases and traits using whole genome genotype
data. BMC Syst Biol 8(Suppl 2):S5. doi:10.1186/1752-0509-8-S2-S5
40 L.J.A. Kogelman and H.N. Kadarmideen

Kogelman LJA, Kadarmideen HN, Mark T et al (2013) An F2 pig resource population as a model
VetBooks.ir

for genetic studies of obesity and obesity-related diseases in humans: design and genetic
parameters. Front Genet 4:29. doi:10.3389/fgene.2013.00029
Kogelman LJA, Cirera S, Zhernakova D, Fredholm M, Franke L, Kadarmideen H (2014a)
Identification of co-expression gene networks, regulatory genes and pathways for obesity
based on adipose tissue RNA Sequencing in a porcine model. BMC Med Genomics 7(1):57.
doi:10.1186/1755-8794-7-57
Kogelman LJA, Pant SD, Fredholm M, Kadarmideen HN (2014b) Systems genetics of obesity in
an F2 pig model by genome-wide association, genetic network and pathway analyses. Front
Genet 5:214. doi:10.3389/fgene.2014.00214
Kogelman LAJ, Zhernakova DV, Westra H-J et al (2015) An integrative systems genetics approach
reveals potential causal genes and pathways related to obesity. Genome Med 7(1):1–15.
doi:10.1186/s13073-015-0229-0
Li K, Zhao H, Zhou J-C et al (2011) Differentially expressed genes in subcutaneous fat tissue in an
obese pig model induced by a high-fat diet. J Anim Vet Adv 10(14):1804–1810. doi:10.3923/
javaa.2011.1804.1810
Li A, Mo D, Zhao X et al (2013) Comparison of the longissimus muscle proteome between obese
and lean pigs at 180 days. Mamm Genome 24(1–2):72–79. doi:10.1007/s00335-012-9440-0
Liu J, Divoux A, Sun J et al (2009) Deficiency and pharmacological stabilization of mast cells
reduce diet-induced obesity and diabetes in mice. Nat Med 15(8):940–945. doi:10.1038/
nm.1994
Locke AE, Kahali B, Berndt SI et al (2015) Genetic studies of body mass index yield new insights
for obesity biology. Nature 518(7538):197–206. doi:10.1038/nature14177
Lubrano-Berthelier C, Cavazos M, Dubern B et al (2003) Molecular genetics of human obesity-
associated MC4R mutations. Ann N Y Acad Sci 994:49–57
Lunney JK (2007) Advances in swine biomedical model genomics. Int J Biol Sci 3(3):179–184.
doi:10.7150/ijbs.3.179
Madsen MB, Birck MM, Fredholm M, Cirera S (2009) Expression studies of the obesity candidate
gene FTO in pig. Anim Biotechnol 21(1):51–63. doi:10.1080/10495390903381792
Marrades MP, Gonzalez-Muniesa P, Martinez JA, Moreno-Aliaga MJ (2010) A dysregulation in
CES1, APOE and other lipid metabolism-related genes is associated to cardiovascular risk fac-
tors linked to obesity. Obes Facts 3(5):312–318. doi:10.1159/000321451
McAnulty PA, Dayan AD, Ganderup N-C, Hastings KL (2011) The minipig in biomedical
research. RC Press, Boca Raton
Mentzel CMJ, Anthon C, Jacobsen MJ et al (2015) Gender and obesity specific MicroRNA expres-
sion in adipose tissue from lean and obese pigs. PLoS One 10(7), e0131650. doi:10.1371/
journal.pone.0131650
Michael Swindle M, Smith A (2008) Swine in biomedical research. In: Conn PM (ed) Sourcebook
of models for biomedical research. Humana Press, Totowa, pp 233–239.
doi:10.1007/978-1-59745-285-4_26
Mitchell AD, Conway JM, Potts WJ (1996) Body composition analysis of pigs by dual-energy
x-ray absorptiometry. J Anim Sci 74(11):2663–2671
Mitchell AD, Scholz AM, Conway JM (1998) Body composition analysis of small pigs by dual-
energy x-ray absorptiometry. J Anim Sci 76(9):2392–2398
Molarius A, Seidell JC (1998) Selection of anthropometric indicators for classification of abdomi-
nal fatness--a critical review. Int J Obes Relat Metab Disord 22(8):719–727
Nowacka-Woszuk J, Szczerbal I, Fijak-Nowak H, Switonski M (2008) Chromosomal localization
of 13 candidate genes for human obesity in the pig genome. J Appl Genet 49(4):373–377.
doi:10.1007/bf03195636
O’Rahilly S, Farooqi I (2006) Genetics of obesity. Philos Trans Royal Soc B Biol Sci
361(1471):1095–1105. doi:10.1098/rstb.2006.1850
Okumura N, Matsumoto T, Hayashi T et al (2013) Genomic regions affecting backfat thickness
and cannon bone circumference identified by genome-wide association study in a Duroc pig
population. Anim Genet 44(4):454–457. doi:10.1111/age.12018
Applications of Systems Genetics and Biology for Obesity Using Pig Models 41

Pant SD, Karlskov-Mortensen P, Jacobsen MJ et al (2015) Comparative analyses of QTLs influ-


VetBooks.ir

encing obesity and metabolic phenotypes in pigs and humans. PLoS One 10(9), e0137356.
doi:10.1371/journal.pone.0137356
Ponsuksili S, Murani E, Brand B, Schwerin M, Wimmers K (2011) Integrating expression profil-
ing and whole-genome association for dissection of fat traits in a porcine model. J Lipid Res
52(4):668–678. doi:10.1194/jlr.M013342
Rauschert S, Uhl O, Koletzko B, Hellmuth C (2014) Metabolomic biomarkers for obesity in
humans: a short review. Ann Nutr Metab 64(3–4):314–324. doi:10.1159/000365040
Razmaite V, Kerziene S, Jatkauskiene V (2009) Body and carcass measurements and organ weights
of Lithuanian indigenous pigs and their wild boar hybrids. Anim Sci Papers Rep
27(4):331–342
Santini F, Maffei M, Pelosini C, Salvetti G, Scartabelli G, Pinchera A (2009) Melanocortin-4
receptor mutations in obesity. Adv Clin Chem 48:95–109
Scuteri A, Sanna S, Chen W-M et al (2007) Genome-wide association scan shows genetic variants
in the FTO gene are associated with obesity-related traits. PLoS Genet 3(7), e115. doi:10.1371/
journal.pgen.0030115
Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated
models of biomolecular interaction networks. Genome Res 13:2498–2504
Shoelson SE, Herrero L, Naaz A (2007) Obesity, inflammation, and insulin resistance.
Gastroenterology 132(6):2169–2180. doi:10.1053/j.gastro.2007.03.059
Shungin D, Winkler TW, Croteau-Chonka DC et al (2015) New genetic loci link adipose and insu-
lin biology to body fat distribution. Nature 518(7538):187–196. doi:10.1038/nature14132
Skinkyte-Juskiene R, Kogelman LJA, Kadarmideen HN (2015) Construction of transcription fac-
tor networks for obesity using RNAseq transcriptomics. In: Genome Informatics, Cold Spring
Harbor
Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals
reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948.
doi:10.1038/ng.686
Spurlock ME, Gabler NK (2008) The development of porcine models of obesity and the metabolic
syndrome. J Nutr 138(2):397–402
Steibel J, Bates R, Rosa G et al (2011) Genome-wide linkage analysis of global gene expression in
loin muscle tissue identifies candidate genes in pigs. PLoS One 6, e16766. doi:10.1371/journal.
pone.0016766
Suravajhala P, Kogelman LJA, Mazzoni G, Kadarmideen HN (2015) Potential role of lncRNA
cyp2c91-protein interactions on diseases of the immune system. Front Genet 6:255.
doi:10.3389/fgene.2015.00255
Suster D, Leury BJ, Ostrowska E et al (2003) Accuracy of dual energy X-ray absorptiometry
(DXA), weight and P2 back fat to predict whole body and carcass composition in pigs
within and across experiments. Livestock Prod Sci 84(3):231–242. doi:10.1016/
S0301-6226(03)00077-0
te Pas MFW, Koopmans S-J, Kruijt L, Calus MPL, Smits MA (2013) Plasma proteome profiles
associated with diet-induced metabolic syndrome and the early onset of metabolic syndrome in
a pig model. PLoS One 8(9), e73087. doi:10.1371/journal.pone.0073087
Tilg H, Moschen AR (2006) Adipocytokines: mediators linking adipose tissue, inflammation and
immunity. Nat Rev Immunol 6(10):772–783
Tolppanen A-M, Pulkkinen L, Kolehmainen M et al (2007) Tenomodulin is associated with obe-
sity and diabetes risk: the Finnish diabetes prevention study. Obesity 15(5):1082–1088.
doi:10.1038/oby.2007.613
van Dijk SJ, Tellam RL, Morrison JL, Muhlhausler BS, Molloy PL (2015) Recent developments
on the role of epigenetics in obesity and metabolic disease. Clin Epigenet 7(1):1–13.
doi:10.1186/s13148-015-0101-5
Vazquez G, Duval S, Jacobs DR, Silventoinen K (2007) Comparison of body mass index, waist
circumference, and waist/hip ratio in predicting incident diabetes: a meta-analysis. Epidemiol
Rev 29(1):115–128. doi:10.1093/epirev/mxm008
42 L.J.A. Kogelman and H.N. Kadarmideen

Walewski JL, Ge F, Gagner M et al (2010) Adipocyte accumulation of long-chain fatty acids in


VetBooks.ir

obesity is multifactorial, resulting from increased fatty acid uptake and decreased activity of
genes involved in fat utilization. Obes Surg 20(1):93–107. doi:10.1007/s11695-009-0002-9
World Health Organization (2012) Obesity and overweight, Fact sheet No. 311 updated March
2013. http://www.who.int/mediacentre/factsheets/fs311/en/
Wren AM, Seal LJ, Cohen MA et al (2001) Ghrelin enhances appetite and increases food intake in
humans. J Clin Endocrinol Metabol 86(12):5992. doi:10.1210/jcem.86.12.8111
Xu H, Barnes GT, Yang Q et al (2003) Chronic inflammation in fat plays a crucial role in the
development of obesity-related insulin resistance. J Clin Invest 112(12):1821–1830.
doi:10.1172/jci19451
Xu J, Li Y, Chen WD et al (2014) Hepatic carboxylesterase 1 is essential for both normal and
farnesoid X receptor-controlled lipid homeostasis. Hepatology 59(5):1761–1771. doi:10.1002/
hep.26714
VetBooks.ir

Merging Metabolomics, Genetics,


and Genomics in Livestock to Dissect
Complex Production Traits

Luca Fontanesi

Abstract
Metabolomics is a multidisciplinary approach that combines several disciplines
to characterise metabolomes in terms of the identification and quantification of
all detectable metabolites present in a biological sample in a single experimental
design or approach. Merging metabolomics with genetics and genomics in live-
stock provides intermediate phenotypes (or molecular phenotypes) that lie (in
the middle) between the genomic space and the external or final phenotypes
(e.g., production traits and disease resistance) contributing to understand the bio-
logical bases of complex traits. Heritability estimates have defined the extent of
the genetic contribution on metabotypes (that are metabolomic-derived pheno-
types). Metabotypes can be used to predict final phenotypes. Metabolite-based
genomewide association studies carried out in cattle and pigs have identified
mQTL on genes or close to genes whose function can directly explain the vari-
ability of the level of the corresponding metabolites. Despite the technological
limits of the analytical platforms that cannot provide a complete and exhaustive
picture of all metabolites present in a biofluid or tissue, metabolomics provides
new traits and biomarkers. Metabolomics might establish next generation pheno-
typing approaches that are needed to refine and improve trait descriptions and, in
turn, prediction of the breeding values of the animals to cope with traditional and
new objectives of the selection programmes.

L. Fontanesi
Department of Agricultural and Food Sciences (DISTAL), Division of Animal Sciences,
University of Bologna, Viale Fanin 46, 40127 Bologna, Italy
e-mail: luca.fontanesi@unibo.it

© Springer International Publishing Switzerland 2016 43


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_3
44 L. Fontanesi

1 Introduction
VetBooks.ir

Beadle and Tatum (1941), even before the discovery of the DNA, wrote these illu-
minating concepts that can represent the foundation of the modern interpretation of
the relationships between metabolism (and metabolites) and genetics (or genomics)
in all organisms, including livestock: “From the standpoint of physiological genet-
ics the development and functioning of an organism consist essentially of an inte-
grated system of chemical reactions controlled in some manner by genes. It is
entirely tenable to suppose that these genes which are themselves a part of the sys-
tem, control or regulate specific reactions in the system either by acting directly as
enzymes or by determining the specificities of enzymes. Since the components of
such a system are likely to be interrelated in complex ways, and since the synthesis
of the parts of individual genes are presumably dependent on the functioning of
other genes, it would appear that there must exist orders of directness of gene con-
trol ranging from simple one-to-one relations to relations of great complexity.”
These authors further extended the idea of Garrod (1902), who published more than
100 years ago the first inborn error of metabolism (e.g., alkaptonuria) that linked
genetics and metabolites, developing the theory “one gene–one enzyme” (e.g.,
Beadle and Tatum 1941).
Extreme cases of genetic variants affecting the metabolism and related biochem-
ical products have been more recently also characterised in livestock. Some of them
are relevant in terms of economic impact on the production systems; others are only
interesting examples of animal models. It is also worth noting that in different spe-
cies, genetic defects on the same genes (i.e., homologous genes) produce similar
results. This is the case of the fishy off-flavour in cow's milk caused by elevated
levels of trimethylamine (TMA), derived by a mutation in the flavin-containing
mono-oxygenase 3 (FMO3) that affects the function of the encoded enzyme and the
downstream transformation of TMA in the odourless trimethylamine-n-oxide
(Lundén et al. 2002). A similar defect with high accumulation of TMA has been
reported in chicken and quail eggs (fishy taint of eggs) caused by mutations in the
homologous FMO3 genes (Honkatukia et al. 2005; Mo et al. 2013). In humans,
trimethylaminuria or fish-odour syndrome (OMIM #602079) is caused by muta-
tions in the same gene (Dolphin et al. 1997). Another important example of inborn
error of metabolism leading to extreme consequences is the deficiency of uridine
monophosphate synthase (DUMPS) in cattle, causing early embryonic death in
homozygous recessive offspring (Schwenger et al. 1993). This defect was identified
as heterozygous carrier cows had a high level of orotic acid in the milk (Robinson
et al. 1984). Elevated blood plasma cholesterol (hypercholesterolemia) has been
reported in commercial pig populations because of mutations in the APOB and
LDLR genes (Rapacz et al. 1986; Purtell et al. 1993; Hasler-Rapacz et al. 1998),
providing interesting models for human hypercholesterolemia. Several examples of
inborn errors of metabolism (or mutations affecting some metabolic products)
occurring in livestock are reported in Table 1.
It is clear that these are extreme examples of altered metabolism due to genetic
mutations. Most metabolites, however, show an interindividual continuous range of
Merging Metabolomics, Genetics, and Genomics 45

Table 1 A few examples of inborn errors of metabolism and major mutations affecting the level
VetBooks.ir

of metabolites in livestock
Description of the defect/ Mutated
Species Defect/trait metabolic pathway gene Reference
Cattle Deficiency of uridine Early embryonic death of UMPS Schwenger et al.
monophosphate homozygous offspring (1993)
synthase (DUMPS) due to deficiency of
uridine monophosphate;
high level of orotic acid
in the milk of
heterozygous cows
Cattle Fishy off-flavour milk High level of FMO3 Lundén et al.
trimethylamine (TMA) in (2002)
milk
Cattle Yellow milk and Accumulation or higher BCO2 Berry et al.
adipose tissue colour level of beta carotene in (2009); Tian
plasma, milk and adipose et al. (2010)
tissues
Cattle Yellow adipose tissue Accumulation of beta RDHE2 Tian et al. (2012)
colour carotene in adipose
tissues
Sheep Yellow adipose tissue Accumulation of beta BCO2 Våge and Boman
colour carotene in adipose (2010)
tissues
Chicken Fish taint High level of FMO3 Honkatukia et al.
(layers) trimethylamine (TMA) in (2005)
eggs
Quail Fish taint High level of FMO3 Mo et al. (2013)
trimethylamine (TMA) in
eggs
Pig Hypercholesterolemia High level of blood APOB Purtell et al.
cholesterol (1993)
Pig Recessive familial High level of blood LDLR Hasler-Rapacz
hypercholesterolemia cholesterol et al. (1998)
Rabbit Watanabe heritable High level of blood LDLR Yamamoto et al.
hyperlipidemia cholesterol and (1986)
triglycerides

variation influenced by the two classical phenotypic components: genetic factors, in


general considered as the sum of small effects in many genes; environmental factors
(i.e., diet, treatments, climate conditions, and circadian rhythm); and the interaction
between the two components. This is the challenging aspect of the interpretation of
metabolic differences that are also the foundation of clinical monitoring to identify
health and disease states and for metabolic marker discovery having simple and
interpretable biological meanings. For example, a continuous range of variation of
the level of plasma cholesterol is also present in pigs, in addition to the contribution
of a few major genes (Gallardo et al. 2008).
46 L. Fontanesi

These studies were useful in establishing the first links between metabolites and
VetBooks.ir

gene variants, demonstrating also in livestock that these aspects might also impact
production traits. However, they considered only a few metabolites—usually
selected according to previous information—to define the phenotype under investi-
gation or of interest. New analytical developments in metabolomics are opening the
way to explore in more detail the relationships between metabolism of the animals
and their genetic background.

2 Metabolites and Metabolomics

Metabolome is the term coined by Oliver et al. (1998) that is used to define all
organic molecules of small molecular mass present in a biological tissue or fluid
produced by different biochemical pathways through different enzymatic reactions
and steps. Metabolites can be defined as any metabolism-originated organic com-
pounds that do not directly come from gene expression (Junot et al. 2014). They can
be distinguished in endogenous metabolites, which are produced directly by the
organism through its biochemical machinery, and xenobiotics, which are organic
compounds that are present in an organism but derive from external molecules that
are at least in part processed or transformed in the organism. Examples of xenobiot-
ics are drugs, drug metabolites, pollutants, or other environment-derived com-
pounds. Other metabolites, that by their origin cannot be included in the two defined
groups, are produced by microbiota and then transferred to the host, contributing to
the interplay between these two biological entities (Fig. 1). Endogenous metabolites
could also be classified as primary metabolites that are simple molecules, or mono-
mers (e.g., sugar phosphates, amino acids, nucleotides, organic acids, and lipid
components) and secondary metabolites that are derived from primary metabolites
(e.g., small hormones, lipids, and phytochemicals).

Endogenous metabolites

Primary Secondary
metabolites metabolites
sugar phosphates, small hormones, Xenobiotics
amino acids, lipids, Microbiota
drugs,drug metabolites,
nucleotides,organic phytochemicals,
Primary & pollutants or other
acids and lipid etc
environmental derived
components, secondary metabolites
compounds, etc.
etc.

Genetic and environmental factors

Fig. 1 Classification of the metabolites according to their origin.


Merging Metabolomics, Genetics, and Genomics 47

Metabolomics is a multidisciplinary approach that combines analytical chemis-


VetBooks.ir

try to obtain raw chemical data and data analysis and data interpretation disciplines
such as chemometrics, biostatistics, biochemistry, and bioinformatics to character-
ise metabolomes in terms of the identification and quantification of all detectable
metabolites present in a biological sample in a single experimental design or
approach (Adamski and Suhre 2013). Metabolomics can provide a picture of a par-
ticular biochemical state of an organism (through its analysed biosamples) that is
influenced by a specific combination of genetic factors (acting through gene expres-
sion and protein or enzyme production and activities) and environmental factors
(nutrition, environmental conditions, treatments, biological phase of the animals,
etc.). In this context, we can consider metabolite-derived elements (the metabolite
species plus its quantification) as important components of the so-called intermedi-
ate phenotypes (Fiehn 2002; Houle et al. 2010) that lie (in the middle) between
genomic information and complex production traits (indicated also as final or exter-
nal phenotypes; Fig. 2). These metabolomic phenotypes can be called metabotypes.
Additional “phenotypic” levels can be identified, considering the different biologi-
cal steps (and then the related involved biomolecules that are investigated by other
omic approaches) to reach the final levels, i.e., complex phenotypes that directly
represent economic traits in livestock (e.g., growth rate, feed efficiency, carcass
traits, and milk production traits). Considering the structure of the biological infor-
mation with different levels (Fig. 2), the challenge is to use metabotypes to fill gaps
to understand the biological mechanisms that construct, on the whole, the differ-
ences among animals in terms of performance or other economically relevant traits

External or final phenotypes


(Intermediate or molecular phenotypes)

Production traits
Internal phenotypes

Microbiome

Metabotypes (Metabolomics)
Environmental factors
Proteins (Proteomics)

Gene expression (Transcriptomics)

Genes and gene variants


(Genomics)

Genome level

Fig. 2 Different levels of phenotypic information with intermediate phenotypes (modified from
Fontanesi, 2016).
48 L. Fontanesi

(Fontanesi 2016). Metabotypes, due to their position in the biological representation


VetBooks.ir

of the different information levels and their closeness to the final phenotypes, can be
used as predictors or proxies of more complex production traits. In particular, they
might be very advantageous for the prediction of “difficult” traits that can be
recorded late in the productive life of the animals, or that cannot be measured on all
animals due to high cost, or that, like disease resistance, need to be defined in condi-
tions that cannot be routinely obtained (e.g., challenging with pathogens).

2.1 Analytical Platforms in Metabolomics

It is worth pointing out that although many studies have contributed to obtain a bet-
ter picture of the metabolic pathways and mechanisms (that are the foundations of
subsequent developments and improvements in this field), we are still at the begin-
ning of a process leading towards the complete characterisation of all metabolites
produced in a complex organism such as an animal (Patti et al. 2012; Fontanesi
2016). It is mainly due to technology gaps that metabolomics has been compared to
genomics and transcriptomics. Genomics and transcriptomics have to collect com-
plex information that (however) is defined by a very simple alphabet composed of
only four (+ one) letters (the four nucleotides: A, T or U, C, and G). The intrinsic
large heterogeneity of the metabolites and their high variability in terms of stability
have thus far prevented the development of metabolomic analytical platforms with
the same potential of next-generation sequencing technologies. However, recent
advances in bioanalytical approaches, including the development and integration of
mass spectrometry (MS), high-performance liquid-phase chromatography (HPLC),
and nuclear magnetic resonance (NMR) spectroscopy, have substantially increased
the throughput, precision, and sensitivity of the analytical platforms. Each of these
methods and instruments has their own advantages and drawbacks, considering that
to integrate metabotypes with genetics, it is very important to have the possibility to
analyse in a large number of animals the largest possible number of metabolites at
the lowest possible cost per unit.
Analytical platforms should produce information useful for the chemical identifi-
cation and quantification of the metabolites. Chemical identification is assigning an
analyte (analytical signal) to a set of chemical compounds or to a group/class of
compounds: compounds may be “known known,” “known unknown,” or “unknown
unknown” (Milman 2015). The first case is in the so-called targeted metabolomics
that identify and quantify at the same time defined groups of chemically character-
ised and biochemically annotated metabolites (Roberts et al. 2012). Therefore, in this
approach, determination of the selected compounds is obtained according to what is
specified before performing the analytical procedures by using standards and internal
or external reference compounds (Milman 2015). Untargeted or nontargeted metabo-
lomics covers “known unknown” (are unknown compounds before analyses, but
they might not be new and can be identified subsequently through informatics analy-
sis of available database information and libraries) and “unknown unknown” (new
compounds) analytes. The decision of which approaches (targeted or untargeted) to
Merging Metabolomics, Genetics, and Genomics 49

use depends on several factors. Targeted metabolomics is useful when the enrich-
VetBooks.ir

ment of preselected metabolites is considered enough to describe the biological con-


dition that the metabolomic analyses would like to explore. Typically, metabolites
are chosen to cover one or more related pathways of interest (if they are clearly
established) or specific groups of metabolites. The drawback of this approach is that,
in general, only a few tens of metabolites (with a maximum of approximately 150–
200) can be analysed simultaneously, with the risk of missing important non-prese-
lected bioanalytes. On the other hand, this approach (obtained, for example, with
GC-MS, LC-MS/MS and flow-injection analysis MS/MS) can usually produce a bet-
ter and reproducible quantification of the biochemical molecules with a higher
throughput (Adamski and Suhre 2013). Most of the first studies that measured
metabolites in livestock for many different purposes can be considered targeted
investigations that are still carried out nowadays for the analysis of general metabo-
lites (amino acids, free fatty acids, nucleotides, etc.) or other more specific biomark-
ers taking advantages from commercial kits and common applications.
Untargeted metabolomics is a free-hypothesis approach that is open to new dis-
coveries providing an unbiased exploratory analysis measuring as many metabolites
as possible (Fuhrer and Zamboni 2015). Determination of the analytes is usually
semiquantitative or relative and based on internal standards used to monitor the
outputs of the analytical apparatus. This approach potentially can quantify thou-
sands of metabolites, depending on the instruments and preparation protocols of the
samples to be analysed (Sévin et al. 2015). Untargeted metabolomics is obtained
using either NMR or MS technologies. Liquid chromatography followed by mass
spectrometry (LC/MS) enables the detection of the largest number of metabolites
and requires minimal amounts of samples and has therefore been the technique of
choice for comprehensive metabolomic surveys (Patti et al. 2012). The advantages
of the higher sensitivity of MS compared to NMR is obtained at the cost of more
complex procedures for sample preparation that might contribute to reduce across
laboratory reproducibility. NMR can allow a nondestructive automated analysis of
samples by processing crude extracts. This technique does not require prior chro-
matographic separation; thus, the preparation steps are simplified, leaving the pos-
sibility to analyse the samples for further investigations (Robinette et al. 2012;
Wolfender et al. 2015).
Many recent metabolomic experiments carried out in livestock designed to
answer questions in different disciplines (e.g., nutrition, physiology, toxicology)
have been based on untargeted metabolomic approaches. Outputs obtained from
untargeted approaches are raw data constituted by many different peaks (MS) or
spectra (NMR) with related information (e.g., two-dimensional bounded signal,
chromatographic peak or retention time, and m/z ratio that characterise metabolite
features in MS; ppm in NMR). Manual inspection and interpretation of these raw
data is impractical and complicated by several analytical problems, i.e., data align-
ment and spectral deconvolution (Alonso et al. 2015). Specific metabolomic soft-
ware and tools (apLCMS, MathDAMP, MetaboHunter, MetAlign, MetFrag,
mzMatch, MZMine, OpenMS, XCMS, and xMSanalyzer) and databases (Biological
Magnetic Resonance Data Bank [BMRB], Human Metabolome DataBase [HMDB],
50 L. Fontanesi

Golm Metabolome Database, Madison-Qingdao Metabolomics Consortium


VetBooks.ir

Database [MQMCD], MassBank, MetFusion, and METLIN) have been developed


to overcome these problems, improving the chemical identification of the analytes
(Ellinger et al. 2013; Alonso et al. 2015; Johnson et al. 2015). Despite these tools
and the continuous development and improvement of methodologies, it is not pos-
sible to characterise in terms of chemical structure and function all signals produced
in untargeted metabolomic experiments. This is due to the analytical limits of the
available platforms as well as to the fact that the metabolism of complex organisms
is much more complicated than it might appear from the rigid description derived
from the classical biochemical textbooks (Patti et al. 2012).

2.2 Data Analysis in Metabolomics

Depending on the different experimental designs or tested hypotheses under inves-


tigation, metabolomic data can be explored using different statistical methods.
Univariate methods are not very informative in exploring multidimensional metabo-
lomic data sets and are mainly used when few metabolites are measured. Multivariate
methods take into account all metabolomic features at the same time and can extract
information from the complex relationships between different analytes. Principal
component analysis (PCA) is the most widely used unsupervised multivariate statis-
tics applied in metabolomics (Bro and Smilde 2014). Apart from its classical use, it
is an explorative method that is useful for revealing problems in the data set (e.g.,
outliers and technical variability of the instruments). Supervised multivariate
approaches are much more interesting, as they can be the basis for building predic-
tion models (Alonso et al. 2015). Partial least square (PLS) is the most common
supervised technique in metabolomic data analysis. PLS can be used as a binary
classifier (PLS-discriminant analysis [PLAS-DA]), while orthogonal PLS (O-PLS
and O-PLS-DA) models or sparse-PLS (sPLS) models improve classification in dif-
ferent contexts (Ren et al. 2015).
Statistical analyses (as reported previously) cannot extract all biological infor-
mation attached to metabolomic data, which, by definition, are associated to bio-
chemical pathways that are already described or are not defined yet. Methodologies
in these contexts use pathway and network analyses, respectively. Pathway analysis
is based on previous information that derives from biochemistry or other disciplines,
on groups of related metabolites that are linked to one another through subsequent
enzymatic steps. Biological pathways are organised in manually curated and dedi-
cated databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG;
http://www.genome.jp/kegg/; Kanehisa et al. 2014), MetaCyc (http://metacyc.org/;
Caspi et al. 2014), small molecule pathway database (SMPDB; http://www.smpdb.
ca/; Jewison et al. 2014), and WikiPathways (http://wikipathways.org/; Kelder et al.
2012). The principle of pathway data analysis in metabolomics is based on metabo-
lite set enrichment analysis (MSEA; Xia and Wishart 2010). This methodology is
derived from the gene set enrichment analysis (GSEA), which is a methodology that
was developed to enrich gene functional information in genomics and
Merging Metabolomics, Genetics, and Genomics 51

transcriptomics (Tárraga et al. 2008). MSEA is implemented in a few tools for com-
VetBooks.ir

prehensive metabolomic analyses such as MetaboAnalyst, which includes several


modules for different data analysis purposes and pathway visualisation (Xia et al.
2015). An example of application of MSEA in livestock has been reported by Bovo
et al. (2015) in the metabolomic profile dissection in pigs of different sexes (Fig. 3).
Network analyses are hypothesis-free in terms of relationship or link construction.
Networks are constructed based on correlations between metabolites, as observed
within the data set produced from a specific comparison between individuals in a
defined condition or experimental design. In this way, it is possible to identify new
relationships among metabolites and reconstruct biological pathways without any
predefined biochemical information. An example of this approach is given by the
use of Gaussian graphical modelling (GGM) to reconstruct pathway reactions from
high-throughput metabolomic data (Krumsiek et al. 2011).

1
7

2
6
5

5
3
-log(p)

4
4

6
3

7 8 9
2
1

0.00 0.05 0.10 0.15 0.20 0.25 0.30


Pathway impact

Fig. 3 An example of a metabolome view of the pathway analysis obtained with MetaboAnalyst
in pigs (modified from Bovo et al. 2015). The impact is the pathway impact value calculated from
pathway topology analysis. Indicated pathways are as follows: (1) valine, leucine, and isoleucine
biosynthesis; (2) valine, leucine, and isoleucine degradation; (3) aminoacyl-tRNA biosynthesis;
(4) beta-alanine metabolism; (5) arginine and proline metabolism; (6) glycerophospholipid metab-
olism; (7) linoleic acid metabolism; (8) tryptophan metabolism; and (9) taurine and hypotaurine
metabolism. Other pathways are not indicated
52 L. Fontanesi

3 Metabolomics for the Dissection of Complex Traits


VetBooks.ir

in Livestock

The recording and analysis of phenotypes are fundamental elements in animal


breeding. Metabolomics produces metabotypes that can be defined as internal, or
intermediate or molecular phenotypes, according to their use and informativity in a
specific biological context (Fig. 2). Metabotypes identified in animals can be useful
to link genetics from one side and physiology from the other side to provide mean-
ingful information for novel applications in animal breeding and genetics. This is
the advantage of including metabotypes in addition to regular phenotypes (final
phenotypes) to dissect complex traits.
Metabolomic studies in livestock can reap advantages from the possibilities to
sample biofluids (and also tissues) that cannot be easily or routinely obtained (e.g.,
milk in dairy species, muscle, or other tissues collected after slaughtering of meat
species) in other species (i.e., in humans) and where environmental factors (e.g.,
feeding and housing) can be controlled or defined more easily. However, on the
other hand, in livestock, especially in field trials and for blood collection, it is usu-
ally more difficult to define standard operating protocols during the sampling and
processing steps of specimens for subsequent metabolomic analysis.

3.1 Heritability of Metabotypes

The first step to link genetics to metabolites is to estimate the heritability of metabo-
types. The heritability of metabolomically derived information has been investigated
thus far in pigs and dairy cattle. In pigs, in particular, heritability was estimated for
plasma metabolites analysed with a targeted approach in approximately 900 perfor-
mance-tested Italian Large White pigs (Fontanesi et al. 2014). Data were grouped
according to the biochemical classes of the different metabolites. A quite broad range
of values was reported (from 0.07 to 0.73), suggesting a quite large heterogeneity
both within as well as across metabolite classes (Fontanesi et al. 2014). In dairy spe-
cies, the most interesting biofluid for metabolomic analysis is the milk. It can be
easily collected, and it provides useful information to assess the animal metabolism
and to evaluate nutritional and cheese-making properties. For these reasons, several
studies in dairy cattle estimated the heritability of specific milk metabolites that were
considered important biomarkers of particular states of the cow or predictors of pro-
duction and functional traits (e.g., urea measured as milk urea nitrogen, lactate,
β-hydroxybutyrate or BHBA, acetone, and glucose; Welper and Freeman 1992;
Mitchell et al. 2005; Miglior et al. 2006; 2007; Stoop et al. 2007; Van der Drift et al.
2012) or relevant for nutritional quality of the milk (e.g., fatty acids; Soyeurt et al.
2007; Stoop et al. 2008). Two studies estimated the heritability of quite a large num-
ber of metabolites detected in bovine milk using metabolomic approaches (Buitenhuis
et al. 2013; Wittenburg et al. 2013). Buitenhuis et al. (2013) detected 31 metabolites
in 371 mid-lactating Danish Holstein cows using 1H-NMR spectroscopy, obtaining
estimates of h2 ranging from 0 (lactic acid) to more than 0.8 for orotic acid and
BHBA. Wittenburg et al. (2013) estimated genetic parameters and evaluated the
Merging Metabolomics, Genetics, and Genomics 53

mode of inheritance of 190 milk metabolites analysed by GC-MS in 1295 Holstein


VetBooks.ir

cows. Heritability ranged from 0 (for 15 metabolites) to approximately 0.7 for


3-(4-hydroxyphenyl)lactic acid, and significant additive genetic variation was calcu-
lated for 55 metabolites. Dominance variation was reported for only two metabolites
(2-oxoglutaric acid and benzoic acid; Wittenburg et al. 2013). These studies were
carried out using different analytical platforms, milk from cows at different lactation
periods, and a few overlapping metabolites (only 10 metabolites were listed in both
studies) for which estimates of heritability were only in part concordant. For exam-
ple, the heritability of lactic acid was almost the same in the two studies (Buitenhuis
et al. 2013; Wittenburg et al. 2013). As lactic acid level in bovine milk has been sug-
gested as a potential marker for mastitis (Davis et al. 2004), it seems clear from these
estimates that it could be used only as a transient indicator of the status of the cow
and not as a predictor for the susceptibility to this disease (Wittenburg et al. 2013).
On the other hand, h2 of orotic acid, an intermediate in pyrimidine biosynthesis
important in nutrition (Tiemeyer et al. 1984), was very different in the two investiga-
tions (0.86 and 0.21; Buitenhuis et al. 2013; Wittenburg et al. 2013; respectively).
The heritability of other important milk metabolites detected in Wittenburg et al.
(2013) that were reported to be indicators of disease states, like subclinical ketosis
(e.g., BHBA; in addition to other ketone bodies; Geishauser et al. 2000) or prognos-
tic markers for risk of ketosis development (e.g., glycerophosphocholine [GPC],
considered as a ratio with phosphocholine [PC]; Klein et al. 2012), was high or
medium, indicating that these biomarkers can be used to breed animals that might be
able to cope with negative energy balance frequently occurring in early lactation
(Klein et al. 2012; Wittenburg et al. 2013). It is clear that the precision of estimates
of heritability is influenced by population structure and number of observations. It is
important to note that heritability estimates for some of these metabolites (i.e.,
BHBA) is different from what was reported by other studies that did not confirm the
high heritability values (van der Drift et al. 2012). Furthermore, medium heritability
estimates (Buitenhuis et al. 2013; Wittenburg et al. 2013) could predict the same
breeding potential for other metabolites that are favourable for human nutrition (e.g.,
choline, oligosaccharides, and several precursors of vitamins) or for improved tech-
nological properties of the milk (e.g., citric acid).
As several analysed milk metabolites are included in the same metabolic path-
way (e.g., glycolysis), a first attempt to evaluate a pathway-derived measure of
genetic variability was obtained by calculating a combined the heritability of cor-
related metabolites included in the same pathway (Wittenburg et al. 2013). This
approach tried to link heritability to pathway analysis considering the fact that
metabotypes are biochemically linked together in complex biochemical pathways
that in most cases are not completely described.

3.2 Metabotypes as Predictors of Economic Relevant Traits

Metabolomics provides information on internal or intermediate phenotypes that can


be used to predict production traits based on two main principles: (i) deconstruction
of complex traits into more simple traits (metabotypes) usually interconnected to
54 L. Fontanesi

one another (in metabolic pathways) close to the biological mechanisms determin-
VetBooks.ir

ing, on the whole, a final phenotype, and (ii) identification of biomarkers that can be
used as correlated and convenient proxies or substitutes of traditionally defined
traits. If it were possible, the advantages are (i) to predict traits that are difficult or
expensive to be measured or detected (e.g., disease resistance) or (ii) to predict as
early as possible traits that can be measured or inferred late in the productive life of
the animals with or without any other information, like pedigree and the related
genealogically derived estimated breeding values.
Approaches in these directions were first designed to identify association
between metabolites and production traits or specific states of the animals as a start-
ing point to define specific physiological roles related to their presence or their dif-
ferent levels in some conditions. Several studies in this direction were reported in
dairy cattle, analysing blood and/or milk metabolites without any direct evaluation
of the genetic factors affecting these relationships (e.g., Klein et al. 2010; Ilves et al.
2012; Harzia et al. 2012, 2013; Melzer et al. 2013a; Sundekilde et al. 2014).
Genetic factors affecting metabolite parameters were considered to establish a
prognostic biomarker for risk of ketosis in dairy cattle (Klein et al. 2012). Breeding
values for energy balance and milk fat-to-protein ratio (that were reported to be
associated with liability to metabolic disorders; Buttchereit et al. 2011) were signifi-
cantly correlated with the level of milk glycerophosphocholine (GPC), phospho-
choline (PC), or its ratio (GPC/PC). High GPC and low PC values and a GPC/PC
ratio >2.5 were considered as indicators of resistance to ketosis in dairy cattle (Klein
et al. 2012).
Melzer et al. (2013a) used milk metabolite profiles to predict milk protein and fat
content and milk pH. Important metabolites were identified using random forests
and PLS. Prediction precision (defined as the correlation between estimated and
observed milk trait values) was higher for milk protein (0.63–0.64) with 16 impor-
tant metabolites identified and lower for milk fat and pH (approximately 0.35) with
11 and 10 different important metabolites identified for the two traits, respectively
(Melzer et al. 2013a). Genetic correlations between milk metabolites and milk pro-
duction traits were reported by Buitenhuis et al. (2013). Several metabotypes were
correlated with one or another milk trait, indicating that some of these metabolites
could be eventually used as biomarkers to disrupt unfavourable correlations between
traits.
In beef cattle, Karisa et al. (2014) reported that 12 plasma metabolites were sig-
nificantly associated with residual feed intake and accounted for approximately
98% of the variation in this trait. However, metabolite levels fluctuated greatly
across different ages in different steer populations and should be considered only as
potential biomarkers for this important trait.
Prediction power of metabolomic profiles for production traits has been investi-
gated in performance-tested growing pigs (60 days old) from three breeds using
plasma 1H-NMR fingerprinting (Rohart et al. 2012). This approach only indirectly
relies on the dissection, and then by summing-up the contribution of a metabolomic
profile on a few traits, as in this case, there is no need to fully characterise the meta-
bolic peaks and then attribute a chemical name to all signals. For this approach, the
Merging Metabolomics, Genetics, and Genomics 55

biological interpretation might be a secondary objective and could be derived in part


VetBooks.ir

by selected metabolites with the greatest predictive values (Rohart et al. 2012).
Predicted traits were growth rate, feed efficiency, carcass, and meat quality traits.
Prediction accuracy was highly dependent on the trait and improved by including in
the model the breed of origin of the animals, but not including the batch of the ani-
mals (probably because micro-environmental effects were not significantly differ-
ent in a performance testing structure). Traits that were determined after slaughtering
were predicted with high error rates. This might be expected, considering, for exam-
ple, that slaughtering conditions are well known to be the most important factors
affecting meat quality parameters. Other traits were well predicted. In particular,
average daily feeding intake, that is an expensive and difficult trait to be measured,
was predicted with quite good accuracy (Rohart et al. 2012).

3.3 Metabolomics and Genomics

The link between genomic information and the level of metabolites accumulated in
specific tissues, circulating in biofluids or essential for important cell functions in
animals, has been already established for several inborn errors of metabolism, as
already discussed, with the identification of causative mutations for these defects
(Table 1). However, for many metabolites that do not affect in extreme ways visible
animal phenotypes or produce genetic diseases, we are just beginning to establish
relationships at the genomic level. In addition, considering information derived
from the estimation of heritability in livestock, it is clear that genetic factors (i.e.,
gene polymorphisms directly identified or indirectly captured using DNA markers
in linkage disequilibrium) can affect the level of many other metabolites, leading
from minor (if even detectable) to relevant modifications of metabolomic profiles
(Fontanesi 2016). Metabolites whose level is modified by genetic factors have been
called genetically influenced metabotypes or GIM (Suhre and Gieger 2012).
Genomewide association studies using metabotypes (mGWAS) analysed in serum,
plasma, urine, and liver have already been identified in humans and mice SNP–
metabolite trait associations or mQTL close to or within genes encoding for key
components of the metabolic machineries (e.g., enzymes, transporters, or other
related proteins) (reviewed in Suhre and Gieger 2012; Gauguier 2015; Kastenmüller
et al. 2015). In this way, mQTLs establish a direct link between genes that can
explain the biological reasons of these associations. In some cases, the functional
interpretation of the results might be difficult due to the lack of information on the
roles of the genes, even if the association with known and well-defined metabolites
might help to attribute a potential function to uncharacterised genes. On the other
hand, it could be possible to deorphanise uncharacterised metabolites or metabolite
features (peaks or spectra, according to the analytical platforms) if mQTLs are
localised on genes that are already well described (Rueedi et al. 2014). mQTLs usu-
ally explain a relevant fraction of the genetic variance for the associated metabo-
types (10%–30%). The same regions might be associated with increased risks for
complex diseases, suggesting that the colocalised mQTL might be important to
56 L. Fontanesi

define the disease state or the biological mechanisms underlying the disease state or
VetBooks.ir

susceptibility.
Population-based mGWAS have also been carried out in pigs and dairy cattle. In
pigs, Fontanesi et al. (2014, 2015) reported an mGWAS on performance-tested ani-
mals whose plasma was analysed with a targeted metabolomic platform. The level
of several circulating plasma nutrients was associated with several genes, explain-
ing the relevant fraction of the genetic variability of these metabotypes and thus
creating new possibilities to design nutrigenomic approaches (Fontanesi 2016). In
dairy cattle, Buitenhuis et al. (2013) used an untargeted metabolomic approach
based on NMR to analyse metabotypes in milk of 371 Holstein cows. Eight genome-
wide associations were reported on different bovine chromosomes (BTA): orotic
acid (BTA1), malonate on BTA2 and BTA3, galactose-1-phosphate on BTA2, glu-
cose on BTA11, urea on BTA12, and carnitine and glycerophosphocholine on
BTA25. Another 21 chromosome-significant associations were reported. Of these
mQTL, a few were located on genes or close to genes that might be involved in
defining the associated milk metabotypes, whereas for others, the function of the
closest genes in relationships with the associated metabolites was not clear. These
unexpected relationships could possibly contribute to assigning novel functions to
these genes. Among the metabolites for which QTLs were identified, it is worth
mentioning glycerophosphocholine, which is considered a biomarker for ketosis
resistance (Klein et al. 2012). Another GWAS that was carried out for a few milk
metabolites related to ketosis resistance (phosphocholine, glycerophosphocholine,
and the ratio between the two metabolites) confirmed an mQTL for glycerophos-
phocholine on BTA25 (Tetens et al. 2015). Gene variants in the apolipoprotein
receptor B (APORB) gene were suggested to be the causative mutations of the
mQTL (Tetens et al. 2015).
Other GWAS were based on one or a few milk or plasma metabolites in dairy
cattle. These metabolites were preselected based on their relevance to human nutri-
tion (considering the milk) or as biomarkers of physiological states of the cows, as
also discussed above. In particular, Poulsen et al. (2015) focused their study on the
level of riboflavin (vitamin B2) in the milk of a total of ~800 Danish Holstein and
Danish Jersey cows. Riboflavin is an essential water-soluble vitamin with many
biological roles, and milk is one of the main sources of this nutrient in the human
diet. Significant markers were reported on BTA14 and BTA17 in Jersey and on
several other chromosomes in Holstein, most of which were on BTA13 and BTA14.
The most promising mQTL for riboflavin content was located on BTA13 in the cor-
respondence of the SLC52A3 gene, coding for a riboflavin transporter, whose func-
tion might directly affect this metabotype. Another GWAS for nonesterified fatty
acid (NEFA), BHBA, and glucose in bovine milk (considered as indicators of meta-
bolic adaptation of the cows) were reported by Ha et al. (2015) . Instead of conven-
tional single-marker analyses, this study used gene enrichment approaches to
increase the power in obtaining gene sets and pathways that might contribute to
explain the metabolic adaptability of dairy cows in their early lactation periods.
Lu et al. (2015) reported a study that evaluated the milk lipid and metabolome
composition in a few milk samples from cows with different genotypes at the
Merging Metabolomics, Genetics, and Genomics 57

DGAT1 K232A polymorphism. The differences observed between genotypes may


VetBooks.ir

contribute to understanding the basic biological mechanisms of this mutation hav-


ing major effects on milk production and composition.
In addition to these approaches, Weikard et al. (2010) reported a family-based
association study for plasma metabotypes obtained from individuals of an F2 fam-
ily produced by crossing Charolais with German Holstein. In this study, two caus-
ative mutations (NCAPG I1442M and GDF8 Q204X) modulating pre- and
postnatal growth rate were first used to identify which metabolites were associated
to these markers. Subsequently, all chromosome regions were covered with the
Illumina Bovine SNP50 Beadchip, obtaining additional signals of suggestive asso-
ciation with residual feed intake that, combined with metabolomic data, were used
in a systems biology approach to understand the effect of the NCAPG I1442M
mutation on feed efficiency and feed intake in male cattle at the onset of puberty
(Widmann et al. 2013, 2015).

3.4 A Simplified Systems Genetic Approach in Livestock

The inclusion of intermediate phenotypes between the genomic space and the exter-
nal phenotypes contributes to fill the biological gaps between these two most distant
biological levels (Fig. 2). Among the several possible intermediate levels, metabo-
types seem to be the most promising to develop approximate systems genetic mod-
els to understand the molecular basis of complex traits. This is due to the fact that
metabotypes are very close to the external phenotypes (i.e., production traits) that
are important in animal breeding. In addition, the biochemical profiles of the ani-
mals can be useful to monitor or to define their physiological states that in most
cases express the production potentials of the animals, if it is possible to distinguish
the genetic components from the environmental influences. Missed biological infor-
mation at the other intermediate levels may produce approximations. However, it
seems that a three-level modelling system can potentially be implemented (in prac-
tice) to clarify the biological steps that produce economically relevant traits and in
turn to predict final phenotypes (Fontanesi 2016). Among the intermediate pheno-
types, it seems that metabotypes are much easier to be analysed on a routine basis.
For example, it is usually quite easier and cheaper to analyse metabolites on milk or
plasma (or serum) on a large number of animals than to obtain gene expression data
at a genomewide level and at the population level. In addition, the collection of
relevant tissues for gene expression analysis might be very complicated in field tri-
als. It could be interesting to include metabolomic data in addition to SNPs for
novel methodological implementations of genomic selection. The integration of
metabolomics and genomics into genomic selection could be useful when the pre-
diction accuracy might be limited by the low number of animals in the training
population or when the heritability of the investigated trait is low or when it is
important to use proxies for more complex or difficult traits that cannot be measured
directly on the animals (e.g., disease resistance defined in challenging plans). As a
first step in this direction, Ehret et al. (2015) described predictive models for
58 L. Fontanesi

subclinical ketosis risk in approximately 200 cows by merging SNP data and a few
VetBooks.ir

metabotypes measured in milk, already described to be associated with this defect


in dairy cattle. This preliminary attempt may suggest that additional methodological
advances might be needed before large-scale implementations are designed, even if
this strategy seems promising.

Conclusions
mGWAS implemented in livestock reported significant markers even if a lower
number of individuals were analysed than what is common in GWAS carried out
in humans. This might be due, at least in part, to the fact that it is usually much
easier to control or reduce environmental factors affecting the level of metabo-
lites in animals than in humans. On the other hand, large-scale implementations
of metabolomic studies in animals seem intrinsically more difficult in field trials
than in humans, and specific sampling protocols and procedures might be needed.
Despite the technological limitations (limited analytical platforms) that can-
not provide a complete and exhaustive picture of all metabolites present in a
biofluid or tissue, metabolomics merged with genetics, and genomics can con-
tribute to clarify the biological bases of complex traits in livestock. New traits
and biomarkers can be defined using metabolomics. Metabolomics can be used
to establish next-generation phenotyping approaches that are needed to refine
and improve trait descriptions and, in turn, prediction of the breeding values of
the animals to cope with traditional and new objectives of selection programmes
(Fontanesi 2016).

Acknowledgements My research work on metabolomics in livestock has been supported by the


Italian Ministry of the Politics in Agriculture, Food and Forestry (MiPAAF), Innovagen project.

References
Adamski J, Suhre K (2013) Metabolomics platforms for genome wide association studies – linking
the genome to the metabolome. Curr Opin Biotechnol 24:39–47. doi:10.1016/j.
copbio.2012.10.003
Alonso A, Marsal S, Julià A (2015) Analytical methods in untargeted metabolomics: state of the
art in 2015. Front Bioeng Biotechnol 3:23. doi:10.3389/fbioe.2015.00023
Beadle GW, Tatum EL (1941) Genetic control of biochemical reactions in Neurospora. Proc Natl
Acad Sci U S A 27:499–505
Berry SD, Davis SR, Beattie EM, et al (2009). Mutation in bovine beta-carotene oxygenase 2
affects milk color. Genetics 182:923–926 doi:10.1534/genetics.109.101741
Bovo S, Mazzoni G, Calò DG et al (2015) Deconstructing the pig sex metabolome: targeted metab-
olomics in heavy pigs revealed sexual dimorphisms in plasma biomarkers and metabolic path-
ways. J Anim Sci. 93:5681–5693. doi:10.2527/jas2015-9528
Bro R, Smilde AK (2014) Principal component analysis. Anal Methods 6:2812–2831. doi:10.1039/
c3ay41907j
Buitenhuis AJ, Sundekilde UK, Poulsen NA et al (2013) Estimation of genetic parameters and
detection of quantitative trait loci for metabolites in Danish Holstein milk. J Dairy Sci 96:3285–
3295. doi:10.3168/jds.2012-5914
Merging Metabolomics, Genetics, and Genomics 59

Buttchereit N, Stamer E, Junge W et al (2011) Short communication: genetic relationships among


VetBooks.ir

daily energy balance, feed intake, body condition score, and fat to protein ratio of milk in dairy
cows. J Dairy Sci 94:1586–1591. doi:10.3168/jds.2010-3396
Caspi R, Altman T, Billington R et al (2014) The MetaCyc database of metabolic pathways and
enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res
42:D459–D571. doi:10.1093/nar/gkt1103
Davis SR, Farr VC, Prosser CG et al (2004) Milk L-lactate concentration is increased during mas-
titis. J Dairy Res 71:175–181. doi:10.1017/S002202990400007X
Dolphin CT, Janmohamed A, Smith RL et al (1997) Missense mutation in flavin-containing mono-
oxygenase 3 gene, FMO3, underlies fish-odour syndrome. Nat Genet 17:491–494. doi:10.1038/
ng1297-491
Ehret A, Hochstuhl D, Krattenmacher N et al (2015) Short communication: use of genomic and
metabolic information as well as milk performance records for prediction of subclinical
ketosis risk via artificial neural networks. J Dairy Sci 98:322–329. doi:10.3168/
jds.2014-8602
Ellinger JJ, Chylla RA, Ulrich EL et al (2013) Databases and software for NMR-based metabolo-
mics. Curr Metabolomics 1(1). doi:10.2174/2213235X11301010028
Fiehn O (2002) Metabolomics – the link between genotypes and phenotypes. Plant Mol Biol
48:155–171. doi:10.1023/A:1013713905833
Fontanesi L (2016) Metabolomics and livestock genomics: insights into a phenotyping frontier and
its applications in animal breeding. 6:73–79. doi:10.2527/af.2016-0011. Front Genet (in press)
Fontanesi L, Bovo S, Mazzoni G et al (2014) Genome wide perspective of genetic variation in pig
metabolism and production traits. Manuscript n. 359. Proceedings of 10th world congress on
genetics applied to livestock production, Vancouver, 17–22 Aug 2014
Fontanesi L, Schiavo G, Bovo S et al (2015) Dissecting complex traits in pigs: metabotypes illu-
minate genomics for practical applications. P. 152. Abstract retrieved from the book of abstracts
of the 66th annual meeting of the European Federation of Animal Science. Book of Abstracts
No. 21, Warsaw, 31 Aug–4 Sept 2015
Fuhrer T, Zamboni N (2015) High-throughput discovery metabolomics. Curr Opin Biotechnol
31:73–78. doi:10.1016/j.copbio.2014.08.006
Gallardo D, Pena RN, Amills M et al (2008) Mapping of quantitative trait loci for cholesterol,
LDL, HDL, and triglyceride serum concentrations in pigs. Physiol Genomics 3:199–209.
doi:10.1152/physiolgenomics.90249.2008
Garrod AE (1902) The incidence of alkaptonuria: a study in chemical individuality. Lancet
2:1616–1620
Gauguier D (2015) Application of quantitative metabolomics in systems genetics in rodent models
of complex phenotypes. Arch Biochem Biophys. doi:10.1016/j.abb.2015.09.016
Geishauser T, Leslie K, Tenhag J et al (2000) Evaluation of eight cow-side ketone tests in milk for
detection of subclinical ketosis in dairy cows. J Dairy Sci 83:296–299. doi:10.3168/jds.
S0022-0302(00)74877-6
Ha NT, Gross JJ, van Dorland A, et al (2015) Gene-based mapping and pathway analysis of meta-
bolic traits in dairy cows. PLoS One 10:e0122325. doi:10.1371/journal.pone.0122325
Harzia H, Kilk K, Jõudu I et al (2012) Comparison of the metabolic profiles of noncoagulating and
coagulating bovine milk. J Dairy Sci 95:533–540. doi:10.3168/jds.2011-4468
Harzia H, Ilves A, Ots M et al (2013) Alterations in milk metabolome and coagulation ability dur-
ing the lactation of dairy cows. J Dairy Sci 96:6440–6448. doi:10.3168/jds.2013-6808
Hasler-Rapacz J, Ellegren H, Fridolfsson AK et al (1998) Identification of a mutation in the low
density lipoprotein receptor gene associated with recessive familial hypercholesterolaemia in
swine. Am J Med Genet 76:379–386. doi:10.1002/(SICI)1096-8628(19980413)76:5<379::AID-
AJMG3>3.0.CO;2-I
Honkatukia M, Reese K, Preisinger R et al (2005) Fishy taint in chicken eggs is associated with a
substitution within a conserved motif of the FMO3 gene. Genomics 86:225–232. doi:10.1016/j.
ygeno.2005.04.005
60 L. Fontanesi

Houle D, Govindaraju DR, Omholt S (2010) Phenomics: the next challenge. Nat Rev Genet
VetBooks.ir

11:855–866. doi:10.1038/nrg2897
Ilves A, Harzia H, Ling K et al (2012) Alterations in milk and blood metabolomes during the first
months of lactation in dairy cows. J Dairy Sci 95:5788–5797. doi:10.3168/jds.2012-5617
Jewison T, Su Y, Disfany FM et al (2014) SMPDB 2.0: big improvements to the Small Molecule
Pathway Database. Nucleic Acids Res 42:D478–D484. doi:10.1093/nar/gkt1067
Johnson CH, Ivanisevic J, Benton HP et al (2015) Bioinformatics: the next frontier of metabolo-
mics. Anal Chem 87:147–156. doi:10.1021/ac5040693
Junot C, Fenaille F, Colsch B et al (2014) High resolution mass spectrometry based techniques at
the crossroads of metabolic pathways. Mass Spectrom Rev 33:471–500. doi:10.1002/
mas.21401
Kanehisa M, Goto S, Sato Y et al (2014) Data, information, knowledge and principle: back to
metabolism in KEGG. Nucleic Acids Res 42:D199–D205. doi:10.1093/nar/gkt1076
Karisa BK, Thomson J, Wang Z et al (2014) Plasma metabolites associated with residual feed
intake and other productivity performance traits in beef cattle. Liv Sci 165:200–211.
doi:10.1016/j.livsci.2014.03.002
Kastenmüller G, Raffler J, Gieger C et al (2015) Genetics of human metabolism: an update. Hum
Mol Genet 24:R93–R101. doi:10.1093/hmg/ddv263
Kelder T, van Iersel MP, Hanspers K et al (2012) WikiPathways: building research communities on
biological pathways. Nucleic Acids Res 40:D1301–D1307. doi:10.1093/nar/gkr1074
Klein MS, Almstetter MF, Schlamberger G et al (2010) Nuclear magnetic resonance and mass
spectrometry-based milk metabolomics in dairy cows during early and late lactation. J Dairy
Sci 93:1539–1550. doi:10.3168/jds.2009-2563
Klein MS, Buttchereit N, Miemczyk SP et al (2012) NMR metabolomic analysis of dairy cows
reveals milk glycerophosphocholine to phosphocholine ratio as prognostic biomarker for risk
of ketosis. J Proteome Res 11:1373–1381. doi:10.1021/pr201017n
Krumsiek J, Suhre K, Illig T et al (2011) Gaussian graphical modelling reconstructs pathway
reactions from high-throughput metabolomics data. BMC Syst Biol 5:21.
doi:10.1186/1752-0509-5-21
Lu J, Boeren S, van Hooijdonk T et al (2015) Effect of the DGAT1 K232A genotype of dairy cows
on the milk metabolome and proteome. J Dairy Sci 98:3460–3469. doi:10.3168/
jds.2014-8872
Lundén A, Marklund S, Gustafsson V et al (2002) A nonsense mutation in the FMO3 gene under-
lies fishy off-flavor in cow’s milk. Genome Res 12:1885–1888. doi:10.1101/gr.240202
Melzer N, Wittenburg D, Hartwig S et al (2013a) Investigating associations between milk metabo-
lite profiles and milk traits of Holstein cows. J Dairy Sci 96:1521–1534. doi:10.3168/
jds.2012-5743
Miglior F, Sewalem A, Jamrozik J (2006) Analysis of milk urea nitrogen and lactose and their
effect on longevity in Canadian dairy cattle. J Dairy Sci 89:4886–4894. doi:10.3168/jds.
S0022-0302(06)72537-1
Miglior F, Sewalem A, Jamrozik J et al (2007) Genetic analysis of milk urea nitrogen and lactose
and their relationships with other production traits in Canadian Holstein cattle. J Dairy Sci
90:2468–2479. doi:10.3168/jds.2006-487
Milman BL (2015) General principles of identification by mass spectrometry. TrAC Trends Anal
Chem 69:24–33. doi:10.1016/j.trac.2014.12.009
Mitchell RG, Rogers GW, Dechow CD et al (2005) Milk urea nitrogen concentration: heritability
and genetic correlations with reproductive performance and disease. J Dairy Sci 88:4434–4440.
doi:10.3168/jds.S0022-0302(05)73130-1
Mo F, Zheng J, Wang P et al (2013) Quail FMO3 gene cloning, tissue expression profiling, poly-
morphism detection and association analysis with fishy taint in eggs. PLoS One 8, e81416.
doi:10.1371/journal.pone.0081416
Oliver SG, Winson MK, Kell DB et al (1998) Systematic functional analysis of the yeast genome.
Trends Biotechnol 16:373–378. doi:10.1016/S0167-7799(98)01214-1
Merging Metabolomics, Genetics, and Genomics 61

Patti GJ, Yanes O, Siuzdak G (2012) Innovation: metabolomics: the apogee of the omics trilogy.
VetBooks.ir

Nat Rev Mol Cell Biol 13:263–269. doi:10.1038/nrm3314


Poulsen NA, Rybicka I, Larsen LB et al (2015) Short communication: genetic variation of ribofla-
vin content in bovine milk. J Dairy Sci 98:3496–3501. doi:10.3168/jds.2014-8829
Purtell C, Maeda N, Ebert DL et al (1993) Nucleotide sequence encoding the carboxyl-terminal
half of apolipoprotein-B from spontaneously hypercholesterolemic pigs. J Lipid Res
34:1323–1335
Rapacz J, Hasler-Rapacz J, Taylor KM et al (1986) Lipoprotein mutations in pigs are associated
with elevated plasma cholesterol and atherosclerosis. Science 234:1573–1577. doi:10.1126/
science.3787263
Ren S, Hinzman AA, Kang EL et al (2015) Computational and statistical analysis of metabolomics
data. Metabolomics 11:1492–1513. doi:10.1007/s11306-015-0823-6
Roberts LD, Souza AL, Gerszten RE et al (2012) Targeted metabolomics. Curr Protoc Mol Biol
Chapter 30:Unit 30.2.1-24. doi:10.1002/0471142727.mb3002s98
Robinette SL, Brüschweiler R, Schroeder FC et al (2012) NMR in metabolomics and natural prod-
ucts research: two sides of the same coin. Acc Chem Res 45:288–297. doi:10.1021/ar2001606
Robinson JL, Dombrowski DB, Clark JH et al (1984) Orotate in milk and urine of dairy cows with
a partial deficiency of uridine monophosphate synthase. J Dairy Sci 67:1024–1029. doi:10.3168/
jds.S0022-0302(84)81401-0
Rohart F, Paris A, Lauren B et al (2012) Phenotypic prediction based on metabolomic data for
growing pigs from three main European breeds. J Anim Sci 90:4729–4740. doi:10.2527/
jas.2012-5338
Rueedi R, Ledda M, Nicholls AW et al (2014) Genome-wide association study of metabolic traits
reveals novel gene-metabolite-disease links. PLoS Genet 10, e1004132. doi:10.1371/journal.
pgen.1004132
Schwenger B, Schöber S, Simon D (1993) DUMPS cattle carry a point mutation in the uridine
monophosphate synthase gene. Genomics 16:241–244. doi:10.1006/geno.1993.1165
Sévin DC, Kuehne A, Zamboni N et al (2015) Biological insights through nontargeted metabolo-
mics. Curr Opin Biotechnol 34:1–8. doi:10.1016/j.copbio.2014.10.001
Soyeurt H, Gillon A, Vanderick S et al (2007) Estimation of heritability and genetic correlations
for the major fatty acids in bovine milk. J Dairy Sci 90:4435–4442. doi:10.3168/
jds.2007-0054
Stoop WM, Bovenhuis H, Van Arendonk JAM (2007) Genetic parameters for milk urea nitrogen in
relation to milk production traits. J Dairy Sci 90:1981–1986. doi:10.3168/jds.2006-434
Stoop WM, van Arendonk JA, Heck JM et al (2008) Genetic parameters for major milk fatty acids
and milk production traits of Dutch Holstein-Friesians. J Dairy Sci 91:385–394. doi:10.3168/
jds.2007-0181
Suhre K, Gieger C (2012) Genetic variation in metabolic phenotypes: study designs and applica-
tions. Nat Rev Genet 13:759–769. doi:10.1038/nrg3314
Sundekilde UK, Gustavsson F, Poulsen NA et al (2014) Association between the bovine milk
metabolome and rennet-induced coagulation properties of milk. J Dairy Sci 97:6076–6084.
doi:10.3168/jds.2014-8304
Tárraga J, Medina I, Carbonell J et al (2008) GEPAS, a web-based tool for microarray data analy-
sis and interpretation. Nucleic Acids Res 36:W308–W314. doi:10.1093/nar/gkn303
Tetens J, Heuer C, Heyer I et al (2015) Polymorphisms within the APOBR gene are highly associ-
ated with milk levels of prognostic ketosis biomarkers in dairy cows. Physiol Genomics
47:129–137. doi:10.1152/physiolgenomics.00126.2014
Tian R, Pitchford WS, Morris CA et al (2010) Genetic variation in the beta, beta-carotene-9',
10'-dioxygenase gene and association with fat colour in bovine adipose tissue and milk. Anim
Genet 41:253–259. doi:10.1111/j.1365-2052.2009.01990.x
Tian R, Cullen NG, Morris CA et al (2012) Major effect of retinal short-chain dehydrogenase
reductase (RDHE2) on bovine fat colour. Mamm Genome 23:378–386. doi:10.1007/
s00335-012-9396-0
62 L. Fontanesi

Tiemeyer W, Stohrer M, Giesecke D (1984) Metabolites of nucleic acids in bovine milk. J Dairy
VetBooks.ir

Sci 67:723–728. doi:10.3168/jds.S0022-0302(84)81361-2


Våge DI, Boman IA (2010) A nonsense mutation in the beta-carotene oxygenase 2 (BCO2) gene
is tightly associated with accumulation of carotenoids in adipose tissue in sheep (Ovis aries).
BMC Genet 11:10. doi:10.1186/1471-2156-11-10
Van der Drift SGA, van Hulzen KJE, Teweldemedhn TG et al (2012) Genetic and nongenetic
variation in plasma and milk β-hydroxybutyrate and milk acetone concentrations of early-
lactation dairy cows. J Dairy Sci 95:6781–6787. doi:10.3168/jds.2012-5640
Weikard R, Altmaier E, Suhre K et al (2010) Metabolomic profiles indicate distinct physiological
pathways affected by two loci with major divergent effect on Bos taurus growth and lipid depo-
sition. Physiol Genomics 42A:79–88. doi:10.1152/physiolgenomics.00120.2010
Welper RD, Freeman AE (1992) Genetic parameters for yield traits of Holsteins, including lactose
and somatic cell score. J Dairy Sci 75:1342–1348. doi:10.3168/jds.S0022-0302(84)81361-2
Widmann P, Reverter A, Fortes MR et al (2013) A systems biology approach using metabolomic
data reveals genes and pathways interacting to modulate divergent growth in cattle. BMC
Genomics 14:798. doi:10.1186/1471-2164-14-798
Widmann P, Reverter A, Weikard R et al (2015) Systems biology analysis merging phenotype,
metabolomic and genomic data identifies Non-SMC Condensin I Complex, Subunit G
(NCAPG) and cellular maintenance processes as major contributors to genetic variability in
bovine feed efficiency. PLoS One 10, e0124574. doi:10.1371/journal.pone.0124574
Wittenburg D, Melzer N, Willmitzer L et al (2013) Milk metabolites and their genetic variability.
J Dairy Sci 96:2557–2569. doi:10.3168/jds.2012-5635
Wolfender JL, Marti G, Thomas A et al (2015) Current approaches and challenges for the metabo-
lite profiling of complex natural extracts. J Chromatogr A 1382:136–164. doi:10.1016/j.
chroma.2014.10.091
Xia J, Wishart DS (2010) MSEA: a web-based tool to identify biologically meaningful patterns in
quantitative metabolomic data. Nucleic Acids Res 38:W71–W77. doi:10.1093/nar/gkq329
Xia J, Sinelnikov IV, Han B et al (2015) MetaboAnalyst 3.0 – making metabolomics more mean-
ingful. Nucleic Acids Res 43:W251–W257. doi:10.1093/nar/gkv380
Yamamoto T, Bishop RW, Brown MS (1986) Deletion in cysteine-rich region of LDL receptor
impedes transport to cell surface in WHHL rabbit. Science 232:1230–1237. doi:10.1126/
science.3010466
VetBooks.ir

RNA Sequencing Applied to Livestock


Production

Sara de las Heras-Saldana, Hawlader A. Al-Mamun,


Mohammad H. Ferdosi, Majid Khansefid, and Cedric Gondro

Abstract
High-throughput sequencing technology is rapidly replacing expression arrays and
becoming the standard method for global expression profiling studies. The develop-
ment of low-cost, rapid sequencing technologies has enabled detailed quantification
of gene expression levels, affecting almost every field in the life sciences. In this
chapter, we will overview the key points for gene expression analysis using RNA-
seq data. First, we will discuss the workflows of RNA-seq data analysis followed by
a discussion about the currently available tools for data analysis and a comparison
between these tools. The chapter concludes with a discussion about the application
of RNA-seq data analysis in livestock. In the appendix, using an example from
livestock RNA-seq data, we show a simple script for RNA-seq data analysis.

1 Introduction

There are two main platforms broadly used for expression profiling: microarrays
and direct sequencing of transcripts. Until a few years ago, microarrays were the
dominant platform, but they are rapidly being superseded by next-generation

S. de las Heras-Saldana • H.A. Al-Mamun • C. Gondro (*)


The Centre for Genetic Analysis and Applications, University of New England,
Armidale, NSW 2351, Australia
e-mail: cgondro2@une.edu.au
M.H. Ferdosi
Animal Genetics and Breeding Unit, University of New England,
Armidale, NSW 2351, Australia
M. Khansefid
Faculty of Veterinary and Agricultural Sciences, University of Melbourne,
Melbourne, VIC 3052, Australia

© Springer International Publishing Switzerland 2016 63


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_4
64 S. de las Heras-Saldana et al.

sequencing methods. Both platforms aim to study gene expression of different


VetBooks.ir

organisms, tissues, or conditions, but there are some fundamental differences


between them (Table 1). Microarrays have the advantage of being a more mature
technology with well-established analytical methods. On the other hand, the analy-
sis of RNA-seq data is still an active area of research with no well-defined best
practice approach. Microarrays are notoriously noisy data, and somewhat surpris-
ingly, RNA-seq is even worse. In common, both demand extensive preprocessing of
the raw data to remove unwanted variation and spurious noise.
Microarrays consist of a substrate (array) onto which thousands of probes are
adhered to (oligonucleotides, cDNA, PCR products). RNA in the target sample can
then be used to hybridize to the complementary probes on the array. To measure
gene expression, the RNA in the samples is tagged with a fluorescent dye (com-
monly Cy3 and Cy5), and expression is quantified based on the levels of fluores-
cence produced in the hybridization.
RNA-seq uses next-generation high-throughput sequencing platforms to
sequence RNA transcripts, and expression is based on the number of reads that map
to a reference genome or transcriptome. The main advantage of RNA-seq in

Table 1 Comparison between microarrays and RNA-seq


Microarrays RNA-seq
Features Uses the fluorescence intensity Gene expression is determined by the
produced after the hybridization number of reads mapped to a reference
process to determine the expression genome or transcriptome.
level of genes or transcripts. Expression levels are represented as discrete
Gene expression is represented as a counts.
continuous variable of the Poisson, binomial, or multinomial
hybridization signal intensities. distributions are assumed.
A normal distribution of expression Large dynamic range.
is assumed. High sensitivity.
Simultaneous profiling of thousands Able to determine the expression of
of genes. isoforms.
Used to discover unannotated genes and
isoforms.
The amount of data generated is large.
Less bias, lower frequency of false-positive
signals, and higher reproducibility.
Cons Requires prior information of genes Analysis needs to deal with large amounts
or transcripts to design the arrays. of data.
Low dynamic range. Differences in sequencing technologies.
Unable to detect novel alternative Lack of bioinformatics tools for analyzing
splicing, genes, or transcripts not on differential expression at the isoform level.
the array. Introduction of biases during library
Problems with background and cross construction.
hybridization. Overdispersion problems.
Difficult to design high-specificity Mapping uncertainties.
probes. Extensive number and options for
Expression levels are relative. preprocessing steps lead to cumulative
differences in expression counts.
RNA Sequencing Applied to Livestock Production 65

comparison to microarrays is that it is not limited to the probes on the array; this
VetBooks.ir

allows the discovery of new genes, isoforms, transcripts (Huang and Khatib 2010),
and small noncoding RNA (Korpelainen et al. 2014). A summary of the main steps
in an RNA-seq workflow is shown in Fig. 1. There are various sequencing plat-
forms, but Illumina is the most widely adopted one for RNA-seq because it displays
a low substitution error rate per read, and there are almost no insertion or deletion
(indel) errors (Ramsköld et al. 2012). Sequencing is based on the detection of fluo-
rescent signals produced by the addition of a single nucleotide during the synthesis
of DNA (Korpelainen et al. 2014).
In broad terms, an RNA-seq experiment involves making a library prior to the
sequencing. The basic steps will require RNA extraction, enrichment for size/type
of RNA, and fragmentation, followed by synthesis of complementary DNA (cDNA,
which is also what is used for microarray hybridization) using random hexamer
primers. The cDNA is then ligated to platform-specific proprietary adaptor
sequences that are attached to the ends of the fragments; finally, an amplification
round completes the library preparation step. Barcodes (short sequences of 5–7 base
pairs that are used to tag a library) are sometimes also added to allow multiplexing
of samples (in this way, multiple samples can be sequenced in the reaction). After a
quality control step, the cDNA library is ready to be sequenced, and it is placed in
the lanes of a flow cell for an amplification step to produce clusters of double
stranded DNA (dsDNA). During this amplification, the addition of nucleotides is
detected, recorded, and converted into base calls. The number of cycles used in the
amplification reflects the length of the reads, whereas the amount of clusters defines
the number of reads (depth of sequencing). At the end of the amplification, the raw
data (short reads) are usually exported as FASTQ files.
Because of the possible introduction of biases during sequencing, it is important
to take into consideration the research question and experimental design, which will
determine the appropriate number of biological or technical replicates, the depth of
sequencing, and the use of single-end (SE) or paired-end reads (PE). PE reads are
the preferred option if the objective of the study is to accurately calculate the abun-
dance of alternative splicing (AS) events within single genes. However, if the study
aims to accurately estimate gene abundance, it is better to sequence large numbers
of short SE reads (Li and Dewey 2011).
A proper experimental design will allow us to recognize if the variation in the
RNA-seq data is due to biological or technical factors (Huang and Khatib 2010).
The number of biological replicates (samples, tissues or cell lines) is determined by
the aim of the experiment and the statistical power required, which in turn depends
on the variability between biological replicas (Zhang et al. 2014). Therefore, a
proper number of biological replicates is essential to determine if the differences in
gene expression are consistent and to evaluate the variance in expression of genes
(Ramsköld et al. 2012; Trapnell et al. 2012). The combination of more biological
replicates and an increased number of reads (depth) result in an increase in statisti-
cal power to detect differentially expressed genes (Trapnell et al. 2013; Wang and
Cairns 2013), and it also improves reproducibility (Liu et al. 2014). Anders et al.
(2012) suggested the use of at least three or four biological replicates per group to
66 S. de las Heras-Saldana et al.

RNA isolation
VetBooks.ir

DNase treatment,
Experimental question RNA quality controls
Size
selection

miRNA-seq library Poly-A RNA isolation

RNA fragmentation

Reverse transcription (cDNA)


Using random hexamer primers

Synthesize the second strand of the cDNA (ds cDNA)


Library preparation

Adapter ligation

PCR-enrichment
Quality control
cDNA library

Placed in 1 of 8 lanes of a
flow-cell

Sequencing Amplification step (cluster of dsDNA)


Flow cell placed in
Sequence machine
Sequence cluster
The fluorescence intensities
are converted into base-calls

Raw data (FASTQ)

Quality control Performed in R

Preprocessing
Not in R
Alignment to a reference genome or transcriptome
BAM file
Quality metrics
Assemble consensus of reads (Cufflinks- GTF or using Scripture- BED) (Maq)

Alternative splicing Quantification of the expression levels

Data analysis Count gene based strategy Transcript/gene Isoform count (FPKM) strategy

Calculate count data Calculate transcript abundance


(HTSeq, BEDtools) (Cufflinks, BitSeq, ebSeq)

Detection of DE Detection of DE
(DESeq, edgeR, baySeq) Normalization
(CuffDiff, DEXSeq)

Genes or exons Isoforms and genes


differentially differentially
expressed expressed

Functional analysis

Link to biological effect: Understanding of: eQTL


understanding of the physiology -developmental process
of a disease. -the effect of treatments
SNP
-tissue differentiation
Applications Understanding of the effect of
treatment and cell The relationship between trait,
differentialtion, development. Discovery of multiple isoforms
genes and its expression.
that improve current
annotations.
Useful in selection programs

Fig. 1 RNA-seq analysis flowchart


RNA Sequencing Applied to Livestock Production 67

do comparisons. However, Wesolowski et al. (2013) recommended that one selects


VetBooks.ir

the number of replicates based on some self-consistency test like cross-validation.


Strategies such as randomization and blocking designs can be used to separate bio-
logical and systematic variability effects, and also the use of multiplexing designs
can help to eliminate lane sequencing variation (McIntyre et al. 2011).
Sequencing depth (SD) refers to the number of short reads in a sample. To deter-
mine adequate depth for a project it is important to consider the analysis strategy,
for example, whether the study is focused on differential expression (DE), SNP
detection, splice junction detection, alternative splicing, or transcriptomic recon-
struction. Also, the library construction needs to be considered because this step
introduces large variation, even in the same biological specimen; however, accuracy
can be improved by increasing the sequencing depth (Cai et al. 2012). If the study
is focused on detection of differentially expressed genes, a balanced SD is required
between conditions (Tarazona et al. 2011). On the other hand, if the study is cen-
tered in quantifying exon and splice junctions, then a sufficient number of reads is
required (Sims et al. 2014). In the case of more complex transcriptomes, more SD
is required for adequate coverage; therefore, the reduction of SD results in a decrease
in the number of detectable genes (Wang et al. 2011), whereas increased depth
improves gene detection and count accuracy for genes with low expression (Tarazona
et al. 2011; Liu et al. 2014). However, genes highly expressed show little benefit for
detection of differential expression when the sequencing depth is increased (Wang
et al. 2011; Rapaport et al. 2013). Moreover, in synthetic data, it was observed that
by doubling the depth, the sensitivity in the identification of junctions improved, but
it affects the specificity due to an increase in the number of reads with high error
rates that end up incorrectly aligned to the genome (Wang et al. 2010). The addi-
tional noise due to increased depth also makes it more difficult to accurately esti-
mate differential expression (Tarazona et al. 2011). Recently, a computational tool
called Scotty was developed that can be used to determine the number of reads
required to measure any given number of genes or transcripts as well as the estima-
tion of the variance between replicates (Busby et al. 2013).
Sequence depth varies widely between studies depending on the experimental
design and, probably to a large extent, on the available budget. For example, a study
with chicken needed a depth of 30 million (M) reads to achieve reliable measure-
ment of mRNA expression across all genes (Wang et al. 2011). Approximately
60 M fragments were used to determine the degree of transcriptomic variation
between in vivo and in vitro bovine embryos (Driver et al. 2012). Thirty-two million
short reads were used to study the bovine mammary gland (Cánovas et al. 2014).
For evaluations of bovine subcutaneous, intramuscular, and omental fat, the
sequencing depth used was 38, 36, and 35 M reads, respectively (Lee et al. 2013).
When the detection of splice junctions is the goal, 23 million paired reads was suf-
ficient for human brain tissue, but deeper sequencing will be necessary for rare
transcripts (Au et al. 2010).
During library construction, multiple biases can be introduced. One example is
the GC content, which can be affected during PCR in the sequencing step. Fragments
with high GC and AT content are undersequenced because these regions remain
68 S. de las Heras-Saldana et al.

annealed during amplification (Sims et al. 2014). Genes with high and low GC con-
VetBooks.ir

tent are underrepresented, and this affects their expression counts (Risso et al. 2011;
Hansen et al. 2012; Korpelainen et al. 2014), which in turn causes gaps in the tran-
script assembly (Martin and Wang 2011). To reduce GC bias, it is necessary to care-
fully optimize the PCR methodology (Sims et al. 2014) and utilize an appropriate
normalization method (Risso et al. 2011). Also, during the synthesis of dsDNA, the
use of random hexamer primers introduces bias at 5′-end. This bias influences the
uniformity of the location of reads along the transcript (Hansen et al. 2010) and
induces a preferable selection of some regions over others (Filloux et al. 2014).
Even though random hexamers have this bias, it is still preferable than oligo (dT)
that are highly biased toward the 3′-end of the transcript (Hansen et al. 2010). The
use of adapters during the library construction can also introduce bias. Some of this
can be removed from the data during the analysis by performing adequate quality
control and the preprocessing steps discussed in the next section (Fig. 1). However,
to adjust for other biases, the sequence data have to be normalized.
Normalization is an important step that aims to adjust for the technical variation
introduced during library construction or between sequencing runs (Garber et al.
2011). The goal is to adjust read counts so that they are comparable between genes
and samples. This is important because the transcript length can cause bias, as long as
transcripts produce more reads than short transcripts. Also, the depth can be different
between samples or treatments. A plethora of methods were developed for microarray
normalization, and recently some of them have been adapted for RNA-seq. Common
normalization procedures are reads per kilobase of transcript per million mapped
reads (RPKM), quantile normalization, CG normalization, and Bayesian methods.
Another challenge in the analysis of the RNA-seq data is due to splicing. Splicing
removes introns and ligates exons to form mRNA; however, many transcripts in
varying abundances can be generated from a single gene by alternative splicing (AS)
of different combinations of exons (Aschoff et al. 2013). AS is regulated by the spli-
ceosome, which recognizes the consensus sequence of splice sites (5′ and 3′) and is
regulated by splice factors (one splice factor can act on several genes) (Ladomery
2014). There is a complex interaction between trans-acting splicing factors and cis-
acting regulatory elements that act as a silencer or enhancer (Aschoff et al. 2013).
Vitting-Seerup et al. (2014) described eight types of AS with exon skipping as the
most common, followed by alternative exon size as the second most common in
animals (Ladomery 2014). Because AS can affect gene expression and protein cod-
ing, a better understanding of these mechanisms is important, especially because
alterations in AS have been associated with diseases (Sterne-Weiler and Sanford
2014). Our understanding of the processes and the actual information available on
AS is still rather poor and incomplete (Rasche et al. 2014). This is largely due to the
complexity of the transcriptome because different mapping sites are compatible with
the sequence data, and it is difficult to map correctly. This leads to incorrect estimates
of isoform expression (Aschoff et al. 2013). Tools developed to estimate AS (detailed
in the AS subsection below) can be classified based on two strategies: exon based
(centered on identifying differential exon usage) and isoform based (which estimates
differential expression of isoforms by comparison of biological conditions) (Shi and
Jiang 2013; Wang and Cairns 2013). The advantage of the isoform-based approach
RNA Sequencing Applied to Livestock Production 69

is that it incorporates information from all the isoforms, as it is based on all reads
VetBooks.ir

mapped to exon and exon junctions (Shi and Jiang 2013).

2 Steps in RNA-seq Data Analysis and the Tools Available

Depending on the objective of the study, there are at least six key steps involved in
the data analysis (Fig. 1):

1. Quality control
2. Preprocessing
3. Mapping to a reference genome or transcriptome
4. Assembly consensus of reads
5. Quantification of expression levels
6. Functional analysis

The preprocessing and alignment steps usually involve a combination of differ-


ent software, while the other steps can be done with R.

2.1 Quality Control and Preprocessing

Quality problems in RNA-seq experiments can be introduced during the library


preparation or sequencing steps. Typical problems are low-quality bases and GC
content. A commonly used software to perform quality control is FastQC, but there
are other options such as PRINSEQ. Trimmomatic (Bolger et al. 2014), and Galaxy
(Goecks et al. 2010) are commonly used in the preprocessing step. The ShortRead
package is a good option in R.
The library construction and sequencing process can introduce sequencing
errors, which influences the mapping of the reads (Chen et al. 2011). The prepro-
cessing step attempts to clean the reads by removing adapters, low-complexity
reads, duplicates, primers, multiplexing identifiers, poly A/T tails, and sequence
contamination. It is also common to trim low-quality bases at the ends of the reads
because they not only affect the alignment but also increase the identification of
false-positive junctions (Wilson and Stein 2015). Once the quality control and pre-
processing is completed, the reads can be aligned to the reference genome or tran-
scriptome. As a rule of thumb, it is good practice to rerun quality control metrics on
the clean data to check if the filtering parameters were adequate (Gondro 2015).

2.2 Alignment of Reads to a Reference Genome or


Transcriptome

In this step, the reads are aligned against a reference genome or transcriptome, if it
is available. There are several choices of alignment software, e.g., Bowtie (Langmead
et al. 2009), Genomic Short-Read Nucleotide Alignment Program (GSNAP)
70 S. de las Heras-Saldana et al.

(Wu and Nacu 2010), SOAP2 (Li et al. 2009), SeqMap (Jiang and Wong 2008),
VetBooks.ir

Burrows–Wheeler Alignment (BWA) (Li and Durbin 2009), and Splice Transcripts
Alignment to a Reference (STAR) (Dobin et al. 2013) (Table 2). In general, the
alignment tools can be classified into unspliced and spliced aligners. The unspliced
aligners map the reads to the transcriptome in a contiguous alignment and use either
the seed or the Burrows–Wheeler method (i.e., Bowtie and BWA). Spliced aligners
use either an exon-first or a seed-extended method (Wu and Nacu 2010), reads are
mapped to the reference genome, and the alignment will contain introns. TopHat
(Trapnell et al. 2009), STAR (Dobin et al. 2013), SpliceMap (Au et al. 2010),
MapSplice, and GSNAP (Wu and Nacu 2010) are examples of spliced aligners.
Because of these differences, the mapping strategy will determine which software
should be used.
An evaluation of the aligners Bowtie, BWA, MAQ, and SOAP2 showed that sen-
sitivity improved as the sequence depth increased from 1× to 20×; however, when
using higher depth, the positive predictive values of all aligners decreased (Liu et al.
2012). The highest performance between aligners was achieved by MAQ and BWA
in this study.
In another study that evaluated the splice aligners STAR, TopHat, GSNAP, RUM,
and MapSplice, it was found that they all exhibit desirable receiver operating char-
acteristic (ROC) curves at high threshold detection values, but at the lowest detec-
tion threshold, STAR had the lowest false-positive rate. Except for GSNAP, that had
lower precision and a high number of pseudo-false positives, the other aligners per-
formed similarly (Dobin et al. 2013). A comparison between SpliceMap and TopHat
showed that SpliceMap detects more annotated junctions than TopHat, and it also
has higher sensitivity without sacrificing specificity (Au et al. 2010). Using the
ARH-seq package, it was reported that the resulting alignments from Bowtie,
TopHat, MapSplice, and SpliceMap had low variation in the prediction of AS
(Rasche et al. 2014). Recently, RNASequel was proposed to remove false positive
junctions in order to refine splice junction detection. RNASequel had the lowest
number of incorrectly spliced alignments, and its realignment had the highest preci-
sion for novel and annotated splice junctions in comparison with STAR (Wilson and
Stein 2015).
An alignment-free approach is Sailfish, which was developed to quantify tran-
script abundance using counts of k-mers. A comparison of Sailfish with RSEM,
eXpress, and Cufflinks showed that it does not sacrifice accuracy and is a robust
approach to evaluate real and synthetic data (Patro et al. 2014).
Different from the methods mentioned above, HISAT alignment is based on a
hierarchical indexing to overcome the map of short and intermediate reads. Different
versions of HISAT were compared with STAR, GSNAP, OLego, and TopHat where
the two-pass mode of HISATx2 discovered more alignments but the HISTx1 and
HIST versions were faster, followed by STAR, HISATx2, STARx2, GSNAP, TopHat2,
and OLego. The two-pass approach had better sensitivity than the one-pass (Kim
et al. 2015).
After the alignment step, other quality metrics should be evaluated. These include
coverage uniformity along the transcript (i.e., the abundance of poly-A at 3′), the
VetBooks.ir

Table 2 Aligner tools


Software Package Utility Strategy Comments
Aligners Bowtie http://bowtie-bio. Align reads to a genome Burrows–Wheeler Indexes the reference genome using a
(Langmead et al. sourceforge.net/index. of reference Transform (BWT) technique borrowed from
2009) shtml data-compression.
BWA (Li and http://maq.sourceforge.net Align short sequence Burrows–Wheeler Burrows–Wheeler Alignment tool. Allows
Durbin 2009) reads to a reference Transform (BWT) mismatches and gaps. Generates mapping
quality and gives multiple hits if required.
Splice MapSplice (Wang http://www.netlab.uky. Detect exon splice Bayesian Uses approximate sequence similarity. It
aligners et al. 2010) edu/p/bioinfo/MapSplice junctions regression can use Bowtie, BWA, SOAP, BFAST, or
MAQ to align the exonic fragments.
TopHat (Trapnell Open-source Discovery of splice Does not need previous knowledge of
et al. 2009) http://ccb.jhu.edu/ junctions splice sites. Uses seed-and-extend strategy
software/tophat/index. to find reads that span junctions.
shtml
C++ and Python
Runs on Linux and Mac
RNA Sequencing Applied to Livestock Production

OS X
STAR (Dobin Open-source Detect splice junctions, Maximal Mappable Spliced Transcripts Alignment to a
et al. 2013) http://code.google.com/p/ find mismatches and Prefix (MMP) Reference (STAR). Detection of novel
rna-star/ indels splice junctions but also has the option to
Implemented in C++ use annotation databases.
RNASequel Implemented in C++ Postprocessing RNA-seq Uses BWA for mapping reads to the
(Wilson and Stein https://github.com/GWW/ data to improve the reference genome but it can use any read
2015) RNASequel accuracy of the alignment mapper.
software
Alt event finder Open source software Generates de novo Uses BFAST as the primary aligner of
(Zhou et al. 2012) http://compbio.iupui.edu/ annotation for alternative short reads. Uses Cufflinks to reconstruct
group/6/pages/ splicing events the transcript isoforms. Its power highly
alteventfinder depends on the sequencing depth.
HISAT (Kim Open source software Splice aligner Hierarchical Uses a global FM index (to represent the
71

et al. 2015) http://www.ccb.jhu.edu/ indexing genome) and small FM indexes for regions
software/hisat/ (Burrows–Wheeler
and FM index)
72 S. de las Heras-Saldana et al.

saturation of sequencing depth, and the distribution between exons, introns, and
VetBooks.ir

intergenic regions. A short reference for postmapping quality control tools is find in
Mazzoni et al. (2015). In this step, it is useful to filter out transcripts with low
expression <10 reads (Rasche et al. 2014) and to visualize the data in order to get a
general feeling for it. Principal component analysis (PCA) or a heatmap plot of the
correlation of expression levels is useful for identifying outliers.

2.3 Assembly

During the assembly step, the previously mapped reads will be used to reconstruct
complete transcripts and identify the different isoforms that come from the same
locus. Some of the tools that are used for assembly purposes are briefly described in
Table 3. In a comparison of assembler packages, Cufflink and rQuant were outper-
formed by RSEM and IsoEM, which were shown to be comparable for paired-end
data, but for single-end data, RSEM is slightly more accurate (Li and Dewey 2011).
In another comparison, Bayesembler had higher sensitivity and precision with a bet-
ter prediction of the number of expressed splice variants and better length distribu-
tion of transcripts than Cufflinks, IsoLAsso, CEM, and Traph, which have a tendency
to produce shorter transcripts (Maretty et al. 2014).
The use of multiple samples improves the prediction of transcripts; specifically,
with ISP (Iterative Shortest Path) and Cuffmerge, an increase of sensitivity and pre-
cision was observed (Tasnim et al. 2015). This study also reported that, in general,
ISP and Cuffmerge had similar sensitivity; however, ISP outperformed Cuffmerge in
precision.
Recently, two Bioconductor packages—Tablemaker and Ballgown—were devel-
oped to improve the link between assembly output formats and the differential
expression analysis packages in R. Ballgown uses the assembly structure and
expression estimates from Tablemaker and converts it into R objects. Ballgown also
performs linear model-based differential expression analysis and can identify more
differentially expressed transcripts than Cuffdiff2 (Frazee et al. 2015).

2.4 Alternative Splicing

There is still a long road ahead for the robust identification of AS events. The com-
plexity of AS events lies not only in splice factors and spliceosome machinery but
also in mutations that produce aberrant splicing, which can affect some genes more
than others depending on their architecture (Sterne-Weiler and Sanford 2014).
Multiple tools that attempt to quantify AS events have been proposed (Table 4). A
good overview of the available methods to study alternative splicing is given by
Alamancos et al. (2014). spliceR is a tool specifically developed to classify tran-
scripts into different isoforms and provide information about the genomic coordi-
nates of splicing regions (Vitting-Seerup et al. 2014). The use of information from
different samples was also proposed in MITIE. This software outperformed Cufflinks
predictions and also Butterfly in terms of sensitivity (Behr et al. 2013).
VetBooks.ir

Table 3 Assembly tools


Software Package Utility Statistical methods Comments
ISP (Iterative Shortest Implemented in C++ Transcriptome assembly Linear programming and Uses TopHat to identify splice
Path) (Tasnim et al. 2015) Runs in Linux/Unix integer linear programming. junctions. Uses a multiple sample
http://alumni.cs.ucr. connectivity graph (MSCG) to
edu/~liw/isp.html represent the splicing connections.
Bayesembler (Maretty Implemented in C++ Transcriptome assembly Gibbs sampling method for Uses TopHat to construct splice
et al. 2014) inference using a Bayesian graphs from paired-end reads.
method.
MITIE (Behr et al. 2013) Implemented in C++ Transcript identification Defines likelihood function Mixed Integer Transcript
http://bioweb.me/mitie and quantification based on the negative Identification. Constructs isoform
binomial distribution. structures by solving a mixed
RNA Sequencing Applied to Livestock Production

integer programming problem


defined from multiple samples.
Can perform predictions with and
without prior annotation.
RSEM (Li and Dewey Software package Align reads to a reference Expectation–maximization Requires alignment to a
2011) Runs in Linux, Mac OS transcriptome and count algorithm (to estimate transcriptome sequence and uses
X, Cygwin. abundance at gene and abundances). Bowtie to align reads. RSEM
http://deweylab.biostat. isoform levels. produces two files one for
wisc.edu/rsem/ Visualization and isoform-level and other for
Implemented in C++ simulation modules gene-level estimates.
and Perl
73
VetBooks.ir

Table 4 Methods used for count data and differential expression considering alternative splicing
74

Software Package Utility Distribution Statistical methods Comments


ARH-seq MPIMG source Identify differential Weibull Compares two Uses Bowtie to align the reads to
(Rasche et al. splicing of genes in case samples with the the genome sequence.
2014) control studies information-theoretic
concept of entropy.
Cufflinks Open source software Quantify abundance of Multivariate t-test Accounts for alternative splicing
(Trapnell et al. transcript isoforms normal distribution events. Generates FPKM.
2010)
DEXSeq (Anders Bioconductor-R Quantify exons and Negative binomial Generalized linear Takes biological variation into
et al. 2012) identify differential exon distribution models account. Avoids the assembly
usage step and calculates p-values for
every annotated exon.
MetaDiff (Jia JAR package open Detect differential Approximately Random-effects Isoform-based approach. Uses
et al. 2015) source isoform expression normal (log meta-regression TopHat to map to the reference
https://github.com/ transformation of model genome and cufflinks to estimate
jiach/MetaDiff FPKM) isoform-specific gene
expression.
Exon-based Differential expression of Log transformed Uses limma and Exon-based strategy. Uses
strategy (Laiho isoform at gene level data edgeR to determine TopHat to align to a reference
and Elo 2014) the p-values. genome.
SUPPA Program written in Quantify alternative Uses RefSeq annotation as
(Alamancos et al. Python splicing events reference. Estimates the Ψ
2015) https://bitbucket.org/ values. Calculates relative
regulatorygenomicsupf/ inclusion values of alternative
suppa splicing events.
SplicingViewer Java Detect splice junctions, It can use MAQ, BWA, Bowtie
(Liu et al. 2012) annotate alternative or SOAP2 as aligner to map to
splicing events the reference genome.
MATS (Shen http://rnaseq-mats. Differential alternative Multivariate Bayesian statistical Calculates p-values and false
et al. 2012) sourceforge.net/ splicing uniform framework. discovery rate using MCMC.
S. de las Heras-Saldana et al.

distribution Markov chain Monte


Carlo
VetBooks.ir

MISO (Katz https://miso. Estimates differential Bayesian approach Mixture-of-isoforms (MISO).


et al. 2010) readthedocs.io/en/ expression of Inferences based on Provides confidence intervals
fastmiso/ alternatively spliced Monte Carlo (Cis) for estimates of exon and
exons and isoforms sampling isoform abundance.
SeqGSEA (Wang Bioconductor-R Differential expression Negative binomial Permutation tests Exon-centroid method.
and Cairns 2013) and splicing in each gene distribution Alignment with TopHat.
SplicingCompass R package Detect differential Uses geometric Takes into account all exons of a
(Aschoff et al. splicing at a gene level angles between the gene at once. Uses TopHat for a
2013) high dimensional splicing junction sensitive
vector of exon read mapping of reads to the
counts. reference genome.
spliceR Bioconductor-R Classification of Designed to integrate with
(Vitting-Seerup alternative splicing Cufflinks and retrieve
et al. 2014) Coding potential information from SQL database.
prediction
BitSeq Open source software Identify transcript Gama distribution Markov chain Monte Allows multi aligned reads
(Glaus et al. https://github.com/ expression. Carlo algorithm for mapping to different genes.
RNA Sequencing Applied to Livestock Production

2012) BitSeq/BitSeq Analysis of expression Bayesian approach Model biological variance and
changes between account for the intrinsic
conditions technical variance in the data.
rSeqDiff (Shi and R package Detection of differential χ2 distribution Hierarchical Isoform-based approach. For
Jiang 2013) expression and likelihood ratio test experiments with two biological
differential splicing conditions. Generates a ranking
Gene level of genes being differentially
spliced to compare between two
biological conditions.
Cuffdiff2 Runs in Mac/Linux Report differentially Beta negative t-like statistics Isoform-based approach.
(Trapnell et al. expressed genes or binomial
2013) transcript isoforms. distribution
Sailfish (Patro Open-source software Transcript quantification Expectation– Avoid mapping reads. Uses
et al. 2014) http://www.cs.cmu. maximization (EM) previously annotated RNA
edu/~ckingsf/software/ algorithm to quantify isoforms from RNA-seq data.
75

sailfish relative abundance


Quantification of expression levels and differential expression analysis
76 S. de las Heras-Saldana et al.

ARH-seq showed the best performance (by comparing ROC curves) compared to
VetBooks.ir

Splicing Index, PAC, Correlation, DASI, DEXSeq, Cuffdif, MISO, and MATS
(Rasche et al. 2014). rSeqDiff is another package useful for the identification of
genes with moderate to high abundance. This package has good control of type I
error rates and good statistical power to detect differential expression and differen-
tial splicing. Also, rSeqDiff can deal with complex isoform structures outperform-
ing MAT and Cuffdiff2 (Shi and Jiang 2013). More recently, MetaDiff incorporated
covariates and cofounding variants, as they seem to play a role that influences gene
expression. When MetaDiff was compared with Cuffdiff, DESeq, DESeq2, edger,
and EBSes, it was observed that Cuffdiff, EBSeq, and DESeq had high false discov-
ery rates (FDRs) because they cannot adjust for confounders. On the other hand,
Bartlett-corrected likelihood ratio test (BCLR), t-test, DESeq2, and EdgeR have a
good control of FDR (Jia et al. 2015).
The integration of differential expression and differential splicing in functional
analysis studies has been proposed because some genes are subjected to both. Wang
and Cairns (2013) developed the SeqGSEA package that performs a weighting and
ranking of both scores, which lead to more powerful results and a more relevant
biological meaning. On the other hand, SplicingCompass focuses on linking splic-
ing regulation and gene expression regulation (Aschoff et al. 2013). They have
reported that SplicingCompass is able to identify new differential splicing events
from unknown exons, in contrast to DEXSeq that cannot identify them because it
uses known annotations.
With respect to the sequencing depth, the use of Alt Event Finder in combination
with Scripture (used to align reads to the genome of reference) was recommended
when there is a need to work with low sequencing coverage. However, with high
sequencing depth, TopHat combined with Scripture have enough power to precisely
map boundaries of known exons and can also identify novel exons (Zhou et al.
2012). Unlike the above-mentioned tools, SUPPA was designed as a free-alignment
pipeline to calculate relative inclusion of AS events (defined as ψ), which can lead
to faster analysis of RNA-seq data. The use of ψ values from SUPPA with Sailfish
or RSEM offers results comparable to using MISO and MAT. Nevertheless, incom-
plete annotation has a significant impact on SUPPA and can result in inaccurate
quantification (Alamancos et al. 2015).
To evaluate gene expression in RNA-seq data, multiple tools have been devel-
oped as R packages and are available from the Bioconductor project, but there are
other software as well (Table 5). In general, gene expression tools can be classified
as those that include the normalization procedure to account for differential expres-
sion and software in which the normalization is done separately. It was reported that
RPKM showed a strong dependence between fold change and GC content, but the
conditional quantile normalization (CQN) method eliminates it and also improves
precision without affecting accuracy. CQN is a sample-gene-specific normalization
(Hansen et al. 2012). Seyednasrollah et al. (2013) compared the trimmed mean of
M-values (TMM) and the default normalization using different methods (edgeR,
DESeq, baySeq, NOIseq, SAMseq, limma, Cuffdiff2, and EBSeq) to determine dif-
ferential expression and found that the normalization method did not have a
VetBooks.ir

Table 5 Methods used for differential expression


Software Package Utility Distribution Statistical methods Comments
DESeq Bioconductor-R Gene level Negative binomial Generalized linear models Includes biological replicates in the
(Anders and distribution and exact test. analysis but also allows analysis
Huber 2010) Parametric method. without biological replicates. Works
Classical hypothesis testing. well in heterogeneous data sets.
BBSeq (Zhou R package Gene level Beta-binomial Likelihood ratio Model recommended for small sample
et al. 2011) distribution comparison. sizes. This package identifies potential
outliers among the read counts.
baySeq Bioconductor-R Gene level Negative binomial Uses log posterior Uses count data directly. It can
(Hardcastle distribution probability ratio to rank the perform parallel processing. Analyses
and Kelly genes. Bayes methods. experimental designs involving
2010) Parametric method. multiple treatment groups.
ShrinkBayes R package (runs in Gene level Spike-Gauss, Spike- Generalized linear mixed Count-based sequencing data. Useful
(van de Wiel Windows or Unix/ and-Slab (parametric). models. for small sample sizes or complex
et al. 2014) Linux) Spike-nonparametric designs.
RNA Sequencing Applied to Livestock Production

edgeR Bioconductor-R Gene expression Negative binomial Generalized linear model Estimates biological variations of few
(McCarthy distribution likelihood ratio test. Bayes biological replicates. Can be applied to
et al. 2012) approach accounts for gene two or more groups (at least one of the
variability. Parametric groups has replicated measurements).
method. Uses count data directly.
easyRNASeq Bioconductor-R Gene, transcripts Depends on the Depends on the package Combines packages to read the
(Delhomme or exon levels package used used. sequence reads, retrieve annotations,
et al. 2012) summarize reads, quantification, and
normalize.
De-multiplexing possibilities.
NOISeq Bioconductor-R Gene, transcript, Noise distribution Log-ratio and the absolute Allows two group comparisons.
(Tarazona or exon levels value of difference. Suitable for small number of
et al. 2011) Nonparametric methods. replicates and samples. Accounts for
genes with low expression levels.
NPEBseq (Bi R package (runs in Differentially Estimates the prior Nonparametric Bayes Analysis across different conditions.
77

and Davuluri Unix/Linux, Mac expressed genes distribution approach.


2013) OS X, Windows) and isoforms
78 S. de las Heras-Saldana et al.

significant impact in their study. Risso et al. (2011) proposed four within-lane GC
VetBooks.ir

content normalization procedures that were shown to reduce the bias and its depen-
dence on GC content. In particular, the full-quantile GC content normalization
appears to effectively remove the dependence of the proportion of DE genes on GC
content.
Aside from the technical bias, biological variability is another challenge in RNA-
seq data analysis because it can cause an overdispersion of the data due to heteroge-
neity within a population of cells or the intrinsic biology of the results (Fang et al.
2012). The overdispersion in read counts has been modeled with a negative bino-
mial distribution, beta-binomial, or two-stage Poisson distribution (Fang et al.
2012). The distribution applied generally depends on the software chosen for dif-
ferential expression analysis (Table 5).
Besides the differences in normalization and distribution procedures, each
method is different in its design; they tend to focus on specific features of the data
and particular statistical methods. Therefore, the performance of these tools can be
different on any given data set. Also, the differentially expressed genes resulting
from the analysis depend on the software and the package version used in the analy-
sis (Glaus et al. 2012; Seyednasrollah et al. 2013; Wang and Cairns 2013). Many
methods have been developed, and their performance evaluated and compared using
data from experiments as well as simulations. Simulated data offer a simple way to
evaluate the performance of different tools, even if it does not capture all the pos-
sible scenarios and errors that exist in a real RNA-seq data set (Wilson and Stein
2015). On the other hand, as we cannot yet know the exact amount of transcripts, the
use of a gold standard dataset from qRT-PCR or the microarray quality control
consortium (MAQC) is a good benchmark to compare the various methods.
In a comparison, Cuffdiff had poor detection of DE genes or differential splicing
(DS) when compared to SeqGSE (Wang and Cairns 2013). DESeq had a similar
sensitivity to detect DE genes as Cuffdiff 2, but DESeq returned more false-positive
genes. The detection of false positives was lower with edge than with DESeq, but
Cuffdiff 2 was the method with the highest precision at gene-level resolution
(Trapnell et al. 2013).
The sequence depth seems to have an effect on the detection of DE genes using
different methods. When more sequence data were incorporated into the analysis,
Fisher’s exact test (FET), edgeR, baySeq, and DESeq included more false positives,
but NOISeq maintained a stable and low false-positive rate. edgeR, DESeq, and
baySeq detected new significant DE genes, whereas NOISeq selected new genes but
also discarded some depending on the changes of the noise distribution when a new
gene was added (Tarazona et al. 2011). On the other hand, the number of differen-
tially expressed genes increases when the number of replicate samples increases.
This is evident with edgeR, DESeq, baySeq, limma, and EBSeq, but the rate of
increase varies across methods. However, the opposite was observed with NOISeq
and Cuffdiff, which seem to be more conservative. In terms of false positives,
DESeq, limma, and Cuffdiff showed the lowest amount, whereas edgeR and EBSeq
reported a high amount of false positives (Seyednasrollah et al. 2013). A two-stage
Poisson model (TSPM) was proposed and compared to edgeR, DESeq, and baySeq.
RNA Sequencing Applied to Livestock Production 79

The latter had the best performance in terms of true positive rates. TSPM performed
VetBooks.ir

poorly compared to the others, but it controlled FDR to the desirable level when
there were four replicates (Kvam et al. 2012). Another Bayesian approach was
shown to be highly sensitive, at the same level of false-positive rates as edgeR and
DESeq. This Bayesian approach performs better for smaller sample sizes; it can
identify transcripts with large treatment effects but low expression levels and
reduces bias due to longer transcripts (Chung et al. 2013). Another Bayes approach
suggested by Bi and Davuluri (2013) is a nonparametric Bayesian-based approach
designed to detect differentially expressed genes and exons across different condi-
tions. NPEBseq outperformed the other methods (DESeq, edgeR, baySeq, and
NOISeq) when the performance was compared as well as from the examination of
the ROC curves, sensitivity, and specificity. In this work, the authors also argued
that any parametric assumption for the prior distribution was unrealistic because
there are a large number of genes/transcripts with low read counts and a small num-
ber of genes with a significantly high number of reads.
On the basis of the wide range of methods to evaluate differential expression, we
can say that the statistical analysis has not reached maturity yet. It is important to
consider the features of the RNA-seq data before choosing a particular method/
software; it can also be useful to run the analyses with different tools and compare
the results. Seyednasrollah et al. (2013) suggested to visualize results from different
methods and use a range of quality assessment metrics as well.

2.5 Functional Analysis

The interpretation of the differentially expressed genes is a key step to understand-


ing the biological function and interactions between the genes. The tools used to
perform functional analysis are based on the identification of metabolic pathways,
networks, biological processes, and molecular functions from either a raw data set
or a list of DE genes. The most commonly used tools for functional analysis are
KEGG (Kanehisa and Goto 2000), DAVID (Huang et al. 2007), and GO (Ashburner
et al. 2000). However, there are other tools as well (see Table 6). DAVID is a data-
base that provides functional interpretation for a large amount of genes; it has use-
ful tools such as the Functional Annotation, which categorize the annotation
according to protein–protein interactions, gene ontology (GO) terms, disease asso-
ciation, gene tissue expression, etc. The list of differentially expressed genes is
used as input in DAVID to obtain the most relevant metabolic pathways in which
DE genes are involved, the GO terms, and the functional annotation clusters
(Huang et al. 2007). On the other hand, KEGG is useful for revealing common
processes and pathways among the DE genes (Kanehisa and Goto 2000). The GO
database retrieves the molecular function, biological processes and cellular com-
ponents (Ashburner et al. 2000). Apart from the tools for functional analysis men-
tioned earlier, other useful tools have been developed and are briefly described in
Table 6. A combination of these tools can be used to compare the results, equili-
brate the specific characteristics of each database, and validate the results by
VetBooks.ir

Table 6 Functional analysis tools


80

Database/software Source Utility Comments


GO Web-based application Describe the role of genes GO produces a structured, precisely defined, common,
Gene Ontology http://www.geneontology.org and products in any controlled vocabulary for describing the roles of genes and gene
(Ashburner et al. 2000) organism products in organisms.
DAVID Web-based application Provide functional Integrates gene-annotation databases from public genomic
Database for Annotation https://david.ncifcrf.gov/ interpretation of genes resources (NCBI, PIR, and UniProt) into a broader secondary
Visualization and gene cluster (DAVID Gene Concept).
Integrated Discovery
(Huang et al. 2007)
PANTHER Software system and Infer the function of genes Results include GO annotations to infer homology, function of
Protein Analysis Through web-based application based on their evolutionary genes and the ancestral gene in which the function was inferred
Evolutionary Relationships http://www.pantherdb.org relationship to have evolved.
(Mi et al. 2010)
GeneMANIA Web-based application Generate hypothesis about It collects data sets from the Gene Expression Omnibus (GEO),
(Warde-Farley et al. 2010) http://www.genemania.org/ gene function BioGRID, Pathway Commons, and I2D. GeneMANIA uses
different adaptive network weighting methods.
Reactome Web-based application Visualization, Uses PSIQUIC to overlay pathways with molecular interaction
(Croft et al. 2010) http://www.reactome.org/ interpretation and analysis data. Uses external databases like IntAct, BioGRID, ChEBL,
of human pathways iReflndex, MINT, and STRING.
KEGG Web-based application Reveal common processes Tools like GENES database extracts information from GenBank.
Kyoto Encyclopaedia of http://www.genome.jp/kegg/ and pathways among PATHWAY database contains graphical representations of
Genes and Genomes differentially expressed cellular processes, such as metabolism, membrane transport,
(Kanehisa and Goto 2000) genes etc.; KEGG LIGAND for information about chemical
compounds, enzyme molecules, and enzymatic reactions.
WEGO Web-based application Visualization, comparison WEGO does not have restrictions on species; it supports the
Web Gene Ontology http://wego.genomics.org.cn and plotting of GO comparison between several gene data sets. WEGO is used to
Annotation Tool annotation results visualize the annotation of sets of genes.
(Ye et al. 2006)
IPA http://www.ingenuity.com Infer and score regulator Predicts the activation or inhibition of regulators based on the
S. de las Heras-Saldana et al.

Ingenuity pathway analysis networks upstream of up-/down-regulation pattern of the expressed genes.
(Krämer et al. 2013) gene-expression
RNA Sequencing Applied to Livestock Production 81

checking for concordance across them (Ramanan et al. 2012). Once the functional
VetBooks.ir

results are obtained, domain-specific knowledge is usually needed to properly


interpret how they link back to the original experimental question.

3 Applications in Livestock

The primary aim of using next-generation sequencing (NGS) technology in DNA


sequencing is to discover variants (SNPs, indels, and other structural DNA changes)
in the genome and find their association with trait variation. However, NGS RNA-
seq technology allows a larger range of applications, including the ability to mea-
sure gene expression levels (Morin et al. 2008). The abundance of RNA for each
gene can be measured by counting its corresponding number of RNA copies or
reads aligned to the reference genome. Therefore, using RNA-seq data, we can
obtain gene expression profiles for each tissue. To illustrate, a study used RNA-seq
to characterize bovine transcriptome of in vivo and in vitro derived bovine blasto-
cysts. It found 793 genes differentially expressed and 4800 genes expressing more
than one splice variant between in vivo and in vitro embryos (Driver et al. 2012).
The expression pattern of the genes is different across tissue types (Melé et al.
2015) but can be influenced by various other factors such as sampling time and
method. Cánovas et al. (2014) found that the integrity and type of RNA varied from
one sampling method to another, influencing the gene expression results. However,
the general patterns found in highly expressed genes (casein, BLG, and LALBA and
GLYCAM1) were similar between the sampling methods giving sampling alterna-
tives that do not include mammary gland biopsy (Cánovas et al. 2014).
It is also possible to evaluate the effect of conditions, treatments, or chemicals in
gene expression. In Angus beef steers, a transcriptome characterization of muscle
was performed after a supply of propionate diet (Baldwin et al. 2012). They reported
153 differentially expressed genes and 5032 intron–exon junctions differentially
expressed between control and propionate-fed steers. These differences in the
expression could be related to the increase in rate and absolute body weight in steers
fed with propionate.
The complexity of the transcriptome has been studied trough RNA-seq meth-
odologies analyzing AS. Novel transcripts were reported in embryo and adult
muscle (29,994 and 24,464, respectively). From these genes, 16,174 were differ-
entially expressed between these samples, and alternative splice events were pres-
ent in 66.6 % of the genes (He and Liu 2013). Another study looked at different
marbling patterns in muscle where the differentially expressed genes were
involved in networks related to protein synthesis, lipid metabolism, molecular
transport, muscle growth, and function-related pathways. In the case of animals
with low marbling, genes associated with immune system processes were highly
expressed (Chen et al. 2015).
In Korean cattle, a characterization of different tissues (muscle, omental tissue,
subcutaneous tissue, and intramuscular adipose) was performed. Here the muscle
and adipose tissues showed different gene expression patterns. The metabolic gene
82 S. de las Heras-Saldana et al.

networks were different as well; the oxidative metabolism was upregulated in mus-
VetBooks.ir

cle, whereas it was downregulated in intramuscular tissue. Although pathways


involved in cell adhesion, structure, integrity, and chemokine signaling were upreg-
ulated in intramuscular adipose tissue, these were downregulated in muscle (Lee
et al. 2014).
Moreover, in the same vein but somewhat more practical, the genes that are dif-
ferentially expressed in animals phenotypically or genetically divergent for a trait
can be identified. The association between the gene expression and the trait can be
measured by (1) calculating the correlation between the normalized gene counts and
the trait and (2) grouping the individuals into two groups based on their phenotypic
records or breeding values for a trait and comparing the expression level of genes in
the groups with software such as Cuffdiff2. In this way, transcriptomic patterns
from different phenotypes can be evaluated using RNA-seq. In the crossbred pig
Iberian × Landrace, high and low levels of intramuscular fat were evaluated. The
high intramuscular fat presented upregulation of transcripts related to lipid metabo-
lism and genes involved in lipoprotein synthesis, cholesterol metabolism, oxidation
of lipids, palmitic fatty acids, and induction of lipogenic gene transcription, whereas
in the low intramuscular fat group, these were downregulated. On the other hand, in
the low intramuscular fat group, the expression of genes related to triacylglycerol,
uptake of lipids, and myristic acid and fatty acid biosynthesis were upregulated
(Ramayo-Caldas et al. 2012). In mammary glands, Cui et al. (2014) identified bio-
logical processes such as protein and lipid metabolism, mammary gland develop-
ment, differentiation, and immune function. They also recognized VGEFA, TRIB3,
SAA, RPL23A, and PTHLH as potential genes that affect the milk protein percent-
age and the fat percentage of dairy cattle. In a study of negative energy balance in
dairy cows, some insight was gained by evaluating the gene expression. Genes asso-
ciated to steroid hormone biosynthesis pathway were differentially expressed in
liver with severe negative energy balance and CYP11A1 was upregulated, whereas
UGT2a1, SULTE1, and CYP7A1 were downregulated in these animals (McCabe
et al. 2012). They interpreted the downregulation of these genes as a reduction in
bile synthesis and inactivation of oestrogens in liver which helps to conserve cho-
lesterol in animals with severe negative energy balance.
RNA-seq analysis can also help to elucidate the differences in the metabolism
related to different feed efficiencies. In residual feed intake (RFI), 25 SNPs signifi-
cantly associated to this trait were identified on chromosomes BTA 1, 2, 4, 7, 15, 18,
20, and 29. The SNPs in GHR, LRP5, and CYP2B genes had significant association
with RFI (Karisa et al. 2013). Animals with low RFI had upregulation of the genes’
fatty acid-binding protein 1 (FABP1), uncoupling protein 2 (UCP2), fatty acid
desaturase 2 (FADS2), and RGS2 (Tizioto et al. 2015). This study suggested that the
high ability to absorb glucose is characteristic of animals with low RFI, whereas
inefficient animals express more genes involved in catalysis and intracellular trans-
port of fatty acids. This variation in gene expression level across individuals can be
useful in finding the genes influencing the traits.
Some studies have investigated the differences and similarities between breeds.
A comparative analysis in muscle between Chinese Luxi and Angus beef cattle
RNA Sequencing Applied to Livestock Production 83

found genes associated with growth and development of muscle (Liu et al. 2015).
VetBooks.ir

They found genes that have different expression levels between breeds as well as
genes that are expressed exclusively in each breed. In other works comparing pig
breeds, it was found that in Jeju native pigs, there were differentially expressed
genes from the collagen family, whereas Berkshire pigs presented significantly
more expression of CD genes, and most of the genes related to immune response
were expressed indicating a better immune system (Ghosh et al. 2015). When the
expression patterns of Pietrain and Polish Landrace pig breeds were contrasted,
Ropka‐Molik et al. (2014) found that genes related to reproduction, immune
response, development, and cellular or metabolic processes were highly expressed
in Polish Landrace, whereas genes involved in negative regulation of apoptosis,
immune response, cell–cell signaling, cell growth, and migration were underex-
pressed. Variations in mRNA and miRNA expression were also observed between
sheep breeds (Dorset sheep and small Tail Han sheep with genotype FecBBFecBB
(Han BB) and FecB+FecB+ (Han ++)) with different fecundity rates. The different
expression of mRNA between the Dorset and Han groups suggested unique gene
profiles associated to different fecundity rates. In this study, they pointed out the
importance of studying miRNA levels because it provides more specificity for com-
parisons between traits than just a comparison of mRNA (Miao and Qin 2015).
Garber et al. (2011) pointed out that by providing the sequence of expressed tran-
scripts, RNA-seq encodes information about allelic variation, and RNA processing,
so reconstruction methods should be adapted to account for this variability and report
it. The variants and the number of copies of alleles in RNA-seq data can be found
using samtool mpileup (Li 2011) or Rsamtools pileup (Morgan et al. 2016), and the
total number of reads mapped to a gene can be measured using HTSeq (Anders et al.
2014). The detected mutations can change the primary structure of proteins or affect
the expression of the genes. Most of the variants discovered by genomewide associa-
tion studies (GWAS) are located in noncoding regions. Therefore, the mutations that
are not changing the protein but regulating the expression of genes seem to be the
causal variants more often than not (Ardlie et al. 2015). For instance, Karim et al.
(2011) reported mutations in the intergenic region near the promoter of the PLAG1
gene, which influences expression of the gene and consequently affects the stature in
cattle. The causal mutations in noncoding regions can be detected using RNA-seq
data if the variants near the genes are known beforehand. Along with the variation of
gene expression in different individuals, the expression level of genes depends on the
tissue used in the study and also varies across time. Therefore, it is important to take
the RNA-seq samples from the right tissue (and time).
The variants that are associated with the expression level of genes in different
individuals are called expression quantitative trait loci (eQTL). A mutation can
change the expression of a gene on the same chromosome (cis eQTL) or gene(s)
elsewhere in the genome (trans eQTL). The cis-acting QTL can be identified using
microarrays or RNA-seq with a traditional eQTL mapping method by comparing
the expression level of genes across individuals. However, in eQTL mapping with
RNA-seq data, the cis eQTL can be found by direct assessment of gene expression
with allele-specific expression (ASE). In the ASE approach, for an individual which
84 S. de las Heras-Saldana et al.

is heterozygous for a variant in RNA-seq, the expression level of alleles is compared


VetBooks.ir

within individuals instead of comparing total mRNA abundance from both alleles
across individuals as in traditional eQTL mapping. Moreover, the environmental
factors affecting gene expression across animals as well as the effect of trans-acting
eQTL are eliminated in ASE assessments because of the use of allele counts of one
individual at a time (Pastinen 2010). If the parental origin of the alleles in the het-
erozygote variants is known, the imprinted genes are also detectable with the ASE
approach. However, to discriminate the effect of the allele itself from the effect of
parent of origin of the allele on expression, the allelic imbalance in at least two
individuals with different paternally-inherited alleles should be measured.
In order to find the eQTL with the traditional approach, the regression of normal-
ized gene counts on variant genotypes can be calculated for each gene. Thus, it is
similar to GWAS for complex traits but uses gene expression instead of phenotypes
and includes variants within or near to the gene. For testing the allelic imbalance
expression, the null hypothesis is that the alleles are equally expressed. Therefore, a
χ2 test can be used to compare observed allele counts with the expected counts
according to the read depth.
Although ASE seems to be better than traditional eQTL mapping, there are a few
challenges in measuring allelic imbalance in gene expression. The main difficulties
are sequencing errors in RNA-seq data, the relative low coverage of reads in hetero-
zygous regions, and the biased allele counts toward the alleles represented in the
reference genome (Pastinen 2010). The sequencing errors can be removed to a great
extent by filtering the sequences with low quality and, obviously, by improving the
sequencing technology to have more accurate sequences. The low coverage prob-
lem can be solved by increasing the overall sequencing depth or enrichment of the
transcript for some particular regions of interest (Heap et al. 2010).
Mapping the reads to a single reference genome containing the most frequent
alleles in the polymorphic sites could potentially cause a bias in favor of the reads
with equivalent alleles to the reference because of their lower number of sequencing
mismatches and consequently better mappability (Satya et al. 2012). There are three
main ways to deal with reference bias difficulty: (1) minimizing the reference bias
by mapping the reads to the parental genomes (Degner et al. 2009), (2) excluding
the heterozygous sites from the reference genome and allowing more mismatches in
mapping (Stevenson et al. 2013), and (3) enhancing the reference genome by includ-
ing the alternative alleles at known polymorphic regions in the reference genome
(Satya et al. 2012). If we can make the maternal and paternal reference genome,
then the systematic bias should be eliminated. Satya et al. (2012) reported that
including alternate alleles instead of excluding heterozygous sites to the reference
genome reduced the reference bias and increased the mappability of the reads.
Finding mutations underlying complex trait variation by changing the expression
of genes and the functional studies performed using results from differentially
expressed genes gives important information to further understand the biological
relation between genes, phenotypes, and the environment. The elucidation and
understanding of the physiological mechanisms involved in certain features can
help to develop more efficient production programs.
RNA Sequencing Applied to Livestock Production 85

Acknowledgements This project was funded by an Australian Research Council Discovery


VetBooks.ir

Project (DP130100542), the Next-Generation BioGreen 21 Program (no. PJ01134906), the Rural
Development Administration, the Republic of Korea, and the Cooperative Research Program for
Agriculture Science and Technology Development (PJ006405), RDA, Korea.

4 Appendix—A Simple Example of RNA-seq Gene


Expression Data Analysis Using R and Other Software

Data and script available in the supplementary online material for the book from the
publisher’s website (adapted from Gondro 2015).

# The Bioconductor packages can be installed with:


source("https://bioconductor.org/biocLite.R")
biocLite(pkgs=c("ShortRead","GenomicFeatures","GenomicAlignments"
,"rtracklayer","edgeR","topGO","Category","Rsamtools","bovine.
db","Rgraphviz","KEGG.db"))

# STEP 0: READ THE SEQUENCE DATA INTO R


library(ShortRead)
rnaseq <- readFastq("RNASEQexample.fastq")
summary(rnaseq)
# Length Class Mode
# 153557 ShortReadQ S4

# Length: number of single sequences in the FASTQ file


# Class: class of R object (override of standard R summary function
by the ShortRead package)
# Mode: S4 is an object oriented system in R

# print the first and last 5 sequences and their length


sread(rnaseq)

# STEP 1: QUALITY CONTROL in R


QC <- qa("RNASEQexample.fastq")
QC
# Shows the class of QC and its structure

# create an HTML report and save it


report(QC, dest = "QCreport")

# STEP 2: PREPROCESSING
# Note: This step is performed in the command line (cmd) and java
must be installed on the system,
# it is important to change the directory to your folder (the same
place with your fastq file and the adapters).
86 S. de las Heras-Saldana et al.

# In order to simplify the trim you can also put the executable file
VetBooks.ir

(trimmomatic-0.33.) in this folder.


# download trimmomatic from: http://www.usadellab.org/
cms/?page=trimmomatic
# so at the end you just need to write the following command in the
cmd:
java -jar trimmomatic-0.36.jar SE -phred33 RNASEQexample.fastq
trimmedRNAexample.fastq ILLUMINACLIP:adapters\TruSeq3-PE.
fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:60
HEADCROP:10 CROP:80

# or using "system" command in R to run the program from inside R:


system("java -jar trimmomatic-0.36.jar SE -phred33 RNASEQexample.
fastq trimmedRNAexample.fastq ILLUMINACLIP:adapters\\TruSeq3-PE.
fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:60
HEADCROP:10 CROP:80")
# -phred33 to set the quality of encoding
# Please refer to this site for more information:
# http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/
TrimmomaticManual_V0.32.pdf

# STEP 3: MAP TO A REFERENCE GENOME OR TRANSCRIPTOME USING BOWTIE


# Note: This step is performed in the cmd (command line). Before
you start, download the latest version of Bowtie2, it is easy if
the Bowtie2 folder is inside the folder where the fastq file is (or
remember to include the Bowtie2 path in the environment variable
path).
# In cmd you should be located in the folder that contains the file
trimmedRNAexample.fastq and just run the following command (assumes
bowtie2 programs are in the bowtie2 folder and the index files are
in the BtauIndex folder):
bowtie2\bowtie2-align-s -p 80 -x BtauIndex\bos_taurus trimme-
dRNAexample.fastq -S RNAalignment.sam
# or in R
system("bowtie2\\bowtie2-align-s -p 80 -x BtauIndex\\bos_taurus
trimmedRNAexample.fastq -S RNAalignment.sam")
# -p is the number of threads to run the program in the parallel
# -x is the base name of index of the reference genome
# -s path to the output file to save the alignments
# Please refer to this site for more information:
# http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
# various indexes can be downloaded from Igenomes: http://support.
illumina.com/sequencing/sequencing_software/igenome.html (this
example uses Bos taurus UMD3.1)
RNA Sequencing Applied to Livestock Production 87

# STEP 4: ASSEMBLY
VetBooks.ir

library(Rsamtools)
library(GenomicFeatures)
library(GenomicAlignments)
library(rtracklayer)

# Make a transcript database for cattle (txdb), the genome abbre-


viations for the "genome" argument can be retrieved with
txdb <- makeTxDbFromUCSC(genome = "bosTau6", tablename = "ensGene")
# Grouped the features by gene identifiers
txGene <- transcriptsBy(txdb, "gene") # txGene hold the annota-
tion and mapping information

# convert from SAM to BAM using SAmtools


asBam("RNAalignment.sam","RNAalignment")
# for more information on samtools: http://www.htslib.org/doc/sam-
tools.html

# Retrieve the counts of transcripts


reads <- readGAlignments("RNAalignment.bam") # to read in the BAM
file

# To match the names in the BAM file with names in txGene


newnames <- paste("chr", rname(reads), sep = "")
newnames <- gsub("MT", "M", newnames)

# To rename our reads and ignore the strands


reads <- GRanges(seqnames = newnames, ranges = IRanges(start =
start(reads), end = end(reads)), strand = rep("*", length(reads)))
rcounts <- countOverlaps(txGene, reads) # Use to create a vector
with the number of alignments that correspond to each of the iden-
tifiers reads
head(rcounts)

# Plot the number of counts


plot(sort(rcounts), col = "blue", pch = 20, main = "sorted RNA-seq
counts", xlab = "genes", ylab = "counts", cex.main = 0.9)
# Number of counts
length(rcounts)
# Count the number of expressed genes that occurred at least once
length(which(rcounts > 0))
# Transcripts with more than 200 counts
index <- which(rcounts >= 200)
length(index)
88 S. de las Heras-Saldana et al.

# Percent of the transcripts with more than 200 counts


VetBooks.ir

sum(rcounts[index])/sum(rcounts)
# Save the count data
write.table(rcounts, "ReadCounts.txt", quote = F, sep = "\t", col.
names = F)

# STEP 5: QUANTIFICATION OF EXPRESSION LEVEL

library(edgeR)
# Read in the data
rcounts <- read.table("RNAdat.txt", header = T, sep = "\t")
contrast <- read.table("RNAcontrast.txt", header = T, sep =
"\t",stringsAsFactors = FALSE)

DGE <- DGEList(rcounts, group = contrast$contrast) # It is a con-


tainer for RNA-seq data
# Estimate the normalization and dispersion factors.
DGE <- calcNormFactors(DGE)
DGE <- estimateCommonDisp(DGE)
DGE <- estimateTagwiseDisp(DGE)

# Fisher's exact test to perform a contrast between the two groups


difexpEx <- exactTest(DGE, pair = c("ctrl", "treat"))
resEx <- topTags(difexpEx, n = nrow(DGE)) # extracted results from
the model
outEx <- resEx$table

# Plot the means variance relationship and biological coefficient


of variation
par(mfrow = c(1, 2))
plotMeanVar(DGE, show.tagwise.vars = T, NBline = T)
plotBCV(DGE)

# To see the significantly differentially expressed features


(p>0.05) after FDR correction
outEx[outEx$FDR < 0.05, c(1, 2, 4)]

# Use the generalized linear model (GLM)


design <- model.matrix(~contrast, contrast)
# Estimate the design matrix dispersions
DGE2 <- estimateGLMTrendedDisp(DGE, design)
DGE2 <- estimateGLMTagwiseDisp(DGE2, design)
# Fit model
fit <- glmFit(DGE2, design)
RNA Sequencing Applied to Livestock Production 89

# Test contrast (coef) from design matrix


VetBooks.ir

difexpGLM <- glmLRT(fit, coef = 2)


# Same as with "exactTest"
res <- topTags(difexpGLM, n = nrow(DGE))
outGLM <- res$table
outGLM[outGLM$FDR < 0.05, ]

# STEP 6: FUNCTIONAL ANALYSIS


# example with microarray probes
library(topGO)
library(bovine.db)
probes <- names(unlist(as.list(bovineENTREZID)))

deprobes <- read.table("DEprobes.txt", header = T, sep = "\t")

genelist <- factor(as.integer(probes %in% deprobes$AFFY))


names(genelist) <- probes
GOdata <- new("topGOdata", ontology = "MF", allGenes = genelist,
annot = annFUN.db, affyLib = "bovine.db")

test <- new("classicCount", testStatistic = GOFisherTest, name =


"Fisher test")
resultFis <- getSigGroups(GOdata, test)

# Retrieve the top most significant GO terms for molecular


function
GenTable(GOdata, Fisher = resultFis, topNodes = 10)
# Plot the most significant nodes
showSigOfNodes(GOdata, score(resultFis), firstSigNodes = 10)

# Pathway analysis
library(Category)
keggDE <- unique(toTable(bovineENTREZID[as.
vector(deprobes$AFFY)])$gene_id)

keggU <- unique(toTable(bovineENTREZID[as.vector(probes)])$gene_id)

params <- new("KEGGHyperGParams", geneIds = keggDE, universeGe-


neIds = keggU, annotation = "bovine", pvalueCutoff = 0.01, testDi-
rection = "over")

keggtest <- hyperGTest(params)


summary(keggtest, pvalue = 0.01)
90 S. de las Heras-Saldana et al.

References
VetBooks.ir

Alamancos GP, Agirre E, Eyras E (2014) Methods to study splicing from high-throughput RNA
Sequencing data. In: Spliceosomal pre-mRNA splicing: methods and protocols. Humana Press,
New York, pp 357–397
Alamancos GP, Pagès A, Trincado JL, Bellora N, Eyras E (2015) Leveraging transcript quantifica-
tion for fast computation of alternative splicing profiles. RNA 21:1521–1531
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol
11:R106
Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data.
Genome Res 22:2008–2017
Anders S, Pyl PT, Huber W (2014) HTSeq–A Python framework to work with high-throughput
sequencing data. Bioinformatics btu638
Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, Gelfand ET, Trowbridge CA, Maller JB,
Tukiainen T, Lek M (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitis-
sue gene regulation in humans. Science 348:648–660
Aschoff M, Hotz-Wagenblatt A, Glatting KH, Fischer M, Eils R, König R (2013) SplicingCompass:
differential splicing detection using RNA-Seq data. Bioinformatics btt101
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT (2000) Gene Ontology: tool for the unification of biology. Nat Genet
25:25–29
Au KF, Jiang H, Lin L, Xing Y, Wong WH (2010) Detection of splice junctions from paired-end
RNA-seq data by SpliceMap. Nucleic Acids Res 38:4570–4578
Baldwin RL, Li RW, Li CJ, Thomson JM, Bequette BJ (2012) Characterization of the longissimus
lumborum transcriptome response to adding propionate to the diet of growing Angus beef
steers. Physiol Genomics 44(10):543–550
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G (2013) MITIE: simultaneous RNA-
Seq-based transcript identification and quantification in multiple samples. Bioinformatics
29:2529–2538
Bi Y, Davuluri RV (2013) NPEBseq: nonparametric empirical bayesian-based procedure for dif-
ferential expression analysis of RNA-seq data. BMC Bioinformatics 14:262
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics btu170
Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT (2013) Scotty: a web tool for designing
RNA-Seq experiments to measure differential gene expression. Bioinformatics 29:656–657
Cai G, Li H, Lu Y, Huang X, Lee J, Müller P, Ji Y, Liang S (2012) Accuracy of RNA-Seq and its
dependence on sequencing depth. BMC Bioinformatics 13:S5
Cánovas A, Rincón G, Bevilacqua C, Islas-Trejo A, Brenaut P, Hovey RC, Boutinaud M,
Morgenthaler C, VanKlompenberg MK, Martin P (2014) Comparison of five different RNA
sources to examine the lactating bovine mammary gland transcriptome using RNA-Sequencing.
Sci Rep 4
Chen G, Wang C, Shi T (2011) Overview of available methods for diverse RNA-Seq data analyses.
Sci China Life Sci 54:1121–1128
Chen D, Li W, Du M, Wu M, Cao B (2015) Sequencing and characterization of divergent marbling
levels in the beef cattle (Longissimus dorsi muscle) transcriptome. Asian-Australas J Anim Sci
28:158
Chung LM, Ferguson JP, Zheng W, Qian F, Bruno V, Montgomery RR, Zhao H (2013) Differential
expression analysis for paired RNA-seq data. BMC Bioinformatics 14:110
Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G,
Jassal B (2010) Reactome: a database of reactions, pathways and biological processes. Nucleic
Acids Res gkq1018
Cui X, Hou Y, Yang S, Xie Y, Zhang S, Zhang Y, Zhang Q, Lu X, Liu GE, Sun D (2014)
Transcriptional profiling of mammary gland in Holstein cows with extremely different milk
protein and fat percentage using RNA sequencing. BMC Genomics 15:226
RNA Sequencing Applied to Livestock Production 91

Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of
VetBooks.ir

read-mapping biases on detecting allele-specific expression from RNA-sequencing data.


Bioinformatics 25:3207–3212
Delhomme N, Padioleau I, Furlong EE, Steinmetz LM (2012) easyRNASeq: a bioconductor pack-
age for processing RNA-Seq data. Bioinformatics 28:2532–2533
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras
TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15–21
Driver AM, Peñagaricano F, Huang W, Ahmad KR, Hackbart KS, Wiltbank MC, Khatib H (2012)
RNA-Seq analysis uncovers transcriptomic variations between morphologically similar
in vivo-and in vitro-derived bovine blastocysts. BMC Genomics 13:118
Fang Z, Martin J, Wang Z (2012) Statistical methods for identifying differentially expressed genes
in RNA-Seq experiments. Cell Biosci 2:26
Filloux C, Cédric M, Romain P, Lionel F, Christophe K, Dominique R, Abderrahman M, Daniel P
(2014) An integrative method to normalize RNA-Seq data. BMC Bioinformatics 15:188
Frazee AC, Pertea G, Jaffe AE, Langmead B, Salzberg SL, Leek JT (2015) Ballgown bridges the
gap between transcriptome assembly and expression analysis. Nat Biotechnol 33:243–246
Garber M, Grabherr MG, Guttman M, Trapnell C (2011) Computational methods for transcrip-
tome annotation and quantification using RNA-seq. Nat Methods 8:469–477
Ghosh M, Sodhi SS, Song KD, Kim JH, Mongre RK, Sharma N, Singh NK, Kim SW, Lee HK,
Jeong DK (2015) Evaluation of body growth and immunity‐related differentially expressed
genes through deep RNA sequencing in the piglets of Jeju native pig and Berkshire. Anim
Genet 46:255–264
Glaus P, Honkela A, Rattray M (2012) Identifying differentially expressed transcripts from RNA-
seq data with biological variation. Bioinformatics 28:1721–1728
Goecks J, Nekrutenko A, Taylor J (2010) Galaxy: a comprehensive approach for supporting acces-
sible, reproducible, and transparent computational research in the life sciences. Genome Biol
11:R86
Gondro C (2015) Primer to analysis of genomic data using R. Springer, Cham
Hansen KD, Brenner SE, Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by
random hexamer priming. Nucleic Acids Res 38, e131
Hansen KD, Irizarry RA, Zhijin WU (2012) Removing technical variability in RNA-seq data using
conditional quantile normalization. Biostatistics 13:204–216
Hardcastle TJ, Kelly KA (2010) bayseq: empirical Bayesian methods for identifying differential
expression in sequence count data. BMC Bioinformatics 11:422
He H, Liu X (2013) Characterization of transcriptional complexity during longissimus muscle
development in bovines using high-throughput sequencing. PLoS One 8, e64356
Heap GA, Yang JH, Downes K, Healy BC, Hunt KA, Bockett N, Franke L, Dubois PC, Mein CA,
Dobson RJ (2010) Genome-wide analysis of allelic expression imbalance in human primary
cells by high-throughput transcriptome resequencing. Hum Mol Genet 19:122–134
Huang W, Khatib H (2010) Comparison of transcriptomic landscapes of bovine embryos using
RNA-Seq. BMC Genomics 11:711
Huang DW, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane
HC (2007) DAVID bioinformatics resources: expanded annotation database and novel algo-
rithms to better extract biology from large gene lists. Nucleic Acids Res 35:W169–W175
Jia C, Guan W, Yang A, Xiao R, Tang WH, Moravec CS, Margulies KB, Cappola TP, Li M, Li C
(2015) MetaDiff: differential isoform expression analysis using random-effects meta-
regression. BMC Bioinformatics 16:208
Jiang H, Wong WH (2008) SeqMap: mapping massive amount of oligonucleotides to the genome.
Bioinformatics 24:2395–2396
Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res
28:27–30
Karim L, Takeda H, Lin L, Druet T, Arias JA, Baurain D, Cambisano N, Davis SR, Farnir F, Grisart
B (2011) Variants modulating the expression of a chromosome domain encompassing PLAG1
influence bovine stature. Nat Genet 43:405–413
92 S. de las Heras-Saldana et al.

Karisa BK, Thomson J, Wang Z, Stothard P, Moore SS, Plastow GS (2013) Candidate genes and
VetBooks.ir

single nucleotide polymorphisms associated with variation in residual feed intake in beef cat-
tle. J Anim Sci 91:3502–3513
Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experi-
ments for identifying isoform regulation. Nat Methods 7:1009–1015
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory require-
ments. Nat Methods 12:357–360
Korpelainen E, Tuimala J, Somervuo P, Huss M, Wong G (2014) RNA-seq data analysis: a practi-
cal approach. CRC Press, Boca Raton
Krämer A, Green J, Pollard J, Tugendreich S (2013) Causal analysis approaches in ingenuity path-
way analysis (IPA). Bioinformatics btt703
Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially
expressed genes from RNA-seq data. Am J Bot 99:248–256
Ladomery MR (2014) Targeting alternative splicing in human genetic disease. RNA Nanotechnol
331
Laiho A, Elo LL (2014) A note on an exon-based strategy to identify differentially expressed genes
in RNA-Seq experiments. PLoS One 9, e115964
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10:R25
Lee HJ, Jang M, Kim H, Kwak W, Park WC, Hwang JY, Lee CK, Jang GW, Park MN, Kim HC
(2013) Comparative transcriptome analysis of adipose tissues reveals that ECM-receptor inter-
action is involved in the depot-specific adipogenesis in cattle. PLoS One 8, e66267
Lee HJ, Park HS, Kim W, Yoon D, Seo S (2014) Comparison of metabolic network between
muscle and intramuscular adipose tissues in Hanwoo beef cattle using a systems biology
approach. Int J Genomics 2014:679437
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and
population genetical parameter estimation from sequencing data. Bioinformatics
27:2987–2993
Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or
without a reference genome. BMC Bioinformatics 12:323
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform.
Bioinformatics 25:1754–1760
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast
tool for short read alignment. Bioinformatics 25:1966–1967
Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J (2012) Detection, annotation and visualization of
alternative splicing from RNA-Seq data with SplicingViewer. Genomics 99:178–182
Liu Y, Zhou J, White KP (2014) RNA-seq differential expression studies: more sequence or more
replication? Bioinformatics 30:301–304
Liu GF, Cheng HJ, You W, Song EL, Liu XM, Wan FC (2015) Transcriptome profiling of muscle
by RNA-Seq reveals significant differences in digital gene expression profiling between Angus
and Luxi cattle. Anim Prod Sci 55:1172–1178
Maretty L, Sibbesen JA, Krogh A (2014) Bayesian transcriptome assembly. Genome Biol 15:501
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682
Mazzoni G, Kogelman LJA, Suravajhala P, Kadarmideen HN (2015) Systems genetics of complex
diseases using RNA-sequencing methods. Int J Biosci Biochem Bioinformatics 5:264
McCabe M, Waters S, Morris D, Kenny D, Lynn D, Creevey C (2012) RNA-seq analysis of dif-
ferential gene expression in liver from lactating dairy cows divergent in negative energy bal-
ance. BMC Genomics 13:193
McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq
experiments with respect to biological variation. Nucleic Acids Res gks042
McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV (2011) RNA-
seq: technical variability and sampling. BMC Genomics 12:293
Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, Young TR, Goldmann JM,
Pervouchine DD, Sullivan TJ (2015) The human transcriptome across tissues and individuals.
Science 348:660–665
RNA Sequencing Applied to Livestock Production 93

Mi H, Dong Q, Muruganujan A, Gaudet P, Lewis S, Thomas PD (2010) PANTHER version 7:


VetBooks.ir

improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium.
Nucleic Acids Res 38:D204–D210
Miao X, Qin QLX (2015) Genome-wide transcriptome analysis of mRNAs and microRNAs in
Dorset and Small Tail Han sheep to explore the regulation of fecundity. Mol Cell Endocrinol
402:32–42
Morgan M, Pagès H, Obenchain V, Hayden N (2016) Rsamtools: binary alignment (BAM),
FASTA, variant call (BCF), and tabix file import
Morin RD, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh TJ, McDonald H, Varhol R,
Jones SJ, Marra MA (2008) Profiling the HeLa S3 transcriptome using randomly primed
cDNA and massively parallel short-read sequencing. Biotechniques 45:81
Pastinen T (2010) Genome-wide allele-specific analysis: insights into regulatory variation. Nat
Rev Genet 11:533–538
Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification
from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462–464
Ramanan VK, Shen L, Moore JH, Saykin AJ (2012) Pathway analysis of genomic data: concepts,
methods, and prospects for future development. Trends Genet 28:323–332
Ramayo-Caldas Y, Mach N, Esteve-Codina A, Corominas J, Castelló A, Ballester M, Estellé J,
Ibáñez-Escriche N, Fernández AI, Pérez-Enciso M (2012) Liver transcriptome profile in
pigs with extreme phenotypes of intramuscular fatty acid composition. BMC Genomics
13:547
Ramsköld D, Kavak E, Sandberg R (2012) How to Analyze Gene Expression Using RNA-
Sequencing Data. In: Next Generation Microarray Bioinformatics: Methods and Protocols
(eds. by Wang J, Tan CA and Tian T), Humana Press, Totowa, NJ. Springer, pp 259–274
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D (2013)
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
Genome Biol 14:R95
Rasche A, Lienhard M, Yaspo M-L, Lehrach H, Herwig R (2014) ARH-seq: identification of dif-
ferential splicing in RNA-seq data. Nucleic Acids Res 42:e110
Risso D, Schwartz K, Sherlock G, Dudoit S (2011) GC-content normalization for RNA-Seq data.
BMC Bioinformatics 12:480
Ropka‐Molik K, Żukowski K, Eckert R, Gurgul A, Piórkowska K, Oczkowicz M (2014)
Comprehensive analysis of the whole transcriptomes from two different pig breeds using RNA‐
Seq method. Anim Genet 45:674–684
Satya RV, Zavaljevski N, Reifman J (2012) A new strategy to reduce allelic bias in RNA-Seq read-
mapping. Nucleic Acids Res gks425
Seyednasrollah F, Laiho A, Elo LL (2013) Comparison of software packages for detecting differ-
ential expression in RNA-seq studies. Briefings Bioinf bbt086
Shen S, Park JW, Huang J, Dittmar KA, Lu Z-x, Zhou Q, Carstens RP, Xing Y (2012) MATS: a
Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq
data. Nucleic Acids Res gkr1291
Shi Y, Jiang H (2013) rSeqDiff: detecting differential isoform expression from RNA-Seq data
using hierarchical likelihood ratio test. http://dx.doi.org/10.1371/journal.pone.0079448
Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP (2014) Sequencing depth and coverage: key
considerations in genomic analyses. Nat Rev Genet 15:121–132
Sterne-Weiler T, Sanford JR (2014) Exon identity crisis: disease-causing mutations that disrupt the
splicing code. Genome Biol 15:201
Stevenson KR, Coolon JD, Wittkopp PJ (2013) Sources of bias in measures of allele-specific
expression derived from RNA-seq data aligned to a single reference genome. BMC Genomics
14:536
Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in
RNA-seq: a matter of depth. Genome Res 21:2213–2223
Tasnim M, Ma S, Yang EW, Jiang T, Li W (2015) Accurate inference of isoforms from multiple
sample RNA-Seq data. BMC Genomics 16:S15
94 S. de las Heras-Saldana et al.

Tizioto PC, Coutinho LL, Decker JE, Schnabel RD, Rosa KO, Oliveira PS, Souza MM, Mourão
VetBooks.ir

GB, Tullio RR, Chaves AS (2015) Global liver gene expression differences in Nelore steers
with divergent residual feed intake phenotypes. BMC Genomics 16:242
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 25:1105–1111
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ,
Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL,
Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments
with TopHat and Cufflinks. Nat Protoc 7:562–578
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analy-
sis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53
van de Wiel MA, Neerincx M, Buffart TE, Sie D, Verheul HMW (2014) ShrinkBayes: a versatile
R-package for analysis of count-based sequencing data in complex study designs. BMC
Bioinformatics 15:116
Vitting-Seerup K, Porse BT, Sandelin A, Waage J (2014) spliceR: an R package for classification
of alternative splicing and prediction of coding potential from RNA-seq data. BMC
Bioinformatics 15:81
Wang X, Cairns MJ (2013) Gene set enrichment analysis of RNA-Seq data: integrating differential
expression and splicing. BMC Bioinformatics 14:S16
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA,
Perou CM (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discov-
ery. Nucleic Acids Res 38, e178
Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H (2011) Evaluation
of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics
12:S5
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C,
Kazi F, Lopes CT (2010) The GeneMANIA prediction server: biological network integration
for gene prioritization and predicting gene function. Nucleic Acids Res 38:W214–W220
Wesolowski S, Birtwistle MR, Rempala GA (2013) A comparison of methods for RNA-Seq dif-
ferential expression analysis and a new empirical Bayes approach. Biosensors 3:238–258
Wilson GW, Stein LD (2015) RNASequel: accurate and repeat tolerant realignment of RNA-seq
reads. Nucleic Acids Res gkv594
Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short
reads. Bioinformatics 26:873–881
Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L (2006) WEGO:
a web tool for plotting GO annotations. Nucleic Acids Res 34:W293–W297
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, Robinson GJ, Lundberg
AE, Bartlett PF, Wray NR (2014) A comparative study of techniques for differential expression
analysis on RNA-Seq data. PLoS One 9, e103207
Zhou YH, Xia K, Wright FA (2011) A powerful and flexible approach to the analysis of RNA
sequence count data. Bioinformatics 27:2672–2678
Zhou A, Breese MR, Hao Y, Edenberg HJ, Li L, Skaar TC, Liu Y (2012) Alt Event Finder: a tool
for extracting alternative splicing events from RNA-seq data. BMC Genomics 13:S10
VetBooks.ir

Applications of Graphical Models


in Quantitative Genetics and Genomics

Guilherme J.M. Rosa, Vivian P.S. Felipe,


and Francisco Peñagaricano

Abstract
In this chapter, we provide a brief introduction about graphical models, with an
emphasis on Bayesian networks, and discuss some of their applications in genet-
ics and genomics studies with agricultural and livestock species. First, some key
definitions regarding stochastic graphical models are provided, as well as basic
principles of inference related to graphical structure and model parameters. Next
is a discussion of some examples of applications, which include prediction of
complex traits using genomic information or other correlated traits as well as the
investigation of the flow of information from DNA polymorphisms to endpoint
phenotypes, including intermediate phenotypes such as gene expression. A first
example with prediction refers to the forecasting of total egg production in quails
using early expressed traits (such as weekly body weight, partial egg production,
and egg quality traits) as explanatory variables to support decision making (e.g.,
earlier culling decisions) in production/breeding systems. An additional example
uses genomic information for the estimation of genetic merit of selection candi-
dates for genetic improvement of economically important traits. An example
with causal inference deals with the network underlying carcass fat deposition
and muscularity in pigs by jointly modeling phenotypic, genotypic, and tran-
scriptomic data. Some additional applications of Bayesian networks and other
graphical model techniques are highlighted as well, including multitrait quantita-
tive trait loci (QTL) analysis and structural equation models with latent ­variables.

G.J.M. Rosa (*)


University of Wisconsin, Madison, WI, USA
e-mail: grosa@wisc.edu
V.P.S. Felipe
Cobb-Vantress Inc., Siloam Springs, AR, USA
F. Peñagaricano
University of Florida, Gainesville, FL, USA

© Springer International Publishing Switzerland 2016 95


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_5
96 G.J.M. Rosa et al.

It is shown that graphical models such as Bayesian networks offer a powerful and
VetBooks.ir

insightful approach both for prediction and for causal inference, with a myriad of
applications in the areas of genetics and genomics, and the study of complex
phenotypic traits in agriculture.

1 Introduction

Networks can be used to represent biological systems that are composed of inter-
connected components. The components, or subunits, can be molecular elements of
a chemical reaction, cellules of a tissue, organs and organ systems, individuals
within populations, species within ecosystems, etc. Links between subunits of a
network may represent different kinds of interactions, such as physical contact,
affinities, and causal effects, among others.
In genetics, networks are used, for example, to represent gene regulation systems,
coexpression, and epistatic interactions. A network modeling approach commonly
used in genetic analyses refers to correlation networks. In such cases, gene expression
data from microarray or RNA-seq assays are used to compute marginal pairwise cor-
relation coefficients, and results are depicted in a graph in which nodes (vertices) rep-
resent the genes or transcripts, and edges represent significant associations determined
by an arbitrary significance level or minimum absolute value. Alternatively, such net-
works can be constructed using partial correlations. Correlation networks derived from
gene expression data are then examined in terms of topology features such as connect-
edness, degree, betweenness, etc. In addition, networks derived from different settings,
such as environmental conditions or genetic backgrounds, can be compared in terms of
their topological structure or specific components of the network.
A correlation network is an example of undirected graph, as its edges are sym-
metrical with no direction from one node to another. As such, it does not reveal or
express, for example, causal relationships between its nodes, i.e., if the expression
of a specific gene represses or activates the expression of another gene connected to
it. However, some network modeling approaches do explore potential direction of
edges connecting nodes, trying to more parsimoniously represent conditional inde-
pendences encapsulated in the joint distribution of the set of variables, or even as an
attempt to uncover causal links between those variables (Shipley 2002; Pearl 2009).
Such methods belong to a set of data analysis tools termed graphical models,
which include techniques such as path analysis, Bayesian networks (BNs), and
structural equation models (Pearl 2009). Interestingly enough, although graphical
modeling methods trace back to the pioneer work of Sewall Wright on path analysis
in genetics (Wright 1921), such methods have been further developed and applied
more extensively in other areas, including social sciences and econometrics
(Haavelmo 1943).
More recently, however, graphical modeling has caught the attention of geneti-
cists and has been employed in different applications of genomics and genetics
analysis of complex traits (e.g., Rosa et al. 2011; Sinoquet and Mourad 2014).
Applications of Graphical Models in Quantitative Genetics and Genomics 97

Earlier applications of Bayesian networks (BNs) to genomic data were in the analy-
VetBooks.ir

sis of gene expression profiles obtained from DNA microarray experiments


(Friedman et al. 2000). Subsequently, the use of BNs in genetics and genomics stud-
ies exploded, for example, in protein signaling networks (Sachs et al. 2005),
genomewide association analysis (Sebastiani et al. 2005), and genome-enabled pre-
diction and classification in livestock (Long et al. 2009). More recently, BN has also
been used for modeling phenotypic networks (e.g., Valente et al. 2010; Bouwman
et al. 2014) and gene–phenotype relationships and causal effects (e.g., Schadt et al.
2005; Li et al. 2006; Chaibub Neto et al. 2010).
In this chapter, we present a brief overview of Bayesian networks, which are a
specific class of graphical models, and illustrate their usefulness in genetic and
genomics studies, with an emphasis in agricultural research. Examples of applica-
tions include the investigation of the joint distribution of multiple variables with
more parsimonious representations, the prediction of complex phenotypic traits
using correlated traits and genomic information, and the examination of the flow of
information from DNA polymorphisms to intermediate phenotypes (such as gene
expression) to end point phenotypic traits, including the inference of causal links
between variables. Some concluding remarks are presented in a final section.

2 Bayesian Networks

Graphical models are probabilistic models that express conditional dependencies


between random variables through a graph. Graphical modeling encompasses a
number of statistical and data mining techniques well suited to study the joint dis-
tribution of a set of random variables. Graphical models are closely related to path
analysis, a method that has its origins in genetics (Wright 1921) and which has been
used extensively in many fields, including social sciences and economics (e.g.,
Haavelmo 1943). Path analysis is used to describe directed dependencies among a
set of variables and can be viewed as a form of multiple regression focusing on
causality. A more general graphical model approach refers to Bayesian networks
(BNs; Pearl 2009). A main output of a BN application refers to a qualitative descrip-
tion of the joint distribution of a set of variables, with a graphical representation of
conditional independencies among those variables. Such representation is generally
expressed with a directed acyclic graph (DAG), which consists of a group of nodes
(representing variables) connected by directed edges.
An example of a DAG is depicted in Fig. 1, which represents a network involving
five variables. The variables are the nodes (or vertices) of the DAG (in this case, V1,
V2, V3, V4, and V5), and the edges are the arrows connecting such nodes. Some
important concepts related to a DAG refer to the definition of parent, child, and
spouse nodes. The parents of a given node are connected to that node by an edge
directed toward it (e.g., in Fig. 1, V1, V3, and V4 are the parents of V5). The child(ren)
of a given node is(are) connected to that node by an edge directed away from it (e.g.,
V4 and V5 are the children of V1 in Fig. 1). Lastly, two or more nodes are defined as
spouses when they share at least one common child (e.g., V1 is a spouse of V2 in
98 G.J.M. Rosa et al.

Fig. 1 Directed acyclic


VetBooks.ir

graph (DAG) representing V1 V2


a network with five
variables (V1, V2, V3, V4,
and V5).
V3 V4

V5

Fig. 1). It is interesting to note that a pedigree graph, commonly used in quantitative
genetics and animal breeding, is an example of a DAG. Nonetheless, DAGs can
have more general structures, including more than two parents for a given node,
e.g., the node V5 in Fig. 1.
A DAG describes local features of the joint probability distribution of variables,
allowing a useful factorization of it. Given a set of variables {V1, V2, …, Vp} and a
DAG D that is compatible with their joint probability distribution Pr(V1, V2, …, Vp),
the following factorization can be performed (Pearl 2009):
p
Pr ( V1 , V2 ,  ,Vp ) = Õ Pr ( Vi | Pa i ) , (1)
i =1

in which Pai are the parents of Vi in D, i.e., all variables with arrows directed toward
Vi.
Hence, using this factorization strategy, the DAG in Fig. 1 can be represented
algebraically as

Pr ( V1 , V2 , V3 , V4 , V5 ) = Pr ( V1 ) Pr ( V2 ) Pr ( V3 ) Pr ( V4 | V1 , V2 )
Pr ( V5 | V1 , V3 , V4 ) .

This factorization reflects some conditional independencies. For example, V5 and V2


are marginally dependent because the distribution of V5 depends on the value of V4
whose distribution in turn depends on the value of V2. However, V5 and V2 become
independent conditionally on V1 and V4. This and other more complex conditional
independencies can be identified more easily by applying the d-­separation criterion
(Pearl 2009) in D.
The conditional independencies entailed in a DAG representation allow a more
parsimonious representation of the joint distribution of a set of variables, which
makes it interesting for prediction purposes. For example, consider that the DAG in
Fig. 1 is compatible with Pr(V1, V2, V3, V4, V5) (i.e., their joint distribution can be
factorized as in Eq. 1). Further, consider V5 as the target variable for prediction. The
information from the BN indicates that not all the variables in the DAG are needed
to predict V5, but only V1, V3, and V4 are sufficient. This is because conditionally on
V1 and V4, as indicated above, V5 is independent of V2, i.e., there is no extra infor-
mation to be learned from V2 after V1 and V4 have been observed.
Applications of Graphical Models in Quantitative Genetics and Genomics 99

Another important definition related to DAGs is the concept of Markov blanket


VetBooks.ir

(MB), which is especially useful in the context of prediction models. The MB of a


node is the set of nodes that includes its parent(s), child(ren), and spouse(s) (e.g., in
the DAG of Fig. 1, the MB of V3 is V1, V4, and V5). The MB of a node V is then the
minimum set of variables in a DAG, conditional on which everything else (i.e., the
remaining nodes) is independent of V. As such, the MB of a node is the only infor-
mation needed to predict its behavior.
Conditional independencies are not the only information conveyed by a DAG. For
example, in Fig. 1 it can be seen that the nodes V1 and V2 are not only marginally
independent, but that they become dependent conditionally on V4. Such conditional
dependencies occur, for example, in the case of unshielded colliders which consist
of two disconnected nodes with a common child (e.g., V1 ® V4 ¬ V2 ). Unshielded
colliders are key components for structure learning, especially regarding direction
of edges, as discussed below.
Bayesian network analysis involves two steps. The first one refers to qualitatively
inferring how variables are related to each other and it is referred to as “structure
learning.” The output of this step is a graphic structure describing the conditional
independencies between the observed variables, so it involves searching for a struc-
ture that is compatible with the joint distribution of the data. Given the inferred struc-
ture, the magnitude of each arrow is then estimated using regression techniques. This
second step is called “parameter learning,” and it can be implemented using tradi-
tional estimation methods such as maximum likelihood or Bayesian approaches.
For structure learning, two basic types of algorithms can be applied: “constraint-­
based” and “search and score.” Constraint-based algorithms (Spirtes et al. 2000)
infer from the data whether or not certain conditional independencies between vari-
ables hold. The search and score approach attempts to identify the network that
maximizes a score function indicating how well the network fits the data
(Tsamardinos et al. 2006). Other algorithms, e.g., the Max–Min Hill-Climbing
(MMHC; Tsamardinos et al. 2006), are considered hybrids of the two structure
learning approaches. The choice of score functions (for score-based algorithms) or
test statistics for conditional independencies (for constraint-based algorithms) will
depend on the form of the probability or density function assumed for the local
distributions, including discrete and continuous variables.
This chapter is focused on applications of Bayesian networks and other graphical
modeling techniques to genetics and genomics studies, so the reader should refer to
specific literature for more details on additional concepts and properties as well as
learning and inference approaches (e.g., Pearl 2009; Spirtes et al. 2000; Nagarajan
et al. 2013).

3 Examples of Applications of Bayesian Networks

As indicated above, BNs offer an interesting tool for a more parsimonious represen-
tation of the joint distribution of a set of variables. In addition, they are also useful
for prediction purposes through the detection of the MB of the target variable.
100 G.J.M. Rosa et al.

Moreover, under certain conditions or assumptions, the selected structure inferred


VetBooks.ir

in a BN can be interpreted as causal relationships between its nodes (Pearl 2009;


Spirtes et al. 2000).
In this section, we illustrate some of these applications of BN. In the first subsec-
tion, as an example of a more parsimonious representation of a multivariate system,
including the conditional independencies entailed by its joint distribution, a BN is
used to investigate the linkage disequilibrium (LD) structure involving single nucle-
otide polymorphism (SNP) loci, as an alternative to the pairwise LD measures such
as D′ or r2. In the following subsection, two examples of BN applied for prediction
of target variables are provided. In the first example, the interest is on prediction of
total egg production (TEP) in quails using early expressed phenotypes. The BN is
used to identify the MB of TEP, and a prediction model is then developed based on
the selected nodes. In the second example, molecular marker information is used to
predict a complex phenotypic trait, which is a key step for genomic selection (GS)
in plant and animal breeding (Goddard and Hayes 2007; de los Campos et al. 2013).
In a third subsection, a BN strategy is employed to infer causal networks underlying
carcass fat deposition and muscularity in pigs by jointly modeling phenotypic,
genotypic, and transcriptomic data. Some additional applications of BN and other
graphical modeling approaches are discussed in the last part of this section.

3.1  arsimonious Modeling of Multidimensional Covariance


P
Structures

Bayesian networks can be used to more parsimoniously represent the joint distribu-
tion of a set of variables. For example, let y be a k-dimensional random vector from
a multivariate normal distribution with covariance matrix Σ, i.e., Var [ y ] = Σ . The
matrix Σ has then k variance parameters (diagonal elements) and k(k − 1)/2 covari-
ances (off-diagonal elements). The covariances between nodes (variables) in a BN
are described by arrows and paths connecting them. A fully connected BN structure
involving k nodes has k(k − 1)/2 arrows, and it is statistically equivalent to an
unstructured covariance matrix Σ as described above. However, the structure-­
learning step of a BN implementation will search for conditional independencies
(d-separations) between its nodes, and each detected d-separation will result in the
removal of the arrow connecting the corresponding pair of nodes. As such, depend-
ing on how many conditional independencies are detected on a BN, the overall
covariance structure involving its nodes could be represented by a much smaller
number of parameters. To illustrate an application of BN in this context, we discuss
here a study that aimed to investigate the linkage disequilibrium (LD) structure
among molecular markers, which was presented by Morota et al. (2012).
Linkage disequilibrium is a nonrandom association of alleles at different loci
within a population. Genome-enabled prediction of complex traits and association
mapping techniques, such as genomewide association studies (GWAS), rely on LD
between molecular markers and quantitative trait loci (QTL) or causative mutations.
Linkage disequilibrium is also exploited when building haplotype blocks in genomic
Applications of Graphical Models in Quantitative Genetics and Genomics 101

studies. Traditionally, LD patterns are assessed using metrics such as D′ and r2,
VetBooks.ir

which are computed for each pair of markers or genetic loci. However, such
approaches are not able to fully exploit the complexities of the LD structure involv-
ing multiple loci.
Morota et al. (2012) used genotypic data on 36,778 SNP markers after quality
control and missing data imputation, together with predicted transmitting abilities
(PTA) for milk protein yield from 4898 progeny-tested Holstein bulls. They used a
BN as a representation of choice for the LD structure, with the view that loci associ-
ate and interact together as a system or network, as opposed to in a simple pairwise
manner.
Markers used for the BN analysis were preselected based on results of a genome-­
enabled prediction model for milk protein yield using a Bayesian LASSO approach.
Three strategies of subset selection were used, with the 15 or 30 top SNPs in terms
of raw absolute posterior mean of the allele substitution effect, standardized abso-
lute posterior mean, or locus-specific additive genetic variance. Two different algo-
rithms, the Tabu search (a local score–based algorithm) and the incremental
association Markov blanket–IAMB (a constraint-based algorithm), coupled with
the chi-square test, were used for learning the structure of the BN and were com-
pared with the reference r2 metric represented as an LD heat map.
Among all combinations of subset of markers and BN algorithm used, an exam-
ple is depicted in Fig. 2 to illustrate the results. The two BNs presented were con-
structed with the Tabu and IAMB algorithms, using the 15 SNPs with the largest
absolute posterior means. Three SNPs on chromosome 14 (A, B, and L) are shown
as gray-filled nodes; these SNPs presented stronger pairwise LD than the remaining
ones. The Tabu search identified one independent network (A, B, C, D, E, F, G, I, K,
L, M, O), with the remaining three SNPs (H, J, N) not associated with any other
group, suggesting their independent segregation. By contrast, the IAMB search
revealed two independent networks (A, B, C, D, E, F, G, H, I, K, L, M, O) and (J,
N). Although the networks constructed by the two algorithms were not the same,
they presented many features in common, including the main cluster of 12 SNPs,
and many consistent connections, such as the trio (A, B, L), the pair (M, E), and the
path connecting the sequence (D, C, I, O, and G, K). In total, the Tabu network had
12 edges and the IAMB had 17, with 10 of the edges common to both networks.
Overall, the BN captured several genetic markers associated as clusters, imply-
ing that these markers are interrelated in a complicated manner. Further, the BN
detected conditionally dependent markers. The results confirm that LD relation-
ships are of a multivariate nature, and hence r2 gives only an incomplete description
and understanding of LD. The observed discrepancies between the Tabu and IAMB
networks are in agreement with the study of Karacaören et al. (2011) who devel-
oped a similar exercise by fitting a BN to significant markers obtained from the
QTL-MAS 2010 data set. In conclusion, Morota et al. (2012) indicated that BN
contains LD information that is additional to that conveyed by pairwise measures,
such as conditional independencies among SNPs in the networks. Such information,
as they pointed out, can be useful, for example, for the selection of tag SNPs to be
used in marker-assisted selection strategies.
102 G.J.M. Rosa et al.

C A C A
VetBooks.ir

D I B L D I B L

K O M K O M

F G E F G E

H N H N

J J

Fig. 2 Examples of networks inferred using genotypes of 15 SNPs with the largest absolute pos-
terior means for milk protein yield. Gray-filled nodes are SNPs located in chromosome 14.
Networks in the left and right panels correspond to the outputs of the Tabu and the IAMB algo-
rithms, respectively, but only in terms of d-separations (Adapted from Morota et al. 2012).

3.2 Prediction of Complex Phenotypic Traits

An application of BN for prediction purposes has been presented by Felipe et al.


(2015), who worked with total egg production (TEP) in quails. TEP is an important
trait in meat-type poultry breeding programs, as it is directly associated with the
number of hatchable eggs. However, early management decisions cannot be based
on TEP because this information is accumulated from 35 to 260 days of age. In this
context, Felipe et al. (2015) attempted to develop a prediction model for TEP using
early expressed traits as explanatory variables, so that predictions could be useful to
support decision making (e.g., earlier culling decisions) in production/breeding
systems.
A classical method used for prediction is the regression analysis. Multiple regres-
sion models describe the target variable with a linear function of a set of covariates
(explanatory variables) with parameters fitted by, for example, least squares. Given
a set of observed covariates, many different models could be fitted by including dif-
ferent subsets of the available covariates, and interaction and polynomial terms
involving them. A challenge in implementing this technique when many covariates
are available is the model selection, given the large number of competing models to
be compared. In addition, the inclusion of correlated predictors may result in
increased standard errors of the regression coefficients, which cause predictions to
be very sensitive to model changes (Burnham and Anderson 2002). As discussed
earlier, an alternative in this context refers to BNs, which can be used to construct a
predictive model parsimoniously by factorizing the posterior density distribution
function assuming a set of stable conditional independencies.
Applications of Graphical Models in Quantitative Genetics and Genomics 103

Felipe et al. (2015) compared the efficiency of the BN approach to the traditional
VetBooks.ir

multiple (stepwise) regression to predict individual TEP in European quails using


weekly body weight, partial egg production, and egg quality traits. In addition, the
authors also used a machine learning technique, the artificial neural network (ANN),
as a third competing approach.
Data were available from two distinct lines of female European quails (L1 and
L2), which included weekly measurements of body weight from birth to 35 days of
age (BW1 to BW6), weight gain from birth to 35 days of age (WG1) and from 21 to
35 days of age (WG2), age at first egg (AFE), number of eggs produced from 35 to
80 days of age (EP1), and the number of eggs produced from 35 to 260 days of age
(TEP). In addition, egg quality traits were measured at four different ages (125, 170,
215, and 260 days), which included egg weight (Ew1 to Ew4), yolk weight (Y1 to
Y4), eggshell weight (ES1 to ES4), egg white weight (EW1 to EW4), and egg spe-
cific gravity (DENS1 to DENS4).
The network structures obtained for both lines (L1 and L2) comprising the 31
phenotypic traits (TEP and 30 covariates) are presented in Fig. 3. For the L1 line,
results indicate that TEP is directly connected only to partial egg production (EP1).
The remaining observed traits are not expected to contribute to predicting the TEP
in the presence of (i.e., conditionally on) EP1. In further analysis (graphic not
shown), in which EP1 was removed from the data set, AFE became directly con-
nected to TEP. Excluding both EP1 and AFE from the covariates set, the selected
DAG presented TEP as completely disconnected from the other nodes, which
implies independence of the TEP from other traits considered in this specific set for
L1. An interesting aspect of the result obtained is the path
BW1 → BW2 → BW3 → BW4 → BW5, for which the statistical consequence seems
to fit what is expected given growth curve behavior and timeline.
For L2, the DAG also shows that TEP is directly dependent only on EP1.
Similarly to L1, results indicate that, conditionally on EP1, no other earlier observed
trait (such as body weight) brings additional information about the potential TEP of
a bird. Also, similar to the results obtained for L1, excluding EP1 and AFE from the
L2 training set resulted in an output DAG in which the TEP was disconnected from
the other traits.
The structures learned for L1 and L2 present a different number of directed and
undirected edges, including 14 edges present exclusively in the L1 DAG, 18 edges
exclusively in the L2 DAG, and 15 edges in common to both structures. However, in
the vicinities of TEP, both DAGs are equivalent, meaning that the inferred MB for
this node is the same for both lines. This result implies that among the variables
studied, EP1 can be considered the only predictor for TEP, regardless of the line.
Moreover, AFE can be considered a predictor trait in the absence of partial egg pro-
duction measurement, as TEP and AFE are independent conditionally solely on EP1.
Comparing the predictive ability across models, choosing covariates using the
BN (parent: EP1) yielded better results in terms of correlation and prediction mean
square error (PMSE) compared to multiple linear regression (with or without step-
wise approach). However, when AFE was considered as predictor (scenario in
which EP1 was removed from the observed data), the predictive ability was reduced
104 G.J.M. Rosa et al.

LINE 1 LINE 2
VetBooks.ir

Bw1 Ew1 Bw6 AFE Dens1 Dens2 WG2


Bw1 Bw6 AFE Ew3 Dens1 Dens3

Y1 ES1 WG1 EP1

Bw2 WG1 EP1 WG2 Y3 ES3 Dens4


MB

Ew1 ES2 TEP

Bw3 TEP Ew3 ES1 ES4


ES4 ES3 Ew2
MB

Bw2 Dens4 Ew3 Y2


Bw4 Ew2 Ew1 Ew4

Bw3 Dens3 Ew3 Y3 Ew2

Bw5 ES2 Y2 Y1 Y4 Ew4

Bw4 Ew4 Y4

Dens2 Ew2 Ew1


Bw5 Ew4

Fig. 3 Phenotype structure inferred by the Bayesian network on two lines of quails (Adapted from
Felipe et al. 2015). Traits (nodes) refer to weekly measured body weight from birth to 35 days of
age (BW1 to BW6), weight gain from birth to 35 days of age (WG1) and from 21 to 35 days of age
(WG2), age at first egg (AFE), number of eggs produced from 35 to 80 days of age (EP1), total egg
production (TEP), and egg quality traits measured in four different life stages of the bird (125, 170,
215, and 260 days of age) (egg weight (Ew1 to Ew4), yolk weight (Y1 to Y4), egg shell weight
(ES1 to ES4), egg white weight (EW1 to EW4), and egg specific gravity (Dens1 to Dens4)).

by 50% in all BN prediction scenarios. In summary, the results showed that using
BN (with EP1 in the set of predictors) to construct regression models provided bet-
ter predictions of TEP and superior generalization ability within and across quail
lines when compared to the standard approaches used with multiple linear regres-
sion (Felipe et al. 2015).
The work of Felipe et al. (2015) is an example of prediction of a target pheno-
typic trait of economic interest based on information on early expressed traits. Such
an application can be used, for example, to aid decision making in livestock produc-
tion systems. Another interesting application of BN was presented by Scutari et al.
(2013), who used genotypic information on thousands of markers for prediction of
complex phenotypic traits in barley, rice, and mouse populations, within the context
Applications of Graphical Models in Quantitative Genetics and Genomics 105

of genomic selection (GS) in plant and animal breeding (Goddard and Hayes 2007;
VetBooks.ir

de los Campos et al. 2013). The main interest was on feature selection, using the
concept of Markov blanket (MB), for identifying noninformative markers such that
simpler GS models could be fitted using only the informative markers, allowing cost
savings from reduced genotyping (e.g., Vazquez et al. 2010).
The authors used three publicly available data sets, including continuous pheno-
typic traits: (1) barley data with information on yield and genotypes for 810 SNPs
from 227 barley varieties, (2) mouse data consisting of 12,545 SNPs and phenotypic
information on growth rate and weight for 1940 subjects, and (3) rice data with
73,808 SNPs and phenotypic score on the number of seeds per panicle for 413
varieties.
The performance of MB feature selection was investigated using three GS mod-
els: Ridge regression, LASSO, and elastic net penalized regressions (Gianola et al.
2009). Each GS model was fitted using all the available SNPs as well as only the
SNPs included in the MB. The prediction quality of the GS models was assessed by
calculating the correlation between observed and predicted trait values using a
cross-validation (CV) approach.
When using a significance threshold of α = 0.15 for testing the network correla-
tions, the resulting MBs selected a relatively small number of SNPs, regardless of
the dimension of the SNP data set. The average size of the MBs obtained from cross
validation was 185 for the barley data, 525 and 543 for weight and growth rate,
respectively, for the mouse data, and 293 for the rice data. To assess the consistency
of MB sets across the CV replications, the authors checked how many SNPs
appeared in at least half of the CV folds. Reasonable consistency was found for the
barley data, with 136 SNPs, and for mouse weight and growth rate, with 241 and
276 SNPs, respectively. With the rice data, however, only 15 SNPs were selected by
at least half of the MBs across CV folds, corresponding to 5% of the MB average
size. As indicated by the authors, the latter result may be attributed to the very low
ratio between sample size and number of SNPs in the rice data compared to the
other two data sets.
Overall, the MB subsets of SNPs were consistently smaller than the sample size,
thus ensuring the regularity and numerical stability of the GS models. No loss in
predictive power was observed with GS models implemented after MB feature
selection. Actually, the increased numerical stability resulting from the reduced
number of SNPs slightly improved the predictive power of the GS models.
Specifically, the averages of the predictive correlation over the four analyses (barley,
rice, and the two mouse traits) using all SNPs were 0.481, 0.498, 0.471, and 0.521
for PLS, ridge, LASSO and elastic net, respectively, whereas the corresponding
averages when using MB feature selection were 0.502, 0.509, 0.487, and 0.520, all
with an approximate standard deviation of 0.0057. Furthermore, MBs outperformed
other subsampling strategies of the same size, such as random sampling and feature
selection based on SNP effect significance.
In summary, Scutari et al. (2013) have shown that MB feature selection applied
as a preliminary step in GS was able to greatly reduce the size of the SNP set with
no loss (and possibly a small gain) in the predictive ability of different GS models.
106 G.J.M. Rosa et al.

As such, MB feature selection may be used as an alternative for reducing genotyp-


VetBooks.ir

ing costs in animal and plant breeding programs.

3.3 Causal Inference

Peñagaricano et al. (2015a) applied a BN technique to infer causal networks under-


lying carcass fat deposition and muscularity in pigs by jointly modeling phenotypic,
genotypic, and transcriptomic data. The data were from a three-generation resource
pig population consisting of 954 F2 animals originating from crosses between Duroc
and Pietrain breeds. Among the 60+ different phenotypes available from this popu-
lation, Peñagaricano et al. (2015a) focused on carcass and meat quality traits.
Details of the population development, animal management, and phenotype collec-
tion can be found in Edwards et al. (2008a, b).
In addition to the phenotypic data, genotype information was available for 124
dinucleotide microsatellites genetic markers (3–9 markers per chromosome), which
was used to derive breed of origin probabilities across the genome of the F2 animals.
Moreover, muscle tissue sampled from a subset of the F2 individuals was used to
assess gene expression with a 70-mer oligonucleotide microarray, including 20,400
annotated oligonucleotides spanning the whole swine genome. Details regarding
tissue sample collection, sample preparation, microarray hybridization, and prepro-
cessing data are in Steibel et al. (2011).
Using the data on phenotypes, genotypes, and gene expression for a total of
171 F2 individuals, Peñagaricano et al. (2015a) performed two complementary
whole-genome scans: a classical phenotypic QTL mapping (pQTL) integrating phe-
notypic and genotypic data, and an expression QTL mapping (eQTL) integrating
transcriptional profiling with genotypic data. Three traits, namely, loin muscle
weight (LOIN), midline 10th-rib backfat thickness (BF10), and average intramus-
cular fat percentage (FAT), showed significant pQTL at 5% genome-wise
significant level. Interestingly, these three significant pQTL mapped to the same
genomic region in chromosome 6 (SSC6) of the pig genome (Fig. 4).
In addition, seven significant eQTLs (FDR ≤ 0.20) were detected in the same
region of the pig genome (Fig. 5). These seven eQTLs were associated with the
level of expression of the following seven genes: zinc finger protein 24 (ZNF24),
aldo-keto reductase family 7, member A2 (AKR7A2), synovial sarcoma, X break-
point 2 interacting protein (SSX2IP), ets variant 2 (ETV2), small integral membrane
protein 12 (SMIM12), peroxisomal biogenesis factor 14 (PEX14), and prostate
tumor overexpressed 1 (PTOV1). It is important to note that all these genes are
located in SCC6 and hence these seven significant eQTLs can be considered as local
or cis-eQTL.
In order to decipher potential functional relationships between these variables
(phenotypic traits, expression levels, and the most significant genetic marker located
in the QTL region), a structure-learning algorithm (more specifically, a constraint-­
based approach within an inductive causation algorithm) was employed in conjunc-
tion with Fisher’s Z test to assess for conditional independence (α = 0.05).
Applications of Graphical Models in Quantitative Genetics and Genomics 107

30
VetBooks.ir

Tenth-rib backfat,mm
Fat, %
Loin Weight, kg
25
20
F.value

15
10
5
0

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 10 11 12 13 13 14 14 15 15 16 18 19 19
Chromosome

Fig. 4 Genome scan results for loin muscle weight, midline 10th-rib backfat thickness, and aver-
age intramuscular fat percentage. The horizontal line indicates the genome-wise significance level
of 5% (Adapted from Peñagaricano et al. 2015a).
12

ZNF24
SSX21P
SMIM12
10

ETV2
PTOV1
PEX14
AKR7A2
8
-log(Pvalue)

6
4
2
0

1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 10 11 12 13 13 14 14 15 15 16 18 19 19
Chromosome

Fig. 5 Genome scan results for seven expression traits that show significant eQTL in chromosome
6 (SSC6). The horizontal line indicates P-value = 1.9 × 10−5 (FDR ≤ 0.20) (Adapted from
Peñagaricano et al. 2015a).

Interestingly, without using any prior information, the structure-learning algorithm


could reconstruct a partially directed acyclic graph with only three undirected edges
(figure not shown). Specifically, only the links between the genetic marker (labeled
as QTL) and AKR7A2, between SMIM12 and AKR7A2, and between PTOV1 and
SMIM12 remained unresolved (i.e., undirected). Using prior knowledge, the undi-
rected link between the genotype and AKR7A2 was set as QTL → AKR7A2 (i.e.,
genotype may have a causal effect on the gene expression but not the opposite).
With that information, the algorithm could reconstruct a fully directed acyclic graph
(Fig. 6). Remarkably, based on the causal graphical model, the genetic marker is
marginally associated (through direct and indirect paths) with all the other vari-
ables, i.e., phenotypic and expression traits. These findings completely agree with
108 G.J.M. Rosa et al.

-1.12 (0.38) BF10


VetBooks.ir

-0.67 (0.16)
LOIN

1.19 (0.39)
SSX21P
-0.22 (0.07) 0.23 (0.05)

-0.15 (0.05) AKR7A2


FAT
0.37 (0.05)
0.51 (0.15) -0.11 (0.02)
SMIM12
QTL
0.64 (0.09)
ZNF24
0.03 (0.01) 0.12(0.02)
-0.33 (0.05)
PTOV1
0.26 (0.05)
PEX14 0.14 (0.04) 0.33 (0.08)
ETV2

Fig. 6 Network inferred after incorporation of QTL → AKR7A2 as prior knowledge, and maxi-
mum likelihood estimates (and standard errors) of the causal parameters (Adapted from
Peñagaricano et al. 2015a).

the previous pQTL and eQTL results. In addition, even though there is a direct link
between the genotype and one phenotype (QTL → FAT), the graphical causal model
showed that, in general, the effects of the genotype on the phenotypes are mediated
by the expression of several genes.
The stability of the network was evaluated using Jackknife resampling
(Peñagaricano et al. 2015a). In each iteration, the network was inferred from a new
data set, which was created by removing one animal at a time from the original data
set. Notably, the majority of the links and directions showed great stability. In fact,
the arrows between phenotypic traits and the links from the genetic marker to the
intermediate variables remained mostly unchanged. There were very few connec-
tions that were unstable (e.g., SMIM12 ↔ PTOV1), i.e., the removal of a single data
point caused the absence of connection between the variables.
Knowledge about gene–phenotype networks can be used to predict the behav-
ior of complex systems. For instance, in Peñagaricano et al. (2015a), the network
model predicted that modulation of ZNF24 expression level should lead to a
change in the expression of SSX2IP. Recently, Li et al. (2009) evaluated potential
ZNF24 target genes. For this purpose, the authors transiently overexpressed and
silenced ZNF24 and then applied microarray assay in order to identify target
genes. Remarkably, the overexpression of ZNF24 significantly decreased the
expression of SSX2IP, as predicted by the network in Peñagaricano et al. (2015a).
In addition, the silencing of ZNF24 resulted in a significant overexpression of
SSX2IP (Li et al. 2009). Therefore, these results support the causal relations
inferred in Peñagaricano et al. (2015a).
Applications of Graphical Models in Quantitative Genetics and Genomics 109

3.4 Additional Applications


VetBooks.ir

On the application of BN presented by Peñagaricano et al. (2015a), the link between


the QTL and the expression level of AKR7A2 remained unsolved in terms of direc-
tion when using only the information provided by the data. The direction of that
edge was then based on the prior information that the flow of information or causal
effect should go from DNA polymorphism to RNA transcription. Interestingly
enough, after including the arrow pointing from the QTL to AKR7A2, the other two
unsolved links in the data-driven network (i.e., links between SMIM12 and AKR7A2,
and between PTOV1 and SMIM12) ended up directed as well, in a domino effect,
generating a fully directed network output.
In fact, the aid that molecular marker and QTL information provides for network
reconstruction is well known. As Thomas and Conti (2004) have pointed out, genet-
ically randomized experimental populations that segregate naturally occurring
allelic variants can provide a basis for the inference of causal relationships among
genetic loci, intermediate, and end point phenotypes. In particular, the randomiza-
tion of alleles that occurs during meiosis provides a setting that is analogous to a
randomized experimental design, such that causality can be inferred within the clas-
sical Fisherian statistical framework (Rosa et al. 2011).
Schadt et al. (2005) exploited this concept and proposed a multistep procedure to
infer causal relationships between two phenotypic traits and a common QTL. More
specifically, they aimed to disentangle the causal path involving the expression of a
particular gene, a cis-acting expression QTL (eQTL), and a complex trait (e.g., a
disease trait), to determine if they are related to each other following a causal, reac-
tive, or independent model. The causal model refers to a network in which allelic
variations in the gene modulate its expression, and such an over- or underexpression
affects the disease outcome. In the reactive model, on the other hand, the expression
of the gene is a consequence of the disease status. Lastly, the independent model
represents a situation in which the QTL controls both the expression and the disease
independently.
The authors proposed a likelihood-based causality model selection test that uses
conditional correlation measures to determine which relationship among a trio of
traits (a transcriptional trait, a complex phenotype, and a common QTL affecting
both) is best supported by the data. Likelihoods associated with each of the models
(causal, reactive, and independent models) were constructed and maximized with
respect to the model parameters, and the AIC criterion was used to select the model
best supported by the data.
Their methodology was applied to a mouse genetical genomics study composed
of large-scale genotypic, gene expression, and complex trait data to identify genes
related to obesity, which have been able to identify known and new susceptibility
genes for fat mass, and to successfully predict transcriptional response to perturba-
tion in such genes. Their procedure, however, is restricted to simple gene–pheno-
types networks, focusing on the identification of genes in the causal-reactive interval
considering a trio of nodes comprising a common QTL affecting the expression of
a specific gene and a complex trait. Evidently, gene and phenotype networks can be
110 G.J.M. Rosa et al.

much more complex, as the causal-reactive genes may be also interacting in a


VetBooks.ir

broader network through an intricate cascade of genes and phenotypic traits.


An extension of work by Schadt et al. (2005) was presented by Li et al. (2006),
who proposed a methodology to analyze multilocus, multitrait genetic data. Their
method allows not only a larger number of nodes (i.e., loci and phenotypic traits)
but also different possible causal relationships among them, such that it provides a
better characterization of the genetic architecture underlying complex traits. Their
method comprises a series of steps, starting with single locus genome scans run for
each individual phenotype. Next, conditional genome scans are performed using
one trait as a covariate in the analysis of another trait. A third step in their procedure
refers to the construction of an initial path model and its respective graph, in which
edges are directed from the QTL to the corresponding traits. Competing path mod-
els are then assessed in terms of goodness of fit by comparing the predicted and
observed covariance matrices and by significance tests for individual path coeffi-
cients. Finally, an additional step is performed to refine the model by proposing and
assessing alternative models, which are generated by adding or removing edges in
the initial model, or by reversing the causal direction of an edge.
Li et al. (2006) applied their methodology in the analysis of body weight and
weights of the inguinal, gonadal, peritoneal, and mesenteric fat pads of an SM × NZB
intercross population with 260 female and 253 male mice raised on an atherogenic
diet, and they concluded that the proposed method provides an insightful descriptive
approach to the genetic analysis of multiple traits, allowing the characterization of
pleiotropic and heterogeneous genetic effects of multiple loci on multiple traits, as
well as the physiological interactions among traits.
Also using QTL information to orient edges connecting phenotypes, Chaibub
Neto et al. (2010) proposed a methodology that simultaneously infers a causal phe-
notype network and its associated genetic architecture. Their approach is based on
jointly modeling phenotypes and QTL using homogeneous conditional Gaussian
regression models and a graphical criterion for model equivalence. The concept of
randomization of alleles during meiosis and the unidirectional relationship from
genotype to phenotype are explicitly used to infer causal effects of QTL on pheno-
types. Subsequently, causal relationships among phenotypes are inferred using the
QTL nodes, which might make it possible to distinguish among phenotype net-
works that would otherwise be distribution equivalent.
A related approach was proposed by Wang and van Eeuwijk (2014), which also
uses QTL information to orient edges in a prelearned undirected or partially directed
phenotype network and is applicable to a range of population structures, including
offspring populations from crosses between inbred parents and outbred parents,
association panels, and natural population. The methodology has been used by
Wang et al. (2015) for network inference using 29 metabolites and 24 sensory traits
scored on ripe tomatoes. The data set consisted of four F2 segregating populations
generated from a half-dialed mating design with four contrasting tomato cultivars as
parental lines. From an initial set of 6000 SNPs, a linkage map was constructed
using 600 selected SNP markers evenly spread at about 2 cm and with 50 markers
per chromosome. A total of 21 QTLs were identified for the 29 metabolites and 24
Applications of Graphical Models in Quantitative Genetics and Genomics 111

sensory traits. Most of the QTLs were found to have pleiotropic effects, with a few
VetBooks.ir

of them serving as hubs in the inferred dependency network. As indicated by the


authors, the resulting network provided interesting hints on how the sensory traits
depend on the metabolites and further on the detected QTLs.
Another interesting graphical modeling approach refers to structural equation
models (SEM). An advantage of SEM is that they allow the study of latent variables,
i.e., variables that are not observable or are not directly measurable but are charac-
terized in the model from several observed variables (Bollen 1989). This allows
modeling complex phenomena, e.g., propensity to disease and animal well-being,
reducing at the same time the dimensionality of the data, as several variables can be
combined in a model to represent an underlying concept of interest (Peñagaricano
et al. 2015b).
SEM have been used to study relationships among phenotypic traits in the con-
text of classical quantitative genetics and mixed effect models, in which the genetic
component contributing to variation of phenotypes are nonobservable variables that
must be inferred from the data using information on recorded traits and familial
relationships among individuals. The use of SEM within a mixed effects model
framework was described by Gianola and Sorensen (2004), which allows the study
of recursive and simultaneous relationships among phenotypes in multivariate sys-
tems. Therefore, SEM can produce an interpretation of relationships among traits,
which differs from that obtained with standard multitrait mixed models (MTM),
where all relationships are represented by symmetric linear associations among ran-
dom variables, i.e., as measured by covariances and correlations (Henderson and
Quaas 1976; Mrode 2005). Unlike MTM, in SEM one trait can be treated as a pre-
dictor of another trait, providing a functional (causal) link between them (Rosa and
Valente 2014).
Many studies have used mixed effects SEM in quantitative genetics and animal
breeding applications, including examples with goats (de los Campos et al. 2006),
pigs (Varona et al. 2007), and dairy cattle (e.g., de Maturana et al. 2009, 2010;
Heringstad et al. 2009; Jamrozik et al. 2010; König et al. 2008; Wu et al. 2007). The
phenotypic traits studied span from production (e.g., milk yield in dairy cattle and
body weight in pigs) to reproductive (e.g., gestation length and calving ease in dairy
cattle, and litter size in pigs) and health-related traits (somatic cell score and masti-
tis incidence in dairy cattle). In addition, some extensions of the methodology pro-
posed by Gianola and Sorensen (2004) have been suggested, such as threshold
models with structural coefficients functioning at the level of liabilities (de Maturana
et al. 2009; König et al. 2008; Wu et al. 2008), and models with heterogeneous
structural coefficients, such as time- and yield-dependent coefficients (e.g., de
Maturana et al. 2009 and 2010; Wu et al. 2007).
In those earlier applications of mixed effects SEM, the causal structures were
assumed to be known a priori (e.g., de Maturana et al. 2009; Heringstad et al. 2009),
or just a few putative structures selected using some prior knowledge were com-
pared (e.g., de los Campos et al. 2006; Varona et al. 2007; Wu et al. 2008). More
recently, Valente et al. (2010) proposed a methodology that allows searching for
recursive causal structures in the context of mixed models for the genetic analysis
112 G.J.M. Rosa et al.

of multiple traits, showing that under certain conditions it is possible to infer pheno-
VetBooks.ir

type networks and causal effects even without QTL or marker information. Their
approach involves a first step of data adjustment for genetic effects, which otherwise
act as confounders of causal effects between phenotypic traits. In Valente et al.
(2010), a classical infinitesimal additive genetic model involving a relationship
matrix A constructed from pedigree information has been considered for such a
task. As an alternative, if high-density molecular marker data are available (e.g.,
SNP genotypes), more efficient genetic merit prediction approaches can be
employed, such as Bayesian regression techniques (Gianola et al. 2009) or kernel
methods (de los Campos et al. 2009). Examples of application of a data-driven con-
struction of causal graph within the mixed effects SEM context can be found in
Valente et al. (2011), who studied causal relationships among five productive and
reproductive traits in European quail, and in Bouwman et al. (2014), who investi-
gated bovine milk fatty acids.
On a different application of SEM with mixed effects models, Peñagaricano
et al. (2015b) described a methodology for assessing causal networks involving
latent variables underlying complex phenotypic traits. The first step of their approach
involves the construction of latent variables defined on the basis of prior knowledge
and biological interest, which are jointly evaluated using confirmatory factor analy-
sis. The estimated factor scores are then used as phenotypes for fitting a multivariate
mixed model to obtain the covariance matrix of latent variables conditional on the
genetic effects. Finally, causal relationships between the adjusted latent variables
are evaluated using different SEM with alternative causal specifications.
Peñagaricano et al. (2015b) applied their methodology to a data set with pigs for
which several phenotypes were recorded over time. Five different latent variables
were evaluated to explore causal links between growth, carcass, and meat quality
traits. They found that both growth (−0.160) and carcass traits (−0.500) have a sig-
nificant negative causal effect on quality traits (P-value < 0.001), which may have
important implications for improving pig production.
Other applications of graphical modeling in genetics and genomics studies include
the selection of appropriate covariables. For example, Valente et al. (2015) pointed
out that selection requires learning causal genetic effects, and that traditional model
comparison techniques based on predictive power might be inappropriate; genomic
predictors from some models may capture noncausal correlational signals which,
despite providing good predictive ability, inadequately represent true genetic effects.
Using simulated examples, the authors showed that aiming for predictive ability
might lead to poor modeling decisions. Alternatively, causal inference approaches
such as graphical models can guide the construction of regression models that better
infer the target genetic effect even when these models underperform in terms of pre-
diction quality using cross validation. Similar modeling techniques should also be
used in candidate gene studies or genomewide association analyses, so that spurious
gene–phenotype associations due to inappropriate conditioning are avoided.
As a last application of graphical models, we want to highlight their importance
in livestock production for the analysis of observational data using field-recorded
information, and their potential utility in the study of causal effects, as discussed in
Rosa and Valente (2013). As these authors postulated, there is much to be learned
Applications of Graphical Models in Quantitative Genetics and Genomics 113

from such data if carefully mined with appropriate causal models. Such analyses
VetBooks.ir

can be used either to explicitly investigate causal relationships between variables or


simply to generate hypotheses for further investigation using controlled experi-
ments or additional field-recorded data.

4 Concluding Remarks

Graphical models such as Bayesian networks (BNs) have became extremely popu-
lar in many areas of research for the analysis of data collected from observational
studies, such as those often found in social sciences, epidemiology, and economics.
They have also been increasingly used in genetics and genomics studies, given their
ability to explore some additional features of the data, such as conditional indepen-
dencies between variables, and to provide further insights about the biological sys-
tem under study, such as functional relationships between variables.
A first example discussed in this chapter illustrates how BNs can be used to
describe more parsimoniously the joint dependencies between variables such as the
LD structure involving molecular markers (Morota et al. 2012). Another important
application of BNs refers to prediction. In this context, BNs are able to detect the
minimal set of a pool of available variables that encompass all the relevant informa-
tion for predicting a target variable of interest, i.e., the Markov blanket (MB). Here
we illustrate this approach using two examples, one with phenotypic traits in quails
(Felipe et al. 2015) and another in the framework of whole-genome prediction mod-
els (Scutari et al. 2013). In this context, Valente et al. (2015) pointed out that the
construction of whole-genome prediction models targeting only predictive ability
using cross-validation techniques could be misleading. As discussed by these
authors, genomic selection models should be constructed aiming primarily at the
identifiability of causal genetic effects, not the predictive ability per se. BN tech-
niques should be useful in this regard as well, aiding the selection of variables to be
included in the models to avoid spurious associations between genetic markers and
phenotypic traits.
An example discussed in Section 3.3 illustrates the application of BN techniques
to investigate causal functional relationships between traits as well as the flow of
information from DNA polymorphisms to transcriptional activity and end point
phenotypes. BN applied to observational data for causal inference purposes is, how-
ever, a much harder task than for the development of prediction models. In this
context, causal inferences are possible only under certain conditions and assump-
tions, such as the Markov condition, faithfulness, and causal sufficiency (Pearl
2009). However, even if some of the conditions are not met, a BN analysis may still
produce interesting and useful results such as the generation of causality hypotheses
for further research and investigation. In genetics, for example, a putative causal
mutation or gene pathway identified by a BN could be ultimately tested using gene
knockout or knockdown techniques.
An advantage of the use of BN analysis in genetics and genomics studies is that
causal inference is aided by the concept of Mendelian randomization (Thomas and
Conti 2004), in which allelic variants are randomized to zygotes during meiosis and
114 G.J.M. Rosa et al.

eventually passed on from parents to offspring, analogously to a randomized experi-


VetBooks.ir

mental design. It is shown that applying graphical model techniques to QTL analy-
sis and gene mapping with multiple traits not only allows inference regarding causal
relationships among phenotypes, but it also enhances detection power and precision
of estimates, with the additional advantage of a distinction between direct and indi-
rect genetic effects of QTL on each trait (Chaibub Neto et al. 2010).
Lastly, graphical models such as BN and structural equation models can be
extremely useful for analysis of farm-recorded data, which are essentially of observa-
tional nature. In this context, abundant data routinely collected in commercial herds
(either for breeding purposes, health control, or general herd management decisions)
could be exploited to investigate causal effects of environment and management fac-
tors in livestock production, well-being, and product quality. As discussed by Rosa
and Valente (2013), there is a huge opportunity for additional learning from such data,
which require, however, careful analysis using graphical model techniques.
In summary, BNs provide a flexible and insightful approach for the development
of prediction models in genetic analyses as well as for investigating causal relation-
ships between DNA polymorphisms and gene activity, and complex end point phe-
notypes. Knowledge regarding functional relationships between phenotypic traits as
well as the flow of information from DNA to phenotypes can aid the development
of more efficient breeding programs and optimal decision-making strategies in live-
stock management practices.

References
Bollen KA (1989) Structural equations with latent variables. Wiley, New York
Bouwman AC, Valente BD, Janss LLG, Bovenhuis H, Rosa GJM (2014) Exploring causal net-
works of bovine milk fatty acids in multivariate mixed model context. Genet Sel Evol 46:2
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical
information-­theoretic approach, 2nd edn. Springer, New York
Chaibub Neto E, Keller MP, Attie AD, Yandell BS (2010) Causal graphical models in systems
genetics: a unified framework for joint inference of causal network and genetic architecture for
correlated phenotypes. Ann Appl Stat 4:320–339
de los Campos G, Gianola D, Boettcher P, Moroni P (2006) A structural equation model for
describing relationships between somatic cell score and milk yield in dairy goats. J Anim Sci
84:2934–2941
de los Campos G, Gianola D, Rosa GJM (2009) Reproducing kernel Hilbert spaces regression: a
general framework for genetic evaluation. J Anim Sci 87:1883–1887
de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole-genome
regression and prediction methods applied to plant and animal breeding. Genetics
193(2):327–345
de Maturana EL, Wu X-L, Gianola D, Weigel KA, Rosa GJM (2009) Exploring biological relation-
ships between calving traits in primiparous cattle with a Bayesian recursive model. Genetics
181:277–287
de Maturana EL, de los Campos G, Wu X-L, Gianola D, Weigel KA, Rosa GJM (2010) Modeling
relationships between calving traits: a comparison between standard and recursive mixed mod-
els. Genet Sel Evol 42:1
Edwards DB, Ernst CW, Raney NE, Doumit ME, Hoge MD et al (2008a) Quantitative trait locus
mapping in an F2 Duroc x Pietrain resource population: II. Carcass and meat quality traits.
J Anim Sci 86:254–266
Applications of Graphical Models in Quantitative Genetics and Genomics 115

Edwards DB, Ernst CW, Tempelman RJ, Rosa GJM, Raney NE et al (2008b) Quantitative trait loci
VetBooks.ir

mapping in an F2 Duroc x Pietrain resource population: I. Growth traits. J Anim Sci


86:241–253
Felipe VPS, Silva MA, Valente BD, Rosa GJM (2015) Using multiple regression. Bayesian net-
works and artificial neural networks for prediction of total egg production in European quails
based on earlier expressed phenotypes. Poult Sci 94:772–780
Friedman N, Linial M, Nachman I, Pe’er D (2000) Using Bayesian networks to analyze expression
data. J Comput Biol 7:601–620
Gianola D, Sorensen D (2004) Quantitative genetic models for describing simultaneous and recur-
sive relationships between phenotypes. Genetics 167:1407–1424
Gianola D, de los Campos G, Hill WG, Manfredi E, Fernando R (2009) Additive genetic variability
and the Bayesian alphabet. Genetics 183:347–363
Goddard ME, Hayes BJ (2007) Genomic selection. J Anim Breed Genet 124(6):323–330
Haavelmo T (1943) The statistical implications of a system of simultaneous equations.
Econometrica 11:1–12
Henderson CR, Quaas RL (1976) Multiple trait evaluation using relatives records. J Anim Sci
43:1188–1197
Heringstad B, Wu X-L, Gianola D (2009) Inferring relationships between health and fertility in
Norwegian red cows using recursive models. J Dairy Sci 92:1778–1784
Jamrozik J, Bohmanova J, Schaeffer LR (2010) Relationships between milk yield and somatic cell
score in Canadian Holsteins from simultaneous and recursive random regression models.
J Dairy Sci 93:1216–1233
Karacaören B, Silander T, Álvarez-Castro JM, Haley CS, de Koning DJ (2011) Association analy-
ses of the MAS-QTL data set using grammar, principal components and Bayesian network
methodologies. BMC Proc 5(Suppl 3):S8
König S, Wu X-L, Gianola D, Heringstad B, Simianer H (2008) Exploration of relationships
between claw disorders and milk yield in Holstein cows via recursive linear and threshold
models. J Dairy Sci 91:395–406
Li R, Tsaih SW, Shockley K, Stylianou IM, Wergedal J, Paigen B, Churchill GA (2006) Structural
model analysis of multiple quantitative traits. PLoS Genet 2, e114
Li JZ, Chen X, Gong XL, Liu Y, Feng H, Qiu L, Hu ZL, Zhang JP (2009) A transcript profiling
approach reveals the zinc finger transcription factor ZNF191 is a pleiotropic factor. BMC
Genomics 10:241
Long N, Gianola D, Rosa GJM, Weigel KA, Avendaño S (2009) Comparison of classification
methods for detecting associations between SNPs and chick mortality. Genet Sel Evol 41:18
Morota G, Valente BD, Rosa GJM, Weigel KA, Gianola D (2012) An assessment of linkage dis-
equilibrium in Holstein cattle using a Bayesian network. J Anim Breed Genet 129:474–487
Mrode RA (2005) Linear models for the prediction of animal breeding values, 2nd edn. CABI
Publishing, Wallingford
Nagarajan R, Scutari M, Lèbre S (2013) Bayesian networks in R with applications in systems biol-
ogy. Springer, New York
Pearl J (2009) Causality: models, reasoning, and inference, 2nd edn. Cambridge University Press,
Cambridge
Peñagaricano F, Valente BD, Steibel JP, Bates RO, Ernst CW, Khatib H, Rosa GJM (2015a)
Exploring causal networks underlying fat deposition and muscularity in pigs through the inte-
gration of phenotypic, genotypic and transcriptomic data. BMC Syst Biol 9:58
Peñagaricano F, Valente BD, Steibel JP, Bates RO, Ernst CW, Khatib H, Rosa GJM (2015b)
Searching for causal networks involving latent variables in complex traits: application to
growth, carcass, and meat quality traits in pigs. J Anim Sci 93:4617–4623
Rosa GJM, Valente BD (2013) Inferring causal effects from observational data in livestock. J Anim
Sci 91:553–564
Rosa GJM, Valente BD (2014) Structural equation models for studying causal phenotype networks
in quantitative genetics. In: Sinoquet C, Mourad R (eds) Probabilistic graphical models for
genetics, genomics and postgenomics. Oxford University Press, Oxford
116 G.J.M. Rosa et al.

Rosa GJM, Valente BD, de los Campos G, Wu X-L, Gianola D, Silva MA (2011) Inferring causal
VetBooks.ir

phenotype networks using structural equation models. Genet Sel Evol 43:6
Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP (2005) Causal protein-signaling net-
works derived from multiparameter single-cell data. Science 308:523–529
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman
M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H,
Kash SF, Drake TA, Sachs A, Lusis AJ (2005) An integrative genomics approach to infer
causal associations between gene expression and disease. Nat Genet 37:710–717
Scutari M, Mackay I, Balding D (2013) Improving the efficiency of genomic selection. Stat Appl
Genet Mol Biol 12(4):517–527
Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH (2005) Genetic dissection and
prognostic modeling of overt stroke in sickle cell anemia. Nat Genet 37:435–440
Shipley B (2002) Cause and correlation in biology. Cambridge University Press, Cambridge, UK
Sinoquet C, Mourad R (eds) (2014) Probabilistic graphical models for genetics, genomics and
postgenomics. Oxford University Press, Oxford
Spirtes P, Glymour C, Scheines R (2000) Causation, prediction and search, 2nd edn. The MIT
Press, Cambridge, MA
Steibel JP, Bates RO, Rosa GJM et al (2011) Genome-wide linkage analysis of global gene expres-
sion in loin muscle tissue identifies candidate genes in pigs. PLoS One 6(2), e16766
Thomas DC, Conti DV (2004) Commentary: the concept of ‘Mendelian randomization’. Int
J Epidemiol 33:21–25
Tsamardinos I, Brown LE, Aliferis CF (2006) The max-min hill-climbing Bayesian network struc-
ture learning algorithm. Mach Learn 65:31–78
Valente BD, Rosa GJM, de los Campos G, Gianola D, Silva MA (2010) Searching for recursive
causal structures in multivariate quantitative genetics mixed models. Genetics 185:633–644
Valente BD, Rosa GJM, Teixeira RB, Torres RA (2011) Searching for phenotypic causal networks
involving complex traits: an application to European quails. Genet Sel Evol 43:37
Valente BD, Morota G, Peñagaricano F, Gianola D, Weigel KA, Rosa GJM (2015) The causal
meaning of genomic predictors and how it affects the construction and comparison of genome-­
enabled selection models. Genetics 200:483–494
Varona L, Sorensen D, Thompson R (2007) Analysis of litter size and average litter weight in pigs
using recursive model. Genetics 177:1791–1799
Vazquez AI, Rosa GJM, Weigel KA, de los Campos G, Gianola D, Allison DB (2010) Predictive
ability of subsets of single nucleotide polymorphisms with and without parent average in US
Holsteins. J Dairy Sci 93(12):5942–5949
Wang H, van Eeuwijk F (2014) A new method to infer causal phenotype networks using QTL and
phenotypic information. PLoS One 9(8), e103997
Wang H, Paulo J, Kruijer W, Boer M, Jansen H, Tikunov Y, Usadel B, van Heusden S, Bovy A, van
Eeuwijk F (2015) Genotype–phenotype modeling considering intermediate level of biological
variation: a case study involving sensory traits, metabolites and QTLs in ripe tomatoes. Mol
Biosyst 11:3101–3110
Wright S (1921) Correlation and causation. J Agri Res 201:557–585
Wu X-L, Heringstad B, Chang YM, de los Campos G, Gianola D (2007) Inferring relationships
between somatic cell score and milk yield using simultaneous and recursive models. J Dairy
Sci 90:3508–3521
Wu X-L, Heringstad B, Gianola D (2008) Exploration of lagged relationships between mastitis and
milk yield in dairy cows using a Bayesian structural equation Gaussian-threshold model. Genet
Sel Evol 40:333–357
VetBooks.ir

Advanced Computational Methods, NGS


Tools, and Software for Mammalian
Systems Biology

Mohamood Adhil, Mahima Agarwal, Prahalad Achutharao,


and Asoke K. Talukder

Abstract
Improvements in animal health, productivity, or disease resistance require an
understanding of functional genomics, which can be obtained using multiscale
multi-omics data analysis. Multiscale omics data analysis requires the integra-
tion of bottom-up and top-down approaches to consider both the data available
from an experiment as well as the existing knowledge of functional properties
and interactions in the biological system. Such a comprehensive analysis is now
possible through the integration of multi-omics data generated across multiple
scales from genomics, transcriptomics, proteomics, and metabolomics experi-
ments. This type of systems biology-based analysis requires the use of different
statistical methods and extensive network analysis fine-tuned for the purpose of
identifying the key components associated with the required phenotypes. This
will require integration of powerful statistical techniques with computational
platforms capable of handling the large data volumes and the integrative nature
of the analysis. In this chapter, we discuss all these aspects of systems biology
with some of the tools and techniques.

1 Introduction

A living cell is like a working factory which uses several organelles to produce
thousands of proteins for specific functions. The whole process is very complex and
even minute differences may result in different phenotypic characteristics, includ-
ing diseases. To understand the system of a whole cell, we need to study different
kinds of molecular data for DNA, genes, RNA, proteins, metabolites, and their

M. Adhil • M. Agarwal • P. Achutharao • A.K. Talukder (*)


InterpretOmics India Pvt. Ltd., Bangalore, India
e-mail: asoke.talukder@interpretomics.co

© Springer International Publishing Switzerland 2016 117


H.N. Kadarmideen (ed.), Systems Biology in Animal Production and Health, Vol. 1,
DOI 10.1007/978-3-319-43335-6_6
118 M. Adhil et al.

interactions and interrelationships. A computational systems biology approach


VetBooks.ir

through statistics and mathematical models will help us to integrate these molecular
data sets and obtain meaningful insights. This will enhance our understanding about
the mechanism of life and will enable useful modifications, be it for disease man-
agement, desired agricultural productivity, or desired animal phenotype.
Norman Borlaug, a biologist, was awarded the Nobel Peace Prize in 1970 for
saving the lives of more than a billion people (Hesser 2006). Borlaug was credited
for the Green Revolution through his contribution in the production of high-yield,
disease-resistant wheat, thereby saving over a billion people from starvation.
Borlaug and his team created a special variety of dwarf wheat that produced about
three times more grain than the traditional varieties. It consumed less water and
natural resource with increased immunity such that the wheat resisted a wide spec-
trum of plant pests and diseases. They were able to do so due to an understanding of
the genetic mechanisms involved in development of the desired phenotypes. Omics
sciences today are able to take us a step further and help us understand, in much
more detail, what is happening inside a living system through generation of massive
quantities of experimental data. These data are generally called “big data” because
they are huge in size and complexity. Supercomputers and complex computational
algorithms are required to process and interpret these biological data. Increasing use
of cloud computing solutions will make the data volume anticipation and handling
problem much more tractable (Stein 2010). Cloud computing with computational
algorithms will be able to make these mechanistic models available and accessible
to all the biologists across the globe.
Figure 1 shows a very simplified model of the basis of omics sciences. The DNA
in the genome produces mRNA, which is converted into protein products. These
proteins then participate directly or indirectly in different complex biochemical
reactions and modulate the production and consumption of metabolites. Through
the complex interplay of these processes, the observable phenotypes develop. To
answer any question in biology, be it for creating a drought-resistant crop, a high-
productivity domesticated animal, or a drug for treating a disease, one needs to
understand the genotype–phenotype relationship. Traditionally, it was only possible
to measure the phenotypic outcome, because it was visible to the naked eye.
However, with the help of omics science, one can now even measure the phenome-
non at the molecular level. It has become possible to study how molecules inside the
cells interact, their shapes, their sizes, what they produce, how can they be modified,
etc., so that we can make changes in some of these molecules, their patterns, and
behaviors and use them to our advantage with manageable risk.
Omics science is transforming biological science into a mechanistic science so that
we can predict outcomes with higher level of accuracy. In this chapter, we will discuss
how mathematics and computational science are used to extract meaningful knowl-
edge from Omics data. We take a multi-omics multiscale toolset named iOMICS
(http://iomics.interpretomics.co) that combines all these various biological big data to
extract the knowledge and actionable insights. It will help you, a biologist, to under-
stand how to create hypotheses from genome scale omics data and establish a cause–
effect or genotype–phenotype relationships in a biological system.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 119

Environment
VetBooks.ir

Post-Translation
Epigenome Modification

Phenotype
DNA methylation, Covalent and
Histone Modification, Enzymatic Modification
Transcription factor binding

Genome Transcriptome Proteome Metabolome

SNP,CNV,LOH, Gene Expression, Protein Expression Metabolites


Mutation, Genomic Small RNA,
Rearangement Alternative Splicing

Cancer
Transcription Translation and Other
Diseases
and
Disorders

Integration Approach

Fig. 1 Genome to phenotype relationship

2 Multiscale–Multi-omics Data

Multiple types of data are being generated with the help of high-throughput
technologies like next-generation sequencing (NGS) (Schuster 2007), microar-
ray (Schena et al. 1995), mass spectrometry (Aebersold and Mann 2003), etc.
These data can be for a specified region of interest or at the genome scale. These
experiments generate data that could be (i) genomic—whole genome sequenc-
ing, exome sequencing, targeted sequencing, and genotyping arrays; (ii) tran-
scriptomic—RNA sequencing, small RNA sequencing, gene expression, and
miRNA expression arrays; (iii) epigenomic—transcription factor binding and
chromatin regulators, generated using both sequencing and array techniques;
(iv) proteomic—protein expression array, mass spectrometry, enzyme-linked
immunosorbent assays (ELISA); and (v) metabolomic—mass spectrometry,
chromatography, etc.
Omics data can be grouped into two major categories:

1. Perishable or consumable data


2. Persistent background data

These data types often need to be combined in order to generate valuable infor-
mation regarding the biological system (Talukder 2015). This information can then
be used to develop strategies for modifying system properties (Agarwal et al. 2015;
Adhil et al. 2015). Their properties are described below.
120 M. Adhil et al.

Perishable Data This includes high-throughput data that are generated from bio-
VetBooks.ir

logical experiments. These data are analyzed using various tools, and knowledge
value is derived from them. When there are multiple data sets, they are called repli-
cates. There are two different types of replicates:

1. Technical replicates—In this type, the same sample is used multiple times to
generate multiple sets of data. This is useful to determine any instrument/
technology-related artifacts and to establish the variability of the technique.
2. Biological replicates—In this type, multiple samples are taken from different
sources, in order to establish the biological variability between samples. This is
useful for discovery of specific information about the specimens.

Cohort level data: In a cohort, multiple samples are analyzed by taking biospeci-
mens from each member of the cohort. Depending on the experimental study design,
the cohort size is chosen. Usually, it is desirable to have 25 or more samples.

Persistent Background Databases In order to make useful and meaningful infer-


ences from the data, background population level data from many different data-
bases are required. These background data include data across multiple scales,
ranging from reference genomes to interactome data. There are many databases that
contain information on genes, exons, introns, intergenic regions, known disease
causing mutations, gene functions, drugs, diseases, and associations between them.
Biological databases can also be searched through a special yearly issue of the jour-
nal nucleic acids research (NAR).

3 Known and Commonly Used Biological Data


Representations

Vast amounts of multiple types of omics data are generated and stored in different
formats. To process these data in order to obtain functional meaning, one needs to
understand the format in which the data are stored. The different types of data,
such as nucleotide or protein sequence, annotation of genes, variation (mutations
and structural variations), protein structures, etc., are all stored in different for-
mats, appropriate for the data type (https://genome.ucsc.edu/FAQ/FAQformat.
html). Biological databases also allow you to download large volumes of data in
formats like Tab delimited text (.txt), XML (Mesiti et al. 2009), SQL dump, or
PostgreSQL dump based relational databases (Letovsky 1999) and Systems
Biology Mark-up Language (SBML) (Hucka et al. 2003). Web Ontology Language
(OWL) (McGuinness and Van Harmelen 2004) and Resource description frame-
work (RDF) (McGuinness and Van Harmelen 2004) are knowledge representation
(in hierarchies) languages used to define the classes of the objects and relation-
ships between the objects. The use of standardized data formats enables the trans-
fer and use of information across groups. Some commonly used data formats are
given below.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 121

FASTA Most of the genome reference files are in this format. It contains the
VetBooks.ir

nucleotide or protein sequence with name or description that precedes the


sequences. This type of file contains a description line that is followed by the
sequence. The description line is distinguished from the sequence by a greater than
symbol (>) in the first column. An example protein sequence in FASTA format is
given below:

> gi|296803301|gb|ADH51719.1| interleukin-2 [Camelus


dromedarius]
MYKLQFLSCIALTLALVANSAPTLSSTKDTKKQLEPLLLDLQFLLKEVNNYENLKLSRMLTFKF
YMPKKA
TELKHLQCLMEELKPLEDVLNLAQSKNSHLTNIKDSMNNINLTVSELKGSETGFTCEYDDETVT
VVEFLN
KWITFCQSIYSTLT

FASTQ Most of the raw NGS sequencing output is stored in FASTQ format. This
data stores both the sequence as contained in the FASTA files, along with quality
scores for each base in the sequence. FASTQ contains four lines per NGS sequence
read. Although the first two lines contain the sequence identifier (description) and
sequence as in the FASTA files, the third line begins with a “+” and may contain
the sequence identifier (optional), whereas the fourth line contains the quality
scores for each of the nucleotides. An example FASTQ sequence is given as
follows:

@SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y/1
GGGTCATGTGGGCTCATTATTTTCCTCTTTCTTTTACCCAAGTGGACAAG
+SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y/1
HHHHHHHHFHHHGHH6GGGHHHHHHHHFHGEHHHHHHGFFDG@BGDEGGG

Gene Feature Format (GFF) This file contains one line per feature, each contain-
ing nine columns of data such as seqname (chromosome), source, feature (Gene),
start, end, score, strand, frame, attribute (additional information). An example of
this format is given below.

##gff-version 3
##sequence-region 11 1 135086622

11 Ensembl gene 70203163 70207390 . + . ID=EN


SG00000168040;Name=ENSG00000168040;biotype=protein_coding
11 Ensembl gene 70206291 70207390 . - . ID=EN
SG00000254721;Name=ENSG00000254721;biotype=antisense

Variant Calling Format (VCF) Information regarding the sequence variants is


stored in the VCF format. This includes information about the variations such as
mutations and SNPs. Files in the VCF format are generated by variant calling
122 M. Adhil et al.

and analysis softwared with the “.vcf” extension. The VCF files contain the
VetBooks.ir

sequence variation information with respect to particular genome. This file contains
meta-information lines, followed by the header and then the variant information. It
contains fields like chromosome, position, ID, reference allele, alternate allele,
quality, filter, info, and format for each variant. It is also useful to capture additional
information about the variants like rsid, genotype probabilities, phred score and
coverage, etc. The meta-information lines begin with “##.” The VCF file format
meta-information field is required. In addition, they may contain a description of the
fields contained within the info field. The sample genotypes are provided in the final
format column. The first four columns of the VCF file describe the variant with
information such as chromosomal position, unique IDs, reference, and alternate
alleles. The quality and filter fields provide information regarding the alternate
allele call. The info field may contain various other annotations for the variants,
such as alternate allele frequencies in different populations, allele count, codon
change, etc.

Protein Data Bank This contains information about the three-dimensional struc-
tures of molecules, such as atomic coordinates, side chain conformers, secondary
structure, and atomic connectivity. The structure information contained within a
PDB file is given under different line types depending on the information. For
example, SEQRES lines give the amino acid sequence of the protein, whereas the
HELIX, SHEET, and TURN lines provide information regarding the secondary
structure elements. A 3-D rendering of the structure in space is achieved by the
information available in the ATOM and HETATM lines, which contain coordinate
information.

BED The BED file contains the feature track information, where each row contains
the feature and each column contains the chromosome, start, end, feature name, and
score. This is used to represent the chromosome with intervals and its respective
significance. This is widely used in ChIP-seq analysis to represent the peaks (fea-
ture) along with their coordinates and score. An example of a few records is shown
below, where the first line contains the description about the data and the others
contain the chromosome with intervals.

track name="sc_ori" description="Data from OriDB: All available


replication origins in S. cerevisiae" db=sacCer1
chr1 650 1791 ARS102
chr1 6136 7136 ARS102.5
chr1 7998 8548 ARS103

Systems Biology Markup Language (SBML) The systems biology markup lan-
guage is an example of a standard data representation that was developed to enable
data sharing between different research groups. Models such as metabolic models
and cell signaling networks are represented in the SBML format to enable data shar-
ing and for standardization. It is based on XML and is widely used to represent
biological networks.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 123

4 Evidence-Based Reasoning
VetBooks.ir

Biological data are mostly empirical, and the results are primarily inferences from
intuition, observed phenomena, and experimental data. This involves inductive rea-
soning, which makes broad generalizations from specific observations. In empirical
science or evidence based practice (EBP), you make many observations, then you
discover a pattern, then you generalize these observations. Finally, you combine all
different conclusions from different experiments and build a theory—answer the
question “why.” Once the mechanistic model is created using data generated in the
past, you are in a position to use this model to answer the question on data that you
have not seen—answer what is likely to happen in the future.
In evidence-based reasoning, the general results are obtained from specific data,
whereas in deductive science, specific results are obtained from general data. Once
the property of the population is known through inductive reasoning, we can deduce
the property of any member of the population.
In twenty-first century biology, you perform experiments, collect data and use
inductive reasoning with the help of bioinformatics and mathematics to arrive at a
general mechanistic model. Once the mechanism is known, you use deductive rea-
soning to predict the outcome based on the theory (model you created).

5 Integration Through Reduction

It becomes difficult to create a mechanistic model of a complex system, so the com-


plex system is broken into smaller manageable constituent parts. In biological sys-
tems, life at the molecular level includes DNA, genes, proteins, metabolites, etc. In
animal breeding for example, you want to increase the production of milk in cows.
In agriculture, you want to change the height of a crop or the increase the yield. To
achieve all these, you use perturbation techniques where the effect of the perturba-
tion is the change in phenotype. You study these relationships between perturbation
and the phenotype in small manageable constituent parts. Finally, you need to obtain
results for the tissue, the organ, and the species. Therefore, you need to integrate the
knowledge generated from the individual experiments. In the process of reduction,
the whole view is lost; therefore, during integration, you must take all these leaf
level pieces of information and combine them into a functional phenotype (Fig. 1).

6 iOMICS for Genomics Data Analysis

Various tools are available, which are capable of handling the large volumes of
“omics” data. These tools run quality checks on experimental data, along with
descriptive statistics, and finally derive insight from the data (Kohl et al. 2014).
These then need to be integrated to obtain a complete system level picture, to link
to the phenotype. Integration of these tools, to obtain isolated insights, followed by
a cross-scale integration is implemented in the iOMICS bioinformatics tool. The
iOMICS platform is a sophisticated next-generation genomics software suite,
124 M. Adhil et al.
VetBooks.ir

Fig. 2 iOMICS apps and its higher level organization (NGS, MicroArray, and Integrative Biology)

designed to process individual and cohort samples (hundreds of samples) of vari-


ous types of omics data such as genome, epigenome, transcriptome, and proteome,
to derive knowledge responsible for the phenotype of the interest. The advantage
of iOMICS is that the results are reproducible and can be integrated with other
knowledge at a later date with the proper context. NGS and microarray are the two
high-throughput techniques used to generate omics data sets. In iOMICS, various
applications, or apps (Fig. 2), are available, which can analyze both NGS and
microarray data. Some of the tools that are used in iOMICS apps are well accepted
by the scientific community and also extensively tested and validated using various
experimental data sets. iOMICS is a functional genomics integrative biology plat-
form. It has its proprietary bioinformatics algorithms combined with some of the
well-known tools available in R/Bioconductor, python and the open domain.
Advantages of using iOMICS suite over individual tools include the following:
(i) less time to process cohort samples to derive knowledge; (ii) graphical user inter-
face to choose parameters, where help tags are available for different options, mak-
ing the application much more user-friendly; (iii) reproducibility of results for the
same samples; (iv) management of raw data, processed data, and results; (v) results
can be filtered (on-the-fly) based on user needs (e.g., significant variation at a spe-
cific gene based on odds ratio cutoff (or) differentially expressed genes based on the
p-value cutoff); (vi) interactive visualization of results using dynamic plots, genome
browser, and networks; (vii) convenience in running entire workflows with minor
changes and to track these parameters, compared to manual subsequent usage of
tools; and, (viii) different formats and files required by individual tools are taken
care of.
iOMICS genomics suite is available in on-premise and cloud versions. In on-
premise versions, installation is made using the local servers. If the user is process-
ing more samples, then it requires the specific computation power and memory
according to the data volume. In most of the steps, parallel computing is performed
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 125

based on the available processors in order to reduce the time. On-premise versions
VetBooks.ir

are preferred when large amounts (hundreds to thousands of samples) of omics data
sets are generated locally and repeatedly. The cloud version of iOMICS has many
advantages compared to the on-premise version, such as follows: (i) local hardware
infrastructure is not necessary, (ii) there is no installation of the software, (iii) user
end software update is not necessary, (iv) it can be accessed at any time and from
anywhere, and (v) automatic scale up/down of instances based on the data volume.
In the following section, we describe how you can use the various functionalities in
iOMICS suite to analyze individual omics data sets.

6.1 Genome

Two major analysis pipelines available in iOMICS for analyzing genome sequence
are assembly and variation analysis. In iOMICS, both NGS and array data can be
analyzed. The array technology is mainly used for the validation of variations—it
cannot be used to find de novo variations. The genome sequence analysis app
includes (i) DNA-seq for analyzing whole genome sequence data and (ii) Exome-
seq for analyzing whole exome sequence analysis. For microarray data, the geno-
typing app is available for SNP and CNV analysis.
These apps have multiple functional modules such as “quality metrics,” “refer-
ence assembly,” “variant detection,” “population genetics,” and “pathway analysis.”
Some modules require a cohort (case/control) data set, such as “population genet-
ics” where the genotype-phenotype association is performed to obtain significant
genotypes associated with the phenotype.

6.2 Epigenome

There are various techniques available for mapping the genome-wide epigenetic
information, which includes chromatin regulators, transcriptional binding sites, and
DNA–RNA hybrids. These techniques are studied using NGS (ChIP-seq) and
microarray (ChIP-on-chip) for different experimental conditions.
In iOMICS, ChIP-seq and ChIP-on-chip apps can be used to analyze all types
of genome-wide epigenetic information. The modules in the ChIP-seq app
include “quality metrics,” “read alignment,” “peak alignment,” and “peak iden-
tification.” The ChIP-on-chip app is based on array technology. It contains mod-
ules such as data normalization and peak calling. The common modules for both
the apps are “peak annotation,” “common/unique peaks,” and “motif identifica-
tion.” Case/control comparisons can also be performed to identify the differen-
tially binding sites and to perform functional annotations on those regions.
Other interesting results from ChIP-seq/ChIP-on-chip analyzes include average
open reading profiling of the binding site, which tells the distribution of peaks
in the intragenic region, and an overall binding site profile near the transcrip-
tional start site (TSS).
126 M. Adhil et al.

6.3 Transcriptome
VetBooks.ir

Transcriptomics data from NGS can be used to identify the transcripts, genes, coex-
pression, and differential expression across samples. In array-based technology,
only coexpression and differential expression analysis can be performed. In
iOMICS, the RNA-seq (NGS) and Gene Expression (Array) app can be used to
perform transcriptomics data analysis. The modules in the RNA-seq app include
“splice aware alignment,” “assembly and expression quantification,” “gene predic-
tion,” and “differential expression analysis.” The modules in the array-based gene
expression app include “data normalization,” “differential expression,” “coexpres-
sion,” and “functional enrichment.” The differential expression analysis is used to
understand the conditional regulation of the genes based on the phenotype.
Coexpression analysis helps to identify the genes coexpressed for the specific con-
dition. Functional enrichment helps to identify the biological process, cellular com-
ponent, and molecular function for sgnificant genes.

6.4 Small RNA

Small RNA are noncoding RNA involved in regulating the translation of target
RNA. The miRNA (NGS) app in iOMICS can be used to identify the miRNA, char-
acterize and find miRNA targets, and compute differential expression of miRNAs
between case and control. The modules in the app include “miRNA identification,”
“conservation study,” “target identification,” and “differential expression analysis.”
Conservation study helps to identify the most conserved miRNAs across the species
and their phylogeny. Target identification helps to identify mRNA-miRNA
interaction.

6.5 Phenotype Modeling

The final goal of any omics data analysis is to arrive at a mechanistic model for the
phenotype. For example, an important application of integrating genotype with phe-
notype is marker-assisted selection (MAS) in breeding processes. MAS is used to
select the trait of interest, based on the molecular marker tightly linked to the trait.
Molecular markers used for this purpose include genes, SNPs, microsatellites, etc.
You can produce a crop that is significantly resistant to a particular disease, or you can
produce a fruit with high nutritional value, or you can do animal breeding to produce
healthy offspring. Once the markers are identified and validated, a statistical method
called quantitative trait loci (QTL) is used to identify significant markers responsible
for the phenotype of interest. First, you construct the genetic map using the set of
molecular markers and then you perform a linkage analysis. The linkage analysis is
performed by using the phenotype, which is a quantitative measurement (e.g., color
intensity, weight, height, and production), along with the genetic linkage map. The
loci can then be used to identify the significant markers responsible for the trait.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 127
VetBooks.ir

Experimental
Data Analysis

Top Down Approach

Systems Biology
Integrative Analysis
Insights

Bottom Up Approach

Reference
Knowledgebases

Fig. 3 Integrative top-down and bottom-up approach

Such phenotype-associated analyses can be run for each individual type of


experimental data. These steps lead to the generation of isolated knowledge, all of
which together lead up to a whole system. However, these need to be combined in
order to be able to obtain a complete mechanistic model for the phenotype which
can be applied to solve biological problems. In addition, there may be gaps in the
information which a single experiment can provide. This may be due to experimen-
tal or sampling limitations. Therefore, the data from experiments need to be supple-
mented with existing knowledge in order to enhance the results. Such an analysis is
known as a bottom-up analysis. Because of the complex nature of the biological
system, the use of existing knowledge can refine the results to obtain more general-
izable insights. Therefore, in order to construct a more complete phenotypic model,
a holistic approach that combines experimental knowledge with existing knowl-
edgebases is required. Such an analysis integrates a top-down exploratory data ana-
lytics approach with a bottom-up integrative approach to combine experimental
results with reference knowledge base. This type of approach is shown in Fig. 3.
The resulting mechanistic phenotype model can be used to study, predict, and even
manipulate phenotypes.
iOMICS contains downstream apps for integrative analysis from the individual
data sets. This stage combines the knowledge from the different sets of data to
provide useful actionable insights. These include both top-down analyses from
different experimental data sets and bottom-up integrative analyses. Because of
the complex nature of biological system, this stage cannot follow a “one size fits
all” paradigm. Instead, it must be flexible, allowing the option of selecting the
128 M. Adhil et al.

type of integration required to answer the biological question. For this purpose,
VetBooks.ir

different types of integrative apps are available, which combine different types of
data such as DNA–RNA (Shabalin 2012), RNA–ChIP (Wang et al. 2013), and
miRNA–RNA.
The results obtained from the integrative modeling approaches, such as those
included in iOMICS, need to be finally applied in an experimental setup to obtain
the desired phenotype. This phenotype may be in the form of disease treatment,
increased productivity, improved quality of produce, etc. In addition, in certain
cases, the final outcome from integrative pipelines may need to be further enhanced
before it can be used. In such cases, certain statistical tools and network modeling
approaches can be used. Reference databases, commonly used statistical and net-
work analysis techniques are described in the following sections.

7 Reference Biological Databases

Biological databases are used for the functional annotation and validation of results
to obtain meaningful biological insights. The biological databases are commonly
classified into the following three categories:

1. Primary databases—These databases contain sequence or structure data directly


obtained from experiments. Researches from around the world share their raw
experimental data for public use. Data can be downloaded from these databases
in formats such as FASTQ, CEL, TIFF, BAM, etc. Examples include GenBank,
DDBJ, PDB, etc.
2. Secondary databases—These databases contain information derived from the
analysis of data in the primary databases. Databases such as OMIM, PROSITE,
and Pfam are some examples of secondary databases. The data can be down-
loaded in various common formats such as MySQL, XML, SBML, Postgresql,
BioPAX, and tab/comma separated flat files.
3. Composite databases—These databases combine different primary databases.
Common examples include NCBI and NRDB (nonredundant database). NCBI
also contains secondary databases such as OMIM in addition to primary sequence
databases. NRDB is a composite of SWISS-PROT and TrEMBL.

The primary databases usually contain archival information, whereas curated


information is available in secondary databases. These databases are used for multi-
omics annotations; for example, few important databases commonly used for anno-
tation and analysis are as follows: (i) dbSNP—contains genetic variation information
across different species along with important attributes such as population based
allele frequency, heterozygosity, PubMed ID, gene location, etc.; (ii) Clinvar—con-
tains the clinical significance (benign, likely pathogenic, pathogenic) of variations
with respect to phenotypes (diseases); (iii) NCBI SRA—contains next generation
sequencing data along with phenotype information; (iv) NCBI GEO—contains
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 129

expression (mostly microarray based) along with phenotype information; (v) The
VetBooks.ir

Cancer Genome Atlas (TCGA) contains raw and processed genetic data for almost
all cancer types and subtypes; (vi) BioGRID, KEGG, and Reactome—contain
curated biological pathways across different species; (vii) Gene Ontology
Consortium (GO)—contains concepts/classes used to describe gene function such
as molecular function, cellular component, and biological process; (viii) UniProt—
central repository for proteins containing sequence and function information; and
(ix) Expression Atlas—contains gene expression pattern under different biological
conditions across species. Many other useful databases are also available at NCBI
and EMBL.

8 Statistical Techniques for Experimental Data Analysis

This section details some of the basic statistical concepts and tools that will come in
handy for analyzing data for insight generation. These insights usually take the form
of biomarkers.

8.1 Variables

In a biological context, the variables of interest are potential biomarkers. These may
be gene names, variants, proteins, or metabolites. The variable being studied repre-
sents some values that may relate to properties such as gene expression, metabolite
consumption, etc. When the variable has a fixed value, which is assumed to be
measured without error and does not change from observation to observation, it is
called a fixed variable. By contrast, the variables of interest usually studied are ran-
dom variables. Their values are defined by a set of possible values following a
particular probability distribution. Depending on the variable being studied, it may
be numeric or categorical.

Numeric Random Variables Numeric random variables refer to numeric values


and can be continuous or discrete. Gene expression is a continuous numeric variable
where there are no limits or range, whereas a discrete variable is within a range.

Categorical Variables Categorical variables refer to values which are in catego-


ries. The values may be two or more categories. There is no intrinsic order to the
categories. For example, functional properties of genes are categorical variables.

Independent and Dependent Variables Dependencies may exist between vari-


ables. For example, consider a simple case where the expression of proteins is
dependent on gene expression. Considering no other dependencies, you will expect
the gene expression to be an independent variable, whereas protein expression is
dependent.
130 M. Adhil et al.

8.2 Descriptive Statistics


VetBooks.ir

It is helpful to summarize the data by using quantitative measurements or visuals


(plots). In biology, the descriptive statistics are fundamental to summarizing the
processed data or results from omics data sets. Some of the common quantitative
measurements include mean, median, mode, standard deviations, and quartile range.
For visuals, the most common plots are box plots, histogram, and scatter plots. For
example, R code is given below to check a pattern of a gene expression in two tis-
sues using quantitative measurements and visuals. For illustration, we have used a
simulated data set (“colon.SOX13,” “breast.SOX13”).

# R- code
# Sample data

> colon.SOX13<- c(1.84, 3.31, 1.61, 1.81, 3.2, 1.72, 2.98, 1.44,
1.26, 1.28, 2.75, 2.52, 2.75, 1.68, 2.45, 2.1, 3.3, 2.49, 4.36,
1.27, 0.39, 2.94, 1.78, 0.91, 3.05, 1.59, 1.68, 1.74, 1.68, 0.46,
1.34, 0.58, 1.89, 1.48, 2.26, 1.85, 4.09, 0.68, 2.09, 1.41, 0.66,
2.32, 1.81, 0.12, 2.03, 1.26, 1.4, 1.57, 2.53, 2.69)
> breast.SOX13<- c(2.01, 0.56, 0.13, 0.07, 0.45, 1.31, 0.66, 1.27,
0.43, 1.67, 1.54, 0.12, 0.48, 0.61, 0.51, 1.3, 1.65, 0.68, 1.39,
1.14, 1.03, 0.88, 0.49, 0.22, 2.07, 0.61, 0.53, 0.37, 0.22, 0.76,
0.62, 0.78, 1.22, 2.97, 1.92, 0.16, 0.62, 1.46, 1.57, 0.33, 2.46,
0.94, 2.5, 1.78, 0.76, 2.55, 0.92, 2.59, 1.34, 0.72)

> summary(colon.SOX13)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.120 1.402 1.795 1.928 2.512 4.360

> summary(breast.SOX13)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.070 0.515 0.830 1.067 1.520 2.970

The summary clearly shows that in colon tissue the “SOX13” gene expression
has a higher level compared to breast tissue. The values are higher in colon tissue in
all sections (minimum value, first quartile, median, mean, third quartile, and maxi-
mum value).
For visual representation of the following data, you can use a box plot or scatter
plot.

# R-code
# Box-Plot
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 131

> tag = c(rep("Colon-Tissue", length(colon.SOX13)), rep("Breast-


VetBooks.ir

Tissue", length(breast.SOX13)))

> boxdata = data.frame(Expression = c(colon.SOX13, breast.SOX13),


Tissue = tag)

> boxplot(Expression~Tissue, data=boxdata, col=(c("lightblue",


"lightgreen")), xlab = "Tissue", ylab = "SOX13-ExpressionValue")

# Scatter-plot

> plot(colon.SOX13, breast.SOX13,xlab="SOX13-Colon-Tissue-


Expression", ylab="SOX13-Breast-Tissue-Expression", col=c("blue",
"green"), pch=19)

> legend(x="topright", legend = c("Colon", "Breast"),


col=c("blue","green"), pch=19)

Both graphs (Figs. 4 and 5) show the clear differentiation of two variables
(colon.SOX13 and breast.SOX13). In the box plot, we can see that one data point
in “Colon-Tissue” has a large deviation (outlier) from the mean. This type of
4
SOX13-ExpressionValue
3
2
1
0

Breast-Tissue Colon-Tissue

Tissue

Fig. 4 Box plot between two tissues (breast and colon) using SOX13 gene expression
132 M. Adhil et al.

3.0
VetBooks.ir

2.5
SOX13-Breast-Tissue-Expression
2.0
1.5
1.0
0.5
0.0

0 1 2 3 4
SOX13-Colon-Tissue-Expression

Fig. 5 Scatter plot between two tissue (breast and colon) using SOX13 gene expression

summary plot will also help you to find outliers, if present in the data sets. In omics
data sets, these types of summary plots are helpful for visualizing overall patterns
in the data.

8.3 Deriving Value from Hypothesis Testing and Error


Estimation

Insight derivation, or biomarker identification, requires the creation and testing of


hypotheses. The experiments are based on certain hypotheses that are then tested on
the basis of the experimental data. Further, based on the results from the experimen-
tal data, you may create hypotheses that you will then test on new data. For exam-
ple, the identification of a biomarker for a phenotype based on experimental data
involves hypothesis creation from the experimental data. When you want to apply
this new biomarker knowledge to new data sets, you test your hypothesis on new
data sets.
The hypothesis depends on the question asked. This could be as simple as
whether or not two sample populations belong to the same overall population. Based
on this, you need to construct two contrasting hypotheses. The null hypothesis
assumes that the samples belong to the same population, whereas the alternate
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 133

hypothesis states that they do not. In mathematical terms, the null hypothesis is
VetBooks.ir

represented as h0, whereas the alternate hypothesis is represented as ha. The hypoth-
esis is tested using the p-value. In addition, due to the high dimensionality and
complexity of biological data, you also need to consider possible sources of error
which can lead you to accept the wrong hypothesis.

8.3.1 p-Value
The significance of hypothesis testing is measured by the p-value. Mathematically, it
is the probability of obtaining the observed values given in the null hypothesis. You
accept the null hypothesis when the p-value is close to 1 and reject the null hypoth-
esis (accept alternate hypothesis) when the p-value is close to 0. In this context, we
define confidence interval. In biology, the confidence interval is generally 95 %,
which means that you reject the null hypothesis when the p-value is less than 0.05.
A p-value of 0.05 indicates that there is 5 % probability of the null hypothesis being
correct. However, for high specificity, one may reduce this to 0.001 or lower.

8.3.2 Type I/Type II Errors


Two types of errors may arise during hypothesis testing. When the null hypothesis
is rejected despite being true, it is called a Type I error. On the other hand, when the
test fails to reject the null hypothesis, despite its being false, it is a Type II error.
Therefore, Type I errors give rise to false positives, whereas Type II errors result in
false negatives. This is illustrated in the Table 1.

8.3.3 Multiple-Testing Correction


In omics data analysis, you often need to run multiple hypothesis tests simulta-
neously. For example, when analyzing data from two cohorts to identify genes as
biomarkers to differentiate the cohorts, you want to test expression levels across
cohorts for all genes. In such a situation, the probability of observing a low
p-value significant result due to chance (Type I error) increases. For example,
when you run even ten tests simultaneously, using 95 % confidence intervals, the
probability of observing at least one significant event by chance is given by
1 − (0.95)10, which is equal to 0.401. Therefore, there is a 40 % chance of obtain-
ing a p-value of 0.05 purely by chance. In such a case, you need to use a suitable
correction method to reduce the likelihood of Type I errors. Two commonly used
methods for multiple testing correction are the Bonferroni correction and false
discovery rate (FDR).

Bonferroni Correction In this method, the p-value cutoff is divided by the number
of tests. Thus, in the previous example of 10 tests, the p-value threshold becomes

Table 1 Hypothesis testing table


True null hypothesis False null hypothesis
Null hypothesis rejected False positive (Type I error) True positive
Null hypothesis accepted True negative False negative (Type II error)
134 M. Adhil et al.

0.05/10, i.e., 0.005. The probability of a Type I error then becomes 0.048. Although
VetBooks.ir

this method reduces the Type I error, it is highly conservative and increases the
chances of Type II errors.

FDR and q-Value An alternative to the Bonferroni correction method is the false
discovery rate. FDR gives the proportion of false positives amongst the significant
results. This FDR value can be used to adjust the p-value in order to reduce the
number of false positives. This is done by calculating the probability that the
obtained significant result is a false positive. This adjusted p-value, called the
q-value, is a better measure of significance in cases of multiple testing.

8.4 Commonly Used Tests for Hypothesis Testing

8.4.1 t-Test
The t-test is applied to one or two random variables to test whether the means of
populations are significantly different. In the one-sample t-test, it is used to test
whether the samples are drawn from a particular population. In the two-sample
t-test, it compares the means of the values of two random variables to identify
whether these represent a true difference in the populations from which they have
been sampled. The t-test assumes that both random variables are distributed nor-
mally. For example, the t-test can be used to compare the expression value of a
gene between two groups. A statistically significant result indicates that the differ-
ence between the expression levels is due to an actual difference in the groups
rather than due to chance. In the following example, the expression levels of the
gene “SOX13” are taken for two tissues. We wish to identify whether there is any
difference in the expression levels of this gene in these tissues. For each group, we
have expression values from 50 samples. A high t-statistic and a low p-value indi-
cate that the expression levels for the two tissues do not belong to the same
distribution.

# R-Code

# t-test to find if there is any significant difference between two


cases
> t.test.SOX13 = t.test(colon.SOX13, breast.SOX13, "two.sided")

# t-statistic: Represents data point on the standard normal


distribution
> t.test.SOX13$statistic
[1] 5.186952

# p-value
> t.test.SOX13$p.value
[1] 1.221483e-06
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 135

In this example, we see the p-value is 1.22e−06. This clearly means that the gene
VetBooks.ir

expression levels of the SOX13 gene in these two cancer types are statistically dif-
ferent. This can be a mechanistic biomarker for an unknown sample, if we can
examine the expression level of SOX13 gene and can predict whether the cancer is
in the breast or colon.

8.4.2 ANOVA
Analysis of variance (ANOVA) also assumes that the random variables are
normally distributed. It measures how much of the variance in the data is
explained by categorizing the data into the different variables, and unlike t-test,
it can even be used to compare more than two random variables. For instance,
in the example for the t-test, let us include data for another tissue. We now want
to see if the expression of SOX13 is tissue dependent. ANOVA can be used in
such a case.

# R-code
# Required R library
> library(reshape2)

> skin.SOX13<- c(1.85, 1.59, 3.66, 1.92, 4.16, 3.91, 3.34, 3.52,
1.51, 2.3, 4.03, 3.91, 4.23, 1.68, 3.34, 2.74, 4.27, 1.91, 2.93,
2.52, 3.63, 5.13, 3.5, 3.88, 1.8, 3.56, 4.94, 3.78, 3.6, 4.21,
2.12, 4.5, 3.94, 2.72, 3.92, 4.2, 3.2, 4.82, 5.27, 2.14, 2.85,
2.66, 3.71, 5.93, 5.23, 2.73, 3.8, 3.43, 4.13, 2.85)

#Construct data-frame with all expression values:


> data1<- data.frame("Breast" = breast.SOX13, "Colon" = colon.SOX13,
"Skin" = skin.SOX13)
#Reshape dataframe into required format:
>data1<- melt(data1)
>data1<- as.data.frame(data1)
>colnames(data1) = c("Tissue", "Expression")

# ANOVA
> result <- aov(Expression ~ Tissue, data=data1)
> summary(result)

Df Sum Sq Mean Sq F value Pr(>F)


Tissue 2 143.0 71.49 85.74 <2e-16 ***
Residuals 147 122.6 0.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘’ 1

Here you can see that the p-value (Pr(>F)) is less than 2e−16, which means that
there is a significant differences in the expression level of SOX13 with respect to the
136 M. Adhil et al.

tissues (“breast,” “colon,” and “skin”). The ANOVA can be carried out for more
VetBooks.ir

than two variables (e.g., more than two tissues or two genes). This type of test can
also be used to test for “house-keeping genes” (a gene expressed in many tissues),
where in most of the cases the gene cannot be used as biomarker.

8.4.3 Chi-Square Test


The chi-square (χ2) test is a “goodness of fit” test between the observed and
expected values, applied to sets of categorical data. The chi-square test assumes
that the probability of obtaining the observed set based on the expected values is
given by a chi-square distribution. In biology, the chi-square test is commonly
used to study deviations from Hardy–Weinberg equilibrium for genetic loci.
Example R code is given below based on the population study on Hardy–
Weinberg equilibrium. Consider a population of 1000 individuals. In this popula-
tion, gene X contains 2 alleles (A and a) with respective frequencies p = 0.7 and
q = 0.25. According to Hardy–Weinberg equilibrium, there should be 490 indi-
viduals with genotype AA (p2), 90 aa (q2), and 420 Aa (2pq). The actual fre-
quency of the genotypes in this population is 450 AA, 50 aa, and 500 Aa. In
order to see if this population deviates significantly from Hardy–Weinberg equi-
librium, we use the chi-square test. The number of degrees of freedom for chi-
square test for Hardy–Weinberg equilibrium is (number of genotypes) − (number
of alleles), i.e., 3–2 = 1.

# R-code

# Loading the data


> data2 = data.frame(Observed = c(450, 500, 50), Expected = c(490,
420, 90), row.names = c("AA","Aa","aa"))

# Chi-squared test
> chisq.statistic<- sum(((data2$Observed - data2$Expected)^2)/
data2$Expected)
> chisq.statistic

[1] 36.28118

> significance = 1-pchisq(chisq.statistic,1)


> significance

[1] 1.708054e-09

Here you can see that for one degree of freedom (df) the chi-square value is
36.28, which is significantly greater than 3.841 (please refer to degree of freedom
“1” and significance level “0.05” in critical values for the chi-squared distribution
table), and the p-value is <1.708054e−09. It gives the evidence to reject the null
hypothesis (population or data follows the Hardy–Weinberg equilibrium model)
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 137

and accepts the alternate hypothesis (population or data does not follow the Hardy–
VetBooks.ir

Weinberg equilibrium model).

8.4.4 Fisher's Exact Test


Fisher’s exact test is similar to the chi-square test but is more suitable when the
sample size is small. It is used for categorical data. This test assumes that the prob-
ability of obtaining the observed values follows a hypergeometric distribution. In
order to perform this test, data have to be represented in the form of a contingency
table. The most common use of Fisher’s exact test in omics analysis is for testing the
over representation of gene ontology (GO) terms for a particular gene set compared
to a reference gene set. A significant result indicates actual difference between the
two sets. Example R code is given below. Consider a hypothetical microarray exper-
iment for a case/control study. You found 125 significant genes expressed differen-
tially in the case compared to the control. Of these, 51 belonged to the GO term
“cell cycle.” Now you will need to test whether this GO term is over represented for
your experiment and therefore holds some value.

# R-code

# Loading the data


#Considering total genes in background as 8713, and 467 belonging
to cell cycle
> data3 = data.frame(Gene_count_in_set=c(51, 125), Gene_count_in_
reference=c(467, 8713), row.names = c("Gene_Count_GO_term",
"Gene_Count_without_GO_term"))

# Fisher's exact test


> results = fisher.test(data3)
> results

Fisher's Exact Test for Count Data

data: data3
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval: 5.309629 10.773256
sample estimates:
odds ratio
7.610023

The Fisher’s exact test result shows that p-value is less than 2.2e−16 and odds
ratio is 7.610023, which means that the particular GO term is significantly expressed
in the experiment. This type of statistics is useful for cohort based study, where you
have significant genes for a particular phenotype that can be used to find the path-
ways and biological process which are intervened (or) affected.
138 M. Adhil et al.

8.5 Correlation
VetBooks.ir

Correlation is a statistical technique used to study the association between two variables
x and y, where x and y represent two data series (x1, x2… xn) and (y1, y2.... yn). The vari-
ables x and y must be numeric variables and may be continuous or discrete. The correla-
tion (r) ranges from −1 to 1 where the r value closer to 1 represents positive correlation
and closer to −1 represents negative correlation. When r is equal to zero, there is no
correlation or no linear relationship between the two variables. In the case of positive
correlation between x and y, y increases as x increases, whereas in the case of negative
correlation, y decreases as x increases. However, correlation does not contain direction-
ality information, i.e., whether x is triggering the activity of y or vice versa. Pearson
correlation is commonly used to identify similarities between data series. It is sensitive
to linear relationships. Rank correlation is an alternative to Pearson correlation, which
calculates the correlation between data series based on the ranking of values.
Correlation is widely used on transcriptomics data for identifying coexpression
patterns for genes. Another common application is the validation of direct target
genes of miRNAs for integration of epigenetic and transcriptomic data (Wang and
Li 2009). Here we demonstrate how Pearson correlation can be used to identify
coexpressed genes. We have taken the “nki” data set from the “breastCancerNKI”
Bioconductor package. We have reduced the data set to 1000 genes and 100 samples
in order to reduce the computational power and time taken by the “rcorr” function
to calculate the gene pair's correlation and p-value. We have used absolute correla-
tion 0.5 and p-value 0.01 as a cutoff to get the most significant gene correlation
pairs, which are stored in the “correlationresult” object. This object contains four
columns: the first column (GeneA) contains the gene names, the second column
(GeneB) also contains the gene names (where the GeneB expression is correlated
with GeneA), and the third column contains the correlation value and the fourth
column contains the p-value. These significant gene pairs tell us that when there is
an increase in expression of GeneA, GeneB also increases.

# R-code

# Required library
> install.packages("Hmisc")
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("breastCancerNKI")
> biocLite("affy")
> library("breastCancerNKI")
> library("affy")
> library("Hmisc")

# Load the data


> data(nki)
> data <- exprs(nki)
> dim(data)
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 139

[1] 24481 337


VetBooks.ir

# Annotate with gene symbol


> genesymbol <- nki@featureData@data$HUGO.gene.symbol
> row.names(data) <- genesymbol

# Reduce the data set


> data <- data[-which(is.na(row.names(data)) == TRUE),]
> datasub <- data[1:1000,1:100]
> dim(datasub)
[1] 1000 100

# Calculating Pearson correlation for the gene pairs


> sigcorrelation <- rcorr(t(datasub), type = "pearson")
> correlation <- sigcorrelation$r
> pvalue <- sigcorrelation$P

# Getting the significant correlated gene pairs with absolute cor-


relation 0.5 and pvalue 0.01
> correlation[(correlation < 0.5 & correlation > -0.5)] <- 0
> correlation[pvalue > 0.01] <- 0
> diag(correlation) <- 0

# Converting the data into list


> datatmp <- which(correlation > 0, arr.ind = TRUE)
> genea <- rownames(datatmp)
> geneb <- c()
> correlationval <- c()
> pval <- c()

> for (l in 1:dim(datatmp)[1]){


+ geneb <- c(geneb, genea[datatmp[l,2]]); correlationval <-
c(correlationval, correlation[datatmp[l,1],datatmp[l,2]]); pval <-
c(pval, pvalue[datatmp[l,1],datatmp[l,2]])
+}

# Final result
> correlationresult <- data.frame(GeneA = genea, GeneB = geneb,
Correlation = correlationval, Pvalue = pval)
> dim(correlationresult)

[1] 8422 4
> length(unique(c(correlationresult$GeneA, correlationresult$GeneB)))
[1] 781
> head(correlationresult)
140 M. Adhil et al.

GeneA GeneB Correlation Pvalue


VetBooks.ir

1 NPR1 PTGDS 0.5684877 6.840182e-10


2 APOD PTGDS 0.5137652 4.585650e-08
3 ABCB1 PTGDS 0.6170301 8.135270e-12
4 PTGDS PTGDS 0.5769966 3.313518e-10
5 PTGFR PTGDS 0.5722998 4.956235e-10
6 TSHZ2 PTGDS 0.5693789 6.346288e-10

From the result, you can see there are 781 genes present, containing 8422 inter-
actions. This result is further studied using the network theory approach in Section
9 for more biological insights, for example, which gene is more connected or
associated with other genes. Then you can compare them with normal cohort coex-
pression (reference set) using a database like COXPRESSdb to confirm it as
biomarker.

8.6 Clustering

Clustering or cluster analysis is used to classify objects into meaningful groups,


such that the objects in the same group are more similar for some parameters com-
pared to objects in some other group. Classification and clustering problems are
common in many biological scenarios. They are required for classifying samples
into groups based on their “omics” data. For example, gene expression data is used
to classify samples into distinct classes, such as stratification of patients into disease
classes.

Hierarchical Clustering Hierarchical clustering is a clustering algorithm which


clusters data by building hierarchical relationships between clusters. A bottom-
up or top-down approach may be followed by the hierarchical clustering algo-
rithm. These rely on combining clusters or breaking down clusters, respectively.
This type of clustering is based on the calculation of distances between objects.
The hclust function of the stats R package is used to perform hierarchical
clustering.

# R-code
> dat <- data[complete.cases(data),]

# reducing the data into 100 genes


> datmat<- t(dat[1:100,])

# Clustering using distance matrix


> fit<- hclust(dist(datmat), method = "ward.D2")

# Cut the tree into respective groups


> groups <- cutree(fit, k= 2)
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 141

# Plotting the result


VetBooks.ir

> plot(fit, cex = 0.3, xlab = '', sub='', axes = F, ylab = '')

> heatmap(datmat, col = heat.colors(256), cexCol = 0.4, xlab="Genes",


ylab="Samples", cexRow = 0.4)

Figure 6 shows that there are different clusters of samples using 100 genes. Note
that these genes are chosen randomly. If we choose significant genes for clustering,
then the data set will show good separation. We can replace these 100 genes with
significant genes and validate them using cluster analysis. For example, we can
check the survival patterns in each cluster considering the data set is from treatment
intervention. If the clusters show significant difference in survival, then you can use
those genes as “prognostic biomarkers.”

9 Systems Biology Approach Using Network Theory

Network analysis or graph theory in the field of biology or medicine is an important


technique to extract meaningful insights from the data. It is used to find the features
that are highly responsible for, or cause the specific phenotype of interest. Graph

Fig. 6 (a) Dendogram showing the clusters using “nki” data set with 100 genes
142 M. Adhil et al.
VetBooks.ir

Fig. 6 (continued) (b) Heat map showing the expression pattern of the 100 genes where samples
on the y-axis and genes are on the x-axis

theory also has a wide range of applications in other fields like World Wide Web
(WWW), social networking, ecology, political science, and history. The network or
graph contains two attributes: nodes and edges. Nodes are the features, and edges
represent the relationships between the features. The graph can be divided into two
broad classes: directed graph and undirected graph.

9.1 Graph Data Structure

To construct a graph or network, the data has to be in an adjacency list or adjacency


matrix format. In an adjacency list, each vertex or feature will contain the collection
of its neighboring vertices or edges. An example of an adjacency list is shown in
Table 2.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 143

Table 2 Example of an adjacency list that Node Relationships


VetBooks.ir

contains five features or nodes (V1, V2, V3,


V1 V2, V3, V5
V4, and V5) and its relationships
V2 V1, V5
V3 V1, V4
V4 V3, V5
V5 V1, V2, V4

Table 3 Example of a matrix which contains four features or nodes (V1, V2, V3, V4) and its
relationships (“0” represents no relationship and “1” represents a relationship between two nodes)
V1 V2 V3 V4
V1 0 1 0 1
V2 1 0 1 0
V3 0 1 0 1
V4 1 0 1 0

The adjacency matrix contains the features in rows and columns. The values in
the matrix represent the relationships between the vertex or features. The values can
be coded or numerical, which depends on the type of the relationship between the
features. An example of a matrix is shown in Table 3.

9.2 Graph Analysis

A graph or network consists of sets of vertices and a set of edges connecting the
vertices. Two broad classifications of the graphs are directed and undirected graph.
The major difference between them is that there is no direction information avail-
able in undirected graphs. For example, from the edge between V1 and V2 in an
undirected graph, we cannot interpret whether V1 influences V2 or vice versa. This
information is very important in biological networks and is explained in the “types
of biological networks” section in detail.
The network or graph can be sparse or dense depending on the following criteria:
For graphs (G = (V, E)) with n vertices and m edges, the graph is dense when the
number of relationships is close to the number of nodes, i.e., m is close to n2, and the
graph is sparse when the number of relationships is much smaller than number of
nodes, i.e., m << n2. Most of the biological networks are densely connected. Example
R code for the construction of an undirected graph is given below using the correla-
tion result from Section 6.5. If the data contains the direction information, then the
“mode” parameter in the “graph.adjacency” function has to be converted to
“directed.” The subset of the graph (“graphtarg”) is created with the selected genes,
and their first-order interactions and is shown in Fig. 7. These genes can be replaced
with clinically important genes (single or multiple genes) to find their first-order
coexpressed genes. This will help you to study the specific set of genes and their
first-order interaction.
144 M. Adhil et al.
VetBooks.ir

Fig. 7 Graph plotted for the selected genes and their first-order interaction with other genes. Red
nodes are the selected genes, and yellow nodes are the first-order interaction

# R-code

# Required library
> install.packages("igraph")
> library(igraph)

# Correlation data matrix from Section 6.6 where it contains gene


pairs correlation value for 1000 genes
> undirectednetworkdata <- correlation
> dim(undirectednetworkdata)
[1] 1000 1000

# Creating a undirected graph using correlation matrix and the edge


weight corresponds to the respective correlation value
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 145

> graph <- graph.adjacency(undirectednetworkdata, mode = "undi-


VetBooks.ir

rected", weighted = TRUE)

# Remove edges with “NA”


> graph=delete.edges(graph, which(is.na(E(graph)$weight)))

# There are 1000 nodes and 5426 edges in the graph


> summary(graph)
IGRAPH UNW- 1000 5426 --
+ attr: name (v/c), weight (e/n)

# To list all the node (gene) names


> nodenames <- V(graph)$name

# To get the 1st order coexpressed genes for the selected or impor-
tant genes
> select_genes = c("C17orf74", "SOX4", "LRFN2", "SLC16A1", "CDH2",
"CDK7", "GSR")
> graphtarg <- induced.subgraph(graph=graph,vids=unlist(neighborhood
(graph=graph,order=1,nodes=select_genes)))
> V(graphtarg)$color <- "yellow"
> V(graphtarg)[select_genes]$color <- "red"

# Plot the graph


> plot(graphtarg, vertex.size=8, vertex.label.font = 2, vertex.
label.cex = 0.5, vertex.label.color="black", vertex.color =
V(graphtarg)$color, edge.width=E(graphtarg)$weight)

Graph traversals are required to find the path (direct and indirect path) between two
nodes. Some nodes in the graph are directly connected and others are indirectly con-
nected. There will be a path between an arbitrary node and any other node in the graph,
unless the node does not contain any relationship or there is no edge connected to other
nodes. Two widely used graph traversal approaches are breadth-first search (BFS) and
depth-first search (DFS). Example R code is given below for the depth first search and
breadth first search. You should give the root node for DFS and BFS as an input, from
which it gives the order of traversal, which is stored in “orderdfs” and “orderbfs.”

# R-code

# depth first search


> dfs <- graph.dfs(graph, root =1)
> orderdfs <- V(graph)$name[dfs$order]

# breadth first search


> bfs <- graph.bfs(graph, root =1)
> orderbfs <- V(graph)$name[bfs$order]
146 M. Adhil et al.

Two network properties commonly used to study the structure of biological net-
VetBooks.ir

works are shortest path and centrality.

Shortest Path This is used to find the shortest path connecting two nodes in a net-
work, and the inference will depend on the type of network. If you are studying a
gene coexpression network, then the shortest path between two genes will indicate
the coexpression patterns and the intermediate players. This will tell you which
genes mediate the association between your query genes. There may be multiple
paths available between two genes, but the shortest one has the most significance.
The shortest path can be identified in both directed and undirected graphs. Similarly,
the shortest path analysis can be used to find the toxicity and mode of action in the
drug discovery process. If we know the drug target (gene1) and end-point biomarker
(gene2), we can use the gene regulatory network to find the shortest path connecting
the target and end-point biomarker.

The genes which are involved in connecting the target and end-point bio-
marker can be used for functional analysis to find the toxicity. Example R code
is given below, where the target is “MDM4” gene (similarly, multiple targets
can also be used) and the end-point biomarker is “SLC5A2.” The genes con-
necting (“SOX4,” “CKAP2L,” and “TAF4”) these two can be further studied
using functional analysis. The shortest path between “MDM4” and “SLC5A2”
is shown in Fig. 8.

MDM4

TAF4

CKAP2L

SOX4

Fig. 8 Shortest path


between “MDM4” (red)
and “SLC5A2” (blue) and
genes connecting these
two are represented in SLC5A2
yellow colors
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 147

# R-code
VetBooks.ir

# target and end-point biomarker


> targets <- c("MDM4", "SLC5A2")
> nodes <- degree(graph)

# finding node id for the targets


> id <- which(names(nodes) %in% targets)
> vertsp <- c()

# calculating shortest path


> sp <- get.shortest.paths(graph, id[1], to = id[2], weights = NA,
output = c("vpath"))
> vert <- as.vector(sp$vpath[[1]])
> V(graph)$name[vert]
[1] SLC5A2 SOX4 CKAP2L TAF4 MDM4
> graphshortest <- induced.subgraph(graph=graph,vids=unlist
(neighborhood(graph=graph,order=0,nodes=vert)))

# assigning colors to the connecting node


> V(graphshortest)$color <- "yellow"

# assigning color to the target and end-point biomarker


> V(graphshortest)["MDM4"]$color <- "red"
> V(graphshortest)["SLC5A2"]$color <- "cadetblue2"

# plotting the shortest path


> plot(graphshortest)

Centrality In any type of biological network analysis, among the key goals is to
identify the features that are the most critical and control the behavior of the biologi-
cal system. These will be the most important components of the mechanistic model.
For this purpose, centrality analysis is used. This will provide you network informa-
tion such as which genes are essential for survival, which are the housekeeping
genes, or which molecular level properties are the most critical for phenotype devel-
opment. Example R code is given below to calculate the centrality measures such as
degree, closeness, betweenness, and eigenvector.

# Degree (A gene having number of connections with the other genes)


> cent.degree<- degree(graph)

# Closeness (Gene having a reciprocal of the sum of its distance


from all other genes. Gene with the lowest distance among all other
gene, more central in the network)
148 M. Adhil et al.

> cent.closeness<- closeness(graph)


VetBooks.ir

# Betweenness (Number of times the gene is present in the shortest


path between all pairs of genes)
> cent.betweenness<- betweenness(graph)

# Eigenvector (Influence of each gene in the network, based on the


connection to the high-scoring genes)
> cent.ev<- eigen_centrality(graph)

These measures will help you to rank the genes which are more central or impor-
tant in the whole network.

9.3 Types of Biological Networks

Here we describe four different types of biomolecular networks constructed on the


basis of information from high-throughput genomics and bibliomics (information
curated from literature) data: gene coexpression network, protein–protein interac-
tion network, metabolic network, and gene regulatory network. These are com-
monly used for integrative analysis of omics data. These network plots are analyzed
and integrated to get actionable insights using the network theory approaches
(Pavlopoulos et al. 2011).

9.3.1 Gene Coexpression Network


These are undirected graphs, which contain gene–gene interaction information. The
nodes containing the genes and edges represent the relationships between two genes.
The gene coexpression network is built using the transcriptomics data and is one such
network. The transcriptomics or gene expression data can be either from NGS or
microarray platforms. The correlation or mutual information techniques can be used
on the cohort expression data set to find out the coexpressed genes for the specific
phenotype of interest. The gene coexpression network cannot be built using fewer
data points or samples, and it needs a sufficient data size to calculate the p-values.
The higher the number of samples, the greater is the accuracy of the coexpressed
genes. The reference coexpression network for model organisms is available at
COXPRESdb database (Obayashi et al. 2013). In COXPRESdb, data sets to build the
coexpression network are taken from publicly available repositories such as GEO and
Array Express and they are based on normal healthy state. The coexpression network
can be used to find the clusters or communities which have functional properties.

9.3.2 Protein–Protein Interaction (PPI) Network


PPI networks are may be directed or undirected graphs that contain proteins as
nodes, and edges represent the interaction between two proteins. These networks
are constructed using experiments like yeast-2-hybrid screens, protein expression
arrays, and mass spectrometry techniques. There are well-known open source
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 149

databases that contain the protein interactions for different model organisms, such
VetBooks.ir

as IntAct, STRING, BioGRID, and Reactome. These databases use publicly avail-
able literature (bibliomics) to construct the protein–protein interactions. These data
can be integrated along with the experimental data to obtain reliable networks.

9.3.3 Metabolic Network


The metabolism of an organism is studied to understand the cellular level processes.
These include generation of essential components such as amino acids, sugars, and
lipids, as well as energy (stored as ATP) to carry out cellular functions. The meta-
bolic network contains many complex chemical reactions to generate such essential
cellular components. This network can be used for whole-cell modeling as well as
for predictive purposes given a specific phenotype. The metabolic networks are
directed graphs, which contain multiple components such as metabolites, reactions,
and substrates. The substrate is transformed to product, through the reactions which
are catalyzed by the enzymes. This type of network analysis uses a systemic
approach rather than a reductionist approach. Analytical techniques like flux bal-
ance analysis (FBA) (Orth et al. 2010) can be used on the metabolic network, which
combines mass balance and other regulation constraints to predict steady-state
fluxes for each reaction. FBA can be used in two different states (e.g., healthy and
disease condition) to obtain reactions that have large difference between the two
states. Later, the reactions can be annotated to obtain proteins or genes involved in
each reaction. There are some well-known databases that contain the metabolic net-
work for an organism, such as ReconX, KEGG, and BioCyc.

9.3.4 Gene Regulatory Network


These are similar to gene coexpression networks, but with directional information.
Here the nodes correspond to the regulators such as DNA, RNA, miRNA, and pro-
teins. The edges represent the regulation or inhibition with the direction. From these
networks, you can infer which gene regulates or inhibits the activity of other genes.
These networks are modeled using ordinary differential equations, or stochastic
ODE. Usually, these gene regulatory networks are constructed using extensive lit-
erature curation and data mining.

9.4 Network-Driven Knowledge Discovery

The different types of biological networks can be integrated and modeled to extract
the knowledge for the specific phenotype (Qabaja et al. 2014). This requires the
integration of multiple levels of omics data as a network for knowledge discovery.
The network-driven knowledge can be used in drug discovery and development
process phase to understand the disease mechanism, to find out the promising tar-
gets or intervention, and also to validate them using in silico knockout experiments.
Using these networks, we can also test the target's efficacy and toxicity using func-
tional or pathway enrichment. Similarly, they can be used to identify specific molec-
ular processes resulting in improved animal productivity.
150 M. Adhil et al.

10 Summary
VetBooks.ir

It is becoming possible to understand the functioning of the biological system at


greater depths than ever before because of the availability of large quantities of
omics data. However, these data sets have brought with them challenges of large
data volumes and high data complexity. Appropriate tool sets, with the correct
approach, are required to be able to analyze these data in order to extract actionable
insights. In addition, multiscale data needs to be integrated for this purpose. For
solving this multi-omics multiscale problem, you need to first analyze individual
data sets to derive knowledge and then combine knowledge from different data sets
to arrive at the insight. This involves the use of complex mathematical and computa-
tion tools built on some common statistical concepts, combined with advanced
algorithms involving integrative network theory and data modeling. The use of indi-
vidual toolsets can become difficult to manage and can create problems with analy-
sis tracking and reproducibility. Therefore, it is advisable to use integrative analytics
suites such as iOMICS, which combine the various tool sets for a more convenient
experience, along with cloud-based computing solutions capable of handing the
expanding data volumes.

Acknowledgment We would like to thank Santhosh Babu Gandham, Krittika Ghosh, Sushmita
Mookherjee, Sakthi Ganesh and all the other members of the iOMICS team for helping us to pub-
lish this chapter.

References
Adhil M, Gandham S, Talukder AK, Agarwal M, Prahalad HA (2015) CuraEx – clinical expert
system using big-data for precision medicine BT – big data analytics. In: Kumar N, Bhatnagar
V (eds) Proceedings of 4th international conference, BDA 2015, Hyderabad, 15–18 Dec 2015.
Springer International Publishing, Cham, pp 216–227
Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198–207
[PMID: 12634793]
Agarwal M, Adhil M, Talukder AK (2015) Multi-omics multi-scale big data analytics for cancer
genomics BT – big data analytics. In: Kumar N, Bhatnagar V (eds) Proceedings of 4th interna-
tional conference, BDA 2015, Hyderabad, 15–18 Dec 2015. Springer International Publishing,
Cham, pp 228–243
Hesser LF (2006) The man who fed the world: Nobel Peace Prize laureate Norman Borlaug and his
battle to end world hunger: an authorized biography. Leon Hesser. Durban House Pub Co Inc
Hucka M et al (2003) The systems biology markup language (SBML): a medium for representa-
tion and exchange of biochemical network models. Bioinformatics 19(4):524–531
[PUBMED:12611808]
Kohl M, Megger DA, Trippler M, Meckel H et al (2014) A practical data processing workflow for
multi-OMICS projects. Biochim Biophys Acta 1844:52–62. doi:10.1016/j.bbapap.2013.02.029
[PMID: 23501674]
Letovsky S (1999) Bioinformatics: databases and systems. Springer Science & Business Media.
Springer
McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C
Recommendation 10.10, 2004
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 151

Mesiti M, Jiménez-Ruiz E, Sanz I et al (2009) XML-based approaches for the integration of het-
VetBooks.ir

erogeneous bio-molecular data. BMC Bioinformatics 10(Suppl 12):S7. doi:10.1186/1471-


2105-10-S12-S7 [PMID: 19828083]
Obayashi T, Okamura Y, Ito S, Tadaka S, Motoike IN, Kinoshita K (2013) COXPRESdb: a data-
base of comparative gene coexpression networks of eleven species for mammals. Nucleic
Acids Res 41(Database issue):D1014–D1020. doi:10.1093/nar/gks1014 [PMID: 23203868]
Orth JD, Thiele I, Palsson BØ (2010) What is flux balance analysis? Nat Biotechnol 28(3):245–
248 [PMID: 20212490]
Pavlopoulos GA et al (2011) Using graph theory to analyze biological networks. BioData Mining
4:10. PMC. Web. 16 Oct. 2015. PMC: 3101653
Qabaja A, Alshalafa M, Alanazi E, Alhajj R (2014) Prediction of novel drug indications using
network driven biological data prioritization and integration. J Chemoinform 6(1):1.
doi:10.1186/1758-2946-6-1 [PMID: 24397863]
Schena M et al (1995) Quantitative monitoring of gene expression patterns with a complementary
DNA microarray. Science 270(5235):467–470 [PMID: 7569999]
Schuster SC (2007) Next-generation sequencing transforms today’s biology. Nature 200(8):16–18
[PMID: 18165802]
Shabalin AA (2012) Matrix eQTL: ultra fast eQTL analysis via large matrix operations.
Bioinformatics 28(10):1353–1358. doi:10.1093/bioinformatics/bts163, [PMID: 22492648],
Epub 2012 Apr 6
Stein LD (2010) The case for cloud computing in genome informatics. Genome Biol 11:207
[PMID: 20441614]
Talukder AK (2015) Genomics 3.0: big-data in precision medicine BT – big data analytics: In:
Kumar N, Bhatnagar V (eds) Proceedings of 4th international conference, BDA 2015,
Hyderabad, 15–18 Dec 2015. Springer International Publishing, Cham, pp 201–215
Wang YP, Li KB (2009) Correlation of expression profiles between microRNAs and mRNA targets
using NCI-60 data. BMC Genomics 10:218. doi:10.1186/1471-2164-10-218, PMID: 19435500
Wang S, Sun H, Ma J, Zang C, Wang C, Wang J et al (2013) Target analysis by integration of tran-
scriptome and ChIP-seq data with BETA. Nat Protoc 8(12):2502–2515 [PMID: 24263090]

You might also like