Professional Documents
Culture Documents
ir
Systems Biology
in Animal
Production and
Health, Vol. 1
VetBooks.ir
Haja N. Kadarmideen
Editor
Haja N. Kadarmideen
Faculty of Health and Medical Sciences
University of Copenhagen
Frederiksberg C, Denmark
Foreword
The increased prominence of “systems biology” in biological research over the past
two decades is arguably a reaction to the reductionist approach exemplified by the
genome sequencing phase of the Human Genome Project. A simplistic view of the
genome projects was that the genome sequence of a species, whether humans,
model organisms, plants or farmed animals, represents a blueprint for the organism
of interest, and thus characterising the sequence would reveal the relevant instruc-
tions. Subsequent targets for the reductionist or cataloguing approach were com-
plete lists of transcripts (transcriptomes) and proteins (proteomes) for the organism
of interest. The ‘omics approach to the comprehensive characterisation of an organ-
ism, tissue or cell has also been extended to metabolites and hence metabolomes.
A catalogue of parts, however, is insufficient to understand how an organism func-
tions. Thus, a holistic approach that recognises the interactions between compo-
nents of the system was required. Given the size and complexity of the data and the
possible interactions, it was necessary to use advanced mathematical and computa-
tional methods to attempt to make sense of the data. Thus, “systems biology” in the
‘omics era is widely considered to concern the use of mathematical modelling and
analysis together with ‘omics data (genome sequence, transcriptomes, proteomes,
metabolomes) to understand complex biological systems. The predictive aspect of
these models is viewed as particularly important. Moreover, it is desirable that the
models’ predictions can be tested experimentally. Systems biology, therefore, con-
tributes in part to converting large ‘omics data sets from data-driven biology experi-
ments into testable hypotheses.
Systems approaches and the use of predictive mathematical models in biological
systems long pre-date the post genome project (re-)emergence of systems biology.
Population biologists/geneticists, epidemiologists, agricultural scientists, quantita-
tive geneticists and plant and animal breeders have been developing and success-
fully exploiting predictive mathematical models and systems approaches for
decades.
Quantitative geneticists and animal breeders, for example, have been remarkably
successful at developing statistical animal models that are effective predictors of
future performance. For decades, these successes were achieved without any knowl-
edge of the underlying molecular components. The accuracy of these models has
been increased by using high-density molecular (single nucleotide polymorphism,
SNP) genotypes in so-called genomic selection. However, whilst the sequences and
v
vi Foreword
genome locations of the SNP markers are known little is known about the functional
VetBooks.ir
with application for obesity using pig models is reviewed by Kogelman and
VetBooks.ir
Preface
ix
x Preface
performs such genome-wide association studies (GWAS), but also performs linking
genetic variations (e.g. SNPs, CNVs, QTLs etc.) at the DNA sequence level with
variation in molecular profiles or traits (e.g. gene expression or metabolomic or
proteomic levels etc. in tissues and biological fluids) that we can measure using
high-throughput next- and third-generation biotechnologies. The systems genetics
approach is still “genetics”, because we are looking at those genetic variants that
exert their effects from DNA to phenotypic expression or disease manifestations
through a number of intermediate molecular profiles. Hence, systems genetics
derives its name, as originally proposed in my earlier article (Mammalian Genome,
2006, 17:548–564), by being able to integrate analyses of all underlying genetic
factors acting at different biological levels, namely, QTL, eQTL, mQTL, pQTL and
so on. I have provided a complete up-to-date review and illustration of systems
genetics or systems genomics and multi-omic data integration and analyses in our
review paper published in Genetics Selection Evolution (2016), 48:38. Overall, sys-
tems genetics/genomics leads us to provide a holistic view on complex trait heredity
at different biological layers or levels.
Whether it is systems biology or systems genetics, the gene ontology annotation
is one of the most important and valuable means of assigning functional information
using standardized vocabulary. This would include annotation of genetic variants
falling into functional groups such as trait QTL, eQTL, mQTL, pQTL. Molecular
pathway profiling, signal transduction and gene set enrichment analyses along with
various types of annotations form the “icing on cake”. For this purpose, several
bioinformatics tools are frequently used. Most chapters in this book and its associ-
ated volume cover these aspects.
I would like to point out that systems biology approaches have been proven to be
very powerful and shown to produce accurate and replicable discoveries of genes,
proteins and metabolites and their networks that are involved in complex diseases or
traits. In very practical terms, it delivers biomarkers, drug targets, vaccine targets,
target transcripts or metabolites, genetic markers, pathway targets etc. to diagnose
and treat diseases better or improve traits or characteristics in animals, plants and
humans. In the world of genomic prediction and genomic selection, there have been
an increasing number of studies that have shown high accuracy and predictive
power when models include functional QTLs such as eQTL, mQTL, pQTL which,
in fact, are results from systems genetics methods.
This book and its associated volume cover the above-mentioned principles, the-
ory and application of systems biology and systems genetics in livestock and animal
models and provides a comprehensive overview of open source and commercially
available software tools, computer programing codes and other reading materials to
learn, use and successfully apply systems biology and systems genetics in animals.
Overall, I believe this book is an extremely valuable source for students inter-
ested in learning the basics and could form as a textbook in higher educational
institutes and universities around the world. Equally, the book chapters are very
relevant and useful for scientists interested in learning and applying advanced HTO
studies, integrative HTO data analyses (e.g. eQTLs and mQTLs) and computational
Preface xi
systems biology techniques to animal production, health and welfare. One of the
VetBooks.ir
Contents
xiii
VetBooks.ir
Abstract
Genetic differences between individuals associated to quantitative phenotypic
traits, including disease states, are usually found in noncoding genomic regions.
These genetic variants are often also associated to differences in expression lev-
els of nearby genes (they are “expression quantitative trait loci” or eQTLs, for
short) and presumably play a gene regulatory role, affecting the status of molecu-
lar networks of interacting genes, proteins, and metabolites. Computational sys-
tems biology approaches to reconstruct causal gene networks from large-scale
omics data have therefore become essential to understand the structure of net-
works controlled by eQTLs together with other regulatory genes, as well as to
generate detailed hypotheses about the molecular mechanisms that lead from
genotype to phenotype. Here we review the main analytical methods and soft-
ware to identify eQTLs and their associated genes, to reconstruct coexpression
networks and modules, to reconstruct causal Bayesian gene and module net-
works, and to validate predicted networks in silico.
1 I ntroduction
Genetic differences between individuals are responsible for variation in the observ-
able phenotypes. This principle underpins genomewide association studies (GWAS),
which map the genetic architecture of complex traits by measuring genetic variation
at single-nucleotide polymorphisms (SNPs) on a genomewide scale across many
plant and animal breeding (Goddard and Hayes 2009) and in numerous insights into
the genetic basis of complex diseases in human (Manolio 2013). However, quantita-
tive trait loci (QTLs) with large effects are uncommon and a molecular explanation
for their trait association rarely exists (Mackay et al. 2009). The vast majority of
QTLs indeed lie in noncoding genomic regions and presumably play a gene regula-
tory role (Hindorff et al. 2009; Schaub et al. 2012). Consequently, numerous studies
have identified cis- and trans-acting DNA variants that influence gene expression
levels (i.e., “expression QTLs”; eQTLs) in model organisms, plants, farm animals,
and humans (reviewed in Rockman and Kruglyak 2006; Georges 2007; Cookson
et al. 2009; Cheung and Spielman 2009; Cubillos et al. 2012). Gene expression
programs are of course highly tissue- and cell-type specific, and the properties and
complex relations of eQTL associations across multiple tissues are only beginning
to be mapped (Dimas et al. 2009; Foroughi Asl et al. 2015; Greenawalt et al. 2011;
Ardlie et al. 2015). At the molecular level, a mounting body of evidence shows that
cis-eQTLs primarily cause variation in transcription factor (TF) binding to gene
regulatory DNA elements, which then causes changes in histone modifications,
DNA methylation, and mRNA expression of nearby genes; trans-eQTLs in turn can
usually be attributed to coding variants in regulatory genes or cis-eQTLs of such
genes (Albert and Kruglyak 2015).
Taken together, these results motivate and justify a systems biological view of
quantitative genetics (“systems genetics”), where it is hypothesized that genetic
variation, together with environmental perturbations, affects the status of molecular
networks of interacting genes, proteins, and metabolites; these networks act within
and across different tissues and collectively control physiological phenotypes
(Williams 2006; Kadarmideen et al. 2006; Rockman 2008; Schadt 2009; Schadt and
Björkegren 2012; Civelek and Lusis 2014; Björkegren et al. 2015). Studying the
impact of genetic variation on gene regulation networks is of crucial importance in
understanding the fundamental biological mechanisms by which genetic variation
causes variation in phenotypes (Chen et al. 2008), and it is expected to lead to the
discovery of novel disease biomarkers and drug targets in human and veterinary
medicine (Schadt et al. 2009). Because the direct experimental mapping of genetic,
protein–protein, or protein–DNA interactions is an immensely challenging task,
further exacerbated by the cell-type-specific and dynamic nature of these interac-
tions (Walhout 2006), comprehensive, experimentally verified molecular networks
will not become available for multi-cellular organisms in the foreseeable future.
Statistical and computational methods are therefore essential to reconstruct trait-
associated causal networks by integrating diverse omics data (Rockman 2008;
Schadt 2009; Ritchie et al. 2015).
A typical systems genetics study collects genotype and gene, protein, and/or
metabolite expression data from a large number of individuals segregating for one
or more traits of interest. After raw data processing and normalization, eQTLs are
identified for each of the expression data types, and a coexpression matrix is con-
structed. Causal Bayesian gene networks, coexpression modules (i.e., clusters), and/
Detection of Regulator Genes and eQTLs in Gene Networks 3
Fig. 1 A flow chart for a typical systems genetics study and the corresponding software. Steps in
light yellow are covered in this chapter
or causal Bayesian module networks are then reconstructed. The in silico validation
of predicted networks and modules using independent data confirms their overall
validity, ideally followed by the experimental validation of the most promising find-
ings in a relevant cell line or model organism (Fig. 1). Here we review the main
analytic principles behind each of the steps from eQTL identification to in silico
network validation and present a selection of most commonly used methods and
software for each step. Throughout this chapter, we tacitly assume that all data have
been quality controlled, preprocessed, and normalized to suit the assumptions of the
analytic methods presented here. For expression data, this usually means working
with log-transformed data where each gene expression profile is centered around
zero with standard deviation one. We also assume that the data have been corrected
for any confounding factors, either by regressing out known covariates or by esti-
mating hidden factors (Stegle et al. 2012).
4 L. Wang and T. Michoel
åX
l =1
il = åG jl = 0 and
l =1
åX
l =1
2
il = åG 2jl = n,
l =1
1
X i( å
m ,j )
= ( m ,j )
X il ,
n {l :G jl = m}
Detection of Regulator Genes and eQTLs in Gene Networks 5
Again assuming that the expression data are standardized, the F-test statistic for
testing gene i against SNP j can be written as
n - - 1 SSi( )
j
Fi ( ) =
j
,
n - SSi( )
j
Let us define the n ´ s indicator matrix I(m) for genotype group m, i.e., I (lj ) = 1
m
å X il = XI ( ( m)
) ij
.
{l :G jl = m }
Hence, for each pair of expression level Xi and SNP Gj, the sum of squares matrix
SSi( j) can be computed via -1 matrix multiplications1.
In the nonparametric ANOVA model, the expression data matrix is converted to
a matrix T of data ranks, independently over each row. In the absence of ties, the
Kruskal–Wallis test statistic is given by
12 2
å n( ) Ti ( ) - 3 ( n + 1) ,
m ,j m ,j
Sij =
n ( n + 1) m = 0
where Ti (
m ,j )
is the average expression rank of gene i in genotype group m of SNP j,
defined as
1
Ti ( å
m ,j )
= ( m ,j )
Til ,
n {l :G jl = m}
which can be similarly obtained from the -1 matrix multiplications.
There is as yet no consensus about which statistical model is most appropriate
for eQTL detection. Nonparametric methods were introduced in the earliest eQTL
studies (Brem et al. 2002; Schadt et al. 2008) and have remained popular, as they are
robust against variations in the underlying genetic model and trait distribution.
More recently, the linear model implemented in matrix eQTL has been used in a
number of large-scale studies (Ardlie et al. 2015; Lappalainen et al. 2013). A com-
parison on a data set of 102 human whole blood samples showed that the parametric
ANOVA method was highly sensitive to the presence of outlying gene expression
1
There are only -1 matrix multiplications, because the data standardization implies that
-1
XI ( 0) = 1 - åXI (
m)
m =1
.
6 L. Wang and T. Michoel
values and SNPs with singleton genotype group. Linear models reported the highest
VetBooks.ir
number of eQTL associations after empirical False Discovery Rate (FDR) correc-
tion, with an expected bias toward additive linear associations. The Kruskal–Wallis
test was most robust against data outliers and heterogeneous genotype group sizes
and detected a higher proportion of nonlinear associations but was more conserva-
tive for calling additive linear associations than linear models (Qi et al. 2014).
In summary, when large numbers of traits and markers have to be tested for asso-
ciation, efficient matrix multiplication methods can be used to calculate all test sta-
tistics at once, leading to a dramatic reduction in computation time compared with
calculating these statistics one by one for every pair using traditional methods.
Matrix multiplication is a basic mathematical operation, which has been purposely
studied and optimized for tens of years (Golub and Van Loan 1996). Highly effi-
cient packages, such as BLAS (http://www.netlib.org/blas/) and LAPACK (http://
www.netlib.org/lapack/), are available for use on generic CPUs and are indeed used
in most mainstream scientific computing software and programming languages,
such as Matlab and R. In recent years, graphics processor unit (GPU)-accelerated
computing, such as CUDA, has revolutionized scientific calculations that involve
repetitive operations in parallel on bulky data, offering even more speedup than the
existing CPU-based packages. The first applications of GPU computing in eQTL
analysis have already appeared (e.g., Hemani et al. 2014), and more can be expected
in the future.
Lastly, for pairs exceeding a predefined threshold on the test statistic, a p-value
can be computed from the corresponding test distribution, and these p-values can
then be further corrected for multiple testing by common procedures (Shabalin
2012; Qi et al. 2014).
The Pearson correlation is the simplest and computationally most efficient similar-
ity measure for gene expression profiles. For genes i and j, their Pearson correlation
can be written as
n
Cij = åX il X jl . (1)
l =1
C = XXT .
Gene pairs with large positive or negative correlation values tend to be up- or down-
regulated together due to either a direct regulatory link between them or being
jointly coregulated by a third, often hidden, factor. By filtering for correlation values
exceeding a significance threshold determined by comparison with randomly
Detection of Regulator Genes and eQTLs in Gene Networks 7
degree of coexpression signifies that genes are involved in the same biological pro-
cesses, graph theoretical methods can be used, for instance, to predict gene function
(Sharan et al. 2007).
One drawback of the Pearson correlation is that by definition, it is biased toward
linear associations. To overcome this limitation, other measures are available. The
Spearman correlation uses expression data ranks (cf. Section 2) in Eq. (1) and will
give high score to monotonic relations. Mutual information is the most general mea-
sure and detects both linear and nonlinear associations. For a pair of discrete ran-
dom variables A and B (representing the expression levels of two genes) taking
values al and bm, respectively, the mutual information is defined as
MI ( A,B ) = H ( A ) + H ( B ) - H ( A,B ) ,
where
H ( A ) = -åP ( al ) log P ( al ) ,
l
H ( B ) = -åP ( bm ) log P ( bm ) ,
m
H ( A,B ) = åP ( al , bm ) log P ( al , bm ) ,
lm
æ W ( M , M ) æ W ( M , M ) ö2 ö
N
S (M ) = åç l l
- çç l 0
÷ ÷,
l =1 W ( M 0 , M 0 )
ç è W ( M 0 , M 0 ) ÷ø ÷
è ø
where W ( A, B ) = å w ( Cij ) is a weight function, summing over all the edges
iÎ A, jÎB , i ¹ j
that connect one vertex in A with another vertex in B, and w(x) is a monotonic
function to map correlation values to edge strengths. Common functions are
w ( x ) = x , x (power law) (Langfelder and Horvath 2008), e
b bx
(exponential)
(Ayroles et al. 2009), or 1 / (1 + e ) (sigmoid) (Lee et al. 2009).
bx
• Inflation: The algorithm first contrasts stronger direct connections against weaker
ones, using an element-wise power law transformation, and normalizes each col-
umn separately to sum to one, such that the element Cij corresponds to the dis-
sipation rate from vertex Xi to Xj in a single step. The inflation operation hence
updates C as C ® Gµ C , where the contrast rate µ> 1 is a predefined parameter
of the algorithm. After operation Γα, each element of C becomes
k
µ µ
Cij ® Gµ Cij = Cij / å C pj .
p =1
• Expansion: The probability flow matrix C controls the random walks performed
in the expansion phase. After some integer b ³ 2 steps of random walk, gene
Detection of Regulator Genes and eQTLs in Gene Networks 9
pairs with strong direct connections and/or strong indirect connections through
VetBooks.ir
other genes tend to see more probability flow exchanges, suggesting higher prob-
abilities of belonging to the same gene modules. The expansion operation for the
β-step random walk corresponds to the matrix power operation
C ® Cb .
The MCL algorithm performs the above two operations iteratively until conver-
gence. Nonzero entries in the convergent matrix C connect gene pairs belonging to
the same cluster, whereas all inter-cluster edges attain the value zero, so that cluster
structure can be obtained directly from this matrix (Van Dongen 2001; Enright et al.
2002).
boring genes, and the rest of terms normalize the output as 0 £ wij £ 1 . This concept
was later extended onto networks with weighted edges by applying a “soft thresh-
old” preprocess on the correlation matrix, for example, as
µ
1 + Cij
Aij = ,
2
or
µ
Aij = Cij ,
such that 0 £ Aij £ 1 (Zhang and Horvath 2005). Note that in the first case, only
positive correlations have high edge weight, whereas in the second case, positive
and negative correlations are treated equally. The parameter µ> 1 is determined
such that the weighted network with adjacency matrix A has approximately a scale-
free degree distribution (Zhang and Horvath 2005).
In principle, any clustering algorithm (including the aforementioned ones) can
be applied to the topological overlap matrix W . In the popular WGCNA software
(http://labs.genetics.ucla.edu/horvath/htdocs/CoexpressionNetwork/Rpackages/
WGCNA/) (Langfelder and Horvath 2008), which is a multipurpose toolbox for
10 L. Wang and T. Michoel
Model-Based Clustering
Model-based clustering approaches assume that the observed data are generated by
a mixture of probability distributions, one for each cluster, and takes explicitly into
account the noise of gene expression data. To infer model parameters and cluster
assignments, techniques such as expectation maximization (EM) or Gibbs sampling
are used (Liu 2002). A recently developed method assumes that the expression lev-
els of genes in a cluster are random samples drawn from a mixture of normal distri-
butions, where each mixture component corresponds to a clustering of samples for
that module, i.e., it performs a two-way co-clustering operation (Joshi et al. 2008).
The method is available as part of the Lemon-Tree package (https://github.com/
eb00/lemon-tree) and has been successfully used in a variety of applications (Bonnet
et al. 2015).
The co-clustering is carried out by a Gibbs sampler, which iteratively updates the
assignment of each gene and, within each gene cluster, the assignment of each
experimental condition. The co-clustering operation results the full posterior distri-
bution, which can be written as
N Ll
p ( C | X ) µ ÕÕ òò p ( m ,t ) Õ Õ p (X im | m ,t ) d m dt ,
l =1 u =1 iÎMl mÎEl ,u
4 C
ausal Gene Networks
2008; Schadt et al. 2005; Neto et al. 2008, 2013; Millstein et al. 2009). Although
VetBooks.ir
varying in their statistical details, these methods conclude that gene A is causal for
gene B, if the expression of B associates significantly with A’s eQTLs, and this
association is abolished by conditioning on the expression of A and on any other
known confounding factors. In essence, this is the principle of “Mendelian random-
ization,” first introduced in epidemiology as an experimental design to detect causal
effects of environmental exposures on human health (Smith and Ebrahim 2003),
applied to gene expression traits.
To illustrate how these methods work, let A and B be two random variables rep-
resenting two gene expression traits, and let E be a random variable representing a
SNP, which is an eQTL for gene A and B. Because genotype cannot be altered by
gene expression (i.e., E cannot have any incoming edges), there are three possible
regulatory models to explain the joint association of E to A and B:
To determine if gene A mediates the effect of SNP E on gene B (model 1), one
can test whether conditioning on A abolishes the correlation between E and B, using
the partial correlation coefficient
• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | A ) ,
• p ( E , A, B ) = p ( E ) p ( B | E ) p ( A | B ) ,
• p ( E , A, B ) = p ( E ) p ( A | E ) p ( B | E , A ) ,
where the dependence on A in the last term of the last model indicates that there
may be a residual correlation between B and A not explained by E. The minimal
additive model assumes the distributions are (Schadt et al. 2005)
12 L. Wang and T. Michoel
E ~ Bernoulli ( q ) ,
VetBooks.ir
A | E ~ N ( m A| E ,s A2 ) ,
æ s ö
B | A ~ N ç m B + r B ( A - m A ) , (1 - r 2 ) s B2 ÷ ,
è sA ø
æ s ö
B | E , A ~ N ç m B| E + r B ( A - m A| E ) , (1 - r 2 ) s B2 ÷ ,
è s A ø
so that E fulfils a Bernoulli distribution, A | E undergoes a normal distribution
whose mean depends on E, and that B | A has a conditional normal distribution
whose mean and variance are contributed in part by A. For ( B | E , A ) , the mean
of B also depends on E. The parameters of all distributions can be estimated by
maximum likelihood, and the model with the highest likelihood is selected as
the most likely causal model. The number of free parameters can be accounted
using penalties such as the Akaike information criterion (AIC) (Schadt et al.
2005).
The approach has been extended in various ways. In the study of Chen et al.
(2007), likelihood ratio tests, comparison to randomly permuted data, and false dis-
covery rate estimation techniques are used to convert the three model scores in a
single probability value P ( A ® B ) for a causal interaction from gene A to B. This
method is available in the Trigger software (https://www.bioconductor.org/pack-
ages/release/bioc/html/trigger.html). In the study of Millstein et al. (2009) and
(Neto et al. (2013), the model selection task is recast into a single hypothesis test,
using F-tests and Vuong’s model selection test respectively, resulting in a signifi-
cance p-value for each gene–gene causal interaction.
It should be noted that all of these approaches suffer from limitations due to their
inherent model assumptions. In particular, the presence of unequal levels of mea-
surement noise among genes, or of hidden regulatory factors causing additional
correlation among genes, can confuse causal inference. For example, excessive
error level in the expression data of gene A, may mistake the true structure
E ® A ® B as E ® B ® A . These limitations are discussed by Rockman (2008)
and Li et al. (2010).
We adopt our previous convention in Section 2, where we have the gene expres-
VetBooks.ir
sion data X and genetic markers G. The model contains a total of k vertices (i.e.,
random variables), Xi with i = 1, ¼, k , corresponding to the expression level of gene
i. Given a DAG , and denoting the parental vertex set of Xi by Pa( ) ( X i ) , the
acyclic property of allows to define the joint probability distribution function as
( )
k
p ( X 1 , ¼, X k | ) = Õ p X i | Pa( ) ( X i ) . (2)
i =1
æ ö
( )
p X i | Pa( ) ( X i ) = N ç a i + å b ji ( X j - a j ) , s i2 ÷ ,
ç ÷
è X j ÎPa( ) ( X i ) ø
where (αi, σi) and βji are parameters for vertex Xi and edge X j ® X i respectively, as
part of the DAG structure . Under such modeling, the Bayesian network is called
a linear Gaussian network.
The likelihood of data X given the graph is
( { })
k n
p ( X | ) = ÕÕ p X il | X jl , X j Î Pa( ) ( X i ) .
i =1 l =1
Using Bayes’ rule, the log-likelihood of the DAG based on the gene expression
data X becomes
to use any of the methods in the previous section to calculate the probability
P ( X i ® X j ) of a causal interaction from Xi to Xj (Zhu et al. 2004, 2008, 2012;
Zhang et al. 2013), for example, by defining the prior as
æ ö
p ( ) = Õ ç Õ P ( X j ® X i ) Õ
X i ç X ÎPa( ) ( X )
( )
1 - P ( X j ® X i ) ÷ . A more ambi-
÷
è j X j ÎPa( ) ( X i ) ø
i
tious approach is to jointly learn the eQTL associations and causal trait (i.e., gene or
phenotype) networks. In the study of Neto et al. (2010), EM is used to alternatingly
map eQTLs given the current DAG structure and update the DAG structure and
model parameters given the current eQTL mapping. In the study of Scutari et al.
(2014), Bayesian networks are learned where SNPs and traits both enter as variables
in the model, with the constraint that traits can depend on SNPs, but not vice versa.
However, the additional complexity of both methods means that they are computa-
tionally expensive and have only been applied to problems with a handful of traits
(Neto et al. 2010; Scutari et al. 2014).
A few additional “tips and tricks” are worth mentioning:
• First, when the number of vertices is much larger than the sample count, we may
break the problem into independent subproblems by learning a separate Bayesian
network for each coexpression module (Section 3.1 and Zhang et al. 2013).
Dependencies between modules could then be learned as a Bayesian network
among the module eigengenes (Langfelder and Horvath 2007), although this
does not seem to have been explored.
• Second, Bayesian network learning algorithms inevitably result in locally opti-
mal models, which may contain a high number of false positives. To address this
problem, we can run the algorithm multiple times and report an averaged net-
work, only consisting of edges that appear sufficiently frequent.
• Finally, another technique that helps in distinguishing genuine dependencies
from false positives is bootstrapping, where resampling with replacement is exe-
cuted on the existing sample pool. A fixed number of samples are randomly
selected and then processed to predict a Bayesian network. This process is
repeated many times, essentially regarding the distribution of sample pool as the
true PDF, and allowing to estimate the robustness of each predicted edge, so that
only those with high significance are retained (Friedman et al. 1999b). In theory,
even the whole pipeline of Fig. 1 up to the in silico validation could be simulated
in this way. Although bootstrapping is computationally expensive and mostly
suited for small data sets, it could be used in conjunction with the separation into
modules on larger data sets.
(Bonnet et al. 2015; Segal et al. 2003; Friedman 2004; Qu et al. 2016). The module
VetBooks.ir
We have recently identified genomewide significant eQTLs for 6500 genes in seven
tissues from the Stockholm Atherosclerosis Gene Expression (STAGE) study
(Foroughi Asl et al. 2015) and performed coexpression clustering and causal net-
works reconstruction (Talukdar et al. 2016). To illustrate the above concepts, we
show some results for a coexpression cluster in visceral fat (88 samples, 324 genes),
which was highly enriched for tissue development genes ( P = 5 ´ 10-10 ) and con-
tained 10 genomewide significant eQTL genes and 25 transcription factors, includ-
ing eight members of the homeobox family (Fig. 2a).
A representative example of an inferred causal interaction is given by the coex-
pression interaction between huntingtin-associated protein 1 (HAP1, chr17 q21.2-
21.3) and forkhead box G1 (FOXG1, chr14 q11-q13). The expression of both genes
is highly correlated ( r = 0.85 , P = 4.4 ´ 10-24 , Fig. 2b). HAP1 expression shows a
significant, nonlinear association with its eQTL rs1558285 ( P = 1.2 ´ 10-4 ); this
SNP also associates significantly with FOXG1 expression in the cross-association
test ( P = 0.0024), but not anymore after conditioning FOXG1 on HAP1 and its
own eQTL rs7160881 ( P = 0.67) (Fig. 2c). By contrast, although FOXG1 expres-
sion is significantly associated with its eQTL rs7160881 (P = 0.0028 ), there is no
association between this SNP and HAP1 expression ( P = 0.037), and conditioning
on FOXG1 and HAP1’s eQTL has only a limited effect ( P = 0.19) (Fig. 2d). Using
conditional independence tests (Section 4.1), this results in a high-confidence pre-
diction that HAP1 ® FOXG1 is causal.
A standard greedy Bayesian network search algorithm (Schmidt et al. 2007) was
run on the aforementioned cluster of 324 genes. Figure 2e shows the predicted con-
sensus subnetwork of causal interactions between the 10 eQTLs and the 25 TFs.
This illustrates how a sparse Bayesian network can accurately represent the fully
connected coexpression network (all 35 genes have high-mutual coexpression, cf.
Fig. 2a).
16 L. Wang and T. Michoel
a −2 0 2 e
VetBooks.ir
HOXB3 VASN
ISL1
CDH1 OBSCN HOXC9 GSC ZBTB16 ZBTB25
BCL11A
HAP1 HOXD8 KLF5 MESP2 FMO3 HOXA4 TEF
FOXG1
TRIM29 DLK1 THNSL2
IRF6
FOXE1 IRF6
KLF5
HLA−DQB1 SALL2 FOXG1
DLK1
HAP1 FOXE1 TBX5
OBSCN
ASCL1 ISL1 HOXC6
SALL2
TP63
HLA−DQB1
TTC39B BCL11A TRIM29 ASCL1 HOXA7
ZBTB25
TBX5 PLCD4 HOXA5 CDH1 PITX2
HOXA7
PITX2
f
HOXB7
HOXC6
HOXC9
HOXA5 SNP_A-8471683
HOXD8
ACVR1C
HOXB3
VASN
PLCD4 ADIPOQ
CIDEC
FMO3 PLIN4
HOXA4 PLIN1
GSC THRSP
SLC19A3
MESP2 GPD1
THNSL2 DGAT2
ZBTB16 TNMD
MRAP
TEF CIDEA
0.2 0.2
Standardized expression
Standardized expression
0.05 0.05
0
0 0
−0.05
−0.05 −0.05
−0.1
−0.1 −0.1
Fig. 2 (a) Heat map of standardized expression profiles across 88 visceral fat samples for 10
eQTL genes and 25 TFs belonging to a coexpression cluster inferred from the STAGE data. (b)
Coexpression of HAP1 and FOXG1 across 88 visceral fat samples. (c) Association between
HAP1’s eQTL (rs1558285) and expression of HAP1 (red), FOXG1 (blue), and FOXG1 adjusted
for HAP1 and FOXG1’s eQTL (green). (d) Association between FOXG1’s eQTL (rs7160881) and
expression of FOXG1 (blue), HAP1 (red), and HAP1 adjusted for FOXG1 and HAP1’s eQTL
(green). (e) Causal interactions inferred between the same genes as in (a) using Bayesian network
inference. (f) Example of a regulatory module inferred by Lemon-Tree from the STAGE data. See
Section 4.4 for further details
applications (e.g., discovering novel candidate drug target genes and pathways)
VetBooks.ir
Functional Enrichment
Organism-specific gene ontology databases contain structured functional gene
annotations (Ashburner et al. 2000). These databases can be used to construct gene
signature sets composed of genes annotated to the same biological process, molecu-
lar function or cellular component. Reconstructed gene networks can then be vali-
dated by testing for enriched connectivity of gene signature sets using a method
proposed by (Zhu et al. 2008). For a given gene set, this method considers all net-
work nodes belonging to the set and their nearest neighbors, and from this set of
nodes and edges, the largest connected subnetwork is identified. Then the enrich-
ment of the gene set in this subnetwork is tested using the Fisher exact test and
compared with the enrichment of randomly selected gene sets of the same size.
(i.e., performed in a relevant cell or tissue type) gene knockout experiments from
the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) or ArrayExpress
(https://www.ebi.ac.uk/arrayexpress/), and comparing the overlap between gene
sets responding to a gene knockout and network genes predicted to be downstream
of the knockout gene. Overlap significance can be estimated by using randomized
networks with the same degree distribution as the predicted network.
the coming years, along with the availability of similar measurements in other cell
VetBooks.ir
types. Despite the challenging heterogeneity of data and analyses in the integration
of multi-omics data, web-based toolboxes, such as GenomeSpace (http://www.
genomespace.org) (Qu et al. 2016), can prove helpful to nonprogrammer
researchers.
Conclusions
In this chapter, we have reviewed the main methods and software to carry out a
systems genetics analysis, which combines genotype and various omics data to
identify eQTLs and their associated genes, to reconstruct coexpression networks
and modules, to reconstruct causal Bayesian gene and module networks, and to
validate predicted networks in silico. Several method and software options are
available for each of these steps, and by necessity, a subjective choice about
which ones to include had to be made, based largely on their ability to handle
large data sets, their popularity in the field, and our personal experience of using
them. Where methods have been compared in the literature, they have usually
been performed on a small number of data sets for a specific subset of tasks, and
results have rarely been conclusive. That is, although each of the presented meth-
ods will give somewhat different results, no objective measurements will consis-
tently select one of them as the “best” one. Given this lack of objective criterion,
the reader may well prefer to use a single software that allows to perform all of
the presented analyses, but such an integrated software does not currently exist.
Nearly all of the examples discussed referred to the integration of genotype
and transcriptome data, reflecting the current dominant availability of these two
data types. However, omics technologies are evolving at a fast pace, and it is
clear that data on the variation of TF binding, histone modifications, and post-
transcriptional and protein expression levels will soon become more widely
available. Developing appropriate statistical models and computational methods
to infer causal gene regulation networks from these multi-omics data sets is
surely the most important challenge for the field.
Acknowledgments The authors’ work is supported by the BBSRC (BB/M020053/1) and Roslin
Institute Strategic Grant funding from the BBSRC (BB/J004235/1).
References
Albert FW, Kruglyak L (2015) The role of regulatory variation in complex traits and disease. Nat
Rev Genet 16:197–212
Ardlie KG et al (2015) The genotype-tissue expression (GTEx) pilot analysis: multitissue gene
regulation in humans. Science 348:648–660
Ashburner M et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology
Consortium. Nat Genet 25:25–29
Aten JE et al (2008) Using genetic markers to orient the edges in quantitative trait networks: the
NEO software. BMC Syst Biol 2:34
Ayroles JF et al (2009) Systems genetics of complex traits in drosophila melanogaster. Nat Genet
41:299–307
20 L. Wang and T. Michoel
Basso K et al (2005) Reverse engineering of regulatory networks in human b cells. Nat Genet
VetBooks.ir
37:382–390
Björkegren JL et al (2015) Genome-wide significant loci: how important are they?: systems genet-
ics to understand heritability of coronary artery disease and other common complex disorders.
J Am Coll Cardiol 65:830–845
Bonnet E, Calzone L, Michoel T (2015) Integrative multi-omics module network inference with
Lemon-Tree. PLoS Comput Biol 11, e1003983
Brem RB et al (2002) Genetic dissection of transcriptional regulation in budding yeast. Science
296:752–755
Butte A, Kohane I (2000) Mutual information relevance networks: functional genomic clustering
using pairwise entropy measurements. Pac Symp Biocompu 5:415–426
Cenik C et al (2015) Integrative analysis of rna, translation and protein levels reveals distinct regu-
latory variation across humans. Genome Res. doi:10.1101/gr.193342.115
Chatr-Aryamontri A et al (2015) The BioGRID interaction database: 2015 update. Nucleic Acids
Res 43(Database issue):D470–D478. doi:10.1093/nar/gku1204
Chen LS, Emmert-Streib F, Storey JD (2007) Harnessing naturally randomized transcription to
infer regulatory relationships among genes. Genome Biol 8:R219
Chen Y et al (2008) Variations in DNA elucidate molecular networks that cause disease. Nature
452:429–435
Cheung VG, Spielman RS (2009) Genetics of human gene expression: mapping dna variants that
influence gene expression. Nat Rev Genet 10:595–604
Civelek M, Lusis AJ (2014) Systems genetics approaches to understand complex traits. Nat Rev
Genet 15:34–48
Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks.
Phys Rev E 70:066111
Cookson W et al (2009) Mapping complex disease traits with global gene expression. Nat Rev
Genet 10:184–194
Cubillos FA, Coustham V, Loudet O (2012) Lessons from eQTL mapping studies: non-coding
regions and their role behind natural phenotypic variation in plants. Curr Opin Plant Biol
15:192–198
Cusanovich DA et al (2014) The functional consequences of variation in transcription factor bind-
ing. PLoS Genet 10, e1004226
Daub CO et al (2004) Estimating mutual information using B-spline functions – an improved simi-
larity measure for analysing gene expression data. BMC Bioinf 5:118
Dimas AS et al (2009) Common regulatory variation impacts gene expression in a cell type–depen-
dent manner. Science 325:1246–1250
Eisen MB et al (1998) Cluster analysis and display of genome-wide expression patterns. PNAS
95:14863–14868
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of
protein families. Nucleic Acids Res 30:1575–1584
Faith JJ et al (2007) Large-scale mapping and validation of Escherichia coli transcriptional regula-
tion from a compendium of expression profiles. PLoS Biol 5, e8
Foroughi Asl H et al (2015) Expression quantitative trait loci acting across multiple tissues are
enriched in inherited risk of coronary artery disease. Circulation Cardiovasc Genet 8:305–315
Foss EJ et al (2007) Genetic basis of proteome variation in yeast. Nat Genet 39:1369–1375
Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science
308:799–805
Friedman N, Nachman I, Peér D (1999) Learning bayesian network structure from massive datas-
ets: the “sparse candidate” algorithm. In Proceedings of the fifteenth conference on uncertainty
in artificial intelligence, UAI’99. Morgan Kaufmann Publishers Inc., San Francisco,
pp 206–215
Friedman N, Goldszmidt M, Wyner A (1999b) Data analysis with Bayesian networks: a bootstrap
approach. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence.
Morgan Kaufmann Publishers Inc, San Francisco, pp 196–205
Detection of Regulator Genes and eQTLs in Gene Networks 21
Friedman N et al (2000) Using Bayesian networks to analyze expression data. J Comput Biol
VetBooks.ir
7:601–620
Furey TS (2012) ChIP–seq and beyond: new and improved methodologies to detect and character-
ize protein–DNA interactions. Nat Rev Genet 13:840–852
Georges M (2007) Mapping, fine mapping, and molecular dissection of quantitative trait loci in
domestic animals. Annu Rev Genomics Hum Genet 8:131–162
Gerstein M et al (2010) Integrative analysis of the Caenorhabditis elegans genome by the modEN-
CODE project. Science 330:1775–1787
Goddard ME, Hayes BJ (2009) Mapping genes for complex traits in domestic animals and their
use in breeding programmes. Nat Rev Genet 10:381–391
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. The Johns Hopkins University
Press, Baltimore
Greenawalt DM et al (2011) A survey of the genetics of stomach, liver, and adipose gene expres-
sion from a morbidly obese cohort. Genome Res 21:1008–1016
Grubert F et al (2015) Genetic control of chromatin states in humans involves local and distal
chromosomal interactions. Cell 162:1051–1065
Hartwell LH et al (1999) From molecular to modular cell biology. Nature 402:C47–C52
Hemani G et al (2014) Detection and replication of epistasis influencing transcription in humans.
Nature 508:249–253
Hindorff LA et al (2009) Potential etiologic and functional implications of genome-wide associa-
tion loci for human diseases and traits. Proc Natl Acad Sci 106:9362–9367
Joshi A, Van de Peer Y, Michoel T (2008) Analysis of a Gibbs sampler for model based clustering
of gene expression data. Bioinformatics 24:176–183
Joshi A et al (2009) Module networks revisited: computational assessment and prioritization of
model predictions. Bioinformatics 25:490–496
Kadarmideen HN, von Rohr P, Janss LL (2006) From genetical genomics to systems genetics: poten-
tial applications in quantitative genomics and animal breeding. Mamm Genome 17:548–564
Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT
Press, Cambridge, MA
Kundaje A et al (2015) Integrative analysis of 111 reference human epigenomes. Nature
518:317–330
Laird N, Lange C (2011) The fundamentals of modern statistical genetics. Springer, New York
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-
expression modules. BMC Syst Biol 1:54
Langfelder P, Horvath S (2008) Wgcna: an r package for weighted correlation network analysis.
BMC Bioinf 9:559
Langfelder P, Zhang B, Horvath S (2008) Defining clusters from a hierarchical cluster tree: the
dynamic tree cut package for r. Bioinformatics 24:719–720
Lappalainen T et al (2013) Transcriptome and genome sequencing uncovers functional variation in
humans. Nature 501:506–511
Lee S et al (2006) Identifying regulatory mechanisms using individual variation reveals key role
for chromatin modification. Proc Natl Acad Sci U S A 103:14062–14067
Lee SI et al (2009) Learning a prior on regulatory potential from eqtl data. PLoS Genet 5, e1000358
Li Y et al (2010) Critical reasoning on causal inference in genome-wide linkage and association
studies. Trends Genet 26:493–498
Liu JS (2002) Monte Carlo strategies in scientific computing. Springer, New York
Lu P et al (2007) Absolute protein expression profiling estimates the relative contributions of tran-
scriptional and translational regulation. Nat Biotech 25:117–124
Mackay TF, Stone EA, Ayroles JF (2009) The genetics of quantitative traits: challenges and pros-
pects. Nat Rev Genet 10:565–577
Manolio TA (2013) Bringing genome-wide association findings into clinical use. Nat Rev Genet
14:549–558
Medvedovic M, Sivaganesan S (2002) Bayesian infinite mixture model based clustering of gene
expression profiles. Bioinformatics 18:1194–1206
22 L. Wang and T. Michoel
Sharan R, Shamir R (2000) CLICK: a clustering algorithm with applications to gene expression
VetBooks.ir
Abstract
In many biomedical research areas, animals have been used as a model to increase
the understanding of molecular mechanisms involved in human diseases. One of
those areas is human obesity, where porcine models are increasingly used. The
pig shows genetic and physiological features that are very similar to humans and
have shown to be an excellent model for human obesity. Using pig populations,
many genetic studies have been performed to unravel the genetic architecture of
human obesity. Most of them are pinpointing toward single genes, but more and
more studies focus on a systems genetics approach, a branch of systems biology.
In this chapter, we will describe the state of the art of genetic studies on human
obesity, using pig populations. We will describe the features of using the pig as a
model for human obesity and briefly discuss the genetics of obesity, and we will
focus on systems genetic research performed using pigs with their contribution
to human obesity research.
Throughout the history of biomedical research, animals have been extensively used
as a model for human diseases. Animal models have several advantages with respect
to costs, ethical potential, and measurement of phenotypic characteristics. The use
of animal models in biomedical research has been previously described in depth
(Hau 2008), showing that the choice of animal model in biomedical research is
both animal and disease under study. For human obesity, rodents are a commonly
used animal model, but because of major anatomical and physiological differences,
the translational efforts to human medical science have been limited (Houpt et al.
1979; Spurlock and Gabler 2008). To overcome those major differences, the pig
(Sus scrofa) has successfully been used as a model for human obesity.
The digestive tract is one of the key organs in obesity research (Halsted 1999)
and, therefore, needs to be considered when choosing an animal model for human
obesity. Zooming in on the anatomy of the digestive tract of pigs, it can be shown
that it is very similar to that of humans. Both species are omnivorous, and their
digestive tract consists of the esophagus, stomach, small intestine (consisting of
duodenum, jejunum, and ileum), and large intestine (consisting of cecum, colon,
rectum, and anus). Furthermore, the digestive tract has proportionally the same size:
the stomach has a capacity of 6–8 l in the pig compared to 2–4 l in humans (Curtus
and Barnes 1994), the small intestine is approximately 18 m long in the pig
(Razmaite et al. 2009) compared to 7 m in humans (Gray 1918), and the large intes-
tine is approximately 6 m in the pig (Razmaite et al. 2009) compared to 1.5 m in
humans.
Also genetically, the pig is very similar to humans. Recently (2013), the pig
genome was assembled and analyzed (Groenen et al. 2012), resulting in the annota-
tion of protein-coding genes and gene transcripts, with numbers comparable to the
human genome (Table 1).
Based on the pig's genetic background, Groenen et al. (2012) also showed the
potential of the pig as a biomedical model. For example, they detected that at 112
positions, the amino acid sequences were equal to the human orthologs that were
implicated in human disease. Moreover, several studies have used the pig as a bio-
medical model, regarding, for example, heart physiology, brain, gut physiology and
nutrition, biomechanical models, respiratory function, and infectious disease mod-
els (Lunney 2007; Michael Swindle and Smith 2008). Here, we will focus on the
use of the pig as a model for human obesity and its application in systems biology
research.
Already in 1979, the use of the pig as a biomedical model to study human obesity
was reviewed (Houpt et al. 1979). Several similarities between the pig and the
humans were discussed with respect to obesity-related phenotypes. For example,
both in humans and in pigs, fat is mainly stored in subcutaneous adipose tissue, and
fat cell size and number are similar (Gurr et al. 1977). High-density lipid protein
In contrast to the obese pigs, production pigs (e.g., Duroc and Yorkshire) have
VetBooks.ir
been bred for centuries for less fat or for their lean meat content to live up to the
standards for human consumption, leading to a pig breed that is genetically predis-
posed for leanness. Although those animals are less valuable in an experimental
setting because of their size, the normal production setting has great potential. In
research, e.g., performed by animal breeding industries, a large amount of data are
collected to breed animals that are growing fast and lean, with a high feed effi-
ciency. These data have vast opportunities to be related to human obesity studies,
gaining knowledge about the genetic architecture of, for example, eating behavior
and development of lean/fat content.
(several immune functions, e.g., role in allergy) are strongly increased in individu-
VetBooks.ir
For several years, genomewide association studies (GWAS) have been very impor-
tant in the detection of genes associated with diseases. Since 2007, GWAS have
been published regarding obesity. Most commonly, the body mass index (BMI) was
used as a measure of obesity. The first GWAS performed on BMI, using 4741 indi-
viduals and 362,129 SNPs, led to the detection of the fat mass and obesity-associated
(FTO) gene (Scuteri et al. 2007). Approximately 8 years later, the latest GWAS
performed on BMI, composed of 339,224 individuals and approximately 2.5 mil-
lion SNPs (GIANT consortium), detected 97 obesity-related loci (Locke et al.
2015). Although this is a huge increase in detected loci/genes associated with obe-
sity, results thus far are disappointing, as they only explain approximately 2.7% of
the BMI variation. Over the years and with new findings, the understanding of bio-
logical mechanisms behind obesity has also changed. Where FTO was discovered
as an actual fat mass gene (related to feed intake), pathway analysis of the 97
obesity-related loci discovered by Locke et al. (2015) is pointing toward a major
role for the central nervous system.
Furthermore, several other phenotypes have been used to detect obesity-related
genes. For example, it has been shown that central obesity has more negative health
30 L.J.A. Kogelman and H.N. Kadarmideen
might be more informative than the BMI (Vazquez et al. 2007; Molarius and Seidell
1998). Recently, 49 loci related to WHR were detected using the GIANT consor-
tium database, consisting of genes that were highly enriched in adipocyte-related
tissues (Shungin et al. 2015). Likewise, as with GWAS on BMI, the variance
explained by those loci is very low: only approximately 1.4% of the variance was
explained.
Many more studies have been performed to try elucidating the genetic back-
ground of obesity, to gain understanding of the biological mechanisms of this com-
plex phenotype. As mentioned, the use of animal models in biomedical research has
some outstanding advantages, such as costs and ethical potential. Several research
groups have made use of those potentials and investigated obesity using the pig as a
biomedical model, either using data sets coming from the pig industry or by using
an experimental animal model, both having their own (dis)advantages (Fig. 1).
The first gene that has been directly associated with obesity in humans is the FTO
gene. This gene has also been related to obesity-related traits in pigs by several
studies. In 2007, a study focused on the alleles of the FTO gene and studied the
relationship of this gene in seven pig breeds with several measured traits (Fontanesi
et al. 2009). They showed, and reconfirmed, that FTO was significantly associated
with obesity-related phenotypes in Duroc pigs, for example, intramuscular fat
content (Fontanesi et al. 2010). Also in an ISU Berkshire × Yorkshire population,
this gene showed significant association with average daily gain and total lipid
percentage in muscle (Fan et al. 2009). Furthermore, an expression study showed
elevated levels of the FTO gene brain tissues, with significantly higher levels in
the cerebellum compared with the cortex of pigs fed a high-cholesterol diet
(Madsen et al. 2009).
Another well-known human obesity gene is MC4R. The gene is active in the
hypothalamic leptin–melanocortin signaling pathway and has been associated with
suppression of food intake (Santini et al. 2009). Also in pigs, the gene has been
associated with several obesity-related traits. In Italian Duroc and Italian Large
White pigs, it has been associated with daily gain, feed conversion ratio, and ham
weight (Davoli et al. 2012). Another study using Large White showed the associa-
tion of MC4R with backfat depth, average daily gain, and daily feed intake (Houston
et al. 2004).
Another study that showed the presence of human obesity genes in pigs was
performed at Poznan University of Life Sciences (Poland). They localized seven
previously mapped (INSIG2, LIPIN1, PLIN, NAMPT, ADIPOQ, UCP2, and UCP3)
and six novel (NR3C1, GNB3, ADRB1, ADRB2, ADRB3, and UCP1) candidate
genes for human obesity in the pig genome (Nowacka-Woszuk et al. 2008). All
genes could be localized on one of the pig chromosomes, and several of them were
located within a known quantitative trait loci (QTL) for fatness traits in pigs.
VetBooks.ir
Another expression-based study also showed the presence of LEP, LEPR, NEGR1,
VetBooks.ir
and ADIPOQ together with FTO and MC4R in production pigs and in Göttingen
Minipigs (Cirera et al. 2014).
Fat-related traits in pigs have been studied intensively for the pig industry because
of their commercial effect. Production pigs have been bred for their lean meat con-
tent, which means that it was commercially interesting to detect biomarkers for
fatness. Although many of these studies have never been related to human obesity,
it can be proposed that those genes might be important in a human context. For
example, a QTL study for backfat thickness and intramuscular fat content in an
experimental cross between Meishan and Dutch Large White and Landrace lines
detected several QTLs on several chromosomes. Comparative mapping of the
results showed human homologues for those QTLs, for example, tumor necrosis
factor α (de Koning et al. 1999). Furthermore, in a QTL study using a three-
generation experimental cross between Meishan and Large White pig breeds, seven
chromosomal regions were detected as genomewide significant for fatness traits.
One of the detected QTLs was close to the insulin-like growth factor 2 locus
(Bidanel et al. 2001). A GWAS performed on 669 Duroc pigs across generations
resulted in the detection of a region associated with backfat thickness. In this region,
six genes were identified (PDE4B, LEPR, LEPROT, DNAJC6, AK3L1, and JAK1)
of which both LEPR and PDE4B have previously been associated with backfat-
related traits and obesity.
A GWAS study of 820 commercial sows (Large White and Large White × Landrace
cross) identified several regions associated with backfat, including the genes MC4R,
ATP6V1H, OPRK1, LDHD, CHCHD3, and ATP2B3 (Fan et al. 2011). Of those,
MC4R, ATP6V1H, and CHCHD3 have been previously associated with obesity in
human or other animal studies (Hwang et al. 2010; Lubrano-Berthelier et al. 2003;
Walewski et al. 2010). Another GWAS using 651 Duroc samples (three generations)
detected a region associated with backfat thickness containing six genes: PDE4B,
LEPR, LEPROT, DNAJC6, AK3L1, and JAK1 (Okumura et al. 2013). Several of
those have been previously associated with human obesity, such as the previously
discussed LEPR.
In addition to GWAS performed on backfat and other fat-related traits, another
study focused on feeding behavior in production pigs (Duroc boars), which might
be an important trait with respect to feed efficiency (Do et al. 2013). However, find-
ings in this study could also be valuable for human research because of its impact
on eating behavior and subsequent development of obesity. Among the genes that
were discovered were ENPP1, HCRT, and MTTP.
Another study focused on the sequencing of the porcine genome using RNA
sequencing and found a gene associated with cholesterol and triglyceride levels:
CES1 (Chen et al. 2011). Interestingly, a human study has also shown that the
Applications of Systems Genetics and Biology for Obesity Using Pig Models 33
whereas in mice, it was shown that CES1 has an important role in lipid homeostasis
(Xu et al. 2014).
As described, the pig is an excellent model for human obesity with huge potential.
Many researchers have taken this opportunity to study human obesity, using differ-
ent porcine models. In this section, we will describe several of those studies in order
to gain insight in to the knowledge gained about human obesity using porcine
models.
One of the ways in which pigs are used as a model for human obesity is by induc-
ing obesity with a high-fat diet. Using this approach, Li et al. (2011) created a por-
cine model of 20 crossbred boars, which were fed a corn-soy basal diet or a diet with
lard for 180 days. There were significant differences between the high-fat diet pigs
and control pigs with respect to, for example, body weight, backfat thickness, and
abdominal fat. They detected 852 genes (387 upregulated and 465 downregulated)
that showed differential expression between the obese pigs and control pigs.
Upregulated genes were mainly associated with metabolic process, immune
response, translation, and cell cycle, whereas downregulated genes were mainly
involved in regulation of transcription, RNA splicing, and transcription.
Porcine models can also be obtained by intercrossing pig breeds, creating a pop-
ulation consisting of different generations that is advantageous for genetic studies.
An excellent example is the UNIK pig resource population (University of
Copenhagen, Denmark), which is an F2 pig population created by intercrossing
Göttingen Minipig boars with Duroc and Yorkshire sows (563 pigs). As described
earlier, the Göttingen Minipig and production pigs (i.e., Duroc and Yorkshire) are
very distinct in some obesity-related features, e.g., body size and body fat content.
By creating an F2 intercross, the F2 animals will show a wide range of values for
each distinct phenotype. This large variation in obesity-related phenotypes was also
obvious in the UNIK pig resource population, shown by genetic parameter calcula-
tions. The high coefficients of variation (15–42%) and moderate to high heritabili-
ties (0.22–0.81) for obesity and obesity-related traits showed that the population
was highly divergent for obesity traits (Kogelman et al. 2013). Furthermore, genetic
correlations between obesity-related traits revealed more of the genetic architecture
in the population. For example, weight and lean mass estimated by DXA scanning
were highly correlated (0.56–0.97), and fat-related traits were strongly correlated
with glucose levels (0.35–0.74). The UNIK resource population thereby proved its
potential for further genetic investigations using, for example, genomics and
transcriptomics.
To detect genomewide associations for obesity-related phenotypes in the UNIK
pig resource population, a combined linkage disequilibrium-linkage analysis was
performed (Pant et al. 2015). This resulted in the detection of 229 QTLs that were
subsequently used for comparative analysis with the human genome. Many different
34 L.J.A. Kogelman and H.N. Kadarmideen
genes were identified for obesity-related phenotypes, such as BMI (e.g., SMAD6 and
VetBooks.ir
PAX5), fasting glucose levels (STIM1), and cholesterol levels (e.g., STRADA).
Another study within the UNIK pig resource population focused on obesity-specific
microRNA expression in subcutaneous adipose tissue (Mentzel et al. 2015). In total,
two differentially expressed microRNAs were discovered and validated using qPCR:
mir-9 and mir-124a. Both microRNAs have been previously associated with obesity-
related phenotypes in human and/or mouse studies.
backfat thickness in the pig and subscapular skinfold thickness in humans were
estimated, and a correlation of 0.479 was detected. In both the pig and human popu-
lations, the human chromosome 2 was detected as the most important chromosome
based on the percentage of heritability explained and was therefore further investi-
gated. Several genes were suggested as being candidate genes for human obesity,
namely, MRPL33, PARD3B, ERBB4, STK39, and ZNF385B.
In the previously mentioned UNIK pig resource population, several systems
genetics approaches were also conducted. Using the 60K genotype data, genetic
networks were constructed using the weighted interaction SNP hub (WISH) method
(Kogelman et al. 2014b). The WISH method detects interactions between SNPs
based on their genotype correlations and based on the epistatic interaction effect
between SNP pairs (Kogelman and Kadarmideen 2014). First, a GWAS was per-
formed on the obesity index (OI), which is an aggregate genotypic value for the
level of obesity of each individual pig. Several genes were detected (e.g., NPC2,
OR4D10, and CACNA1E), and pathway analysis of all genomewide significant
SNPs (404 SNPs) showed association with, e.g., insulin and immune system path-
ways. Second, 2500 SNPs were selected based on GWAS results (P-value < 0.05)
and their connectivity to construct the WISH network. Pathway analysis of detected
modules resulted in pathways related to, for example, metabolic processes and puri-
nergic receptor activity. Lastly, a differentially wired network was constructed to
detect potential obesity genes based on a differential connectivity. Here, genes such
as UBR1, PNPLA8, and CTNAP2 were detected, all of which are previously associ-
ated with obesity or obesity-related diseases.
Besides the genotyping in the UNIK pig resource population, a subset of the
population was selected for RNA sequencing. The selection of pigs was obtained
using selective expression profiling, based on the OI, selecting 12 extremely lean,
12 extremely obese, and 12 intermediate pigs. One of the small studies on this data
set was the investigation of interaction between highly differentiating genes (lean
vs. obese animals). They were visualized in Cytoscape (Shannon et al. 2003) and
were further analyzed using the inbuild network analyzer (Fig. 2). We detected sev-
eral genes that were so-called hub genes, and one clear example was TNMD. TNMD
encodes a protein related to chondromodulin I, a cartilage-specific glycoprotein.
Genetic variation in TNMD has been associated with central obesity and type 2
diabetes (Tolppanen et al. 2007).
The complete RNA-sequencing data were then used to construct coexpression
networks using the weighted gene coexpression network analysis (WGCNA)
approach, and regulator genes were detected using Lemon-Tree algorithms
(Kogelman et al. 2014a). This resulted in the detection of several obesity-related
pathways, but more interestingly in the detection of three potential causal genes
linking obesity with osteoporosis: CCR1, MSR1, and SPI1. Continuing the sys-
tems genetics approaches using this UNIK pig resource population, one study
focused on lncRNAs present among the genes in the previously detected WGCNA
modules (Suravajhala et al. 2015). This study investigates those genes using the
RNA–protein interaction predictor and detects network properties using Cytoscape
36 L.J.A. Kogelman and H.N. Kadarmideen
VetBooks.ir
Fig. 2 Visualization of coexpression of genes selected based on their association with obesity,
resulting from RNA-sequencing data of subcutaneous adipose tissue in the UNIK pig resource
population. Edges represent Pearson's correlation coefficients between selected genes, colored on
a red–green scale based on negative–positive correlation. The thicknesses of the edges are repre-
senting the strength of the correlations. Nodes represent selected genes, with size representing the
fold change detected in differential expression analysis and darkness (gray scale) represents sig-
nificance level of differential expression analysis. Only genes with a coexpression higher than 0.8
are visualized. TNMD, one of the selected hub genes, is colored yellow
(Shannon et al. 2003). They found that cyp2c91 has strong interactions with sev-
eral regulator genes, and they showed its importance in transcriptional regulation
localized to cytoplasm. Furthermore, another study in progress within the UNIK
population focuses on transcriptional regulation networks. This study extracted
known transcription factors and used them as input of WGCNA to detect interac-
tion patterns between transcription factors (Skinkyte-Juskiene et al. 2015).
Most recent research in the UNIK research population has been a study that inte-
grates the genomic data with the transcriptomics data, using an eQTL mapping
Applications of Systems Genetics and Biology for Obesity Using Pig Models 37
genes for the degree of obesity. Second, eQTL analysis revealed 987 cis-eQTLs and
73 trans-eQTLs. Data were further integrated by confining the eQTL mapping input
to genomewide associated SNPs and differentially expressed genes. Furthermore,
eQTL results were used for coexpression network construction. Many different
obesity-related pathways and genes were identified, and several obesity candidate
genes were proposed: ENPP1, CTSL, and ABHD12B (Kogelman et al. 2015).
8 Future Perspectives
Here we have described a selection of (ongoing) genetic research into obesity, using
the pig as a biomedical model. The complexity of obesity, resulting from interac-
tions between and within genes and environment, requires complex analyses of the
different -omic levels to elucidate underlying genetic and biological mechanisms.
Systems genetics approaches, a genetic branch of systems biology, offer those pos-
sibilities and are therefore an excellent choice for obesity research. Here we have
focused on genomic and transcriptomic research, and besides the huge opportuni-
ties that still can be reached within those fields, there are also prospects for other
-omics levels like epigenomics, metabolomics, and proteomics.
In human, epigenetic studies have shown its importance of detecting epigenetic
changes underlying the development of obesity (van Dijk et al. 2015). A recently pub-
lished study using a subset of pigs from the previously mentioned UNIK pig resource
population focused on the epigenetic mechanisms in leukocytes (Jacobsen et al. 2016).
They detected several genes that were differentially expressed (lean vs. obese pigs) in
retroperitoneal adipose tissue and peripheral blood mononuclear cells. Several of those
genes had a strong association with inflammatory pathways and fatty acid metabolism,
as, for example, SPPI1, LEP, and INSIG1. Also, metabolomics has shown its potential
in human obesity research (Rauschert et al. 2014). A few metabolomic studies have
been performed using pigs, with the aim to unravel metabolic mechanisms. He et al.
(2012) found many differences in metabolite levels between lean and obese pigs, show-
ing a distinct metabolism (e.g., difference in lipid oxidation and fermentation of gastro-
intestinal microbes) for obese pigs. Obesity research could highly benefit from the
integration of metabolomics and genomic/transcriptomic data, to narrow down the
unknown interactions between genetics and environmental factors. Proteomic studies
have been used already more frequently in obesity research, using the pig as a model,
and its potential is reviewed by Bendixen et al. (2010). It has been shown that several
plasma proteins correlated well with obesity-related parameters (e.g., cholesterol and
glucose) among pigs that were fed different diets (te Pas et al. 2013). Also, the protein
expression of muscle in obese and lean pigs has been studied to gain knowledge about
growth rate and meat quality in pigs, which can be used in human research (Li et al.
2013). They found metabolism-related proteins that were highly expressed in the obese
pigs, but not in the lean pigs. Both proteins (COX5A and ATP5B) participate in oxida-
tive phosphorylation. The protein that was highly expressed in the lean animals but not
in obese pigs (ENO3) is involved in glycolysis.
38 L.J.A. Kogelman and H.N. Kadarmideen
Omic data generation (both microarray and sequence based) is getting easier and
VetBooks.ir
more affordable, and in combination with the ease of collecting samples from dif-
ferent cells/tissues in a porcine model, there is a huge potential to elucidate genetic
and biological mechanisms of human obesity. Inherent in all these approaches are
the dramatic increase in the use of omic-scale modeling and joint analyses of dispa-
rate multi-omic data sets using highly advanced bioinformatic and computational
systems biology methods. This will become even more important and widespread in
biomedical research in general and in complex diseases, such as obesity, in particu-
lar. As shown throughout this chapter, in the field of systems biology/systems genet-
ics, there are many untouched areas in obesity research using a porcine model,
which promises further growth and good prospective in the future.
References
Bendixen E, Danielsen M, Larsen K, Bendixen C (2010) Advances in porcine genomics and pro-
teomics—a toolbox for developing the pig as a model organism for molecular biomedical
research. Brief Funct Genomics 9(3):208–219. doi:10.1093/bfgp/elq004
Bidanel J, Milan D, Iannuccelli N et al (2001) Detection of quantitative trait loci for growth and
fatness in pigs. Genet Sel Evol 33:289–309. doi:10.1186/1297-9686-33-3-289
Bollen PJ, Madsen LW, Meyer O, Ritskes-Hoitinga J (2005) Growth differences of male and
female Gottingen minipigs during ad libitum feeding: a pilot study. Lab Anim 39(1):80–93.
doi:10.1258/0023677052886565
Bougnères P (2002) Genetics of obesity and type 2 diabetes. Diabetes 51(suppl 3):S295–S303.
doi:10.2337/diabetes.51.2007.S295
Chen C, Ai H, Ren J et al (2011) A global view of porcine transcriptome in three tissues from a
full-sib pair with extreme phenotypes in growth and fat deposition by paired-end RNA sequenc-
ing. BMC Genomics 12:448. doi:10.1186/1471-2164-12-448
Cirera S, Jensen MS, Elbrønd VS et al (2014) Expression studies of six human obesity-related
genes in seven tissues from divergent pig breeds. Anim Genet 45(1):59–66. doi:10.1111/
age.12082
Curtus H, Barnes NS (1994) Invitation to biology, vol 529, 5th edn. Worth, New York
Davis MA, Henry R, Leslie RB (1974) Comparative studies on porcine and human high density
lipoproteins. Comp Biochem Physiol B 47(4):831–849
Davoli R, Braglia S, Valastro V et al (2012) Analysis of MC4R polymorphism in Italian Large
White and Italian Duroc pigs: association with carcass traits. Meat Sci 90(4):887–892.
doi:10.1016/j.meatsci.2011.11.025
de Koning D, Janss L, Rattink A et al (1999) Detection of quantitative trait loci for backfat thick-
ness and intramuscular fat content in pigs (Sus scrofa). Genetics 152:1679–1690
Diez J, Iglesias P (2003) The role of the novel adipocyte-derived hormone adiponectin in human
disease. Eur J Endocrinol 148(3):293–300. doi:10.1530/eje.0.1480293
Do DN, Strathe AB, Ostersen T, Jensen J, Mark T, Kadarmideen HN (2013) Genome-wide asso-
ciation study reveals genetic architecture of eating behavior in pigs and its implications for
humans obesity by comparative mapping. PLoS One 8(8), e71509. doi:10.1371/journal.
pone.0071509
Dyson MC, Alloosh M, Vuchetich JP, Mokelke EA, Sturek M (2006) Components of metabolic
syndrome and coronary artery disease in female Ossabaw swine fed excess atherogenic diet.
Comp Med 56(1):35–45
Elgazar-Carmon V, Rudich A, Hadad N, Levy R (2008) Neutrophils transiently infiltrate intra-
abdominal fat early in the course of high-fat feeding. J Lipid Res 49(9):1894–1903. doi:10.1194/
jlr.M800132-JLR200
Applications of Systems Genetics and Biology for Obesity Using Pig Models 39
Fan B, Du ZQ, Rothschild MF (2009) The fat mass and obesity-associated (FTO) gene is associ-
VetBooks.ir
ated with intramuscular fat content and growth rate in the pig. Anim Biotechnol 20(2):58–70.
doi:10.1080/10495390902800792
Fan B, Onteru SK, Du Z-Q, Garrick DJ, Stalder KJ, Rothschild MF (2011) Genome-wide associa-
tion study identifies loci for body composition and structural soundness traits in pigs. PLoS
One 6(2), e14726. doi:10.1371/journal.pone.0014726
Fantuzzi G (2005) Adipose tissue, adipokines, and inflammation. J Allergy Clin Immunol
115(5):911–919. doi:10.1016/j.jaci.2005.02.023
Ferrante AW (2013) The immune cells in adipose tissue. Diabetes Obes Metab 15(s3):34–38.
doi:10.1111/dom.12154
Fontanesi L, Scotti E, Buttazzoni L, Davoli R, Russo V (2009) The porcine fat mass and obesity
associated (FTO) gene is associated with fat deposition in Italian Duroc pigs. Anim Genet
40(1):90–93. doi:10.1111/j.1365-2052.2008.01777.x
Fontanesi L, Scotti E, Buttazzoni L et al (2010) Confirmed association between a single nucleotide
polymorphism in the FTO gene and obesity-related traits in heavy pigs. Mol Biol Rep
37(1):461–466. doi:10.1007/s11033-009-9638-8
Friedman JM (2002) The function of leptin in nutrition, weight, and physiology. Nutr Rev 60(suppl
10):S1–S14. doi:10.1301/002966402320634878
Galgani J, Ravussin E (2010) Energy metabolism, fuel selection and body weight regulation. Int
J Obes (Lond) 32(Suppl 7):S109–S119. doi:10.1038/ijo.2008.246
Gray H (1918) Anatomy of the human body. Lea & Febiger
Groenen MAM, Archibald AL, Uenishi H et al (2012) Analyses of pig genomes provide insight
into porcine demography and evolution. Nature 491(7424):393–398. doi:10.1038/nature11622
Gurr MI, Kirtland J, Phillip M, Robinson MP (1977) The consequences of early overnutrition for
fat cell size and number: the pig as an experimental model for human obesity. Int J Obes (Lond)
1(2):151–170
Halsted CH (1999) Obesity: effects on the liver and gastrointestinal system. Curr Opin Clin Nutr
Metab Care 2(5):425–429
Hau J (2008) Animal models for human diseases. In: Conn PM (ed) Sourcebook of models for
biomedical research. Humana Press, Totowa, pp 3–8. doi:10.1007/978-1-59745-285-4_1
He Q, Ren P, Kong X et al (2012) Comparison of serum metabolite compositions between obese
and lean growing pigs using an NMR-based metabonomic approach. J Nutr Biochem
23(2):133–139. doi:10.1016/j.jnutbio.2010.11.007
Heber D (2010) An integrative view of obesity. Am J Clin Nutr 91(1):280S–283S. doi:10.3945/
ajcn.2009.28473B
Houpt KA, Houpt TR, Pond WG (1979) The pig as a model for the study of obesity and of control
of food intake: a review. Yale J Biol Med 52(3):307–329
Houston RD, Cameron ND, Rance KA (2004) A melanocortin-4 receptor (MC4R) polymorphism
is associated with performance traits in divergently selected Large White pig populations.
Anim Genet 35(5):386–390. doi:10.1111/j.1365-2052.2004.01182.x
Hwang H, Bowen BP, Lefort N et al (2010) Proteomics analysis of human skeletal muscle reveals novel
abnormalities in obesity and type 2 diabetes. Diabetes 59(1):33–42. doi:10.2337/db09-0214
Jacobsen MJ, Mentzel CMJ, Olesen AS et al (2016) Altered methylation profile of lymphocytes is
concordant with perturbation of lipids metabolism and inflammatory response in obesity.
J Diabet Res 2016:11. doi:10.1155/2016/8539057
Johansen T, Hansen HS, Richelsen B, Malmlöf K (2001) The obese Gottingen minipig as a model
of the metabolic syndrome: dietary effects on obesity, insulin sensitivity, and growth hormone
profile. Comp Med 51(2):150–155
Kim J, Lee T, Kim T-H, Lee K-T, Kim H (2012) An integrated approach of comparative genomics
and heritability analysis of pig and human on obesity trait: evidence for candidate genes on
human chromosome 2. BMC Genomics 13:711. doi:10.1186/1471-2164-13-711
Kogelman LJA, Kadarmideen H (2014) Weighted Interaction SNP Hub (WISH) network method
for building genetic networks for complex diseases and traits using whole genome genotype
data. BMC Syst Biol 8(Suppl 2):S5. doi:10.1186/1752-0509-8-S2-S5
40 L.J.A. Kogelman and H.N. Kadarmideen
Kogelman LJA, Kadarmideen HN, Mark T et al (2013) An F2 pig resource population as a model
VetBooks.ir
for genetic studies of obesity and obesity-related diseases in humans: design and genetic
parameters. Front Genet 4:29. doi:10.3389/fgene.2013.00029
Kogelman LJA, Cirera S, Zhernakova D, Fredholm M, Franke L, Kadarmideen H (2014a)
Identification of co-expression gene networks, regulatory genes and pathways for obesity
based on adipose tissue RNA Sequencing in a porcine model. BMC Med Genomics 7(1):57.
doi:10.1186/1755-8794-7-57
Kogelman LJA, Pant SD, Fredholm M, Kadarmideen HN (2014b) Systems genetics of obesity in
an F2 pig model by genome-wide association, genetic network and pathway analyses. Front
Genet 5:214. doi:10.3389/fgene.2014.00214
Kogelman LAJ, Zhernakova DV, Westra H-J et al (2015) An integrative systems genetics approach
reveals potential causal genes and pathways related to obesity. Genome Med 7(1):1–15.
doi:10.1186/s13073-015-0229-0
Li K, Zhao H, Zhou J-C et al (2011) Differentially expressed genes in subcutaneous fat tissue in an
obese pig model induced by a high-fat diet. J Anim Vet Adv 10(14):1804–1810. doi:10.3923/
javaa.2011.1804.1810
Li A, Mo D, Zhao X et al (2013) Comparison of the longissimus muscle proteome between obese
and lean pigs at 180 days. Mamm Genome 24(1–2):72–79. doi:10.1007/s00335-012-9440-0
Liu J, Divoux A, Sun J et al (2009) Deficiency and pharmacological stabilization of mast cells
reduce diet-induced obesity and diabetes in mice. Nat Med 15(8):940–945. doi:10.1038/
nm.1994
Locke AE, Kahali B, Berndt SI et al (2015) Genetic studies of body mass index yield new insights
for obesity biology. Nature 518(7538):197–206. doi:10.1038/nature14177
Lubrano-Berthelier C, Cavazos M, Dubern B et al (2003) Molecular genetics of human obesity-
associated MC4R mutations. Ann N Y Acad Sci 994:49–57
Lunney JK (2007) Advances in swine biomedical model genomics. Int J Biol Sci 3(3):179–184.
doi:10.7150/ijbs.3.179
Madsen MB, Birck MM, Fredholm M, Cirera S (2009) Expression studies of the obesity candidate
gene FTO in pig. Anim Biotechnol 21(1):51–63. doi:10.1080/10495390903381792
Marrades MP, Gonzalez-Muniesa P, Martinez JA, Moreno-Aliaga MJ (2010) A dysregulation in
CES1, APOE and other lipid metabolism-related genes is associated to cardiovascular risk fac-
tors linked to obesity. Obes Facts 3(5):312–318. doi:10.1159/000321451
McAnulty PA, Dayan AD, Ganderup N-C, Hastings KL (2011) The minipig in biomedical
research. RC Press, Boca Raton
Mentzel CMJ, Anthon C, Jacobsen MJ et al (2015) Gender and obesity specific MicroRNA expres-
sion in adipose tissue from lean and obese pigs. PLoS One 10(7), e0131650. doi:10.1371/
journal.pone.0131650
Michael Swindle M, Smith A (2008) Swine in biomedical research. In: Conn PM (ed) Sourcebook
of models for biomedical research. Humana Press, Totowa, pp 233–239.
doi:10.1007/978-1-59745-285-4_26
Mitchell AD, Conway JM, Potts WJ (1996) Body composition analysis of pigs by dual-energy
x-ray absorptiometry. J Anim Sci 74(11):2663–2671
Mitchell AD, Scholz AM, Conway JM (1998) Body composition analysis of small pigs by dual-
energy x-ray absorptiometry. J Anim Sci 76(9):2392–2398
Molarius A, Seidell JC (1998) Selection of anthropometric indicators for classification of abdomi-
nal fatness--a critical review. Int J Obes Relat Metab Disord 22(8):719–727
Nowacka-Woszuk J, Szczerbal I, Fijak-Nowak H, Switonski M (2008) Chromosomal localization
of 13 candidate genes for human obesity in the pig genome. J Appl Genet 49(4):373–377.
doi:10.1007/bf03195636
O’Rahilly S, Farooqi I (2006) Genetics of obesity. Philos Trans Royal Soc B Biol Sci
361(1471):1095–1105. doi:10.1098/rstb.2006.1850
Okumura N, Matsumoto T, Hayashi T et al (2013) Genomic regions affecting backfat thickness
and cannon bone circumference identified by genome-wide association study in a Duroc pig
population. Anim Genet 44(4):454–457. doi:10.1111/age.12018
Applications of Systems Genetics and Biology for Obesity Using Pig Models 41
encing obesity and metabolic phenotypes in pigs and humans. PLoS One 10(9), e0137356.
doi:10.1371/journal.pone.0137356
Ponsuksili S, Murani E, Brand B, Schwerin M, Wimmers K (2011) Integrating expression profil-
ing and whole-genome association for dissection of fat traits in a porcine model. J Lipid Res
52(4):668–678. doi:10.1194/jlr.M013342
Rauschert S, Uhl O, Koletzko B, Hellmuth C (2014) Metabolomic biomarkers for obesity in
humans: a short review. Ann Nutr Metab 64(3–4):314–324. doi:10.1159/000365040
Razmaite V, Kerziene S, Jatkauskiene V (2009) Body and carcass measurements and organ weights
of Lithuanian indigenous pigs and their wild boar hybrids. Anim Sci Papers Rep
27(4):331–342
Santini F, Maffei M, Pelosini C, Salvetti G, Scartabelli G, Pinchera A (2009) Melanocortin-4
receptor mutations in obesity. Adv Clin Chem 48:95–109
Scuteri A, Sanna S, Chen W-M et al (2007) Genome-wide association scan shows genetic variants
in the FTO gene are associated with obesity-related traits. PLoS Genet 3(7), e115. doi:10.1371/
journal.pgen.0030115
Shannon P, Markiel A, Ozier O et al (2003) Cytoscape: a software environment for integrated
models of biomolecular interaction networks. Genome Res 13:2498–2504
Shoelson SE, Herrero L, Naaz A (2007) Obesity, inflammation, and insulin resistance.
Gastroenterology 132(6):2169–2180. doi:10.1053/j.gastro.2007.03.059
Shungin D, Winkler TW, Croteau-Chonka DC et al (2015) New genetic loci link adipose and insu-
lin biology to body fat distribution. Nature 518(7538):187–196. doi:10.1038/nature14132
Skinkyte-Juskiene R, Kogelman LJA, Kadarmideen HN (2015) Construction of transcription fac-
tor networks for obesity using RNAseq transcriptomics. In: Genome Informatics, Cold Spring
Harbor
Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals
reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948.
doi:10.1038/ng.686
Spurlock ME, Gabler NK (2008) The development of porcine models of obesity and the metabolic
syndrome. J Nutr 138(2):397–402
Steibel J, Bates R, Rosa G et al (2011) Genome-wide linkage analysis of global gene expression in
loin muscle tissue identifies candidate genes in pigs. PLoS One 6, e16766. doi:10.1371/journal.
pone.0016766
Suravajhala P, Kogelman LJA, Mazzoni G, Kadarmideen HN (2015) Potential role of lncRNA
cyp2c91-protein interactions on diseases of the immune system. Front Genet 6:255.
doi:10.3389/fgene.2015.00255
Suster D, Leury BJ, Ostrowska E et al (2003) Accuracy of dual energy X-ray absorptiometry
(DXA), weight and P2 back fat to predict whole body and carcass composition in pigs
within and across experiments. Livestock Prod Sci 84(3):231–242. doi:10.1016/
S0301-6226(03)00077-0
te Pas MFW, Koopmans S-J, Kruijt L, Calus MPL, Smits MA (2013) Plasma proteome profiles
associated with diet-induced metabolic syndrome and the early onset of metabolic syndrome in
a pig model. PLoS One 8(9), e73087. doi:10.1371/journal.pone.0073087
Tilg H, Moschen AR (2006) Adipocytokines: mediators linking adipose tissue, inflammation and
immunity. Nat Rev Immunol 6(10):772–783
Tolppanen A-M, Pulkkinen L, Kolehmainen M et al (2007) Tenomodulin is associated with obe-
sity and diabetes risk: the Finnish diabetes prevention study. Obesity 15(5):1082–1088.
doi:10.1038/oby.2007.613
van Dijk SJ, Tellam RL, Morrison JL, Muhlhausler BS, Molloy PL (2015) Recent developments
on the role of epigenetics in obesity and metabolic disease. Clin Epigenet 7(1):1–13.
doi:10.1186/s13148-015-0101-5
Vazquez G, Duval S, Jacobs DR, Silventoinen K (2007) Comparison of body mass index, waist
circumference, and waist/hip ratio in predicting incident diabetes: a meta-analysis. Epidemiol
Rev 29(1):115–128. doi:10.1093/epirev/mxm008
42 L.J.A. Kogelman and H.N. Kadarmideen
obesity is multifactorial, resulting from increased fatty acid uptake and decreased activity of
genes involved in fat utilization. Obes Surg 20(1):93–107. doi:10.1007/s11695-009-0002-9
World Health Organization (2012) Obesity and overweight, Fact sheet No. 311 updated March
2013. http://www.who.int/mediacentre/factsheets/fs311/en/
Wren AM, Seal LJ, Cohen MA et al (2001) Ghrelin enhances appetite and increases food intake in
humans. J Clin Endocrinol Metabol 86(12):5992. doi:10.1210/jcem.86.12.8111
Xu H, Barnes GT, Yang Q et al (2003) Chronic inflammation in fat plays a crucial role in the
development of obesity-related insulin resistance. J Clin Invest 112(12):1821–1830.
doi:10.1172/jci19451
Xu J, Li Y, Chen WD et al (2014) Hepatic carboxylesterase 1 is essential for both normal and
farnesoid X receptor-controlled lipid homeostasis. Hepatology 59(5):1761–1771. doi:10.1002/
hep.26714
VetBooks.ir
Luca Fontanesi
Abstract
Metabolomics is a multidisciplinary approach that combines several disciplines
to characterise metabolomes in terms of the identification and quantification of
all detectable metabolites present in a biological sample in a single experimental
design or approach. Merging metabolomics with genetics and genomics in live-
stock provides intermediate phenotypes (or molecular phenotypes) that lie (in
the middle) between the genomic space and the external or final phenotypes
(e.g., production traits and disease resistance) contributing to understand the bio-
logical bases of complex traits. Heritability estimates have defined the extent of
the genetic contribution on metabotypes (that are metabolomic-derived pheno-
types). Metabotypes can be used to predict final phenotypes. Metabolite-based
genomewide association studies carried out in cattle and pigs have identified
mQTL on genes or close to genes whose function can directly explain the vari-
ability of the level of the corresponding metabolites. Despite the technological
limits of the analytical platforms that cannot provide a complete and exhaustive
picture of all metabolites present in a biofluid or tissue, metabolomics provides
new traits and biomarkers. Metabolomics might establish next generation pheno-
typing approaches that are needed to refine and improve trait descriptions and, in
turn, prediction of the breeding values of the animals to cope with traditional and
new objectives of the selection programmes.
L. Fontanesi
Department of Agricultural and Food Sciences (DISTAL), Division of Animal Sciences,
University of Bologna, Viale Fanin 46, 40127 Bologna, Italy
e-mail: luca.fontanesi@unibo.it
1 Introduction
VetBooks.ir
Beadle and Tatum (1941), even before the discovery of the DNA, wrote these illu-
minating concepts that can represent the foundation of the modern interpretation of
the relationships between metabolism (and metabolites) and genetics (or genomics)
in all organisms, including livestock: “From the standpoint of physiological genet-
ics the development and functioning of an organism consist essentially of an inte-
grated system of chemical reactions controlled in some manner by genes. It is
entirely tenable to suppose that these genes which are themselves a part of the sys-
tem, control or regulate specific reactions in the system either by acting directly as
enzymes or by determining the specificities of enzymes. Since the components of
such a system are likely to be interrelated in complex ways, and since the synthesis
of the parts of individual genes are presumably dependent on the functioning of
other genes, it would appear that there must exist orders of directness of gene con-
trol ranging from simple one-to-one relations to relations of great complexity.”
These authors further extended the idea of Garrod (1902), who published more than
100 years ago the first inborn error of metabolism (e.g., alkaptonuria) that linked
genetics and metabolites, developing the theory “one gene–one enzyme” (e.g.,
Beadle and Tatum 1941).
Extreme cases of genetic variants affecting the metabolism and related biochem-
ical products have been more recently also characterised in livestock. Some of them
are relevant in terms of economic impact on the production systems; others are only
interesting examples of animal models. It is also worth noting that in different spe-
cies, genetic defects on the same genes (i.e., homologous genes) produce similar
results. This is the case of the fishy off-flavour in cow's milk caused by elevated
levels of trimethylamine (TMA), derived by a mutation in the flavin-containing
mono-oxygenase 3 (FMO3) that affects the function of the encoded enzyme and the
downstream transformation of TMA in the odourless trimethylamine-n-oxide
(Lundén et al. 2002). A similar defect with high accumulation of TMA has been
reported in chicken and quail eggs (fishy taint of eggs) caused by mutations in the
homologous FMO3 genes (Honkatukia et al. 2005; Mo et al. 2013). In humans,
trimethylaminuria or fish-odour syndrome (OMIM #602079) is caused by muta-
tions in the same gene (Dolphin et al. 1997). Another important example of inborn
error of metabolism leading to extreme consequences is the deficiency of uridine
monophosphate synthase (DUMPS) in cattle, causing early embryonic death in
homozygous recessive offspring (Schwenger et al. 1993). This defect was identified
as heterozygous carrier cows had a high level of orotic acid in the milk (Robinson
et al. 1984). Elevated blood plasma cholesterol (hypercholesterolemia) has been
reported in commercial pig populations because of mutations in the APOB and
LDLR genes (Rapacz et al. 1986; Purtell et al. 1993; Hasler-Rapacz et al. 1998),
providing interesting models for human hypercholesterolemia. Several examples of
inborn errors of metabolism (or mutations affecting some metabolic products)
occurring in livestock are reported in Table 1.
It is clear that these are extreme examples of altered metabolism due to genetic
mutations. Most metabolites, however, show an interindividual continuous range of
Merging Metabolomics, Genetics, and Genomics 45
Table 1 A few examples of inborn errors of metabolism and major mutations affecting the level
VetBooks.ir
of metabolites in livestock
Description of the defect/ Mutated
Species Defect/trait metabolic pathway gene Reference
Cattle Deficiency of uridine Early embryonic death of UMPS Schwenger et al.
monophosphate homozygous offspring (1993)
synthase (DUMPS) due to deficiency of
uridine monophosphate;
high level of orotic acid
in the milk of
heterozygous cows
Cattle Fishy off-flavour milk High level of FMO3 Lundén et al.
trimethylamine (TMA) in (2002)
milk
Cattle Yellow milk and Accumulation or higher BCO2 Berry et al.
adipose tissue colour level of beta carotene in (2009); Tian
plasma, milk and adipose et al. (2010)
tissues
Cattle Yellow adipose tissue Accumulation of beta RDHE2 Tian et al. (2012)
colour carotene in adipose
tissues
Sheep Yellow adipose tissue Accumulation of beta BCO2 Våge and Boman
colour carotene in adipose (2010)
tissues
Chicken Fish taint High level of FMO3 Honkatukia et al.
(layers) trimethylamine (TMA) in (2005)
eggs
Quail Fish taint High level of FMO3 Mo et al. (2013)
trimethylamine (TMA) in
eggs
Pig Hypercholesterolemia High level of blood APOB Purtell et al.
cholesterol (1993)
Pig Recessive familial High level of blood LDLR Hasler-Rapacz
hypercholesterolemia cholesterol et al. (1998)
Rabbit Watanabe heritable High level of blood LDLR Yamamoto et al.
hyperlipidemia cholesterol and (1986)
triglycerides
These studies were useful in establishing the first links between metabolites and
VetBooks.ir
gene variants, demonstrating also in livestock that these aspects might also impact
production traits. However, they considered only a few metabolites—usually
selected according to previous information—to define the phenotype under investi-
gation or of interest. New analytical developments in metabolomics are opening the
way to explore in more detail the relationships between metabolism of the animals
and their genetic background.
Metabolome is the term coined by Oliver et al. (1998) that is used to define all
organic molecules of small molecular mass present in a biological tissue or fluid
produced by different biochemical pathways through different enzymatic reactions
and steps. Metabolites can be defined as any metabolism-originated organic com-
pounds that do not directly come from gene expression (Junot et al. 2014). They can
be distinguished in endogenous metabolites, which are produced directly by the
organism through its biochemical machinery, and xenobiotics, which are organic
compounds that are present in an organism but derive from external molecules that
are at least in part processed or transformed in the organism. Examples of xenobiot-
ics are drugs, drug metabolites, pollutants, or other environment-derived com-
pounds. Other metabolites, that by their origin cannot be included in the two defined
groups, are produced by microbiota and then transferred to the host, contributing to
the interplay between these two biological entities (Fig. 1). Endogenous metabolites
could also be classified as primary metabolites that are simple molecules, or mono-
mers (e.g., sugar phosphates, amino acids, nucleotides, organic acids, and lipid
components) and secondary metabolites that are derived from primary metabolites
(e.g., small hormones, lipids, and phytochemicals).
Endogenous metabolites
Primary Secondary
metabolites metabolites
sugar phosphates, small hormones, Xenobiotics
amino acids, lipids, Microbiota
drugs,drug metabolites,
nucleotides,organic phytochemicals,
Primary & pollutants or other
acids and lipid etc
environmental derived
components, secondary metabolites
compounds, etc.
etc.
try to obtain raw chemical data and data analysis and data interpretation disciplines
such as chemometrics, biostatistics, biochemistry, and bioinformatics to character-
ise metabolomes in terms of the identification and quantification of all detectable
metabolites present in a biological sample in a single experimental design or
approach (Adamski and Suhre 2013). Metabolomics can provide a picture of a par-
ticular biochemical state of an organism (through its analysed biosamples) that is
influenced by a specific combination of genetic factors (acting through gene expres-
sion and protein or enzyme production and activities) and environmental factors
(nutrition, environmental conditions, treatments, biological phase of the animals,
etc.). In this context, we can consider metabolite-derived elements (the metabolite
species plus its quantification) as important components of the so-called intermedi-
ate phenotypes (Fiehn 2002; Houle et al. 2010) that lie (in the middle) between
genomic information and complex production traits (indicated also as final or exter-
nal phenotypes; Fig. 2). These metabolomic phenotypes can be called metabotypes.
Additional “phenotypic” levels can be identified, considering the different biologi-
cal steps (and then the related involved biomolecules that are investigated by other
omic approaches) to reach the final levels, i.e., complex phenotypes that directly
represent economic traits in livestock (e.g., growth rate, feed efficiency, carcass
traits, and milk production traits). Considering the structure of the biological infor-
mation with different levels (Fig. 2), the challenge is to use metabotypes to fill gaps
to understand the biological mechanisms that construct, on the whole, the differ-
ences among animals in terms of performance or other economically relevant traits
Production traits
Internal phenotypes
Microbiome
Metabotypes (Metabolomics)
Environmental factors
Proteins (Proteomics)
Genome level
Fig. 2 Different levels of phenotypic information with intermediate phenotypes (modified from
Fontanesi, 2016).
48 L. Fontanesi
of the different information levels and their closeness to the final phenotypes, can be
used as predictors or proxies of more complex production traits. In particular, they
might be very advantageous for the prediction of “difficult” traits that can be
recorded late in the productive life of the animals, or that cannot be measured on all
animals due to high cost, or that, like disease resistance, need to be defined in condi-
tions that cannot be routinely obtained (e.g., challenging with pathogens).
It is worth pointing out that although many studies have contributed to obtain a bet-
ter picture of the metabolic pathways and mechanisms (that are the foundations of
subsequent developments and improvements in this field), we are still at the begin-
ning of a process leading towards the complete characterisation of all metabolites
produced in a complex organism such as an animal (Patti et al. 2012; Fontanesi
2016). It is mainly due to technology gaps that metabolomics has been compared to
genomics and transcriptomics. Genomics and transcriptomics have to collect com-
plex information that (however) is defined by a very simple alphabet composed of
only four (+ one) letters (the four nucleotides: A, T or U, C, and G). The intrinsic
large heterogeneity of the metabolites and their high variability in terms of stability
have thus far prevented the development of metabolomic analytical platforms with
the same potential of next-generation sequencing technologies. However, recent
advances in bioanalytical approaches, including the development and integration of
mass spectrometry (MS), high-performance liquid-phase chromatography (HPLC),
and nuclear magnetic resonance (NMR) spectroscopy, have substantially increased
the throughput, precision, and sensitivity of the analytical platforms. Each of these
methods and instruments has their own advantages and drawbacks, considering that
to integrate metabotypes with genetics, it is very important to have the possibility to
analyse in a large number of animals the largest possible number of metabolites at
the lowest possible cost per unit.
Analytical platforms should produce information useful for the chemical identifi-
cation and quantification of the metabolites. Chemical identification is assigning an
analyte (analytical signal) to a set of chemical compounds or to a group/class of
compounds: compounds may be “known known,” “known unknown,” or “unknown
unknown” (Milman 2015). The first case is in the so-called targeted metabolomics
that identify and quantify at the same time defined groups of chemically character-
ised and biochemically annotated metabolites (Roberts et al. 2012). Therefore, in this
approach, determination of the selected compounds is obtained according to what is
specified before performing the analytical procedures by using standards and internal
or external reference compounds (Milman 2015). Untargeted or nontargeted metabo-
lomics covers “known unknown” (are unknown compounds before analyses, but
they might not be new and can be identified subsequently through informatics analy-
sis of available database information and libraries) and “unknown unknown” (new
compounds) analytes. The decision of which approaches (targeted or untargeted) to
Merging Metabolomics, Genetics, and Genomics 49
use depends on several factors. Targeted metabolomics is useful when the enrich-
VetBooks.ir
transcriptomics (Tárraga et al. 2008). MSEA is implemented in a few tools for com-
VetBooks.ir
1
7
2
6
5
5
3
-log(p)
4
4
6
3
7 8 9
2
1
Fig. 3 An example of a metabolome view of the pathway analysis obtained with MetaboAnalyst
in pigs (modified from Bovo et al. 2015). The impact is the pathway impact value calculated from
pathway topology analysis. Indicated pathways are as follows: (1) valine, leucine, and isoleucine
biosynthesis; (2) valine, leucine, and isoleucine degradation; (3) aminoacyl-tRNA biosynthesis;
(4) beta-alanine metabolism; (5) arginine and proline metabolism; (6) glycerophospholipid metab-
olism; (7) linoleic acid metabolism; (8) tryptophan metabolism; and (9) taurine and hypotaurine
metabolism. Other pathways are not indicated
52 L. Fontanesi
in Livestock
The first step to link genetics to metabolites is to estimate the heritability of metabo-
types. The heritability of metabolomically derived information has been investigated
thus far in pigs and dairy cattle. In pigs, in particular, heritability was estimated for
plasma metabolites analysed with a targeted approach in approximately 900 perfor-
mance-tested Italian Large White pigs (Fontanesi et al. 2014). Data were grouped
according to the biochemical classes of the different metabolites. A quite broad range
of values was reported (from 0.07 to 0.73), suggesting a quite large heterogeneity
both within as well as across metabolite classes (Fontanesi et al. 2014). In dairy spe-
cies, the most interesting biofluid for metabolomic analysis is the milk. It can be
easily collected, and it provides useful information to assess the animal metabolism
and to evaluate nutritional and cheese-making properties. For these reasons, several
studies in dairy cattle estimated the heritability of specific milk metabolites that were
considered important biomarkers of particular states of the cow or predictors of pro-
duction and functional traits (e.g., urea measured as milk urea nitrogen, lactate,
β-hydroxybutyrate or BHBA, acetone, and glucose; Welper and Freeman 1992;
Mitchell et al. 2005; Miglior et al. 2006; 2007; Stoop et al. 2007; Van der Drift et al.
2012) or relevant for nutritional quality of the milk (e.g., fatty acids; Soyeurt et al.
2007; Stoop et al. 2008). Two studies estimated the heritability of quite a large num-
ber of metabolites detected in bovine milk using metabolomic approaches (Buitenhuis
et al. 2013; Wittenburg et al. 2013). Buitenhuis et al. (2013) detected 31 metabolites
in 371 mid-lactating Danish Holstein cows using 1H-NMR spectroscopy, obtaining
estimates of h2 ranging from 0 (lactic acid) to more than 0.8 for orotic acid and
BHBA. Wittenburg et al. (2013) estimated genetic parameters and evaluated the
Merging Metabolomics, Genetics, and Genomics 53
one another (in metabolic pathways) close to the biological mechanisms determin-
VetBooks.ir
ing, on the whole, a final phenotype, and (ii) identification of biomarkers that can be
used as correlated and convenient proxies or substitutes of traditionally defined
traits. If it were possible, the advantages are (i) to predict traits that are difficult or
expensive to be measured or detected (e.g., disease resistance) or (ii) to predict as
early as possible traits that can be measured or inferred late in the productive life of
the animals with or without any other information, like pedigree and the related
genealogically derived estimated breeding values.
Approaches in these directions were first designed to identify association
between metabolites and production traits or specific states of the animals as a start-
ing point to define specific physiological roles related to their presence or their dif-
ferent levels in some conditions. Several studies in this direction were reported in
dairy cattle, analysing blood and/or milk metabolites without any direct evaluation
of the genetic factors affecting these relationships (e.g., Klein et al. 2010; Ilves et al.
2012; Harzia et al. 2012, 2013; Melzer et al. 2013a; Sundekilde et al. 2014).
Genetic factors affecting metabolite parameters were considered to establish a
prognostic biomarker for risk of ketosis in dairy cattle (Klein et al. 2012). Breeding
values for energy balance and milk fat-to-protein ratio (that were reported to be
associated with liability to metabolic disorders; Buttchereit et al. 2011) were signifi-
cantly correlated with the level of milk glycerophosphocholine (GPC), phospho-
choline (PC), or its ratio (GPC/PC). High GPC and low PC values and a GPC/PC
ratio >2.5 were considered as indicators of resistance to ketosis in dairy cattle (Klein
et al. 2012).
Melzer et al. (2013a) used milk metabolite profiles to predict milk protein and fat
content and milk pH. Important metabolites were identified using random forests
and PLS. Prediction precision (defined as the correlation between estimated and
observed milk trait values) was higher for milk protein (0.63–0.64) with 16 impor-
tant metabolites identified and lower for milk fat and pH (approximately 0.35) with
11 and 10 different important metabolites identified for the two traits, respectively
(Melzer et al. 2013a). Genetic correlations between milk metabolites and milk pro-
duction traits were reported by Buitenhuis et al. (2013). Several metabotypes were
correlated with one or another milk trait, indicating that some of these metabolites
could be eventually used as biomarkers to disrupt unfavourable correlations between
traits.
In beef cattle, Karisa et al. (2014) reported that 12 plasma metabolites were sig-
nificantly associated with residual feed intake and accounted for approximately
98% of the variation in this trait. However, metabolite levels fluctuated greatly
across different ages in different steer populations and should be considered only as
potential biomarkers for this important trait.
Prediction power of metabolomic profiles for production traits has been investi-
gated in performance-tested growing pigs (60 days old) from three breeds using
plasma 1H-NMR fingerprinting (Rohart et al. 2012). This approach only indirectly
relies on the dissection, and then by summing-up the contribution of a metabolomic
profile on a few traits, as in this case, there is no need to fully characterise the meta-
bolic peaks and then attribute a chemical name to all signals. For this approach, the
Merging Metabolomics, Genetics, and Genomics 55
by selected metabolites with the greatest predictive values (Rohart et al. 2012).
Predicted traits were growth rate, feed efficiency, carcass, and meat quality traits.
Prediction accuracy was highly dependent on the trait and improved by including in
the model the breed of origin of the animals, but not including the batch of the ani-
mals (probably because micro-environmental effects were not significantly differ-
ent in a performance testing structure). Traits that were determined after slaughtering
were predicted with high error rates. This might be expected, considering, for exam-
ple, that slaughtering conditions are well known to be the most important factors
affecting meat quality parameters. Other traits were well predicted. In particular,
average daily feeding intake, that is an expensive and difficult trait to be measured,
was predicted with quite good accuracy (Rohart et al. 2012).
The link between genomic information and the level of metabolites accumulated in
specific tissues, circulating in biofluids or essential for important cell functions in
animals, has been already established for several inborn errors of metabolism, as
already discussed, with the identification of causative mutations for these defects
(Table 1). However, for many metabolites that do not affect in extreme ways visible
animal phenotypes or produce genetic diseases, we are just beginning to establish
relationships at the genomic level. In addition, considering information derived
from the estimation of heritability in livestock, it is clear that genetic factors (i.e.,
gene polymorphisms directly identified or indirectly captured using DNA markers
in linkage disequilibrium) can affect the level of many other metabolites, leading
from minor (if even detectable) to relevant modifications of metabolomic profiles
(Fontanesi 2016). Metabolites whose level is modified by genetic factors have been
called genetically influenced metabotypes or GIM (Suhre and Gieger 2012).
Genomewide association studies using metabotypes (mGWAS) analysed in serum,
plasma, urine, and liver have already been identified in humans and mice SNP–
metabolite trait associations or mQTL close to or within genes encoding for key
components of the metabolic machineries (e.g., enzymes, transporters, or other
related proteins) (reviewed in Suhre and Gieger 2012; Gauguier 2015; Kastenmüller
et al. 2015). In this way, mQTLs establish a direct link between genes that can
explain the biological reasons of these associations. In some cases, the functional
interpretation of the results might be difficult due to the lack of information on the
roles of the genes, even if the association with known and well-defined metabolites
might help to attribute a potential function to uncharacterised genes. On the other
hand, it could be possible to deorphanise uncharacterised metabolites or metabolite
features (peaks or spectra, according to the analytical platforms) if mQTLs are
localised on genes that are already well described (Rueedi et al. 2014). mQTLs usu-
ally explain a relevant fraction of the genetic variance for the associated metabo-
types (10%–30%). The same regions might be associated with increased risks for
complex diseases, suggesting that the colocalised mQTL might be important to
56 L. Fontanesi
define the disease state or the biological mechanisms underlying the disease state or
VetBooks.ir
susceptibility.
Population-based mGWAS have also been carried out in pigs and dairy cattle. In
pigs, Fontanesi et al. (2014, 2015) reported an mGWAS on performance-tested ani-
mals whose plasma was analysed with a targeted metabolomic platform. The level
of several circulating plasma nutrients was associated with several genes, explain-
ing the relevant fraction of the genetic variability of these metabotypes and thus
creating new possibilities to design nutrigenomic approaches (Fontanesi 2016). In
dairy cattle, Buitenhuis et al. (2013) used an untargeted metabolomic approach
based on NMR to analyse metabotypes in milk of 371 Holstein cows. Eight genome-
wide associations were reported on different bovine chromosomes (BTA): orotic
acid (BTA1), malonate on BTA2 and BTA3, galactose-1-phosphate on BTA2, glu-
cose on BTA11, urea on BTA12, and carnitine and glycerophosphocholine on
BTA25. Another 21 chromosome-significant associations were reported. Of these
mQTL, a few were located on genes or close to genes that might be involved in
defining the associated milk metabotypes, whereas for others, the function of the
closest genes in relationships with the associated metabolites was not clear. These
unexpected relationships could possibly contribute to assigning novel functions to
these genes. Among the metabolites for which QTLs were identified, it is worth
mentioning glycerophosphocholine, which is considered a biomarker for ketosis
resistance (Klein et al. 2012). Another GWAS that was carried out for a few milk
metabolites related to ketosis resistance (phosphocholine, glycerophosphocholine,
and the ratio between the two metabolites) confirmed an mQTL for glycerophos-
phocholine on BTA25 (Tetens et al. 2015). Gene variants in the apolipoprotein
receptor B (APORB) gene were suggested to be the causative mutations of the
mQTL (Tetens et al. 2015).
Other GWAS were based on one or a few milk or plasma metabolites in dairy
cattle. These metabolites were preselected based on their relevance to human nutri-
tion (considering the milk) or as biomarkers of physiological states of the cows, as
also discussed above. In particular, Poulsen et al. (2015) focused their study on the
level of riboflavin (vitamin B2) in the milk of a total of ~800 Danish Holstein and
Danish Jersey cows. Riboflavin is an essential water-soluble vitamin with many
biological roles, and milk is one of the main sources of this nutrient in the human
diet. Significant markers were reported on BTA14 and BTA17 in Jersey and on
several other chromosomes in Holstein, most of which were on BTA13 and BTA14.
The most promising mQTL for riboflavin content was located on BTA13 in the cor-
respondence of the SLC52A3 gene, coding for a riboflavin transporter, whose func-
tion might directly affect this metabotype. Another GWAS for nonesterified fatty
acid (NEFA), BHBA, and glucose in bovine milk (considered as indicators of meta-
bolic adaptation of the cows) were reported by Ha et al. (2015) . Instead of conven-
tional single-marker analyses, this study used gene enrichment approaches to
increase the power in obtaining gene sets and pathways that might contribute to
explain the metabolic adaptability of dairy cows in their early lactation periods.
Lu et al. (2015) reported a study that evaluated the milk lipid and metabolome
composition in a few milk samples from cows with different genotypes at the
Merging Metabolomics, Genetics, and Genomics 57
The inclusion of intermediate phenotypes between the genomic space and the exter-
nal phenotypes contributes to fill the biological gaps between these two most distant
biological levels (Fig. 2). Among the several possible intermediate levels, metabo-
types seem to be the most promising to develop approximate systems genetic mod-
els to understand the molecular basis of complex traits. This is due to the fact that
metabotypes are very close to the external phenotypes (i.e., production traits) that
are important in animal breeding. In addition, the biochemical profiles of the ani-
mals can be useful to monitor or to define their physiological states that in most
cases express the production potentials of the animals, if it is possible to distinguish
the genetic components from the environmental influences. Missed biological infor-
mation at the other intermediate levels may produce approximations. However, it
seems that a three-level modelling system can potentially be implemented (in prac-
tice) to clarify the biological steps that produce economically relevant traits and in
turn to predict final phenotypes (Fontanesi 2016). Among the intermediate pheno-
types, it seems that metabotypes are much easier to be analysed on a routine basis.
For example, it is usually quite easier and cheaper to analyse metabolites on milk or
plasma (or serum) on a large number of animals than to obtain gene expression data
at a genomewide level and at the population level. In addition, the collection of
relevant tissues for gene expression analysis might be very complicated in field tri-
als. It could be interesting to include metabolomic data in addition to SNPs for
novel methodological implementations of genomic selection. The integration of
metabolomics and genomics into genomic selection could be useful when the pre-
diction accuracy might be limited by the low number of animals in the training
population or when the heritability of the investigated trait is low or when it is
important to use proxies for more complex or difficult traits that cannot be measured
directly on the animals (e.g., disease resistance defined in challenging plans). As a
first step in this direction, Ehret et al. (2015) described predictive models for
58 L. Fontanesi
subclinical ketosis risk in approximately 200 cows by merging SNP data and a few
VetBooks.ir
Conclusions
mGWAS implemented in livestock reported significant markers even if a lower
number of individuals were analysed than what is common in GWAS carried out
in humans. This might be due, at least in part, to the fact that it is usually much
easier to control or reduce environmental factors affecting the level of metabo-
lites in animals than in humans. On the other hand, large-scale implementations
of metabolomic studies in animals seem intrinsically more difficult in field trials
than in humans, and specific sampling protocols and procedures might be needed.
Despite the technological limitations (limited analytical platforms) that can-
not provide a complete and exhaustive picture of all metabolites present in a
biofluid or tissue, metabolomics merged with genetics, and genomics can con-
tribute to clarify the biological bases of complex traits in livestock. New traits
and biomarkers can be defined using metabolomics. Metabolomics can be used
to establish next-generation phenotyping approaches that are needed to refine
and improve trait descriptions and, in turn, prediction of the breeding values of
the animals to cope with traditional and new objectives of selection programmes
(Fontanesi 2016).
References
Adamski J, Suhre K (2013) Metabolomics platforms for genome wide association studies – linking
the genome to the metabolome. Curr Opin Biotechnol 24:39–47. doi:10.1016/j.
copbio.2012.10.003
Alonso A, Marsal S, Julià A (2015) Analytical methods in untargeted metabolomics: state of the
art in 2015. Front Bioeng Biotechnol 3:23. doi:10.3389/fbioe.2015.00023
Beadle GW, Tatum EL (1941) Genetic control of biochemical reactions in Neurospora. Proc Natl
Acad Sci U S A 27:499–505
Berry SD, Davis SR, Beattie EM, et al (2009). Mutation in bovine beta-carotene oxygenase 2
affects milk color. Genetics 182:923–926 doi:10.1534/genetics.109.101741
Bovo S, Mazzoni G, Calò DG et al (2015) Deconstructing the pig sex metabolome: targeted metab-
olomics in heavy pigs revealed sexual dimorphisms in plasma biomarkers and metabolic path-
ways. J Anim Sci. 93:5681–5693. doi:10.2527/jas2015-9528
Bro R, Smilde AK (2014) Principal component analysis. Anal Methods 6:2812–2831. doi:10.1039/
c3ay41907j
Buitenhuis AJ, Sundekilde UK, Poulsen NA et al (2013) Estimation of genetic parameters and
detection of quantitative trait loci for metabolites in Danish Holstein milk. J Dairy Sci 96:3285–
3295. doi:10.3168/jds.2012-5914
Merging Metabolomics, Genetics, and Genomics 59
daily energy balance, feed intake, body condition score, and fat to protein ratio of milk in dairy
cows. J Dairy Sci 94:1586–1591. doi:10.3168/jds.2010-3396
Caspi R, Altman T, Billington R et al (2014) The MetaCyc database of metabolic pathways and
enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res
42:D459–D571. doi:10.1093/nar/gkt1103
Davis SR, Farr VC, Prosser CG et al (2004) Milk L-lactate concentration is increased during mas-
titis. J Dairy Res 71:175–181. doi:10.1017/S002202990400007X
Dolphin CT, Janmohamed A, Smith RL et al (1997) Missense mutation in flavin-containing mono-
oxygenase 3 gene, FMO3, underlies fish-odour syndrome. Nat Genet 17:491–494. doi:10.1038/
ng1297-491
Ehret A, Hochstuhl D, Krattenmacher N et al (2015) Short communication: use of genomic and
metabolic information as well as milk performance records for prediction of subclinical
ketosis risk via artificial neural networks. J Dairy Sci 98:322–329. doi:10.3168/
jds.2014-8602
Ellinger JJ, Chylla RA, Ulrich EL et al (2013) Databases and software for NMR-based metabolo-
mics. Curr Metabolomics 1(1). doi:10.2174/2213235X11301010028
Fiehn O (2002) Metabolomics – the link between genotypes and phenotypes. Plant Mol Biol
48:155–171. doi:10.1023/A:1013713905833
Fontanesi L (2016) Metabolomics and livestock genomics: insights into a phenotyping frontier and
its applications in animal breeding. 6:73–79. doi:10.2527/af.2016-0011. Front Genet (in press)
Fontanesi L, Bovo S, Mazzoni G et al (2014) Genome wide perspective of genetic variation in pig
metabolism and production traits. Manuscript n. 359. Proceedings of 10th world congress on
genetics applied to livestock production, Vancouver, 17–22 Aug 2014
Fontanesi L, Schiavo G, Bovo S et al (2015) Dissecting complex traits in pigs: metabotypes illu-
minate genomics for practical applications. P. 152. Abstract retrieved from the book of abstracts
of the 66th annual meeting of the European Federation of Animal Science. Book of Abstracts
No. 21, Warsaw, 31 Aug–4 Sept 2015
Fuhrer T, Zamboni N (2015) High-throughput discovery metabolomics. Curr Opin Biotechnol
31:73–78. doi:10.1016/j.copbio.2014.08.006
Gallardo D, Pena RN, Amills M et al (2008) Mapping of quantitative trait loci for cholesterol,
LDL, HDL, and triglyceride serum concentrations in pigs. Physiol Genomics 3:199–209.
doi:10.1152/physiolgenomics.90249.2008
Garrod AE (1902) The incidence of alkaptonuria: a study in chemical individuality. Lancet
2:1616–1620
Gauguier D (2015) Application of quantitative metabolomics in systems genetics in rodent models
of complex phenotypes. Arch Biochem Biophys. doi:10.1016/j.abb.2015.09.016
Geishauser T, Leslie K, Tenhag J et al (2000) Evaluation of eight cow-side ketone tests in milk for
detection of subclinical ketosis in dairy cows. J Dairy Sci 83:296–299. doi:10.3168/jds.
S0022-0302(00)74877-6
Ha NT, Gross JJ, van Dorland A, et al (2015) Gene-based mapping and pathway analysis of meta-
bolic traits in dairy cows. PLoS One 10:e0122325. doi:10.1371/journal.pone.0122325
Harzia H, Kilk K, Jõudu I et al (2012) Comparison of the metabolic profiles of noncoagulating and
coagulating bovine milk. J Dairy Sci 95:533–540. doi:10.3168/jds.2011-4468
Harzia H, Ilves A, Ots M et al (2013) Alterations in milk metabolome and coagulation ability dur-
ing the lactation of dairy cows. J Dairy Sci 96:6440–6448. doi:10.3168/jds.2013-6808
Hasler-Rapacz J, Ellegren H, Fridolfsson AK et al (1998) Identification of a mutation in the low
density lipoprotein receptor gene associated with recessive familial hypercholesterolaemia in
swine. Am J Med Genet 76:379–386. doi:10.1002/(SICI)1096-8628(19980413)76:5<379::AID-
AJMG3>3.0.CO;2-I
Honkatukia M, Reese K, Preisinger R et al (2005) Fishy taint in chicken eggs is associated with a
substitution within a conserved motif of the FMO3 gene. Genomics 86:225–232. doi:10.1016/j.
ygeno.2005.04.005
60 L. Fontanesi
Houle D, Govindaraju DR, Omholt S (2010) Phenomics: the next challenge. Nat Rev Genet
VetBooks.ir
11:855–866. doi:10.1038/nrg2897
Ilves A, Harzia H, Ling K et al (2012) Alterations in milk and blood metabolomes during the first
months of lactation in dairy cows. J Dairy Sci 95:5788–5797. doi:10.3168/jds.2012-5617
Jewison T, Su Y, Disfany FM et al (2014) SMPDB 2.0: big improvements to the Small Molecule
Pathway Database. Nucleic Acids Res 42:D478–D484. doi:10.1093/nar/gkt1067
Johnson CH, Ivanisevic J, Benton HP et al (2015) Bioinformatics: the next frontier of metabolo-
mics. Anal Chem 87:147–156. doi:10.1021/ac5040693
Junot C, Fenaille F, Colsch B et al (2014) High resolution mass spectrometry based techniques at
the crossroads of metabolic pathways. Mass Spectrom Rev 33:471–500. doi:10.1002/
mas.21401
Kanehisa M, Goto S, Sato Y et al (2014) Data, information, knowledge and principle: back to
metabolism in KEGG. Nucleic Acids Res 42:D199–D205. doi:10.1093/nar/gkt1076
Karisa BK, Thomson J, Wang Z et al (2014) Plasma metabolites associated with residual feed
intake and other productivity performance traits in beef cattle. Liv Sci 165:200–211.
doi:10.1016/j.livsci.2014.03.002
Kastenmüller G, Raffler J, Gieger C et al (2015) Genetics of human metabolism: an update. Hum
Mol Genet 24:R93–R101. doi:10.1093/hmg/ddv263
Kelder T, van Iersel MP, Hanspers K et al (2012) WikiPathways: building research communities on
biological pathways. Nucleic Acids Res 40:D1301–D1307. doi:10.1093/nar/gkr1074
Klein MS, Almstetter MF, Schlamberger G et al (2010) Nuclear magnetic resonance and mass
spectrometry-based milk metabolomics in dairy cows during early and late lactation. J Dairy
Sci 93:1539–1550. doi:10.3168/jds.2009-2563
Klein MS, Buttchereit N, Miemczyk SP et al (2012) NMR metabolomic analysis of dairy cows
reveals milk glycerophosphocholine to phosphocholine ratio as prognostic biomarker for risk
of ketosis. J Proteome Res 11:1373–1381. doi:10.1021/pr201017n
Krumsiek J, Suhre K, Illig T et al (2011) Gaussian graphical modelling reconstructs pathway
reactions from high-throughput metabolomics data. BMC Syst Biol 5:21.
doi:10.1186/1752-0509-5-21
Lu J, Boeren S, van Hooijdonk T et al (2015) Effect of the DGAT1 K232A genotype of dairy cows
on the milk metabolome and proteome. J Dairy Sci 98:3460–3469. doi:10.3168/
jds.2014-8872
Lundén A, Marklund S, Gustafsson V et al (2002) A nonsense mutation in the FMO3 gene under-
lies fishy off-flavor in cow’s milk. Genome Res 12:1885–1888. doi:10.1101/gr.240202
Melzer N, Wittenburg D, Hartwig S et al (2013a) Investigating associations between milk metabo-
lite profiles and milk traits of Holstein cows. J Dairy Sci 96:1521–1534. doi:10.3168/
jds.2012-5743
Miglior F, Sewalem A, Jamrozik J (2006) Analysis of milk urea nitrogen and lactose and their
effect on longevity in Canadian dairy cattle. J Dairy Sci 89:4886–4894. doi:10.3168/jds.
S0022-0302(06)72537-1
Miglior F, Sewalem A, Jamrozik J et al (2007) Genetic analysis of milk urea nitrogen and lactose
and their relationships with other production traits in Canadian Holstein cattle. J Dairy Sci
90:2468–2479. doi:10.3168/jds.2006-487
Milman BL (2015) General principles of identification by mass spectrometry. TrAC Trends Anal
Chem 69:24–33. doi:10.1016/j.trac.2014.12.009
Mitchell RG, Rogers GW, Dechow CD et al (2005) Milk urea nitrogen concentration: heritability
and genetic correlations with reproductive performance and disease. J Dairy Sci 88:4434–4440.
doi:10.3168/jds.S0022-0302(05)73130-1
Mo F, Zheng J, Wang P et al (2013) Quail FMO3 gene cloning, tissue expression profiling, poly-
morphism detection and association analysis with fishy taint in eggs. PLoS One 8, e81416.
doi:10.1371/journal.pone.0081416
Oliver SG, Winson MK, Kell DB et al (1998) Systematic functional analysis of the yeast genome.
Trends Biotechnol 16:373–378. doi:10.1016/S0167-7799(98)01214-1
Merging Metabolomics, Genetics, and Genomics 61
Patti GJ, Yanes O, Siuzdak G (2012) Innovation: metabolomics: the apogee of the omics trilogy.
VetBooks.ir
Tiemeyer W, Stohrer M, Giesecke D (1984) Metabolites of nucleic acids in bovine milk. J Dairy
VetBooks.ir
Abstract
High-throughput sequencing technology is rapidly replacing expression arrays and
becoming the standard method for global expression profiling studies. The develop-
ment of low-cost, rapid sequencing technologies has enabled detailed quantification
of gene expression levels, affecting almost every field in the life sciences. In this
chapter, we will overview the key points for gene expression analysis using RNA-
seq data. First, we will discuss the workflows of RNA-seq data analysis followed by
a discussion about the currently available tools for data analysis and a comparison
between these tools. The chapter concludes with a discussion about the application
of RNA-seq data analysis in livestock. In the appendix, using an example from
livestock RNA-seq data, we show a simple script for RNA-seq data analysis.
1 Introduction
There are two main platforms broadly used for expression profiling: microarrays
and direct sequencing of transcripts. Until a few years ago, microarrays were the
dominant platform, but they are rapidly being superseded by next-generation
comparison to microarrays is that it is not limited to the probes on the array; this
VetBooks.ir
allows the discovery of new genes, isoforms, transcripts (Huang and Khatib 2010),
and small noncoding RNA (Korpelainen et al. 2014). A summary of the main steps
in an RNA-seq workflow is shown in Fig. 1. There are various sequencing plat-
forms, but Illumina is the most widely adopted one for RNA-seq because it displays
a low substitution error rate per read, and there are almost no insertion or deletion
(indel) errors (Ramsköld et al. 2012). Sequencing is based on the detection of fluo-
rescent signals produced by the addition of a single nucleotide during the synthesis
of DNA (Korpelainen et al. 2014).
In broad terms, an RNA-seq experiment involves making a library prior to the
sequencing. The basic steps will require RNA extraction, enrichment for size/type
of RNA, and fragmentation, followed by synthesis of complementary DNA (cDNA,
which is also what is used for microarray hybridization) using random hexamer
primers. The cDNA is then ligated to platform-specific proprietary adaptor
sequences that are attached to the ends of the fragments; finally, an amplification
round completes the library preparation step. Barcodes (short sequences of 5–7 base
pairs that are used to tag a library) are sometimes also added to allow multiplexing
of samples (in this way, multiple samples can be sequenced in the reaction). After a
quality control step, the cDNA library is ready to be sequenced, and it is placed in
the lanes of a flow cell for an amplification step to produce clusters of double
stranded DNA (dsDNA). During this amplification, the addition of nucleotides is
detected, recorded, and converted into base calls. The number of cycles used in the
amplification reflects the length of the reads, whereas the amount of clusters defines
the number of reads (depth of sequencing). At the end of the amplification, the raw
data (short reads) are usually exported as FASTQ files.
Because of the possible introduction of biases during sequencing, it is important
to take into consideration the research question and experimental design, which will
determine the appropriate number of biological or technical replicates, the depth of
sequencing, and the use of single-end (SE) or paired-end reads (PE). PE reads are
the preferred option if the objective of the study is to accurately calculate the abun-
dance of alternative splicing (AS) events within single genes. However, if the study
aims to accurately estimate gene abundance, it is better to sequence large numbers
of short SE reads (Li and Dewey 2011).
A proper experimental design will allow us to recognize if the variation in the
RNA-seq data is due to biological or technical factors (Huang and Khatib 2010).
The number of biological replicates (samples, tissues or cell lines) is determined by
the aim of the experiment and the statistical power required, which in turn depends
on the variability between biological replicas (Zhang et al. 2014). Therefore, a
proper number of biological replicates is essential to determine if the differences in
gene expression are consistent and to evaluate the variance in expression of genes
(Ramsköld et al. 2012; Trapnell et al. 2012). The combination of more biological
replicates and an increased number of reads (depth) result in an increase in statisti-
cal power to detect differentially expressed genes (Trapnell et al. 2013; Wang and
Cairns 2013), and it also improves reproducibility (Liu et al. 2014). Anders et al.
(2012) suggested the use of at least three or four biological replicates per group to
66 S. de las Heras-Saldana et al.
RNA isolation
VetBooks.ir
DNase treatment,
Experimental question RNA quality controls
Size
selection
RNA fragmentation
Adapter ligation
PCR-enrichment
Quality control
cDNA library
Placed in 1 of 8 lanes of a
flow-cell
Preprocessing
Not in R
Alignment to a reference genome or transcriptome
BAM file
Quality metrics
Assemble consensus of reads (Cufflinks- GTF or using Scripture- BED) (Maq)
Data analysis Count gene based strategy Transcript/gene Isoform count (FPKM) strategy
Detection of DE Detection of DE
(DESeq, edgeR, baySeq) Normalization
(CuffDiff, DEXSeq)
Functional analysis
annealed during amplification (Sims et al. 2014). Genes with high and low GC con-
VetBooks.ir
tent are underrepresented, and this affects their expression counts (Risso et al. 2011;
Hansen et al. 2012; Korpelainen et al. 2014), which in turn causes gaps in the tran-
script assembly (Martin and Wang 2011). To reduce GC bias, it is necessary to care-
fully optimize the PCR methodology (Sims et al. 2014) and utilize an appropriate
normalization method (Risso et al. 2011). Also, during the synthesis of dsDNA, the
use of random hexamer primers introduces bias at 5′-end. This bias influences the
uniformity of the location of reads along the transcript (Hansen et al. 2010) and
induces a preferable selection of some regions over others (Filloux et al. 2014).
Even though random hexamers have this bias, it is still preferable than oligo (dT)
that are highly biased toward the 3′-end of the transcript (Hansen et al. 2010). The
use of adapters during the library construction can also introduce bias. Some of this
can be removed from the data during the analysis by performing adequate quality
control and the preprocessing steps discussed in the next section (Fig. 1). However,
to adjust for other biases, the sequence data have to be normalized.
Normalization is an important step that aims to adjust for the technical variation
introduced during library construction or between sequencing runs (Garber et al.
2011). The goal is to adjust read counts so that they are comparable between genes
and samples. This is important because the transcript length can cause bias, as long as
transcripts produce more reads than short transcripts. Also, the depth can be different
between samples or treatments. A plethora of methods were developed for microarray
normalization, and recently some of them have been adapted for RNA-seq. Common
normalization procedures are reads per kilobase of transcript per million mapped
reads (RPKM), quantile normalization, CG normalization, and Bayesian methods.
Another challenge in the analysis of the RNA-seq data is due to splicing. Splicing
removes introns and ligates exons to form mRNA; however, many transcripts in
varying abundances can be generated from a single gene by alternative splicing (AS)
of different combinations of exons (Aschoff et al. 2013). AS is regulated by the spli-
ceosome, which recognizes the consensus sequence of splice sites (5′ and 3′) and is
regulated by splice factors (one splice factor can act on several genes) (Ladomery
2014). There is a complex interaction between trans-acting splicing factors and cis-
acting regulatory elements that act as a silencer or enhancer (Aschoff et al. 2013).
Vitting-Seerup et al. (2014) described eight types of AS with exon skipping as the
most common, followed by alternative exon size as the second most common in
animals (Ladomery 2014). Because AS can affect gene expression and protein cod-
ing, a better understanding of these mechanisms is important, especially because
alterations in AS have been associated with diseases (Sterne-Weiler and Sanford
2014). Our understanding of the processes and the actual information available on
AS is still rather poor and incomplete (Rasche et al. 2014). This is largely due to the
complexity of the transcriptome because different mapping sites are compatible with
the sequence data, and it is difficult to map correctly. This leads to incorrect estimates
of isoform expression (Aschoff et al. 2013). Tools developed to estimate AS (detailed
in the AS subsection below) can be classified based on two strategies: exon based
(centered on identifying differential exon usage) and isoform based (which estimates
differential expression of isoforms by comparison of biological conditions) (Shi and
Jiang 2013; Wang and Cairns 2013). The advantage of the isoform-based approach
RNA Sequencing Applied to Livestock Production 69
is that it incorporates information from all the isoforms, as it is based on all reads
VetBooks.ir
Depending on the objective of the study, there are at least six key steps involved in
the data analysis (Fig. 1):
1. Quality control
2. Preprocessing
3. Mapping to a reference genome or transcriptome
4. Assembly consensus of reads
5. Quantification of expression levels
6. Functional analysis
In this step, the reads are aligned against a reference genome or transcriptome, if it
is available. There are several choices of alignment software, e.g., Bowtie (Langmead
et al. 2009), Genomic Short-Read Nucleotide Alignment Program (GSNAP)
70 S. de las Heras-Saldana et al.
(Wu and Nacu 2010), SOAP2 (Li et al. 2009), SeqMap (Jiang and Wong 2008),
VetBooks.ir
Burrows–Wheeler Alignment (BWA) (Li and Durbin 2009), and Splice Transcripts
Alignment to a Reference (STAR) (Dobin et al. 2013) (Table 2). In general, the
alignment tools can be classified into unspliced and spliced aligners. The unspliced
aligners map the reads to the transcriptome in a contiguous alignment and use either
the seed or the Burrows–Wheeler method (i.e., Bowtie and BWA). Spliced aligners
use either an exon-first or a seed-extended method (Wu and Nacu 2010), reads are
mapped to the reference genome, and the alignment will contain introns. TopHat
(Trapnell et al. 2009), STAR (Dobin et al. 2013), SpliceMap (Au et al. 2010),
MapSplice, and GSNAP (Wu and Nacu 2010) are examples of spliced aligners.
Because of these differences, the mapping strategy will determine which software
should be used.
An evaluation of the aligners Bowtie, BWA, MAQ, and SOAP2 showed that sen-
sitivity improved as the sequence depth increased from 1× to 20×; however, when
using higher depth, the positive predictive values of all aligners decreased (Liu et al.
2012). The highest performance between aligners was achieved by MAQ and BWA
in this study.
In another study that evaluated the splice aligners STAR, TopHat, GSNAP, RUM,
and MapSplice, it was found that they all exhibit desirable receiver operating char-
acteristic (ROC) curves at high threshold detection values, but at the lowest detec-
tion threshold, STAR had the lowest false-positive rate. Except for GSNAP, that had
lower precision and a high number of pseudo-false positives, the other aligners per-
formed similarly (Dobin et al. 2013). A comparison between SpliceMap and TopHat
showed that SpliceMap detects more annotated junctions than TopHat, and it also
has higher sensitivity without sacrificing specificity (Au et al. 2010). Using the
ARH-seq package, it was reported that the resulting alignments from Bowtie,
TopHat, MapSplice, and SpliceMap had low variation in the prediction of AS
(Rasche et al. 2014). Recently, RNASequel was proposed to remove false positive
junctions in order to refine splice junction detection. RNASequel had the lowest
number of incorrectly spliced alignments, and its realignment had the highest preci-
sion for novel and annotated splice junctions in comparison with STAR (Wilson and
Stein 2015).
An alignment-free approach is Sailfish, which was developed to quantify tran-
script abundance using counts of k-mers. A comparison of Sailfish with RSEM,
eXpress, and Cufflinks showed that it does not sacrifice accuracy and is a robust
approach to evaluate real and synthetic data (Patro et al. 2014).
Different from the methods mentioned above, HISAT alignment is based on a
hierarchical indexing to overcome the map of short and intermediate reads. Different
versions of HISAT were compared with STAR, GSNAP, OLego, and TopHat where
the two-pass mode of HISATx2 discovered more alignments but the HISTx1 and
HIST versions were faster, followed by STAR, HISATx2, STARx2, GSNAP, TopHat2,
and OLego. The two-pass approach had better sensitivity than the one-pass (Kim
et al. 2015).
After the alignment step, other quality metrics should be evaluated. These include
coverage uniformity along the transcript (i.e., the abundance of poly-A at 3′), the
VetBooks.ir
OS X
STAR (Dobin Open-source Detect splice junctions, Maximal Mappable Spliced Transcripts Alignment to a
et al. 2013) http://code.google.com/p/ find mismatches and Prefix (MMP) Reference (STAR). Detection of novel
rna-star/ indels splice junctions but also has the option to
Implemented in C++ use annotation databases.
RNASequel Implemented in C++ Postprocessing RNA-seq Uses BWA for mapping reads to the
(Wilson and Stein https://github.com/GWW/ data to improve the reference genome but it can use any read
2015) RNASequel accuracy of the alignment mapper.
software
Alt event finder Open source software Generates de novo Uses BFAST as the primary aligner of
(Zhou et al. 2012) http://compbio.iupui.edu/ annotation for alternative short reads. Uses Cufflinks to reconstruct
group/6/pages/ splicing events the transcript isoforms. Its power highly
alteventfinder depends on the sequencing depth.
HISAT (Kim Open source software Splice aligner Hierarchical Uses a global FM index (to represent the
71
et al. 2015) http://www.ccb.jhu.edu/ indexing genome) and small FM indexes for regions
software/hisat/ (Burrows–Wheeler
and FM index)
72 S. de las Heras-Saldana et al.
saturation of sequencing depth, and the distribution between exons, introns, and
VetBooks.ir
intergenic regions. A short reference for postmapping quality control tools is find in
Mazzoni et al. (2015). In this step, it is useful to filter out transcripts with low
expression <10 reads (Rasche et al. 2014) and to visualize the data in order to get a
general feeling for it. Principal component analysis (PCA) or a heatmap plot of the
correlation of expression levels is useful for identifying outliers.
2.3 Assembly
During the assembly step, the previously mapped reads will be used to reconstruct
complete transcripts and identify the different isoforms that come from the same
locus. Some of the tools that are used for assembly purposes are briefly described in
Table 3. In a comparison of assembler packages, Cufflink and rQuant were outper-
formed by RSEM and IsoEM, which were shown to be comparable for paired-end
data, but for single-end data, RSEM is slightly more accurate (Li and Dewey 2011).
In another comparison, Bayesembler had higher sensitivity and precision with a bet-
ter prediction of the number of expressed splice variants and better length distribu-
tion of transcripts than Cufflinks, IsoLAsso, CEM, and Traph, which have a tendency
to produce shorter transcripts (Maretty et al. 2014).
The use of multiple samples improves the prediction of transcripts; specifically,
with ISP (Iterative Shortest Path) and Cuffmerge, an increase of sensitivity and pre-
cision was observed (Tasnim et al. 2015). This study also reported that, in general,
ISP and Cuffmerge had similar sensitivity; however, ISP outperformed Cuffmerge in
precision.
Recently, two Bioconductor packages—Tablemaker and Ballgown—were devel-
oped to improve the link between assembly output formats and the differential
expression analysis packages in R. Ballgown uses the assembly structure and
expression estimates from Tablemaker and converts it into R objects. Ballgown also
performs linear model-based differential expression analysis and can identify more
differentially expressed transcripts than Cuffdiff2 (Frazee et al. 2015).
There is still a long road ahead for the robust identification of AS events. The com-
plexity of AS events lies not only in splice factors and spliceosome machinery but
also in mutations that produce aberrant splicing, which can affect some genes more
than others depending on their architecture (Sterne-Weiler and Sanford 2014).
Multiple tools that attempt to quantify AS events have been proposed (Table 4). A
good overview of the available methods to study alternative splicing is given by
Alamancos et al. (2014). spliceR is a tool specifically developed to classify tran-
scripts into different isoforms and provide information about the genomic coordi-
nates of splicing regions (Vitting-Seerup et al. 2014). The use of information from
different samples was also proposed in MITIE. This software outperformed Cufflinks
predictions and also Butterfly in terms of sensitivity (Behr et al. 2013).
VetBooks.ir
Table 4 Methods used for count data and differential expression considering alternative splicing
74
2012) BitSeq/BitSeq Analysis of expression Bayesian approach Model biological variance and
changes between account for the intrinsic
conditions technical variance in the data.
rSeqDiff (Shi and R package Detection of differential χ2 distribution Hierarchical Isoform-based approach. For
Jiang 2013) expression and likelihood ratio test experiments with two biological
differential splicing conditions. Generates a ranking
Gene level of genes being differentially
spliced to compare between two
biological conditions.
Cuffdiff2 Runs in Mac/Linux Report differentially Beta negative t-like statistics Isoform-based approach.
(Trapnell et al. expressed genes or binomial
2013) transcript isoforms. distribution
Sailfish (Patro Open-source software Transcript quantification Expectation– Avoid mapping reads. Uses
et al. 2014) http://www.cs.cmu. maximization (EM) previously annotated RNA
edu/~ckingsf/software/ algorithm to quantify isoforms from RNA-seq data.
75
ARH-seq showed the best performance (by comparing ROC curves) compared to
VetBooks.ir
Splicing Index, PAC, Correlation, DASI, DEXSeq, Cuffdif, MISO, and MATS
(Rasche et al. 2014). rSeqDiff is another package useful for the identification of
genes with moderate to high abundance. This package has good control of type I
error rates and good statistical power to detect differential expression and differen-
tial splicing. Also, rSeqDiff can deal with complex isoform structures outperform-
ing MAT and Cuffdiff2 (Shi and Jiang 2013). More recently, MetaDiff incorporated
covariates and cofounding variants, as they seem to play a role that influences gene
expression. When MetaDiff was compared with Cuffdiff, DESeq, DESeq2, edger,
and EBSes, it was observed that Cuffdiff, EBSeq, and DESeq had high false discov-
ery rates (FDRs) because they cannot adjust for confounders. On the other hand,
Bartlett-corrected likelihood ratio test (BCLR), t-test, DESeq2, and EdgeR have a
good control of FDR (Jia et al. 2015).
The integration of differential expression and differential splicing in functional
analysis studies has been proposed because some genes are subjected to both. Wang
and Cairns (2013) developed the SeqGSEA package that performs a weighting and
ranking of both scores, which lead to more powerful results and a more relevant
biological meaning. On the other hand, SplicingCompass focuses on linking splic-
ing regulation and gene expression regulation (Aschoff et al. 2013). They have
reported that SplicingCompass is able to identify new differential splicing events
from unknown exons, in contrast to DEXSeq that cannot identify them because it
uses known annotations.
With respect to the sequencing depth, the use of Alt Event Finder in combination
with Scripture (used to align reads to the genome of reference) was recommended
when there is a need to work with low sequencing coverage. However, with high
sequencing depth, TopHat combined with Scripture have enough power to precisely
map boundaries of known exons and can also identify novel exons (Zhou et al.
2012). Unlike the above-mentioned tools, SUPPA was designed as a free-alignment
pipeline to calculate relative inclusion of AS events (defined as ψ), which can lead
to faster analysis of RNA-seq data. The use of ψ values from SUPPA with Sailfish
or RSEM offers results comparable to using MISO and MAT. Nevertheless, incom-
plete annotation has a significant impact on SUPPA and can result in inaccurate
quantification (Alamancos et al. 2015).
To evaluate gene expression in RNA-seq data, multiple tools have been devel-
oped as R packages and are available from the Bioconductor project, but there are
other software as well (Table 5). In general, gene expression tools can be classified
as those that include the normalization procedure to account for differential expres-
sion and software in which the normalization is done separately. It was reported that
RPKM showed a strong dependence between fold change and GC content, but the
conditional quantile normalization (CQN) method eliminates it and also improves
precision without affecting accuracy. CQN is a sample-gene-specific normalization
(Hansen et al. 2012). Seyednasrollah et al. (2013) compared the trimmed mean of
M-values (TMM) and the default normalization using different methods (edgeR,
DESeq, baySeq, NOIseq, SAMseq, limma, Cuffdiff2, and EBSeq) to determine dif-
ferential expression and found that the normalization method did not have a
VetBooks.ir
edgeR Bioconductor-R Gene expression Negative binomial Generalized linear model Estimates biological variations of few
(McCarthy distribution likelihood ratio test. Bayes biological replicates. Can be applied to
et al. 2012) approach accounts for gene two or more groups (at least one of the
variability. Parametric groups has replicated measurements).
method. Uses count data directly.
easyRNASeq Bioconductor-R Gene, transcripts Depends on the Depends on the package Combines packages to read the
(Delhomme or exon levels package used used. sequence reads, retrieve annotations,
et al. 2012) summarize reads, quantification, and
normalize.
De-multiplexing possibilities.
NOISeq Bioconductor-R Gene, transcript, Noise distribution Log-ratio and the absolute Allows two group comparisons.
(Tarazona or exon levels value of difference. Suitable for small number of
et al. 2011) Nonparametric methods. replicates and samples. Accounts for
genes with low expression levels.
NPEBseq (Bi R package (runs in Differentially Estimates the prior Nonparametric Bayes Analysis across different conditions.
77
significant impact in their study. Risso et al. (2011) proposed four within-lane GC
VetBooks.ir
content normalization procedures that were shown to reduce the bias and its depen-
dence on GC content. In particular, the full-quantile GC content normalization
appears to effectively remove the dependence of the proportion of DE genes on GC
content.
Aside from the technical bias, biological variability is another challenge in RNA-
seq data analysis because it can cause an overdispersion of the data due to heteroge-
neity within a population of cells or the intrinsic biology of the results (Fang et al.
2012). The overdispersion in read counts has been modeled with a negative bino-
mial distribution, beta-binomial, or two-stage Poisson distribution (Fang et al.
2012). The distribution applied generally depends on the software chosen for dif-
ferential expression analysis (Table 5).
Besides the differences in normalization and distribution procedures, each
method is different in its design; they tend to focus on specific features of the data
and particular statistical methods. Therefore, the performance of these tools can be
different on any given data set. Also, the differentially expressed genes resulting
from the analysis depend on the software and the package version used in the analy-
sis (Glaus et al. 2012; Seyednasrollah et al. 2013; Wang and Cairns 2013). Many
methods have been developed, and their performance evaluated and compared using
data from experiments as well as simulations. Simulated data offer a simple way to
evaluate the performance of different tools, even if it does not capture all the pos-
sible scenarios and errors that exist in a real RNA-seq data set (Wilson and Stein
2015). On the other hand, as we cannot yet know the exact amount of transcripts, the
use of a gold standard dataset from qRT-PCR or the microarray quality control
consortium (MAQC) is a good benchmark to compare the various methods.
In a comparison, Cuffdiff had poor detection of DE genes or differential splicing
(DS) when compared to SeqGSE (Wang and Cairns 2013). DESeq had a similar
sensitivity to detect DE genes as Cuffdiff 2, but DESeq returned more false-positive
genes. The detection of false positives was lower with edge than with DESeq, but
Cuffdiff 2 was the method with the highest precision at gene-level resolution
(Trapnell et al. 2013).
The sequence depth seems to have an effect on the detection of DE genes using
different methods. When more sequence data were incorporated into the analysis,
Fisher’s exact test (FET), edgeR, baySeq, and DESeq included more false positives,
but NOISeq maintained a stable and low false-positive rate. edgeR, DESeq, and
baySeq detected new significant DE genes, whereas NOISeq selected new genes but
also discarded some depending on the changes of the noise distribution when a new
gene was added (Tarazona et al. 2011). On the other hand, the number of differen-
tially expressed genes increases when the number of replicate samples increases.
This is evident with edgeR, DESeq, baySeq, limma, and EBSeq, but the rate of
increase varies across methods. However, the opposite was observed with NOISeq
and Cuffdiff, which seem to be more conservative. In terms of false positives,
DESeq, limma, and Cuffdiff showed the lowest amount, whereas edgeR and EBSeq
reported a high amount of false positives (Seyednasrollah et al. 2013). A two-stage
Poisson model (TSPM) was proposed and compared to edgeR, DESeq, and baySeq.
RNA Sequencing Applied to Livestock Production 79
The latter had the best performance in terms of true positive rates. TSPM performed
VetBooks.ir
poorly compared to the others, but it controlled FDR to the desirable level when
there were four replicates (Kvam et al. 2012). Another Bayesian approach was
shown to be highly sensitive, at the same level of false-positive rates as edgeR and
DESeq. This Bayesian approach performs better for smaller sample sizes; it can
identify transcripts with large treatment effects but low expression levels and
reduces bias due to longer transcripts (Chung et al. 2013). Another Bayes approach
suggested by Bi and Davuluri (2013) is a nonparametric Bayesian-based approach
designed to detect differentially expressed genes and exons across different condi-
tions. NPEBseq outperformed the other methods (DESeq, edgeR, baySeq, and
NOISeq) when the performance was compared as well as from the examination of
the ROC curves, sensitivity, and specificity. In this work, the authors also argued
that any parametric assumption for the prior distribution was unrealistic because
there are a large number of genes/transcripts with low read counts and a small num-
ber of genes with a significantly high number of reads.
On the basis of the wide range of methods to evaluate differential expression, we
can say that the statistical analysis has not reached maturity yet. It is important to
consider the features of the RNA-seq data before choosing a particular method/
software; it can also be useful to run the analyses with different tools and compare
the results. Seyednasrollah et al. (2013) suggested to visualize results from different
methods and use a range of quality assessment metrics as well.
Ingenuity pathway analysis networks upstream of up-/down-regulation pattern of the expressed genes.
(Krämer et al. 2013) gene-expression
RNA Sequencing Applied to Livestock Production 81
checking for concordance across them (Ramanan et al. 2012). Once the functional
VetBooks.ir
3 Applications in Livestock
networks were different as well; the oxidative metabolism was upregulated in mus-
VetBooks.ir
found genes associated with growth and development of muscle (Liu et al. 2015).
VetBooks.ir
They found genes that have different expression levels between breeds as well as
genes that are expressed exclusively in each breed. In other works comparing pig
breeds, it was found that in Jeju native pigs, there were differentially expressed
genes from the collagen family, whereas Berkshire pigs presented significantly
more expression of CD genes, and most of the genes related to immune response
were expressed indicating a better immune system (Ghosh et al. 2015). When the
expression patterns of Pietrain and Polish Landrace pig breeds were contrasted,
Ropka‐Molik et al. (2014) found that genes related to reproduction, immune
response, development, and cellular or metabolic processes were highly expressed
in Polish Landrace, whereas genes involved in negative regulation of apoptosis,
immune response, cell–cell signaling, cell growth, and migration were underex-
pressed. Variations in mRNA and miRNA expression were also observed between
sheep breeds (Dorset sheep and small Tail Han sheep with genotype FecBBFecBB
(Han BB) and FecB+FecB+ (Han ++)) with different fecundity rates. The different
expression of mRNA between the Dorset and Han groups suggested unique gene
profiles associated to different fecundity rates. In this study, they pointed out the
importance of studying miRNA levels because it provides more specificity for com-
parisons between traits than just a comparison of mRNA (Miao and Qin 2015).
Garber et al. (2011) pointed out that by providing the sequence of expressed tran-
scripts, RNA-seq encodes information about allelic variation, and RNA processing,
so reconstruction methods should be adapted to account for this variability and report
it. The variants and the number of copies of alleles in RNA-seq data can be found
using samtool mpileup (Li 2011) or Rsamtools pileup (Morgan et al. 2016), and the
total number of reads mapped to a gene can be measured using HTSeq (Anders et al.
2014). The detected mutations can change the primary structure of proteins or affect
the expression of the genes. Most of the variants discovered by genomewide associa-
tion studies (GWAS) are located in noncoding regions. Therefore, the mutations that
are not changing the protein but regulating the expression of genes seem to be the
causal variants more often than not (Ardlie et al. 2015). For instance, Karim et al.
(2011) reported mutations in the intergenic region near the promoter of the PLAG1
gene, which influences expression of the gene and consequently affects the stature in
cattle. The causal mutations in noncoding regions can be detected using RNA-seq
data if the variants near the genes are known beforehand. Along with the variation of
gene expression in different individuals, the expression level of genes depends on the
tissue used in the study and also varies across time. Therefore, it is important to take
the RNA-seq samples from the right tissue (and time).
The variants that are associated with the expression level of genes in different
individuals are called expression quantitative trait loci (eQTL). A mutation can
change the expression of a gene on the same chromosome (cis eQTL) or gene(s)
elsewhere in the genome (trans eQTL). The cis-acting QTL can be identified using
microarrays or RNA-seq with a traditional eQTL mapping method by comparing
the expression level of genes across individuals. However, in eQTL mapping with
RNA-seq data, the cis eQTL can be found by direct assessment of gene expression
with allele-specific expression (ASE). In the ASE approach, for an individual which
84 S. de las Heras-Saldana et al.
within individuals instead of comparing total mRNA abundance from both alleles
across individuals as in traditional eQTL mapping. Moreover, the environmental
factors affecting gene expression across animals as well as the effect of trans-acting
eQTL are eliminated in ASE assessments because of the use of allele counts of one
individual at a time (Pastinen 2010). If the parental origin of the alleles in the het-
erozygote variants is known, the imprinted genes are also detectable with the ASE
approach. However, to discriminate the effect of the allele itself from the effect of
parent of origin of the allele on expression, the allelic imbalance in at least two
individuals with different paternally-inherited alleles should be measured.
In order to find the eQTL with the traditional approach, the regression of normal-
ized gene counts on variant genotypes can be calculated for each gene. Thus, it is
similar to GWAS for complex traits but uses gene expression instead of phenotypes
and includes variants within or near to the gene. For testing the allelic imbalance
expression, the null hypothesis is that the alleles are equally expressed. Therefore, a
χ2 test can be used to compare observed allele counts with the expected counts
according to the read depth.
Although ASE seems to be better than traditional eQTL mapping, there are a few
challenges in measuring allelic imbalance in gene expression. The main difficulties
are sequencing errors in RNA-seq data, the relative low coverage of reads in hetero-
zygous regions, and the biased allele counts toward the alleles represented in the
reference genome (Pastinen 2010). The sequencing errors can be removed to a great
extent by filtering the sequences with low quality and, obviously, by improving the
sequencing technology to have more accurate sequences. The low coverage prob-
lem can be solved by increasing the overall sequencing depth or enrichment of the
transcript for some particular regions of interest (Heap et al. 2010).
Mapping the reads to a single reference genome containing the most frequent
alleles in the polymorphic sites could potentially cause a bias in favor of the reads
with equivalent alleles to the reference because of their lower number of sequencing
mismatches and consequently better mappability (Satya et al. 2012). There are three
main ways to deal with reference bias difficulty: (1) minimizing the reference bias
by mapping the reads to the parental genomes (Degner et al. 2009), (2) excluding
the heterozygous sites from the reference genome and allowing more mismatches in
mapping (Stevenson et al. 2013), and (3) enhancing the reference genome by includ-
ing the alternative alleles at known polymorphic regions in the reference genome
(Satya et al. 2012). If we can make the maternal and paternal reference genome,
then the systematic bias should be eliminated. Satya et al. (2012) reported that
including alternate alleles instead of excluding heterozygous sites to the reference
genome reduced the reference bias and increased the mappability of the reads.
Finding mutations underlying complex trait variation by changing the expression
of genes and the functional studies performed using results from differentially
expressed genes gives important information to further understand the biological
relation between genes, phenotypes, and the environment. The elucidation and
understanding of the physiological mechanisms involved in certain features can
help to develop more efficient production programs.
RNA Sequencing Applied to Livestock Production 85
Project (DP130100542), the Next-Generation BioGreen 21 Program (no. PJ01134906), the Rural
Development Administration, the Republic of Korea, and the Cooperative Research Program for
Agriculture Science and Technology Development (PJ006405), RDA, Korea.
Data and script available in the supplementary online material for the book from the
publisher’s website (adapted from Gondro 2015).
# STEP 2: PREPROCESSING
# Note: This step is performed in the command line (cmd) and java
must be installed on the system,
# it is important to change the directory to your folder (the same
place with your fastq file and the adapters).
86 S. de las Heras-Saldana et al.
# In order to simplify the trim you can also put the executable file
VetBooks.ir
# STEP 4: ASSEMBLY
VetBooks.ir
library(Rsamtools)
library(GenomicFeatures)
library(GenomicAlignments)
library(rtracklayer)
sum(rcounts[index])/sum(rcounts)
# Save the count data
write.table(rcounts, "ReadCounts.txt", quote = F, sep = "\t", col.
names = F)
library(edgeR)
# Read in the data
rcounts <- read.table("RNAdat.txt", header = T, sep = "\t")
contrast <- read.table("RNAcontrast.txt", header = T, sep =
"\t",stringsAsFactors = FALSE)
# Pathway analysis
library(Category)
keggDE <- unique(toTable(bovineENTREZID[as.
vector(deprobes$AFFY)])$gene_id)
References
VetBooks.ir
Alamancos GP, Agirre E, Eyras E (2014) Methods to study splicing from high-throughput RNA
Sequencing data. In: Spliceosomal pre-mRNA splicing: methods and protocols. Humana Press,
New York, pp 357–397
Alamancos GP, Pagès A, Trincado JL, Bellora N, Eyras E (2015) Leveraging transcript quantifica-
tion for fast computation of alternative splicing profiles. RNA 21:1521–1531
Anders S, Huber W (2010) Differential expression analysis for sequence count data. Genome Biol
11:R106
Anders S, Reyes A, Huber W (2012) Detecting differential usage of exons from RNA-seq data.
Genome Res 22:2008–2017
Anders S, Pyl PT, Huber W (2014) HTSeq–A Python framework to work with high-throughput
sequencing data. Bioinformatics btu638
Ardlie KG, Deluca DS, Segrè AV, Sullivan TJ, Young TR, Gelfand ET, Trowbridge CA, Maller JB,
Tukiainen T, Lek M (2015) The Genotype-Tissue Expression (GTEx) pilot analysis: multitis-
sue gene regulation in humans. Science 348:648–660
Aschoff M, Hotz-Wagenblatt A, Glatting KH, Fischer M, Eils R, König R (2013) SplicingCompass:
differential splicing detection using RNA-Seq data. Bioinformatics btt101
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K,
Dwight SS, Eppig JT (2000) Gene Ontology: tool for the unification of biology. Nat Genet
25:25–29
Au KF, Jiang H, Lin L, Xing Y, Wong WH (2010) Detection of splice junctions from paired-end
RNA-seq data by SpliceMap. Nucleic Acids Res 38:4570–4578
Baldwin RL, Li RW, Li CJ, Thomson JM, Bequette BJ (2012) Characterization of the longissimus
lumborum transcriptome response to adding propionate to the diet of growing Angus beef
steers. Physiol Genomics 44(10):543–550
Behr J, Kahles A, Zhong Y, Sreedharan VT, Drewe P, Rätsch G (2013) MITIE: simultaneous RNA-
Seq-based transcript identification and quantification in multiple samples. Bioinformatics
29:2529–2538
Bi Y, Davuluri RV (2013) NPEBseq: nonparametric empirical bayesian-based procedure for dif-
ferential expression analysis of RNA-seq data. BMC Bioinformatics 14:262
Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence
data. Bioinformatics btu170
Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT (2013) Scotty: a web tool for designing
RNA-Seq experiments to measure differential gene expression. Bioinformatics 29:656–657
Cai G, Li H, Lu Y, Huang X, Lee J, Müller P, Ji Y, Liang S (2012) Accuracy of RNA-Seq and its
dependence on sequencing depth. BMC Bioinformatics 13:S5
Cánovas A, Rincón G, Bevilacqua C, Islas-Trejo A, Brenaut P, Hovey RC, Boutinaud M,
Morgenthaler C, VanKlompenberg MK, Martin P (2014) Comparison of five different RNA
sources to examine the lactating bovine mammary gland transcriptome using RNA-Sequencing.
Sci Rep 4
Chen G, Wang C, Shi T (2011) Overview of available methods for diverse RNA-Seq data analyses.
Sci China Life Sci 54:1121–1128
Chen D, Li W, Du M, Wu M, Cao B (2015) Sequencing and characterization of divergent marbling
levels in the beef cattle (Longissimus dorsi muscle) transcriptome. Asian-Australas J Anim Sci
28:158
Chung LM, Ferguson JP, Zheng W, Qian F, Bruno V, Montgomery RR, Zhao H (2013) Differential
expression analysis for paired RNA-seq data. BMC Bioinformatics 14:110
Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G,
Jassal B (2010) Reactome: a database of reactions, pathways and biological processes. Nucleic
Acids Res gkq1018
Cui X, Hou Y, Yang S, Xie Y, Zhang S, Zhang Y, Zhang Q, Lu X, Liu GE, Sun D (2014)
Transcriptional profiling of mammary gland in Holstein cows with extremely different milk
protein and fat percentage using RNA sequencing. BMC Genomics 15:226
RNA Sequencing Applied to Livestock Production 91
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, Pritchard JK (2009) Effect of
VetBooks.ir
Karisa BK, Thomson J, Wang Z, Stothard P, Moore SS, Plastow GS (2013) Candidate genes and
VetBooks.ir
single nucleotide polymorphisms associated with variation in residual feed intake in beef cat-
tle. J Anim Sci 91:3502–3513
Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experi-
ments for identifying isoform regulation. Nat Methods 7:1009–1015
Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory require-
ments. Nat Methods 12:357–360
Korpelainen E, Tuimala J, Somervuo P, Huss M, Wong G (2014) RNA-seq data analysis: a practi-
cal approach. CRC Press, Boca Raton
Krämer A, Green J, Pollard J, Tugendreich S (2013) Causal analysis approaches in ingenuity path-
way analysis (IPA). Bioinformatics btt703
Kvam VM, Liu P, Si Y (2012) A comparison of statistical methods for detecting differentially
expressed genes from RNA-seq data. Am J Bot 99:248–256
Ladomery MR (2014) Targeting alternative splicing in human genetic disease. RNA Nanotechnol
331
Laiho A, Elo LL (2014) A note on an exon-based strategy to identify differentially expressed genes
in RNA-Seq experiments. PLoS One 9, e115964
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10:R25
Lee HJ, Jang M, Kim H, Kwak W, Park WC, Hwang JY, Lee CK, Jang GW, Park MN, Kim HC
(2013) Comparative transcriptome analysis of adipose tissues reveals that ECM-receptor inter-
action is involved in the depot-specific adipogenesis in cattle. PLoS One 8, e66267
Lee HJ, Park HS, Kim W, Yoon D, Seo S (2014) Comparison of metabolic network between
muscle and intramuscular adipose tissues in Hanwoo beef cattle using a systems biology
approach. Int J Genomics 2014:679437
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and
population genetical parameter estimation from sequencing data. Bioinformatics
27:2987–2993
Li B, Dewey CN (2011) RSEM: accurate transcript quantification from RNA-Seq data with or
without a reference genome. BMC Bioinformatics 12:323
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform.
Bioinformatics 25:1754–1760
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J (2009) SOAP2: an improved ultrafast
tool for short read alignment. Bioinformatics 25:1966–1967
Liu Q, Chen C, Shen E, Zhao F, Sun Z, Wu J (2012) Detection, annotation and visualization of
alternative splicing from RNA-Seq data with SplicingViewer. Genomics 99:178–182
Liu Y, Zhou J, White KP (2014) RNA-seq differential expression studies: more sequence or more
replication? Bioinformatics 30:301–304
Liu GF, Cheng HJ, You W, Song EL, Liu XM, Wan FC (2015) Transcriptome profiling of muscle
by RNA-Seq reveals significant differences in digital gene expression profiling between Angus
and Luxi cattle. Anim Prod Sci 55:1172–1178
Maretty L, Sibbesen JA, Krogh A (2014) Bayesian transcriptome assembly. Genome Biol 15:501
Martin JA, Wang Z (2011) Next-generation transcriptome assembly. Nat Rev Genet 12:671–682
Mazzoni G, Kogelman LJA, Suravajhala P, Kadarmideen HN (2015) Systems genetics of complex
diseases using RNA-sequencing methods. Int J Biosci Biochem Bioinformatics 5:264
McCabe M, Waters S, Morris D, Kenny D, Lynn D, Creevey C (2012) RNA-seq analysis of dif-
ferential gene expression in liver from lactating dairy cows divergent in negative energy bal-
ance. BMC Genomics 13:193
McCarthy DJ, Chen Y, Smyth GK (2012) Differential expression analysis of multifactor RNA-Seq
experiments with respect to biological variation. Nucleic Acids Res gks042
McIntyre LM, Lopiano KK, Morse AM, Amin V, Oberg AL, Young LJ, Nuzhdin SV (2011) RNA-
seq: technical variability and sampling. BMC Genomics 12:293
Melé M, Ferreira PG, Reverter F, DeLuca DS, Monlong J, Sammeth M, Young TR, Goldmann JM,
Pervouchine DD, Sullivan TJ (2015) The human transcriptome across tissues and individuals.
Science 348:660–665
RNA Sequencing Applied to Livestock Production 93
improved phylogenetic trees, orthologs and collaboration with the Gene Ontology Consortium.
Nucleic Acids Res 38:D204–D210
Miao X, Qin QLX (2015) Genome-wide transcriptome analysis of mRNAs and microRNAs in
Dorset and Small Tail Han sheep to explore the regulation of fecundity. Mol Cell Endocrinol
402:32–42
Morgan M, Pagès H, Obenchain V, Hayden N (2016) Rsamtools: binary alignment (BAM),
FASTA, variant call (BCF), and tabix file import
Morin RD, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh TJ, McDonald H, Varhol R,
Jones SJ, Marra MA (2008) Profiling the HeLa S3 transcriptome using randomly primed
cDNA and massively parallel short-read sequencing. Biotechniques 45:81
Pastinen T (2010) Genome-wide allele-specific analysis: insights into regulatory variation. Nat
Rev Genet 11:533–538
Patro R, Mount SM, Kingsford C (2014) Sailfish enables alignment-free isoform quantification
from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462–464
Ramanan VK, Shen L, Moore JH, Saykin AJ (2012) Pathway analysis of genomic data: concepts,
methods, and prospects for future development. Trends Genet 28:323–332
Ramayo-Caldas Y, Mach N, Esteve-Codina A, Corominas J, Castelló A, Ballester M, Estellé J,
Ibáñez-Escriche N, Fernández AI, Pérez-Enciso M (2012) Liver transcriptome profile in
pigs with extreme phenotypes of intramuscular fatty acid composition. BMC Genomics
13:547
Ramsköld D, Kavak E, Sandberg R (2012) How to Analyze Gene Expression Using RNA-
Sequencing Data. In: Next Generation Microarray Bioinformatics: Methods and Protocols
(eds. by Wang J, Tan CA and Tian T), Humana Press, Totowa, NJ. Springer, pp 259–274
Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND, Betel D (2013)
Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.
Genome Biol 14:R95
Rasche A, Lienhard M, Yaspo M-L, Lehrach H, Herwig R (2014) ARH-seq: identification of dif-
ferential splicing in RNA-seq data. Nucleic Acids Res 42:e110
Risso D, Schwartz K, Sherlock G, Dudoit S (2011) GC-content normalization for RNA-Seq data.
BMC Bioinformatics 12:480
Ropka‐Molik K, Żukowski K, Eckert R, Gurgul A, Piórkowska K, Oczkowicz M (2014)
Comprehensive analysis of the whole transcriptomes from two different pig breeds using RNA‐
Seq method. Anim Genet 45:674–684
Satya RV, Zavaljevski N, Reifman J (2012) A new strategy to reduce allelic bias in RNA-Seq read-
mapping. Nucleic Acids Res gks425
Seyednasrollah F, Laiho A, Elo LL (2013) Comparison of software packages for detecting differ-
ential expression in RNA-seq studies. Briefings Bioinf bbt086
Shen S, Park JW, Huang J, Dittmar KA, Lu Z-x, Zhou Q, Carstens RP, Xing Y (2012) MATS: a
Bayesian framework for flexible detection of differential alternative splicing from RNA-Seq
data. Nucleic Acids Res gkr1291
Shi Y, Jiang H (2013) rSeqDiff: detecting differential isoform expression from RNA-Seq data
using hierarchical likelihood ratio test. http://dx.doi.org/10.1371/journal.pone.0079448
Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP (2014) Sequencing depth and coverage: key
considerations in genomic analyses. Nat Rev Genet 15:121–132
Sterne-Weiler T, Sanford JR (2014) Exon identity crisis: disease-causing mutations that disrupt the
splicing code. Genome Biol 15:201
Stevenson KR, Coolon JD, Wittkopp PJ (2013) Sources of bias in measures of allele-specific
expression derived from RNA-seq data aligned to a single reference genome. BMC Genomics
14:536
Tarazona S, García-Alcalde F, Dopazo J, Ferrer A, Conesa A (2011) Differential expression in
RNA-seq: a matter of depth. Genome Res 21:2213–2223
Tasnim M, Ma S, Yang EW, Jiang T, Li W (2015) Accurate inference of isoforms from multiple
sample RNA-Seq data. BMC Genomics 16:S15
94 S. de las Heras-Saldana et al.
Tizioto PC, Coutinho LL, Decker JE, Schnabel RD, Rosa KO, Oliveira PS, Souza MM, Mourão
VetBooks.ir
GB, Tullio RR, Chaves AS (2015) Global liver gene expression differences in Nelore steers
with divergent residual feed intake phenotypes. BMC Genomics 16:242
Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq.
Bioinformatics 25:1105–1111
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ,
Pachter L (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated
transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL,
Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments
with TopHat and Cufflinks. Nat Protoc 7:562–578
Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L (2013) Differential analy-
sis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46–53
van de Wiel MA, Neerincx M, Buffart TE, Sie D, Verheul HMW (2014) ShrinkBayes: a versatile
R-package for analysis of count-based sequencing data in complex study designs. BMC
Bioinformatics 15:116
Vitting-Seerup K, Porse BT, Sandelin A, Waage J (2014) spliceR: an R package for classification
of alternative splicing and prediction of coding potential from RNA-seq data. BMC
Bioinformatics 15:81
Wang X, Cairns MJ (2013) Gene set enrichment analysis of RNA-Seq data: integrating differential
expression and splicing. BMC Bioinformatics 14:S16
Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, Mieczkowski P, Grimm SA,
Perou CM (2010) MapSplice: accurate mapping of RNA-seq reads for splice junction discov-
ery. Nucleic Acids Res 38, e178
Wang Y, Ghaffari N, Johnson CD, Braga-Neto UM, Wang H, Chen R, Zhou H (2011) Evaluation
of the coverage and depth of transcriptome by RNA-Seq in chickens. BMC Bioinformatics
12:S5
Warde-Farley D, Donaldson SL, Comes O, Zuberi K, Badrawi R, Chao P, Franz M, Grouios C,
Kazi F, Lopes CT (2010) The GeneMANIA prediction server: biological network integration
for gene prioritization and predicting gene function. Nucleic Acids Res 38:W214–W220
Wesolowski S, Birtwistle MR, Rempala GA (2013) A comparison of methods for RNA-Seq dif-
ferential expression analysis and a new empirical Bayes approach. Biosensors 3:238–258
Wilson GW, Stein LD (2015) RNASequel: accurate and repeat tolerant realignment of RNA-seq
reads. Nucleic Acids Res gkv594
Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short
reads. Bioinformatics 26:873–881
Ye J, Fang L, Zheng H, Zhang Y, Chen J, Zhang Z, Wang J, Li S, Li R, Bolund L (2006) WEGO:
a web tool for plotting GO annotations. Nucleic Acids Res 34:W293–W297
Zhang ZH, Jhaveri DJ, Marshall VM, Bauer DC, Edson J, Narayanan RK, Robinson GJ, Lundberg
AE, Bartlett PF, Wray NR (2014) A comparative study of techniques for differential expression
analysis on RNA-Seq data. PLoS One 9, e103207
Zhou YH, Xia K, Wright FA (2011) A powerful and flexible approach to the analysis of RNA
sequence count data. Bioinformatics 27:2672–2678
Zhou A, Breese MR, Hao Y, Edenberg HJ, Li L, Skaar TC, Liu Y (2012) Alt Event Finder: a tool
for extracting alternative splicing events from RNA-seq data. BMC Genomics 13:S10
VetBooks.ir
Abstract
In this chapter, we provide a brief introduction about graphical models, with an
emphasis on Bayesian networks, and discuss some of their applications in genet-
ics and genomics studies with agricultural and livestock species. First, some key
definitions regarding stochastic graphical models are provided, as well as basic
principles of inference related to graphical structure and model parameters. Next
is a discussion of some examples of applications, which include prediction of
complex traits using genomic information or other correlated traits as well as the
investigation of the flow of information from DNA polymorphisms to endpoint
phenotypes, including intermediate phenotypes such as gene expression. A first
example with prediction refers to the forecasting of total egg production in quails
using early expressed traits (such as weekly body weight, partial egg production,
and egg quality traits) as explanatory variables to support decision making (e.g.,
earlier culling decisions) in production/breeding systems. An additional example
uses genomic information for the estimation of genetic merit of selection candi-
dates for genetic improvement of economically important traits. An example
with causal inference deals with the network underlying carcass fat deposition
and muscularity in pigs by jointly modeling phenotypic, genotypic, and tran-
scriptomic data. Some additional applications of Bayesian networks and other
graphical model techniques are highlighted as well, including multitrait quantita-
tive trait loci (QTL) analysis and structural equation models with latent variables.
It is shown that graphical models such as Bayesian networks offer a powerful and
VetBooks.ir
insightful approach both for prediction and for causal inference, with a myriad of
applications in the areas of genetics and genomics, and the study of complex
phenotypic traits in agriculture.
1 Introduction
Networks can be used to represent biological systems that are composed of inter-
connected components. The components, or subunits, can be molecular elements of
a chemical reaction, cellules of a tissue, organs and organ systems, individuals
within populations, species within ecosystems, etc. Links between subunits of a
network may represent different kinds of interactions, such as physical contact,
affinities, and causal effects, among others.
In genetics, networks are used, for example, to represent gene regulation systems,
coexpression, and epistatic interactions. A network modeling approach commonly
used in genetic analyses refers to correlation networks. In such cases, gene expression
data from microarray or RNA-seq assays are used to compute marginal pairwise cor-
relation coefficients, and results are depicted in a graph in which nodes (vertices) rep-
resent the genes or transcripts, and edges represent significant associations determined
by an arbitrary significance level or minimum absolute value. Alternatively, such net-
works can be constructed using partial correlations. Correlation networks derived from
gene expression data are then examined in terms of topology features such as connect-
edness, degree, betweenness, etc. In addition, networks derived from different settings,
such as environmental conditions or genetic backgrounds, can be compared in terms of
their topological structure or specific components of the network.
A correlation network is an example of undirected graph, as its edges are sym-
metrical with no direction from one node to another. As such, it does not reveal or
express, for example, causal relationships between its nodes, i.e., if the expression
of a specific gene represses or activates the expression of another gene connected to
it. However, some network modeling approaches do explore potential direction of
edges connecting nodes, trying to more parsimoniously represent conditional inde-
pendences encapsulated in the joint distribution of the set of variables, or even as an
attempt to uncover causal links between those variables (Shipley 2002; Pearl 2009).
Such methods belong to a set of data analysis tools termed graphical models,
which include techniques such as path analysis, Bayesian networks (BNs), and
structural equation models (Pearl 2009). Interestingly enough, although graphical
modeling methods trace back to the pioneer work of Sewall Wright on path analysis
in genetics (Wright 1921), such methods have been further developed and applied
more extensively in other areas, including social sciences and econometrics
(Haavelmo 1943).
More recently, however, graphical modeling has caught the attention of geneti-
cists and has been employed in different applications of genomics and genetics
analysis of complex traits (e.g., Rosa et al. 2011; Sinoquet and Mourad 2014).
Applications of Graphical Models in Quantitative Genetics and Genomics 97
Earlier applications of Bayesian networks (BNs) to genomic data were in the analy-
VetBooks.ir
2 Bayesian Networks
V5
Fig. 1). It is interesting to note that a pedigree graph, commonly used in quantitative
genetics and animal breeding, is an example of a DAG. Nonetheless, DAGs can
have more general structures, including more than two parents for a given node,
e.g., the node V5 in Fig. 1.
A DAG describes local features of the joint probability distribution of variables,
allowing a useful factorization of it. Given a set of variables {V1, V2, …, Vp} and a
DAG D that is compatible with their joint probability distribution Pr(V1, V2, …, Vp),
the following factorization can be performed (Pearl 2009):
p
Pr ( V1 , V2 , ,Vp ) = Õ Pr ( Vi | Pa i ) , (1)
i =1
in which Pai are the parents of Vi in D, i.e., all variables with arrows directed toward
Vi.
Hence, using this factorization strategy, the DAG in Fig. 1 can be represented
algebraically as
Pr ( V1 , V2 , V3 , V4 , V5 ) = Pr ( V1 ) Pr ( V2 ) Pr ( V3 ) Pr ( V4 | V1 , V2 )
Pr ( V5 | V1 , V3 , V4 ) .
As indicated above, BNs offer an interesting tool for a more parsimonious represen-
tation of the joint distribution of a set of variables. In addition, they are also useful
for prediction purposes through the detection of the MB of the target variable.
100 G.J.M. Rosa et al.
Bayesian networks can be used to more parsimoniously represent the joint distribu-
tion of a set of variables. For example, let y be a k-dimensional random vector from
a multivariate normal distribution with covariance matrix Σ, i.e., Var [ y ] = Σ . The
matrix Σ has then k variance parameters (diagonal elements) and k(k − 1)/2 covari-
ances (off-diagonal elements). The covariances between nodes (variables) in a BN
are described by arrows and paths connecting them. A fully connected BN structure
involving k nodes has k(k − 1)/2 arrows, and it is statistically equivalent to an
unstructured covariance matrix Σ as described above. However, the structure-
learning step of a BN implementation will search for conditional independencies
(d-separations) between its nodes, and each detected d-separation will result in the
removal of the arrow connecting the corresponding pair of nodes. As such, depend-
ing on how many conditional independencies are detected on a BN, the overall
covariance structure involving its nodes could be represented by a much smaller
number of parameters. To illustrate an application of BN in this context, we discuss
here a study that aimed to investigate the linkage disequilibrium (LD) structure
among molecular markers, which was presented by Morota et al. (2012).
Linkage disequilibrium is a nonrandom association of alleles at different loci
within a population. Genome-enabled prediction of complex traits and association
mapping techniques, such as genomewide association studies (GWAS), rely on LD
between molecular markers and quantitative trait loci (QTL) or causative mutations.
Linkage disequilibrium is also exploited when building haplotype blocks in genomic
Applications of Graphical Models in Quantitative Genetics and Genomics 101
studies. Traditionally, LD patterns are assessed using metrics such as D′ and r2,
VetBooks.ir
which are computed for each pair of markers or genetic loci. However, such
approaches are not able to fully exploit the complexities of the LD structure involv-
ing multiple loci.
Morota et al. (2012) used genotypic data on 36,778 SNP markers after quality
control and missing data imputation, together with predicted transmitting abilities
(PTA) for milk protein yield from 4898 progeny-tested Holstein bulls. They used a
BN as a representation of choice for the LD structure, with the view that loci associ-
ate and interact together as a system or network, as opposed to in a simple pairwise
manner.
Markers used for the BN analysis were preselected based on results of a genome-
enabled prediction model for milk protein yield using a Bayesian LASSO approach.
Three strategies of subset selection were used, with the 15 or 30 top SNPs in terms
of raw absolute posterior mean of the allele substitution effect, standardized abso-
lute posterior mean, or locus-specific additive genetic variance. Two different algo-
rithms, the Tabu search (a local score–based algorithm) and the incremental
association Markov blanket–IAMB (a constraint-based algorithm), coupled with
the chi-square test, were used for learning the structure of the BN and were com-
pared with the reference r2 metric represented as an LD heat map.
Among all combinations of subset of markers and BN algorithm used, an exam-
ple is depicted in Fig. 2 to illustrate the results. The two BNs presented were con-
structed with the Tabu and IAMB algorithms, using the 15 SNPs with the largest
absolute posterior means. Three SNPs on chromosome 14 (A, B, and L) are shown
as gray-filled nodes; these SNPs presented stronger pairwise LD than the remaining
ones. The Tabu search identified one independent network (A, B, C, D, E, F, G, I, K,
L, M, O), with the remaining three SNPs (H, J, N) not associated with any other
group, suggesting their independent segregation. By contrast, the IAMB search
revealed two independent networks (A, B, C, D, E, F, G, H, I, K, L, M, O) and (J,
N). Although the networks constructed by the two algorithms were not the same,
they presented many features in common, including the main cluster of 12 SNPs,
and many consistent connections, such as the trio (A, B, L), the pair (M, E), and the
path connecting the sequence (D, C, I, O, and G, K). In total, the Tabu network had
12 edges and the IAMB had 17, with 10 of the edges common to both networks.
Overall, the BN captured several genetic markers associated as clusters, imply-
ing that these markers are interrelated in a complicated manner. Further, the BN
detected conditionally dependent markers. The results confirm that LD relation-
ships are of a multivariate nature, and hence r2 gives only an incomplete description
and understanding of LD. The observed discrepancies between the Tabu and IAMB
networks are in agreement with the study of Karacaören et al. (2011) who devel-
oped a similar exercise by fitting a BN to significant markers obtained from the
QTL-MAS 2010 data set. In conclusion, Morota et al. (2012) indicated that BN
contains LD information that is additional to that conveyed by pairwise measures,
such as conditional independencies among SNPs in the networks. Such information,
as they pointed out, can be useful, for example, for the selection of tag SNPs to be
used in marker-assisted selection strategies.
102 G.J.M. Rosa et al.
C A C A
VetBooks.ir
D I B L D I B L
K O M K O M
F G E F G E
H N H N
J J
Fig. 2 Examples of networks inferred using genotypes of 15 SNPs with the largest absolute pos-
terior means for milk protein yield. Gray-filled nodes are SNPs located in chromosome 14.
Networks in the left and right panels correspond to the outputs of the Tabu and the IAMB algo-
rithms, respectively, but only in terms of d-separations (Adapted from Morota et al. 2012).
Felipe et al. (2015) compared the efficiency of the BN approach to the traditional
VetBooks.ir
LINE 1 LINE 2
VetBooks.ir
Bw4 Ew4 Y4
Fig. 3 Phenotype structure inferred by the Bayesian network on two lines of quails (Adapted from
Felipe et al. 2015). Traits (nodes) refer to weekly measured body weight from birth to 35 days of
age (BW1 to BW6), weight gain from birth to 35 days of age (WG1) and from 21 to 35 days of age
(WG2), age at first egg (AFE), number of eggs produced from 35 to 80 days of age (EP1), total egg
production (TEP), and egg quality traits measured in four different life stages of the bird (125, 170,
215, and 260 days of age) (egg weight (Ew1 to Ew4), yolk weight (Y1 to Y4), egg shell weight
(ES1 to ES4), egg white weight (EW1 to EW4), and egg specific gravity (Dens1 to Dens4)).
by 50% in all BN prediction scenarios. In summary, the results showed that using
BN (with EP1 in the set of predictors) to construct regression models provided bet-
ter predictions of TEP and superior generalization ability within and across quail
lines when compared to the standard approaches used with multiple linear regres-
sion (Felipe et al. 2015).
The work of Felipe et al. (2015) is an example of prediction of a target pheno-
typic trait of economic interest based on information on early expressed traits. Such
an application can be used, for example, to aid decision making in livestock produc-
tion systems. Another interesting application of BN was presented by Scutari et al.
(2013), who used genotypic information on thousands of markers for prediction of
complex phenotypic traits in barley, rice, and mouse populations, within the context
Applications of Graphical Models in Quantitative Genetics and Genomics 105
of genomic selection (GS) in plant and animal breeding (Goddard and Hayes 2007;
VetBooks.ir
de los Campos et al. 2013). The main interest was on feature selection, using the
concept of Markov blanket (MB), for identifying noninformative markers such that
simpler GS models could be fitted using only the informative markers, allowing cost
savings from reduced genotyping (e.g., Vazquez et al. 2010).
The authors used three publicly available data sets, including continuous pheno-
typic traits: (1) barley data with information on yield and genotypes for 810 SNPs
from 227 barley varieties, (2) mouse data consisting of 12,545 SNPs and phenotypic
information on growth rate and weight for 1940 subjects, and (3) rice data with
73,808 SNPs and phenotypic score on the number of seeds per panicle for 413
varieties.
The performance of MB feature selection was investigated using three GS mod-
els: Ridge regression, LASSO, and elastic net penalized regressions (Gianola et al.
2009). Each GS model was fitted using all the available SNPs as well as only the
SNPs included in the MB. The prediction quality of the GS models was assessed by
calculating the correlation between observed and predicted trait values using a
cross-validation (CV) approach.
When using a significance threshold of α = 0.15 for testing the network correla-
tions, the resulting MBs selected a relatively small number of SNPs, regardless of
the dimension of the SNP data set. The average size of the MBs obtained from cross
validation was 185 for the barley data, 525 and 543 for weight and growth rate,
respectively, for the mouse data, and 293 for the rice data. To assess the consistency
of MB sets across the CV replications, the authors checked how many SNPs
appeared in at least half of the CV folds. Reasonable consistency was found for the
barley data, with 136 SNPs, and for mouse weight and growth rate, with 241 and
276 SNPs, respectively. With the rice data, however, only 15 SNPs were selected by
at least half of the MBs across CV folds, corresponding to 5% of the MB average
size. As indicated by the authors, the latter result may be attributed to the very low
ratio between sample size and number of SNPs in the rice data compared to the
other two data sets.
Overall, the MB subsets of SNPs were consistently smaller than the sample size,
thus ensuring the regularity and numerical stability of the GS models. No loss in
predictive power was observed with GS models implemented after MB feature
selection. Actually, the increased numerical stability resulting from the reduced
number of SNPs slightly improved the predictive power of the GS models.
Specifically, the averages of the predictive correlation over the four analyses (barley,
rice, and the two mouse traits) using all SNPs were 0.481, 0.498, 0.471, and 0.521
for PLS, ridge, LASSO and elastic net, respectively, whereas the corresponding
averages when using MB feature selection were 0.502, 0.509, 0.487, and 0.520, all
with an approximate standard deviation of 0.0057. Furthermore, MBs outperformed
other subsampling strategies of the same size, such as random sampling and feature
selection based on SNP effect significance.
In summary, Scutari et al. (2013) have shown that MB feature selection applied
as a preliminary step in GS was able to greatly reduce the size of the SNP set with
no loss (and possibly a small gain) in the predictive ability of different GS models.
106 G.J.M. Rosa et al.
30
VetBooks.ir
Tenth-rib backfat,mm
Fat, %
Loin Weight, kg
25
20
F.value
15
10
5
0
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 10 11 12 13 13 14 14 15 15 16 18 19 19
Chromosome
Fig. 4 Genome scan results for loin muscle weight, midline 10th-rib backfat thickness, and aver-
age intramuscular fat percentage. The horizontal line indicates the genome-wise significance level
of 5% (Adapted from Peñagaricano et al. 2015a).
12
ZNF24
SSX21P
SMIM12
10
ETV2
PTOV1
PEX14
AKR7A2
8
-log(Pvalue)
6
4
2
0
1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 10 11 12 13 13 14 14 15 15 16 18 19 19
Chromosome
Fig. 5 Genome scan results for seven expression traits that show significant eQTL in chromosome
6 (SSC6). The horizontal line indicates P-value = 1.9 × 10−5 (FDR ≤ 0.20) (Adapted from
Peñagaricano et al. 2015a).
-0.67 (0.16)
LOIN
1.19 (0.39)
SSX21P
-0.22 (0.07) 0.23 (0.05)
Fig. 6 Network inferred after incorporation of QTL → AKR7A2 as prior knowledge, and maxi-
mum likelihood estimates (and standard errors) of the causal parameters (Adapted from
Peñagaricano et al. 2015a).
the previous pQTL and eQTL results. In addition, even though there is a direct link
between the genotype and one phenotype (QTL → FAT), the graphical causal model
showed that, in general, the effects of the genotype on the phenotypes are mediated
by the expression of several genes.
The stability of the network was evaluated using Jackknife resampling
(Peñagaricano et al. 2015a). In each iteration, the network was inferred from a new
data set, which was created by removing one animal at a time from the original data
set. Notably, the majority of the links and directions showed great stability. In fact,
the arrows between phenotypic traits and the links from the genetic marker to the
intermediate variables remained mostly unchanged. There were very few connec-
tions that were unstable (e.g., SMIM12 ↔ PTOV1), i.e., the removal of a single data
point caused the absence of connection between the variables.
Knowledge about gene–phenotype networks can be used to predict the behav-
ior of complex systems. For instance, in Peñagaricano et al. (2015a), the network
model predicted that modulation of ZNF24 expression level should lead to a
change in the expression of SSX2IP. Recently, Li et al. (2009) evaluated potential
ZNF24 target genes. For this purpose, the authors transiently overexpressed and
silenced ZNF24 and then applied microarray assay in order to identify target
genes. Remarkably, the overexpression of ZNF24 significantly decreased the
expression of SSX2IP, as predicted by the network in Peñagaricano et al. (2015a).
In addition, the silencing of ZNF24 resulted in a significant overexpression of
SSX2IP (Li et al. 2009). Therefore, these results support the causal relations
inferred in Peñagaricano et al. (2015a).
Applications of Graphical Models in Quantitative Genetics and Genomics 109
sensory traits. Most of the QTLs were found to have pleiotropic effects, with a few
VetBooks.ir
of multiple traits, showing that under certain conditions it is possible to infer pheno-
VetBooks.ir
type networks and causal effects even without QTL or marker information. Their
approach involves a first step of data adjustment for genetic effects, which otherwise
act as confounders of causal effects between phenotypic traits. In Valente et al.
(2010), a classical infinitesimal additive genetic model involving a relationship
matrix A constructed from pedigree information has been considered for such a
task. As an alternative, if high-density molecular marker data are available (e.g.,
SNP genotypes), more efficient genetic merit prediction approaches can be
employed, such as Bayesian regression techniques (Gianola et al. 2009) or kernel
methods (de los Campos et al. 2009). Examples of application of a data-driven con-
struction of causal graph within the mixed effects SEM context can be found in
Valente et al. (2011), who studied causal relationships among five productive and
reproductive traits in European quail, and in Bouwman et al. (2014), who investi-
gated bovine milk fatty acids.
On a different application of SEM with mixed effects models, Peñagaricano
et al. (2015b) described a methodology for assessing causal networks involving
latent variables underlying complex phenotypic traits. The first step of their approach
involves the construction of latent variables defined on the basis of prior knowledge
and biological interest, which are jointly evaluated using confirmatory factor analy-
sis. The estimated factor scores are then used as phenotypes for fitting a multivariate
mixed model to obtain the covariance matrix of latent variables conditional on the
genetic effects. Finally, causal relationships between the adjusted latent variables
are evaluated using different SEM with alternative causal specifications.
Peñagaricano et al. (2015b) applied their methodology to a data set with pigs for
which several phenotypes were recorded over time. Five different latent variables
were evaluated to explore causal links between growth, carcass, and meat quality
traits. They found that both growth (−0.160) and carcass traits (−0.500) have a sig-
nificant negative causal effect on quality traits (P-value < 0.001), which may have
important implications for improving pig production.
Other applications of graphical modeling in genetics and genomics studies include
the selection of appropriate covariables. For example, Valente et al. (2015) pointed
out that selection requires learning causal genetic effects, and that traditional model
comparison techniques based on predictive power might be inappropriate; genomic
predictors from some models may capture noncausal correlational signals which,
despite providing good predictive ability, inadequately represent true genetic effects.
Using simulated examples, the authors showed that aiming for predictive ability
might lead to poor modeling decisions. Alternatively, causal inference approaches
such as graphical models can guide the construction of regression models that better
infer the target genetic effect even when these models underperform in terms of pre-
diction quality using cross validation. Similar modeling techniques should also be
used in candidate gene studies or genomewide association analyses, so that spurious
gene–phenotype associations due to inappropriate conditioning are avoided.
As a last application of graphical models, we want to highlight their importance
in livestock production for the analysis of observational data using field-recorded
information, and their potential utility in the study of causal effects, as discussed in
Rosa and Valente (2013). As these authors postulated, there is much to be learned
Applications of Graphical Models in Quantitative Genetics and Genomics 113
from such data if carefully mined with appropriate causal models. Such analyses
VetBooks.ir
4 Concluding Remarks
Graphical models such as Bayesian networks (BNs) have became extremely popu-
lar in many areas of research for the analysis of data collected from observational
studies, such as those often found in social sciences, epidemiology, and economics.
They have also been increasingly used in genetics and genomics studies, given their
ability to explore some additional features of the data, such as conditional indepen-
dencies between variables, and to provide further insights about the biological sys-
tem under study, such as functional relationships between variables.
A first example discussed in this chapter illustrates how BNs can be used to
describe more parsimoniously the joint dependencies between variables such as the
LD structure involving molecular markers (Morota et al. 2012). Another important
application of BNs refers to prediction. In this context, BNs are able to detect the
minimal set of a pool of available variables that encompass all the relevant informa-
tion for predicting a target variable of interest, i.e., the Markov blanket (MB). Here
we illustrate this approach using two examples, one with phenotypic traits in quails
(Felipe et al. 2015) and another in the framework of whole-genome prediction mod-
els (Scutari et al. 2013). In this context, Valente et al. (2015) pointed out that the
construction of whole-genome prediction models targeting only predictive ability
using cross-validation techniques could be misleading. As discussed by these
authors, genomic selection models should be constructed aiming primarily at the
identifiability of causal genetic effects, not the predictive ability per se. BN tech-
niques should be useful in this regard as well, aiding the selection of variables to be
included in the models to avoid spurious associations between genetic markers and
phenotypic traits.
An example discussed in Section 3.3 illustrates the application of BN techniques
to investigate causal functional relationships between traits as well as the flow of
information from DNA polymorphisms to transcriptional activity and end point
phenotypes. BN applied to observational data for causal inference purposes is, how-
ever, a much harder task than for the development of prediction models. In this
context, causal inferences are possible only under certain conditions and assump-
tions, such as the Markov condition, faithfulness, and causal sufficiency (Pearl
2009). However, even if some of the conditions are not met, a BN analysis may still
produce interesting and useful results such as the generation of causality hypotheses
for further research and investigation. In genetics, for example, a putative causal
mutation or gene pathway identified by a BN could be ultimately tested using gene
knockout or knockdown techniques.
An advantage of the use of BN analysis in genetics and genomics studies is that
causal inference is aided by the concept of Mendelian randomization (Thomas and
Conti 2004), in which allelic variants are randomized to zygotes during meiosis and
114 G.J.M. Rosa et al.
mental design. It is shown that applying graphical model techniques to QTL analy-
sis and gene mapping with multiple traits not only allows inference regarding causal
relationships among phenotypes, but it also enhances detection power and precision
of estimates, with the additional advantage of a distinction between direct and indi-
rect genetic effects of QTL on each trait (Chaibub Neto et al. 2010).
Lastly, graphical models such as BN and structural equation models can be
extremely useful for analysis of farm-recorded data, which are essentially of observa-
tional nature. In this context, abundant data routinely collected in commercial herds
(either for breeding purposes, health control, or general herd management decisions)
could be exploited to investigate causal effects of environment and management fac-
tors in livestock production, well-being, and product quality. As discussed by Rosa
and Valente (2013), there is a huge opportunity for additional learning from such data,
which require, however, careful analysis using graphical model techniques.
In summary, BNs provide a flexible and insightful approach for the development
of prediction models in genetic analyses as well as for investigating causal relation-
ships between DNA polymorphisms and gene activity, and complex end point phe-
notypes. Knowledge regarding functional relationships between phenotypic traits as
well as the flow of information from DNA to phenotypes can aid the development
of more efficient breeding programs and optimal decision-making strategies in live-
stock management practices.
References
Bollen KA (1989) Structural equations with latent variables. Wiley, New York
Bouwman AC, Valente BD, Janss LLG, Bovenhuis H, Rosa GJM (2014) Exploring causal net-
works of bovine milk fatty acids in multivariate mixed model context. Genet Sel Evol 46:2
Burnham KP, Anderson DR (2002) Model selection and multimodel inference: a practical
information-theoretic approach, 2nd edn. Springer, New York
Chaibub Neto E, Keller MP, Attie AD, Yandell BS (2010) Causal graphical models in systems
genetics: a unified framework for joint inference of causal network and genetic architecture for
correlated phenotypes. Ann Appl Stat 4:320–339
de los Campos G, Gianola D, Boettcher P, Moroni P (2006) A structural equation model for
describing relationships between somatic cell score and milk yield in dairy goats. J Anim Sci
84:2934–2941
de los Campos G, Gianola D, Rosa GJM (2009) Reproducing kernel Hilbert spaces regression: a
general framework for genetic evaluation. J Anim Sci 87:1883–1887
de los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MPL (2013) Whole-genome
regression and prediction methods applied to plant and animal breeding. Genetics
193(2):327–345
de Maturana EL, Wu X-L, Gianola D, Weigel KA, Rosa GJM (2009) Exploring biological relation-
ships between calving traits in primiparous cattle with a Bayesian recursive model. Genetics
181:277–287
de Maturana EL, de los Campos G, Wu X-L, Gianola D, Weigel KA, Rosa GJM (2010) Modeling
relationships between calving traits: a comparison between standard and recursive mixed mod-
els. Genet Sel Evol 42:1
Edwards DB, Ernst CW, Raney NE, Doumit ME, Hoge MD et al (2008a) Quantitative trait locus
mapping in an F2 Duroc x Pietrain resource population: II. Carcass and meat quality traits.
J Anim Sci 86:254–266
Applications of Graphical Models in Quantitative Genetics and Genomics 115
Edwards DB, Ernst CW, Tempelman RJ, Rosa GJM, Raney NE et al (2008b) Quantitative trait loci
VetBooks.ir
Rosa GJM, Valente BD, de los Campos G, Wu X-L, Gianola D, Silva MA (2011) Inferring causal
VetBooks.ir
phenotype networks using structural equation models. Genet Sel Evol 43:6
Sachs K, Perez O, Pe’er D, Lauffenburger DA, Nolan GP (2005) Causal protein-signaling net-
works derived from multiparameter single-cell data. Science 308:523–529
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, GuhaThakurta D, Sieberts SK, Monks S, Reitman
M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H,
Kash SF, Drake TA, Sachs A, Lusis AJ (2005) An integrative genomics approach to infer
causal associations between gene expression and disease. Nat Genet 37:710–717
Scutari M, Mackay I, Balding D (2013) Improving the efficiency of genomic selection. Stat Appl
Genet Mol Biol 12(4):517–527
Sebastiani P, Ramoni MF, Nolan V, Baldwin CT, Steinberg MH (2005) Genetic dissection and
prognostic modeling of overt stroke in sickle cell anemia. Nat Genet 37:435–440
Shipley B (2002) Cause and correlation in biology. Cambridge University Press, Cambridge, UK
Sinoquet C, Mourad R (eds) (2014) Probabilistic graphical models for genetics, genomics and
postgenomics. Oxford University Press, Oxford
Spirtes P, Glymour C, Scheines R (2000) Causation, prediction and search, 2nd edn. The MIT
Press, Cambridge, MA
Steibel JP, Bates RO, Rosa GJM et al (2011) Genome-wide linkage analysis of global gene expres-
sion in loin muscle tissue identifies candidate genes in pigs. PLoS One 6(2), e16766
Thomas DC, Conti DV (2004) Commentary: the concept of ‘Mendelian randomization’. Int
J Epidemiol 33:21–25
Tsamardinos I, Brown LE, Aliferis CF (2006) The max-min hill-climbing Bayesian network struc-
ture learning algorithm. Mach Learn 65:31–78
Valente BD, Rosa GJM, de los Campos G, Gianola D, Silva MA (2010) Searching for recursive
causal structures in multivariate quantitative genetics mixed models. Genetics 185:633–644
Valente BD, Rosa GJM, Teixeira RB, Torres RA (2011) Searching for phenotypic causal networks
involving complex traits: an application to European quails. Genet Sel Evol 43:37
Valente BD, Morota G, Peñagaricano F, Gianola D, Weigel KA, Rosa GJM (2015) The causal
meaning of genomic predictors and how it affects the construction and comparison of genome-
enabled selection models. Genetics 200:483–494
Varona L, Sorensen D, Thompson R (2007) Analysis of litter size and average litter weight in pigs
using recursive model. Genetics 177:1791–1799
Vazquez AI, Rosa GJM, Weigel KA, de los Campos G, Gianola D, Allison DB (2010) Predictive
ability of subsets of single nucleotide polymorphisms with and without parent average in US
Holsteins. J Dairy Sci 93(12):5942–5949
Wang H, van Eeuwijk F (2014) A new method to infer causal phenotype networks using QTL and
phenotypic information. PLoS One 9(8), e103997
Wang H, Paulo J, Kruijer W, Boer M, Jansen H, Tikunov Y, Usadel B, van Heusden S, Bovy A, van
Eeuwijk F (2015) Genotype–phenotype modeling considering intermediate level of biological
variation: a case study involving sensory traits, metabolites and QTLs in ripe tomatoes. Mol
Biosyst 11:3101–3110
Wright S (1921) Correlation and causation. J Agri Res 201:557–585
Wu X-L, Heringstad B, Chang YM, de los Campos G, Gianola D (2007) Inferring relationships
between somatic cell score and milk yield using simultaneous and recursive models. J Dairy
Sci 90:3508–3521
Wu X-L, Heringstad B, Gianola D (2008) Exploration of lagged relationships between mastitis and
milk yield in dairy cows using a Bayesian structural equation Gaussian-threshold model. Genet
Sel Evol 40:333–357
VetBooks.ir
Abstract
Improvements in animal health, productivity, or disease resistance require an
understanding of functional genomics, which can be obtained using multiscale
multi-omics data analysis. Multiscale omics data analysis requires the integra-
tion of bottom-up and top-down approaches to consider both the data available
from an experiment as well as the existing knowledge of functional properties
and interactions in the biological system. Such a comprehensive analysis is now
possible through the integration of multi-omics data generated across multiple
scales from genomics, transcriptomics, proteomics, and metabolomics experi-
ments. This type of systems biology-based analysis requires the use of different
statistical methods and extensive network analysis fine-tuned for the purpose of
identifying the key components associated with the required phenotypes. This
will require integration of powerful statistical techniques with computational
platforms capable of handling the large data volumes and the integrative nature
of the analysis. In this chapter, we discuss all these aspects of systems biology
with some of the tools and techniques.
1 Introduction
A living cell is like a working factory which uses several organelles to produce
thousands of proteins for specific functions. The whole process is very complex and
even minute differences may result in different phenotypic characteristics, includ-
ing diseases. To understand the system of a whole cell, we need to study different
kinds of molecular data for DNA, genes, RNA, proteins, metabolites, and their
through statistics and mathematical models will help us to integrate these molecular
data sets and obtain meaningful insights. This will enhance our understanding about
the mechanism of life and will enable useful modifications, be it for disease man-
agement, desired agricultural productivity, or desired animal phenotype.
Norman Borlaug, a biologist, was awarded the Nobel Peace Prize in 1970 for
saving the lives of more than a billion people (Hesser 2006). Borlaug was credited
for the Green Revolution through his contribution in the production of high-yield,
disease-resistant wheat, thereby saving over a billion people from starvation.
Borlaug and his team created a special variety of dwarf wheat that produced about
three times more grain than the traditional varieties. It consumed less water and
natural resource with increased immunity such that the wheat resisted a wide spec-
trum of plant pests and diseases. They were able to do so due to an understanding of
the genetic mechanisms involved in development of the desired phenotypes. Omics
sciences today are able to take us a step further and help us understand, in much
more detail, what is happening inside a living system through generation of massive
quantities of experimental data. These data are generally called “big data” because
they are huge in size and complexity. Supercomputers and complex computational
algorithms are required to process and interpret these biological data. Increasing use
of cloud computing solutions will make the data volume anticipation and handling
problem much more tractable (Stein 2010). Cloud computing with computational
algorithms will be able to make these mechanistic models available and accessible
to all the biologists across the globe.
Figure 1 shows a very simplified model of the basis of omics sciences. The DNA
in the genome produces mRNA, which is converted into protein products. These
proteins then participate directly or indirectly in different complex biochemical
reactions and modulate the production and consumption of metabolites. Through
the complex interplay of these processes, the observable phenotypes develop. To
answer any question in biology, be it for creating a drought-resistant crop, a high-
productivity domesticated animal, or a drug for treating a disease, one needs to
understand the genotype–phenotype relationship. Traditionally, it was only possible
to measure the phenotypic outcome, because it was visible to the naked eye.
However, with the help of omics science, one can now even measure the phenome-
non at the molecular level. It has become possible to study how molecules inside the
cells interact, their shapes, their sizes, what they produce, how can they be modified,
etc., so that we can make changes in some of these molecules, their patterns, and
behaviors and use them to our advantage with manageable risk.
Omics science is transforming biological science into a mechanistic science so that
we can predict outcomes with higher level of accuracy. In this chapter, we will discuss
how mathematics and computational science are used to extract meaningful knowl-
edge from Omics data. We take a multi-omics multiscale toolset named iOMICS
(http://iomics.interpretomics.co) that combines all these various biological big data to
extract the knowledge and actionable insights. It will help you, a biologist, to under-
stand how to create hypotheses from genome scale omics data and establish a cause–
effect or genotype–phenotype relationships in a biological system.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 119
Environment
VetBooks.ir
Post-Translation
Epigenome Modification
Phenotype
DNA methylation, Covalent and
Histone Modification, Enzymatic Modification
Transcription factor binding
Cancer
Transcription Translation and Other
Diseases
and
Disorders
Integration Approach
2 Multiscale–Multi-omics Data
Multiple types of data are being generated with the help of high-throughput
technologies like next-generation sequencing (NGS) (Schuster 2007), microar-
ray (Schena et al. 1995), mass spectrometry (Aebersold and Mann 2003), etc.
These data can be for a specified region of interest or at the genome scale. These
experiments generate data that could be (i) genomic—whole genome sequenc-
ing, exome sequencing, targeted sequencing, and genotyping arrays; (ii) tran-
scriptomic—RNA sequencing, small RNA sequencing, gene expression, and
miRNA expression arrays; (iii) epigenomic—transcription factor binding and
chromatin regulators, generated using both sequencing and array techniques;
(iv) proteomic—protein expression array, mass spectrometry, enzyme-linked
immunosorbent assays (ELISA); and (v) metabolomic—mass spectrometry,
chromatography, etc.
Omics data can be grouped into two major categories:
These data types often need to be combined in order to generate valuable infor-
mation regarding the biological system (Talukder 2015). This information can then
be used to develop strategies for modifying system properties (Agarwal et al. 2015;
Adhil et al. 2015). Their properties are described below.
120 M. Adhil et al.
Perishable Data This includes high-throughput data that are generated from bio-
VetBooks.ir
logical experiments. These data are analyzed using various tools, and knowledge
value is derived from them. When there are multiple data sets, they are called repli-
cates. There are two different types of replicates:
1. Technical replicates—In this type, the same sample is used multiple times to
generate multiple sets of data. This is useful to determine any instrument/
technology-related artifacts and to establish the variability of the technique.
2. Biological replicates—In this type, multiple samples are taken from different
sources, in order to establish the biological variability between samples. This is
useful for discovery of specific information about the specimens.
Cohort level data: In a cohort, multiple samples are analyzed by taking biospeci-
mens from each member of the cohort. Depending on the experimental study design,
the cohort size is chosen. Usually, it is desirable to have 25 or more samples.
Vast amounts of multiple types of omics data are generated and stored in different
formats. To process these data in order to obtain functional meaning, one needs to
understand the format in which the data are stored. The different types of data,
such as nucleotide or protein sequence, annotation of genes, variation (mutations
and structural variations), protein structures, etc., are all stored in different for-
mats, appropriate for the data type (https://genome.ucsc.edu/FAQ/FAQformat.
html). Biological databases also allow you to download large volumes of data in
formats like Tab delimited text (.txt), XML (Mesiti et al. 2009), SQL dump, or
PostgreSQL dump based relational databases (Letovsky 1999) and Systems
Biology Mark-up Language (SBML) (Hucka et al. 2003). Web Ontology Language
(OWL) (McGuinness and Van Harmelen 2004) and Resource description frame-
work (RDF) (McGuinness and Van Harmelen 2004) are knowledge representation
(in hierarchies) languages used to define the classes of the objects and relation-
ships between the objects. The use of standardized data formats enables the trans-
fer and use of information across groups. Some commonly used data formats are
given below.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 121
FASTA Most of the genome reference files are in this format. It contains the
VetBooks.ir
FASTQ Most of the raw NGS sequencing output is stored in FASTQ format. This
data stores both the sequence as contained in the FASTA files, along with quality
scores for each base in the sequence. FASTQ contains four lines per NGS sequence
read. Although the first two lines contain the sequence identifier (description) and
sequence as in the FASTA files, the third line begins with a “+” and may contain
the sequence identifier (optional), whereas the fourth line contains the quality
scores for each of the nucleotides. An example FASTQ sequence is given as
follows:
@SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y/1
GGGTCATGTGGGCTCATTATTTTCCTCTTTCTTTTACCCAAGTGGACAAG
+SRR292250.1 80DYAABXX_000168:7:1:10000:103094:Y/1
HHHHHHHHFHHHGHH6GGGHHHHHHHHFHGEHHHHHHGFFDG@BGDEGGG
Gene Feature Format (GFF) This file contains one line per feature, each contain-
ing nine columns of data such as seqname (chromosome), source, feature (Gene),
start, end, score, strand, frame, attribute (additional information). An example of
this format is given below.
##gff-version 3
##sequence-region 11 1 135086622
and analysis softwared with the “.vcf” extension. The VCF files contain the
VetBooks.ir
sequence variation information with respect to particular genome. This file contains
meta-information lines, followed by the header and then the variant information. It
contains fields like chromosome, position, ID, reference allele, alternate allele,
quality, filter, info, and format for each variant. It is also useful to capture additional
information about the variants like rsid, genotype probabilities, phred score and
coverage, etc. The meta-information lines begin with “##.” The VCF file format
meta-information field is required. In addition, they may contain a description of the
fields contained within the info field. The sample genotypes are provided in the final
format column. The first four columns of the VCF file describe the variant with
information such as chromosomal position, unique IDs, reference, and alternate
alleles. The quality and filter fields provide information regarding the alternate
allele call. The info field may contain various other annotations for the variants,
such as alternate allele frequencies in different populations, allele count, codon
change, etc.
Protein Data Bank This contains information about the three-dimensional struc-
tures of molecules, such as atomic coordinates, side chain conformers, secondary
structure, and atomic connectivity. The structure information contained within a
PDB file is given under different line types depending on the information. For
example, SEQRES lines give the amino acid sequence of the protein, whereas the
HELIX, SHEET, and TURN lines provide information regarding the secondary
structure elements. A 3-D rendering of the structure in space is achieved by the
information available in the ATOM and HETATM lines, which contain coordinate
information.
BED The BED file contains the feature track information, where each row contains
the feature and each column contains the chromosome, start, end, feature name, and
score. This is used to represent the chromosome with intervals and its respective
significance. This is widely used in ChIP-seq analysis to represent the peaks (fea-
ture) along with their coordinates and score. An example of a few records is shown
below, where the first line contains the description about the data and the others
contain the chromosome with intervals.
Systems Biology Markup Language (SBML) The systems biology markup lan-
guage is an example of a standard data representation that was developed to enable
data sharing between different research groups. Models such as metabolic models
and cell signaling networks are represented in the SBML format to enable data shar-
ing and for standardization. It is based on XML and is widely used to represent
biological networks.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 123
4 Evidence-Based Reasoning
VetBooks.ir
Biological data are mostly empirical, and the results are primarily inferences from
intuition, observed phenomena, and experimental data. This involves inductive rea-
soning, which makes broad generalizations from specific observations. In empirical
science or evidence based practice (EBP), you make many observations, then you
discover a pattern, then you generalize these observations. Finally, you combine all
different conclusions from different experiments and build a theory—answer the
question “why.” Once the mechanistic model is created using data generated in the
past, you are in a position to use this model to answer the question on data that you
have not seen—answer what is likely to happen in the future.
In evidence-based reasoning, the general results are obtained from specific data,
whereas in deductive science, specific results are obtained from general data. Once
the property of the population is known through inductive reasoning, we can deduce
the property of any member of the population.
In twenty-first century biology, you perform experiments, collect data and use
inductive reasoning with the help of bioinformatics and mathematics to arrive at a
general mechanistic model. Once the mechanism is known, you use deductive rea-
soning to predict the outcome based on the theory (model you created).
Various tools are available, which are capable of handling the large volumes of
“omics” data. These tools run quality checks on experimental data, along with
descriptive statistics, and finally derive insight from the data (Kohl et al. 2014).
These then need to be integrated to obtain a complete system level picture, to link
to the phenotype. Integration of these tools, to obtain isolated insights, followed by
a cross-scale integration is implemented in the iOMICS bioinformatics tool. The
iOMICS platform is a sophisticated next-generation genomics software suite,
124 M. Adhil et al.
VetBooks.ir
Fig. 2 iOMICS apps and its higher level organization (NGS, MicroArray, and Integrative Biology)
based on the available processors in order to reduce the time. On-premise versions
VetBooks.ir
are preferred when large amounts (hundreds to thousands of samples) of omics data
sets are generated locally and repeatedly. The cloud version of iOMICS has many
advantages compared to the on-premise version, such as follows: (i) local hardware
infrastructure is not necessary, (ii) there is no installation of the software, (iii) user
end software update is not necessary, (iv) it can be accessed at any time and from
anywhere, and (v) automatic scale up/down of instances based on the data volume.
In the following section, we describe how you can use the various functionalities in
iOMICS suite to analyze individual omics data sets.
6.1 Genome
Two major analysis pipelines available in iOMICS for analyzing genome sequence
are assembly and variation analysis. In iOMICS, both NGS and array data can be
analyzed. The array technology is mainly used for the validation of variations—it
cannot be used to find de novo variations. The genome sequence analysis app
includes (i) DNA-seq for analyzing whole genome sequence data and (ii) Exome-
seq for analyzing whole exome sequence analysis. For microarray data, the geno-
typing app is available for SNP and CNV analysis.
These apps have multiple functional modules such as “quality metrics,” “refer-
ence assembly,” “variant detection,” “population genetics,” and “pathway analysis.”
Some modules require a cohort (case/control) data set, such as “population genet-
ics” where the genotype-phenotype association is performed to obtain significant
genotypes associated with the phenotype.
6.2 Epigenome
There are various techniques available for mapping the genome-wide epigenetic
information, which includes chromatin regulators, transcriptional binding sites, and
DNA–RNA hybrids. These techniques are studied using NGS (ChIP-seq) and
microarray (ChIP-on-chip) for different experimental conditions.
In iOMICS, ChIP-seq and ChIP-on-chip apps can be used to analyze all types
of genome-wide epigenetic information. The modules in the ChIP-seq app
include “quality metrics,” “read alignment,” “peak alignment,” and “peak iden-
tification.” The ChIP-on-chip app is based on array technology. It contains mod-
ules such as data normalization and peak calling. The common modules for both
the apps are “peak annotation,” “common/unique peaks,” and “motif identifica-
tion.” Case/control comparisons can also be performed to identify the differen-
tially binding sites and to perform functional annotations on those regions.
Other interesting results from ChIP-seq/ChIP-on-chip analyzes include average
open reading profiling of the binding site, which tells the distribution of peaks
in the intragenic region, and an overall binding site profile near the transcrip-
tional start site (TSS).
126 M. Adhil et al.
6.3 Transcriptome
VetBooks.ir
Transcriptomics data from NGS can be used to identify the transcripts, genes, coex-
pression, and differential expression across samples. In array-based technology,
only coexpression and differential expression analysis can be performed. In
iOMICS, the RNA-seq (NGS) and Gene Expression (Array) app can be used to
perform transcriptomics data analysis. The modules in the RNA-seq app include
“splice aware alignment,” “assembly and expression quantification,” “gene predic-
tion,” and “differential expression analysis.” The modules in the array-based gene
expression app include “data normalization,” “differential expression,” “coexpres-
sion,” and “functional enrichment.” The differential expression analysis is used to
understand the conditional regulation of the genes based on the phenotype.
Coexpression analysis helps to identify the genes coexpressed for the specific con-
dition. Functional enrichment helps to identify the biological process, cellular com-
ponent, and molecular function for sgnificant genes.
Small RNA are noncoding RNA involved in regulating the translation of target
RNA. The miRNA (NGS) app in iOMICS can be used to identify the miRNA, char-
acterize and find miRNA targets, and compute differential expression of miRNAs
between case and control. The modules in the app include “miRNA identification,”
“conservation study,” “target identification,” and “differential expression analysis.”
Conservation study helps to identify the most conserved miRNAs across the species
and their phylogeny. Target identification helps to identify mRNA-miRNA
interaction.
The final goal of any omics data analysis is to arrive at a mechanistic model for the
phenotype. For example, an important application of integrating genotype with phe-
notype is marker-assisted selection (MAS) in breeding processes. MAS is used to
select the trait of interest, based on the molecular marker tightly linked to the trait.
Molecular markers used for this purpose include genes, SNPs, microsatellites, etc.
You can produce a crop that is significantly resistant to a particular disease, or you can
produce a fruit with high nutritional value, or you can do animal breeding to produce
healthy offspring. Once the markers are identified and validated, a statistical method
called quantitative trait loci (QTL) is used to identify significant markers responsible
for the phenotype of interest. First, you construct the genetic map using the set of
molecular markers and then you perform a linkage analysis. The linkage analysis is
performed by using the phenotype, which is a quantitative measurement (e.g., color
intensity, weight, height, and production), along with the genetic linkage map. The
loci can then be used to identify the significant markers responsible for the trait.
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 127
VetBooks.ir
Experimental
Data Analysis
Systems Biology
Integrative Analysis
Insights
Bottom Up Approach
Reference
Knowledgebases
type of integration required to answer the biological question. For this purpose,
VetBooks.ir
different types of integrative apps are available, which combine different types of
data such as DNA–RNA (Shabalin 2012), RNA–ChIP (Wang et al. 2013), and
miRNA–RNA.
The results obtained from the integrative modeling approaches, such as those
included in iOMICS, need to be finally applied in an experimental setup to obtain
the desired phenotype. This phenotype may be in the form of disease treatment,
increased productivity, improved quality of produce, etc. In addition, in certain
cases, the final outcome from integrative pipelines may need to be further enhanced
before it can be used. In such cases, certain statistical tools and network modeling
approaches can be used. Reference databases, commonly used statistical and net-
work analysis techniques are described in the following sections.
Biological databases are used for the functional annotation and validation of results
to obtain meaningful biological insights. The biological databases are commonly
classified into the following three categories:
expression (mostly microarray based) along with phenotype information; (v) The
VetBooks.ir
Cancer Genome Atlas (TCGA) contains raw and processed genetic data for almost
all cancer types and subtypes; (vi) BioGRID, KEGG, and Reactome—contain
curated biological pathways across different species; (vii) Gene Ontology
Consortium (GO)—contains concepts/classes used to describe gene function such
as molecular function, cellular component, and biological process; (viii) UniProt—
central repository for proteins containing sequence and function information; and
(ix) Expression Atlas—contains gene expression pattern under different biological
conditions across species. Many other useful databases are also available at NCBI
and EMBL.
This section details some of the basic statistical concepts and tools that will come in
handy for analyzing data for insight generation. These insights usually take the form
of biomarkers.
8.1 Variables
In a biological context, the variables of interest are potential biomarkers. These may
be gene names, variants, proteins, or metabolites. The variable being studied repre-
sents some values that may relate to properties such as gene expression, metabolite
consumption, etc. When the variable has a fixed value, which is assumed to be
measured without error and does not change from observation to observation, it is
called a fixed variable. By contrast, the variables of interest usually studied are ran-
dom variables. Their values are defined by a set of possible values following a
particular probability distribution. Depending on the variable being studied, it may
be numeric or categorical.
# R- code
# Sample data
> colon.SOX13<- c(1.84, 3.31, 1.61, 1.81, 3.2, 1.72, 2.98, 1.44,
1.26, 1.28, 2.75, 2.52, 2.75, 1.68, 2.45, 2.1, 3.3, 2.49, 4.36,
1.27, 0.39, 2.94, 1.78, 0.91, 3.05, 1.59, 1.68, 1.74, 1.68, 0.46,
1.34, 0.58, 1.89, 1.48, 2.26, 1.85, 4.09, 0.68, 2.09, 1.41, 0.66,
2.32, 1.81, 0.12, 2.03, 1.26, 1.4, 1.57, 2.53, 2.69)
> breast.SOX13<- c(2.01, 0.56, 0.13, 0.07, 0.45, 1.31, 0.66, 1.27,
0.43, 1.67, 1.54, 0.12, 0.48, 0.61, 0.51, 1.3, 1.65, 0.68, 1.39,
1.14, 1.03, 0.88, 0.49, 0.22, 2.07, 0.61, 0.53, 0.37, 0.22, 0.76,
0.62, 0.78, 1.22, 2.97, 1.92, 0.16, 0.62, 1.46, 1.57, 0.33, 2.46,
0.94, 2.5, 1.78, 0.76, 2.55, 0.92, 2.59, 1.34, 0.72)
> summary(colon.SOX13)
> summary(breast.SOX13)
The summary clearly shows that in colon tissue the “SOX13” gene expression
has a higher level compared to breast tissue. The values are higher in colon tissue in
all sections (minimum value, first quartile, median, mean, third quartile, and maxi-
mum value).
For visual representation of the following data, you can use a box plot or scatter
plot.
# R-code
# Box-Plot
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 131
Tissue", length(breast.SOX13)))
# Scatter-plot
Both graphs (Figs. 4 and 5) show the clear differentiation of two variables
(colon.SOX13 and breast.SOX13). In the box plot, we can see that one data point
in “Colon-Tissue” has a large deviation (outlier) from the mean. This type of
4
SOX13-ExpressionValue
3
2
1
0
Breast-Tissue Colon-Tissue
Tissue
Fig. 4 Box plot between two tissues (breast and colon) using SOX13 gene expression
132 M. Adhil et al.
3.0
VetBooks.ir
2.5
SOX13-Breast-Tissue-Expression
2.0
1.5
1.0
0.5
0.0
0 1 2 3 4
SOX13-Colon-Tissue-Expression
Fig. 5 Scatter plot between two tissue (breast and colon) using SOX13 gene expression
summary plot will also help you to find outliers, if present in the data sets. In omics
data sets, these types of summary plots are helpful for visualizing overall patterns
in the data.
hypothesis states that they do not. In mathematical terms, the null hypothesis is
VetBooks.ir
represented as h0, whereas the alternate hypothesis is represented as ha. The hypoth-
esis is tested using the p-value. In addition, due to the high dimensionality and
complexity of biological data, you also need to consider possible sources of error
which can lead you to accept the wrong hypothesis.
8.3.1 p-Value
The significance of hypothesis testing is measured by the p-value. Mathematically, it
is the probability of obtaining the observed values given in the null hypothesis. You
accept the null hypothesis when the p-value is close to 1 and reject the null hypoth-
esis (accept alternate hypothesis) when the p-value is close to 0. In this context, we
define confidence interval. In biology, the confidence interval is generally 95 %,
which means that you reject the null hypothesis when the p-value is less than 0.05.
A p-value of 0.05 indicates that there is 5 % probability of the null hypothesis being
correct. However, for high specificity, one may reduce this to 0.001 or lower.
Bonferroni Correction In this method, the p-value cutoff is divided by the number
of tests. Thus, in the previous example of 10 tests, the p-value threshold becomes
0.05/10, i.e., 0.005. The probability of a Type I error then becomes 0.048. Although
VetBooks.ir
this method reduces the Type I error, it is highly conservative and increases the
chances of Type II errors.
FDR and q-Value An alternative to the Bonferroni correction method is the false
discovery rate. FDR gives the proportion of false positives amongst the significant
results. This FDR value can be used to adjust the p-value in order to reduce the
number of false positives. This is done by calculating the probability that the
obtained significant result is a false positive. This adjusted p-value, called the
q-value, is a better measure of significance in cases of multiple testing.
8.4.1 t-Test
The t-test is applied to one or two random variables to test whether the means of
populations are significantly different. In the one-sample t-test, it is used to test
whether the samples are drawn from a particular population. In the two-sample
t-test, it compares the means of the values of two random variables to identify
whether these represent a true difference in the populations from which they have
been sampled. The t-test assumes that both random variables are distributed nor-
mally. For example, the t-test can be used to compare the expression value of a
gene between two groups. A statistically significant result indicates that the differ-
ence between the expression levels is due to an actual difference in the groups
rather than due to chance. In the following example, the expression levels of the
gene “SOX13” are taken for two tissues. We wish to identify whether there is any
difference in the expression levels of this gene in these tissues. For each group, we
have expression values from 50 samples. A high t-statistic and a low p-value indi-
cate that the expression levels for the two tissues do not belong to the same
distribution.
# R-Code
# p-value
> t.test.SOX13$p.value
[1] 1.221483e-06
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 135
In this example, we see the p-value is 1.22e−06. This clearly means that the gene
VetBooks.ir
expression levels of the SOX13 gene in these two cancer types are statistically dif-
ferent. This can be a mechanistic biomarker for an unknown sample, if we can
examine the expression level of SOX13 gene and can predict whether the cancer is
in the breast or colon.
8.4.2 ANOVA
Analysis of variance (ANOVA) also assumes that the random variables are
normally distributed. It measures how much of the variance in the data is
explained by categorizing the data into the different variables, and unlike t-test,
it can even be used to compare more than two random variables. For instance,
in the example for the t-test, let us include data for another tissue. We now want
to see if the expression of SOX13 is tissue dependent. ANOVA can be used in
such a case.
# R-code
# Required R library
> library(reshape2)
> skin.SOX13<- c(1.85, 1.59, 3.66, 1.92, 4.16, 3.91, 3.34, 3.52,
1.51, 2.3, 4.03, 3.91, 4.23, 1.68, 3.34, 2.74, 4.27, 1.91, 2.93,
2.52, 3.63, 5.13, 3.5, 3.88, 1.8, 3.56, 4.94, 3.78, 3.6, 4.21,
2.12, 4.5, 3.94, 2.72, 3.92, 4.2, 3.2, 4.82, 5.27, 2.14, 2.85,
2.66, 3.71, 5.93, 5.23, 2.73, 3.8, 3.43, 4.13, 2.85)
# ANOVA
> result <- aov(Expression ~ Tissue, data=data1)
> summary(result)
Here you can see that the p-value (Pr(>F)) is less than 2e−16, which means that
there is a significant differences in the expression level of SOX13 with respect to the
136 M. Adhil et al.
tissues (“breast,” “colon,” and “skin”). The ANOVA can be carried out for more
VetBooks.ir
than two variables (e.g., more than two tissues or two genes). This type of test can
also be used to test for “house-keeping genes” (a gene expressed in many tissues),
where in most of the cases the gene cannot be used as biomarker.
# R-code
# Chi-squared test
> chisq.statistic<- sum(((data2$Observed - data2$Expected)^2)/
data2$Expected)
> chisq.statistic
[1] 36.28118
[1] 1.708054e-09
Here you can see that for one degree of freedom (df) the chi-square value is
36.28, which is significantly greater than 3.841 (please refer to degree of freedom
“1” and significance level “0.05” in critical values for the chi-squared distribution
table), and the p-value is <1.708054e−09. It gives the evidence to reject the null
hypothesis (population or data follows the Hardy–Weinberg equilibrium model)
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 137
and accepts the alternate hypothesis (population or data does not follow the Hardy–
VetBooks.ir
# R-code
data: data3
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval: 5.309629 10.773256
sample estimates:
odds ratio
7.610023
The Fisher’s exact test result shows that p-value is less than 2.2e−16 and odds
ratio is 7.610023, which means that the particular GO term is significantly expressed
in the experiment. This type of statistics is useful for cohort based study, where you
have significant genes for a particular phenotype that can be used to find the path-
ways and biological process which are intervened (or) affected.
138 M. Adhil et al.
8.5 Correlation
VetBooks.ir
Correlation is a statistical technique used to study the association between two variables
x and y, where x and y represent two data series (x1, x2… xn) and (y1, y2.... yn). The vari-
ables x and y must be numeric variables and may be continuous or discrete. The correla-
tion (r) ranges from −1 to 1 where the r value closer to 1 represents positive correlation
and closer to −1 represents negative correlation. When r is equal to zero, there is no
correlation or no linear relationship between the two variables. In the case of positive
correlation between x and y, y increases as x increases, whereas in the case of negative
correlation, y decreases as x increases. However, correlation does not contain direction-
ality information, i.e., whether x is triggering the activity of y or vice versa. Pearson
correlation is commonly used to identify similarities between data series. It is sensitive
to linear relationships. Rank correlation is an alternative to Pearson correlation, which
calculates the correlation between data series based on the ranking of values.
Correlation is widely used on transcriptomics data for identifying coexpression
patterns for genes. Another common application is the validation of direct target
genes of miRNAs for integration of epigenetic and transcriptomic data (Wang and
Li 2009). Here we demonstrate how Pearson correlation can be used to identify
coexpressed genes. We have taken the “nki” data set from the “breastCancerNKI”
Bioconductor package. We have reduced the data set to 1000 genes and 100 samples
in order to reduce the computational power and time taken by the “rcorr” function
to calculate the gene pair's correlation and p-value. We have used absolute correla-
tion 0.5 and p-value 0.01 as a cutoff to get the most significant gene correlation
pairs, which are stored in the “correlationresult” object. This object contains four
columns: the first column (GeneA) contains the gene names, the second column
(GeneB) also contains the gene names (where the GeneB expression is correlated
with GeneA), and the third column contains the correlation value and the fourth
column contains the p-value. These significant gene pairs tell us that when there is
an increase in expression of GeneA, GeneB also increases.
# R-code
# Required library
> install.packages("Hmisc")
> source("http://www.bioconductor.org/biocLite.R")
> biocLite("breastCancerNKI")
> biocLite("affy")
> library("breastCancerNKI")
> library("affy")
> library("Hmisc")
# Final result
> correlationresult <- data.frame(GeneA = genea, GeneB = geneb,
Correlation = correlationval, Pvalue = pval)
> dim(correlationresult)
[1] 8422 4
> length(unique(c(correlationresult$GeneA, correlationresult$GeneB)))
[1] 781
> head(correlationresult)
140 M. Adhil et al.
From the result, you can see there are 781 genes present, containing 8422 inter-
actions. This result is further studied using the network theory approach in Section
9 for more biological insights, for example, which gene is more connected or
associated with other genes. Then you can compare them with normal cohort coex-
pression (reference set) using a database like COXPRESSdb to confirm it as
biomarker.
8.6 Clustering
# R-code
> dat <- data[complete.cases(data),]
> plot(fit, cex = 0.3, xlab = '', sub='', axes = F, ylab = '')
Figure 6 shows that there are different clusters of samples using 100 genes. Note
that these genes are chosen randomly. If we choose significant genes for clustering,
then the data set will show good separation. We can replace these 100 genes with
significant genes and validate them using cluster analysis. For example, we can
check the survival patterns in each cluster considering the data set is from treatment
intervention. If the clusters show significant difference in survival, then you can use
those genes as “prognostic biomarkers.”
Fig. 6 (a) Dendogram showing the clusters using “nki” data set with 100 genes
142 M. Adhil et al.
VetBooks.ir
Fig. 6 (continued) (b) Heat map showing the expression pattern of the 100 genes where samples
on the y-axis and genes are on the x-axis
theory also has a wide range of applications in other fields like World Wide Web
(WWW), social networking, ecology, political science, and history. The network or
graph contains two attributes: nodes and edges. Nodes are the features, and edges
represent the relationships between the features. The graph can be divided into two
broad classes: directed graph and undirected graph.
Table 3 Example of a matrix which contains four features or nodes (V1, V2, V3, V4) and its
relationships (“0” represents no relationship and “1” represents a relationship between two nodes)
V1 V2 V3 V4
V1 0 1 0 1
V2 1 0 1 0
V3 0 1 0 1
V4 1 0 1 0
The adjacency matrix contains the features in rows and columns. The values in
the matrix represent the relationships between the vertex or features. The values can
be coded or numerical, which depends on the type of the relationship between the
features. An example of a matrix is shown in Table 3.
A graph or network consists of sets of vertices and a set of edges connecting the
vertices. Two broad classifications of the graphs are directed and undirected graph.
The major difference between them is that there is no direction information avail-
able in undirected graphs. For example, from the edge between V1 and V2 in an
undirected graph, we cannot interpret whether V1 influences V2 or vice versa. This
information is very important in biological networks and is explained in the “types
of biological networks” section in detail.
The network or graph can be sparse or dense depending on the following criteria:
For graphs (G = (V, E)) with n vertices and m edges, the graph is dense when the
number of relationships is close to the number of nodes, i.e., m is close to n2, and the
graph is sparse when the number of relationships is much smaller than number of
nodes, i.e., m << n2. Most of the biological networks are densely connected. Example
R code for the construction of an undirected graph is given below using the correla-
tion result from Section 6.5. If the data contains the direction information, then the
“mode” parameter in the “graph.adjacency” function has to be converted to
“directed.” The subset of the graph (“graphtarg”) is created with the selected genes,
and their first-order interactions and is shown in Fig. 7. These genes can be replaced
with clinically important genes (single or multiple genes) to find their first-order
coexpressed genes. This will help you to study the specific set of genes and their
first-order interaction.
144 M. Adhil et al.
VetBooks.ir
Fig. 7 Graph plotted for the selected genes and their first-order interaction with other genes. Red
nodes are the selected genes, and yellow nodes are the first-order interaction
# R-code
# Required library
> install.packages("igraph")
> library(igraph)
# To get the 1st order coexpressed genes for the selected or impor-
tant genes
> select_genes = c("C17orf74", "SOX4", "LRFN2", "SLC16A1", "CDH2",
"CDK7", "GSR")
> graphtarg <- induced.subgraph(graph=graph,vids=unlist(neighborhood
(graph=graph,order=1,nodes=select_genes)))
> V(graphtarg)$color <- "yellow"
> V(graphtarg)[select_genes]$color <- "red"
Graph traversals are required to find the path (direct and indirect path) between two
nodes. Some nodes in the graph are directly connected and others are indirectly con-
nected. There will be a path between an arbitrary node and any other node in the graph,
unless the node does not contain any relationship or there is no edge connected to other
nodes. Two widely used graph traversal approaches are breadth-first search (BFS) and
depth-first search (DFS). Example R code is given below for the depth first search and
breadth first search. You should give the root node for DFS and BFS as an input, from
which it gives the order of traversal, which is stored in “orderdfs” and “orderbfs.”
# R-code
Two network properties commonly used to study the structure of biological net-
VetBooks.ir
Shortest Path This is used to find the shortest path connecting two nodes in a net-
work, and the inference will depend on the type of network. If you are studying a
gene coexpression network, then the shortest path between two genes will indicate
the coexpression patterns and the intermediate players. This will tell you which
genes mediate the association between your query genes. There may be multiple
paths available between two genes, but the shortest one has the most significance.
The shortest path can be identified in both directed and undirected graphs. Similarly,
the shortest path analysis can be used to find the toxicity and mode of action in the
drug discovery process. If we know the drug target (gene1) and end-point biomarker
(gene2), we can use the gene regulatory network to find the shortest path connecting
the target and end-point biomarker.
The genes which are involved in connecting the target and end-point bio-
marker can be used for functional analysis to find the toxicity. Example R code
is given below, where the target is “MDM4” gene (similarly, multiple targets
can also be used) and the end-point biomarker is “SLC5A2.” The genes con-
necting (“SOX4,” “CKAP2L,” and “TAF4”) these two can be further studied
using functional analysis. The shortest path between “MDM4” and “SLC5A2”
is shown in Fig. 8.
MDM4
TAF4
CKAP2L
SOX4
# R-code
VetBooks.ir
Centrality In any type of biological network analysis, among the key goals is to
identify the features that are the most critical and control the behavior of the biologi-
cal system. These will be the most important components of the mechanistic model.
For this purpose, centrality analysis is used. This will provide you network informa-
tion such as which genes are essential for survival, which are the housekeeping
genes, or which molecular level properties are the most critical for phenotype devel-
opment. Example R code is given below to calculate the centrality measures such as
degree, closeness, betweenness, and eigenvector.
These measures will help you to rank the genes which are more central or impor-
tant in the whole network.
databases that contain the protein interactions for different model organisms, such
VetBooks.ir
as IntAct, STRING, BioGRID, and Reactome. These databases use publicly avail-
able literature (bibliomics) to construct the protein–protein interactions. These data
can be integrated along with the experimental data to obtain reliable networks.
The different types of biological networks can be integrated and modeled to extract
the knowledge for the specific phenotype (Qabaja et al. 2014). This requires the
integration of multiple levels of omics data as a network for knowledge discovery.
The network-driven knowledge can be used in drug discovery and development
process phase to understand the disease mechanism, to find out the promising tar-
gets or intervention, and also to validate them using in silico knockout experiments.
Using these networks, we can also test the target's efficacy and toxicity using func-
tional or pathway enrichment. Similarly, they can be used to identify specific molec-
ular processes resulting in improved animal productivity.
150 M. Adhil et al.
10 Summary
VetBooks.ir
Acknowledgment We would like to thank Santhosh Babu Gandham, Krittika Ghosh, Sushmita
Mookherjee, Sakthi Ganesh and all the other members of the iOMICS team for helping us to pub-
lish this chapter.
References
Adhil M, Gandham S, Talukder AK, Agarwal M, Prahalad HA (2015) CuraEx – clinical expert
system using big-data for precision medicine BT – big data analytics. In: Kumar N, Bhatnagar
V (eds) Proceedings of 4th international conference, BDA 2015, Hyderabad, 15–18 Dec 2015.
Springer International Publishing, Cham, pp 216–227
Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198–207
[PMID: 12634793]
Agarwal M, Adhil M, Talukder AK (2015) Multi-omics multi-scale big data analytics for cancer
genomics BT – big data analytics. In: Kumar N, Bhatnagar V (eds) Proceedings of 4th interna-
tional conference, BDA 2015, Hyderabad, 15–18 Dec 2015. Springer International Publishing,
Cham, pp 228–243
Hesser LF (2006) The man who fed the world: Nobel Peace Prize laureate Norman Borlaug and his
battle to end world hunger: an authorized biography. Leon Hesser. Durban House Pub Co Inc
Hucka M et al (2003) The systems biology markup language (SBML): a medium for representa-
tion and exchange of biochemical network models. Bioinformatics 19(4):524–531
[PUBMED:12611808]
Kohl M, Megger DA, Trippler M, Meckel H et al (2014) A practical data processing workflow for
multi-OMICS projects. Biochim Biophys Acta 1844:52–62. doi:10.1016/j.bbapap.2013.02.029
[PMID: 23501674]
Letovsky S (1999) Bioinformatics: databases and systems. Springer Science & Business Media.
Springer
McGuinness DL, Van Harmelen F (2004) OWL web ontology language overview. W3C
Recommendation 10.10, 2004
Advanced Computational Methods, NGS Tools, and Software for Mammalian Systems 151
Mesiti M, Jiménez-Ruiz E, Sanz I et al (2009) XML-based approaches for the integration of het-
VetBooks.ir