You are on page 1of 38

Johann Bernoulli Institute for Mathematics and Computer

Science, University of Groningen


9-October 2015

Presentation by:
Ahmad Alsahaf

• Research collaborator at the Hydroinformatics lab - Politecnico di Milano


• MSc in Automation and Control Engineering
A long History of Animal Breeding
A long History of Animal Breeding

First step in animal breeding: Domestication


Breeding for desirable traits

Longevity, fertility,
more wool, milk, eggs, meat, etc.
Breeding for desirable traits

Longevity, fertility,
more wool, milk, eggs, meat, etc.

Fasters race horses, better


hunting dogs, cuter kittens.
Early History

Selective breeding relied on observable traits and


human intuition
1900’s

1930’s
Early History

Selective breeding relied on observable traits and human intuition

1900’s Rediscovery of Mendel’s law of inheritance

1930’s Gregor Mendal (1822-1884)


Early History

Selective breeding relied on observable traits and human intuition

1900’s Rediscovery of Mendel’s law of inheritance

Biometrician, Karl Pearson, and the rejection of


Mendel’s laws

1930’s
Karl Pearson (1854-1936)

1940’s
Early History

Selective breeding relied on observable traits and human intuition

1900’s Rediscovery of Mendel’s law of inheritance

Biometrician, Karl Pearson, and the rejection of Mendel’s laws

“Animal Breeding plans” 1937 book by Lush.


1930’s
First application of statistics and quantitative
genetics to animal breeding (cattle)

1940’s
Early History

Selective breeding relied on observable traits and human intuition

1900’s Rediscovery of Mendel’s law of inheritance

Biometrician, Karl Pearson, and the rejection of Mendel’s laws

“Animal Breeding plans” 1937 book by Lush.


1930’s First application of statistics and quantitative genetics to animal
breeding (cattle)

1940’s Artificial insemination became common practice in dairy


cattle
The Dairy Cattle Example

• One of the sectors of the animal industry that benefitted most


from selective breeding, and the use of data in it.

• Pedigree records have been kept well

• Few and easily measureable traits (Milk/protein/fat yields, feed efficiency)

• Bulls deemed good can be fully utilized

• Advanced artificial insemination technology


The Holstein Friesian Dairy Cattle Breed
Genotype Vs. Phenotype
Progeny Testing

Test bull: Genetic information


not available Bull’s milk producing daughters

Artificial
Insemination

Measure the
quality of the
milk

Determine the
economic value
of the bull
Progeny Testing

Test bull: Genetic information


now AVAILABLE Bull’s milk producing daughters

Artificial
Insemination

50,000 – 70,000
Genetic Markers
Measure the
quality of the
milk

Determine the
economic value
of the bull
Machine Learning Examples
1. Using classification models (supervised learning) to detect problems
in artificial insemination.

Grzesiak,Wilhelm, et al. "Detection of cows with insemination problems using selected


classification models." Computers and electronics in agriculture 74.2 (2010): 265-273.
Machine Learning Examples
1. Using classification models (supervised learning) to detect problems
in artificial insemination.

• Lactation number
• % HF Genome
• Sex of the calf Good cow
• Age of cow
• AI season Bad cow
• Health metric
• % of fat/protein
in Milk
In 1200 cows
nominal phenotypes,
categorical phenotypes,
environmental factors
Machine Learning Examples
1. Using classification models (supervised learning) to detect problems
in artificial insemination.

• Lactation number
• % HF Genome
• Sex of the calf Linear Classifiers
Good cow
• Age of cow Logistic Regression
• AI season Artificial Neural Networks Bad cow
• Health metric Multivariate adaptive regression splines
• % of fat/protein
in Milk
In 1200 cows
nominal phenotypes,
categorical phenotypes,
environmental factors
Machine Learning Examples
1. Using classification models (supervised learning) to detect problems
in artificial insemination.

Logistical and Economical


implications
Of the classification outcome
• Lactation number False positives
• % HF Genome Vs.
• Sex of the calf False negatives Good cow
• Age of cow
• AI season Bad cow
• Health metric
• % of fat/protein
in Milk
In 1200 cows
Genetic information,
nominal phenotypes,
categorical phenotypes,
environmental factors
Machine Learning Examples
2. Clustering dairy cows based on their Phenotype-Phenotype
phenotypic traits
ile Analizi, Kümeleme Yöntemleri. "Principal
prediction studies
component and clustering analysis of functional
traits in Swiss dairy cattle." Turk. J. Vet. Anim. Sci. # Data ML methods
(2008).

3. Prediction of insemination outcome


1 10 phenotypes and ANN,
Shahinfar, Saleh, et al. "Prediction of insemination environmental factors Logistig reg.
outcomes in Holstein dairy cattle using alternative
machine learning algorithms." Journal of dairy 2 5 phenotypes Hierachial
science (2014) clustering,
PCA
4. Predicting the lactation yield of dairy 3 26 phenotypes naïve Bayes,
cows using multiple regression or neural and environmental decision
networks
Grzesiak, W., et al. "A comparison of neural network
factors trees
and multiple regression predictions for 305-day 4 7 phenotypes ANN,
lactation yield using partial lactation
records." Canadian journal of animal science (2003)
multiple
regression
Machine Learning with High Dimensional Genetic Data

Genome Wide Association Studies

• A unit of genetic variation


or a Genetic Marker
SNP’s (single nucleotide polymorphism)

• The goal is to associate an SNP (or several)


with a phenotype, e.g. a disease

• This is typically done by GWAS (Genome


Wide Association Studies).

• Which SNP’s (or other markers) occur


frequently within a population that has the
trait of interest.
Machine Learning with High Dimensional Genetic Data

Why Machine Learning?

• Quantitative traits (e.g. Milk yield, disease, longevity) are controlled by multiple
markers.

• Machine Learning can associate multiple genetic markers to a phenotype AND find
complex interactions between markers.

• Machine Learning can facilitate dealing with redundant and irrelevant variables.
Example: From genotype to Milk yield

1. Using Neural Networks


Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural
networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87.

Input:
n: 297 cows
p: 35,798 SNPS

i.e. Small n, large


p problem

Output:
Milk yield
Protein yield
Fat yield
Example: From genotype to Milk yield

1. Using Neural Networks


Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural
networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87.

Input: Dealing with dimenionality:


n: 297 cows • Bayesian regularized back propagation; commonly
p: 35,798 SNPS used to avoid overfitting in BP.

i.e. Small n, large • 297 variables derived from the original 35,798
p problem
§ Using genome derived (SNP) relationships
between the cows as inputs instead of the SNP’s
Output: themselves.
Milk yield
Protein yield § By constructing a matrix of genomic
Fat yield relationships that’s analogous to a covariance
matrix and is based on allele frequency in the
population
Example: From genotype to Milk yield

1. Using Neural Networks


Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural
networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87.

Results: Effective number of parameters


Example: From genotype to Milk yield

1. Using Neural Networks


Gianola, Daniel, et al. "Predicting complex quantitative traits with Bayesian neural
networks: a case study with Jersey cows and wheat." BMC genetics 12.1 (2011): 87.

Results: Mean Squared Error of the predictions


Example: Genotype to Feed efficiency

1. Using Random Forests (Decision Trees)


Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single
nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal
of dairy science 96.10 (2013)

Input:
395 Holstein cows
42,275 SNPS

i.e. Small n, large p problem

Output:
Residual Feed Intake of the cow

Adjusted for environmental and external factors


Example: Genotype to Feed efficiency

1. Using Random Forests (Decision Trees)


Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single
nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal
of dairy science 96.10 (2013)

Methods:

• Decision trees

• A predictive model with a tree structure based


on if-else statement.

• At each node, pick the best split (best question


to ask).
Example: Genotype to Feed efficiency

1. Using Random Forests (Decision Trees)


Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single
nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal
of dairy science 96.10 (2013)

Methods:

• Random Forests algorithm


(ensemble method).

• The output is the averaged


outcome of all weak
learners in the ensemble
(decision trees)
Example: Genotype to Feed efficiency

1. Using Random Forests (Decision Trees)


Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single
nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal
of dairy science 96.10 (2013)

Dealing with
dimensionality:

• Bootstrapping for each tree


in the forest

• At each note of each tree,


choosing the best split out
of a subset of p variables,
not all of them (100, 1000)
Example: Genotype to Feed efficiency

1. Using Random Forests (Decision Trees)


Yao, Chen, et al. "Random Forests approach for identifying additive and epistatic single
nucleotide polymorphisms associated with residual feed intake in dairy cattle." Journal
of dairy science 96.10 (2013)

Results and Findings:

• Ranking SNP’s according to their


A
importance to the phenotype
output. (implicit feature ranking
capability of decision trees).
B C
• Identifying pairs epistatic genes
through the RF structure, as they
will tend to fall into the same
branches of trees. (Parent-child) D E
Dipartimento di Elettronica, Master of Science in Automation and
Informazione e Bioingegneria Control Engineering – Dec 2014

Supervisor: Andra Castelletti Master Thesis by:


Ahmad Alsahaf
Co-supervisor: Stefano Galelli
Co-supervisor: Matteo Giuliani
What is Model-order reduction (Emulation Modelling)?

Such that:
§ the emulator is less computationally intensive than the PB model;
§ the input-output behavior reproduces accurately the PB model behaviour;
§ the emulator is ‘’credible’’ from the user/analyst’s point of view. (Physically inrerpretable)
What is Model-order reduction (Emulation Modelling)?

Such that:
§ the emulator is less computationally intensive than the PB model;
§ the input-output behavior reproduces accurately the PB model behaviour;
§ the emulator is ‘’credible’’ from the user/analyst’s point of view. (Physical interpretability)
Recursive Variable Selection - A feature selection algorithm

>2%

State variables
Exogenous inputs
Control variables
Output variable
PCA vs Sparse PCA– Coefficients heat map
PCA Vs. Sparse and Weighted PCA – Emulator performance

Emulator performance

1 Explained variance
PCA
0.9
WPCA
0.8 SPCA

0.7

0.6
R2

0.5

0.4

0.3

0.2

0.1

-0.1

-0.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Principle Components

Emulator structure: Extra-trees (Geurts et al., 2006)

Ref: Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees.
Machine learning, 63(1), 3-42.

You might also like