ESF Genomic

Resources Summer
School
October 1-7, 2012, Pag Island, Croatia

Livestock Conservation Genomics:
Data, Tools and Trends

Day 3 – Practical workshop:
Phasing and imputation
Gregor Gorjanc
Biotechnical Faculty, Department of Animal Science
University of Ljubljana, Slovenia
October 1-7, 2012, Pag Island, Croatia

Background
• Introduced to phasing and imputation via genomic
selection = genome-wide SNP data used for genomic
prediciton

• Collaboration with John Hickey

Whole-genome sequence
haplotypes (human trios) – Chr1

Red - paternal grandfather
Magenta – maternal grandfather
Blue - paternal grandmother
Green – maternal grandmother
Scale – number of variants (1000 variants / Mb)

Roach et al (2011)

Blue – paternal
Brown – maternal
Dark – grandfather
Light – grandmother

Roach et al (2011)

*type
Haplotype 1
hi,1=(A-C-A-T)
hi,2=(G-C-A-G)
Haplotype 2

gi=(AG,CC,AA,TG)
Genotype

• Haplotype = combination of alleles on chr. (over loci)
• Genotype = combination of alleles at locus (over chr.)
• Diplotype = combination of haplotypes

Genotyping platforms
• We obtain genotypes separately for each locus
 can not resolve haplotypes directly

• With n SNP loci there are 2n possible haplotypes




1 locus 21 = 2 haplotypes (0, 1  alleles)
2 loci 22 = 4 haplotypes (0-0, 0-1, 1-0, 1-1)
3 loci 23 = 8 haplotypes (0-0-0, 0-0-1, 0-1-1, …)

10 loci 210 = 1024 haplotypes

Classes of methods for phasing
(=„haplotyping“)
• Molecular analysis
isolation of single molecule and sequencing
 not completely there yet
• Population/statistical analysis
linkage disequilibrium – short regions
• Genetic analysis
genetic principles (simple rules, linkage) – longer regions
• Parsimony analysis
(minimum number of recombinations)
• Pooled DNA
• …

Methods & Programs
• Very active area of developments: several methods,
programs, data formats, options, tricks, …









fastPHASE
Beagle
MACH
Some review papers:
• Liu et al (2009) Ann Rev Genomics Hum Genet, 10, 387-406
IMPUTE2
• Marchini & Howie (2010) Nat Rev Gen, 11, 499-511
FIMPUTE
• Browning & Browning (2011) Nat Rev Gen, 12, 703-714
findhap.f90
PHASEBOOK
Alpha{Phase, Impute}
LDMIP
................................................................

LD based modelling  HMM
• Hidden Markov Model (HMM) – inferring haplotype
cluster (or reference haplotype) membership locally
along the chromosome (clustering each marker)
allowing for block-like structure and gradual decline of
LD with increasing distance
• Examples
– fastPHASE
– IMPUTE2

Scheet & Stephens (2006)

Beagle model (also LD based)
• Locally variable number of clusters (haplotypes) - K

Browning & Browning (2010)

Long-range phasing
• A fast rule based phasing
method
• Basically a pedigree free linkage approach
• Completely unrelated animals contribute phasing
information
• Even animals from a different breed can contribute
Kong et al (2008); Hickey et al (2011)

What underlies a genotype?
(coded as allele dosage)
Father

Mother

11110222111111111111121021

10111121211121212211221121

Proband

10110122121110212101220022

What underlies a genotype?
(coded as allele dosage)
Father

Mother

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010

10101110101111111111111110

11110222111111111111121021

10111121211121212211221121

Proband
10100111011100111001110011
00010011110010101100110011
10110122121110212101220022

Phasing a Trio
Father

Mother

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010

10101110101111111111111110

11110122111111111111121021

10111121211121212211221121

Proband
10100111011100111001110011
00010011110010101100110011
10110122121110212101220022

Can not phase this locus!!

Surrogate parents are the drivers of
long range phasing
Pat Hap:

10100111011100111001110011

Mat Hap:

01010111100011000110011010

Genotype:

11110222111111111111121021

10100111011100111001110011

Proband G:

10110122121110212101220022

00010011110010101100110011

Opp Homo:

**************************

Pat Hap:

00010011110010101100110011

Mat Hap:

10101110101111111111111110

Genotype:

10111121211121212211221121

Proband G:

10110122121110212101220022

Opp Homo:

**************************

Pat Hap:

10111110101011111100111110

Mat Hap:

10101110010110110000111110

Genotype:

20212220111121221100222220

Proband G:

10110122121110212101220022

Opp Homo:

****X**X**************XX*X

Proband

Father:

10110122121110212101220022

Genotype

Mother:

Other:
Not a surrogate parent!

Surrogate parents are the drivers of
long range phasing
Pat Hap:

10100111011100111001110011

Mat Hap:

01010111100011000110011010

Genotype:

11110222111111111111121021

10100111011100111001110011

Proband G:

10110122121110212101220022

00010011110010101100110011

Opp Homo:

**************************

Pat Hap:

00010011110010101100110011

Mat Hap:

10101110101111111111111110

Genotype:

10111121211121212211221121

Proband G:

10110122121110212101220022

Opp Homo:

**************************

Pat Hap:

10100111011100111001110011

Mat Hap:

10101110101010101011100110

Genotype:

20201221112110212012210121

Proband G:

10110122121110212101220022

Opp Homo:

**************************

Proband

Father:

10110122121110212101220022

Genotype

Mother:

Other:
A surrogate parent!
(Even without pedigree information)

Phasing a Trio
Could be a female
Could be a descendant
Could be many generations distant
Can be ‘unrelated’

Mother

Surrogate Father
10100111011100111001110011

00010011110010101100110011

10101110101010101011100110

10101110101111111111111110

20201221112…

10111121211121212211221121

Proband
10100111011100111001110011

00010011110010101100110011
10110122121110212101220022

Can now phase this locus!

Erdös 2 Surrogate Mothers

Long Range Phasing

10111110101011111100111110

Surrogate
giving phase
information

• Erdös 1 surrogates are surrogates of the proband.

10101110010110110000111110

• Erdös n+1 surrogates are surrogates of Erdos n
surrogates of the proband.

11111110111111110000111110
10101010101111000001100110

Potentially many

Erdös 1 Surrogate Fathers

meiosis separating

Erdös 1 Surrogate Mothers

10100111011100111001110011

00010011110010101100110011

10101010101000000000111110

10111110101011111100111110

10100111011100111001110011
10101110101010101011100110

Surrogate
giving phase
information

00010011110010101100110011
10101110010110110000111110

10100111011100111001110011

00010011110010101100110011

10101010101111000001100110

11111110111111110000111110

Potentially many meiosis separating

Father

Mother

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010

10101110101111111111111110

Proband
10100111011100111001110011
00010011110010101100110011

Kong et al (2008); Hickey et al (2011)

Haplotype library imputation
• Build library of all completely phased
haplotypes

10100111011100111001110011
10100000010000100011110011
10110011001100111001110011

10100111011001001001110011

• Find haplotypes in the library which can
explain an individuals genotype

10100101011100111001110011
111110111011100111001110011
10100100000000111001110011
10100111001100111001110001

• Low error rates
• Computationally fast

• Useful for extremely large data sets
– Strategic use

Practical - Phasing
1. Get AlphaPhase distribution package (either)
a) https://sites.google.com/site/hickeyjohn/alphaphase
b) DropBox (software directory)

2. Unpack the distribution package to your working
directory
Distribution
|-- Examples
|
|-- PhasingWithoutPedigreeInformation
|
|
|-- GenotypeFormat
|
|
`-- UnorderedFormat
|
|-- PhasingWithPedigreeInformation
|
|
|-- GenotypeFormat
|
|
`-- UnorderedFormat
|
`-- SimulatedScenario
|-- LinuxExecutable
|-- MacOSXexecutable
`-- WindowsExecutable

Practical – Phasing II
3. Open terminal (Windows: Menu Start, Run, cmd) and
change to the following directory (change ???)
cd ???/Distribution/Examples/PhasingWithPedigreeInformation/GenotypeFormat

4. Open parameter file and change numbre of SNP to 500
5. Run AlphaPhase (change OS if needed)
../../../WindowsExecutable/AlphaPhase.exe

6. Open AlphaPhase manual and go to the Output section –
read the meaning of created output files
7. Check the phasing of genotype for individual 1598

Copy the first few values (say 20) of maternal and paternal
haplotype to Excel and use Data  Text to columns  Space +
merge delimiters
Add genotype of individual 1598 and check if the sum of
haplotypes gives genotype

Practical – Phasing III
8. Who is the first (sequentially) Erdos 1 surrogate parent of
individual 1598?
9. Find genotype of the above surrogate parent and copy the
first few values into Excel and find which loci were
informative for the phasing of genotype of individual 1598.
10. What was the phasing yield for individual 1598 and the
overall phasing yield per core? Were there any SNP that
had very low phasing yield?
11. How many haplotypes were found for the first core?
12. Is distribution of haplotype occurrence uniform?
13. We see „high repetition“ of some haplotypes in individuals
1591:1600. Why?

Imputation – Why?
• Genotyping at high density is expensive
• Genotyping at low density is cheaper

• Imputation is free
• A genotype imputation method that allows us to get lots
of genotyped and phenotyped animals at a low cost

Imputation
Empo_er_ng you_ wo_k (e.g. GW_S)

Marchini & Howie (2010)

AlphaImpute vs. IMPUTE2
• AlphaImpute
– Uses pedigree and linkage information

• IMPUTE2
– Pedigree free imputation which uses linkage
disequilibrium
– Similar algorithm to Beagle / fastPHASE

• Data set for comparison
– pig data set
– multiple breed cattle data set

Pig data set
0.5k LD
Category

BothParents

SireMGS

DamPGS

Sire

Dam

Other

Count

2.5k LD

5k LD

7.5k LD

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

51

0.98

0.77

0.99

0.92

1.00

0.96

1.00

0.96

62

0.93

0.80

0.98

0.92

0.99

0.94

0.99

0.96

47

0.96

0.79

0.98

0.92

0.99

0.95

0.99

0.96

45

0.89

0.78

0.97

0.92

0.99

0.95

0.99

0.97

13

0.90

0.76

0.96

0.89

0.98

0.93

0.98

0.95

291

0.86

0.79

0.94

0.91

0.97

0.95

0.97

0.96

Correlation is the statistic that matters

Pedigree ~6500, SNP60K ~3200, SNPLD ~500

Cattle data set
0.5k LD
Category

BothParents

SireMGS

DamPGS

Sire

Dam

Other

Count

2.5k LD

5k LD

7.5k LD

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

AlphaImpute

IMPUTE2

28

0.97

0.64

0.99

0.92

0.99

0.94

1.00

0.95

224

0.87

0.60

0.97

0.91

0.98

0.95

0.99

0.96

7

0.92

0.63

0.97

0.87

0.98

0.91

0.98

0.95

144

0.86

0.60

0.96

0.90

0.98

0.95

0.98

0.96

4

0.95

0.63

0.98

0.90

0.99

0.97

0.99

0.95

219

0.84

0.58

0.94

0.90

0.96

0.95

0.97

0.96

Correlation is the statistic that matters

Pedigree ~24000, SNP60K ~4400, SNPLD ~600

Practical - Imputation
1. We have LD genotype for core 1 on full-sib of individual
1600  1601:
1?????????????????????????????0??????????????????????
?1?????1??????????????????2???????????????1???2

2. Write down pedigree of 1601 and find out haplotype
codes of parents for core 1.
3. Find haplotypes (01111000…) of parents and setup the
table for inspection of data and imputation (see next
slide).
4. What do the existing LD genotypes of individual 1601
show?
5. Impute missing genotypes.

Practical – Imputation (Solutions)
Parental haplotypes
100
0
0 127
100
0
0
20
200
200

0
0

0
0

13
80

LD genotype
1601 100 200

/

19 0011001001001100000001101000000100000111101110111100101000001000100000000110110110100001001100001001*
49 *1001001100100100000011001000000100000100000110111110111000001111000001011100010010100011110001000110
34 0110110001001100010001001000000111100111100110111100101000001000100011010000101110100000001100001001
17 *0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001*

/

1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2

1. The first genotype suggests that 1601 inherited (part of) haplotype 20 from
parent 100, while haplotypes 13 or 80 were inherited from parent 200
2. The second LD genotype is not informative
3. The third LD genotype suggests that haplotype 80 was inherited from parent 200
4. The same as 3.
5. The fifth LD genotype suggests that there has been recombination between parent 100
gametes (haplotypes 127 and 20) as we have allele 1 on both chromosomes in
individual 1601, but not on haplotype 20; no change can be seen for the other
homologous chromosome
6. The sixth LD genotype is not informative for parent 100, while it is for parent
7. The seventh genotype again suggest recombination between the 4th and 5th LD genotype
for parent 100, while there is no info on recombination for parent 200
Result:
LD genotype
1601 100 200
/
Haplotypes
1601 100 200 20/127
1601 100 200
80

/

1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2

1
18

1001001100100100000011001000000100000100000110111110111000001111000001000110110110100001001100001001
0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001
|
V

Full (imputed) genotype
1601 100 200
/
/

1011101110210111110012002000000200011111100111111211111011101121001102110111120220200001011100012002

Imputation quality
• %Incorrect genotypes  genotype error rate
– true 0, inferred 2  error (1)
– true 0, inferred 1  error (1)
– true 0, inferred 0  correct (0)

• %Correct genotypes  gen. concordance = 1 – gen. error rate
• %Incorrect alleles  allele error rate



true 0, inferred 2  error (1)
true 0, inferred 1  error/2 (1/2)
true 0, inferred 0  correct (0)
with probabilistic imputation:
½ abs(true allele dosage – inferred allele dosage)

• %Correct alleles  allele concordance = 1 - allele error rate
• Correlation(true allele dosage, inferred allele dosage)

Example in maize

IMPUTE2, 1227 maize inbred lines

Hickey et al (2012)

Example in maize

IMPUTE2, 1227 maize inbred lines

Hickey et al (2012)

Practical – Imputation II
• Evaluate imputation quality for a „pink part“ of the core 1
Result:
LD genotype
1601 100 200
/
/
Haplotypes
1601 100 200 20/127
1
1601 100 200
80
19
Full (imputed) genotype
1601 100 200
/
/
True genotype
1601 100 200
/
/

…1?????1??????????????????2???????????????1…

…100000111100000100011011011010000100110000…
…001110001000110111000101011010000001000001…
…101110112100110211011112022020000101110001…
…101110112100110111011112022020000101110001…

Livestock Conservation Genomics:
Data, Tools and Trends

Day 3 – Practical workshop:
Phasing and imputation
Gregor Gorjanc
Biotechnical Faculty, Department of Animal Science
University of Ljubljana, Slovenia
October 1-7, 2012, Pag Island, Croatia

Sign up to vote on this title
UsefulNot useful