Professional Documents
Culture Documents
Data Analysis
Susan H. Slifer1
1
University of Miami Miller School of Medicine, Miami, Florida
Genetic data analysis of large numbers of single nucleotide variants (SNVs),
including genome-wide association studies (GWAS), exome chips, and whole
exome (WES) or whole-genome (WGS) sequencing data, requires well defined
processing steps. As a result, several freely available analytic toolkits have
been developed to streamline these processes. Among these, PLINK is the most
comprehensive in terms of its quality control and analytic modules, although
its focus remains on SNVs. PLINK fulfills two analytic needs—aiding the
process of performing quality control (QC) on large data sets and providing
basic statistical tools to analyze the variants in genetic models. The current
version of PLINK (v1.90b) has incorporated several sophisticated statistical
modeling features, such as those that were introduced by GCTA (genome-
wide complex trait analysis), including mixed-model association analysis and
cluster-based algorithms. Although PLINK is diverse in its applicability to
data management and analysis, in some instances, other available tools offer
more optimal options. Here we provide a practical overview of major PLINK
features with respect to QC, data management, and association mapping, along
with learned shortcuts and limitations to be considered. In cases where PLINK
features are limited, we provide alternative approaches using additional freely
available pipelines. C 2018 by John Wiley & Sons, Inc.
INTRODUCTION
Genotyping strategies changed rapidly between the late 1990’s and the first years of
the 21st century. Earlier selection of genetic markers had often targeted known candi-
date regions or genes, with final data sets numbering in the hundreds of variants. As
genotyping methods rapidly improved, by the middle of the first decade of the 21st
century, this transition enabled genome-wide sets of markers and large-scale SNV and
indel (insertion and deletion) analysis to be run for hundreds of thousands of variants at
once. This required a new and fast data-processing framework. We now routinely process
datasets from SNV chip arrays that contain 1 to 2 million variants. Currently, through
next-generation sequencing (NGS), whole-exome and whole-genome sequencing can
identify millions of SNVs and indels, putting additional analytic burdens on computa-
tional pipelines. PLINK, introduced in 2007 (Purcell et al., 2007), was one of the first
freely available tools developed to enable the processing of large amounts of genotypic
data, and, along with PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq/), remains a
useful and comprehensive tool to date.
Excellent, detailed documentation on PLINK is available on the PLINK Web site. Most
of the older v1.07 commands are supported in the new v1.90b, and the older manual
Slifer
PLINK USAGE
Computer Resources
PLINK may be run on Linux, Windows, and OS X platforms. The current PLINK v1.90b
has been optimized to run much faster than previous versions. According to PLINK
documentation, this current version should enable even desktop machines to process
millions of SNVs for tens of thousands of individuals at a time. For more memory-
intensive analytic computations, a parallel option is available that allows distributed
processing across clusters. In these instances, running PLINK on a Linux machine may
be the best option.
File Types
PLINK is designed to accept most file types commonly used for SNV genotypic data as
input. All files read into PLINK are converted to binary format internally before filtering
or calculation is performed by the program. This PLINK format is especially helpful
2 of 20
Current Protocols in Human Genetics
Table 2 Common PLINK Input Files
Number of
Input file type input files Genotypes Pedigree structure Map info
PLINK text files 2 <file>.ped <file>.ped <file>.map
PLINK binary files 3 <file>.bed <file>.fam <file>.bim
LGEN file 3 <file>.lgen <file>.fam <file>.map
VCF file 1 <file>.vcf None <files>.vcf
(.gz) (.gz)
Oxford format 2 <file>.gen <file>.sample <file>.gen
since it allows a compressed data set to be used for analysis. Depending on the input
format, additional map and/or phenotypic data files may be required. See Background
Information for details on file types; Table 2 summarizes common PLINK files.
Merging Files
Files containing different SNVs and/or samples may be merged using PLINK. PLINK
text files can be merged using --merge and PLINK binary files merged using --
bmerge. The flag --merge-list is used to merge a list of files. While there is no
requirement that the files to be merged must contain the same individuals and variants,
if they do, the genotypes are compared before being merged.
The merge command has several options for --merge-mode; the files can be merged
together or simply compared and summarized for differences. There are options to drop
mismatching calls or to use those from one file or the other.
When two (or more) datasets are merged, the name, chromosome, basepair, and alleles
are compared.
a. If files match for chrom, basepair, and alleles, then the merge will combine individuals
and have a single SNP1 position:
Merged file
1 SNP1 0 10000 A C
b. If the chr-basepair and alleles are the same but SNV names are different, then the
merged file will contain records for both “SNP2” and “SNP22” with PLINK warning
that this is the case. If the --merge-equal-position command is used, then
the two SNVs will be combined into a single SNP in the merged file called “SNP2,”
the name of the first file.
Merged file (with
--merge-equal-position)
1 SNP2 0 20000 A G
c. If the same SNV name is used in both files, but there is a difference in chr or basepair
position, the merged file will use the basepair location of the first file:
Merged file
Slifer
1 SNP3 0 30000 A G
3 of 20
Current Protocols in Human Genetics
Allelic Checks When Merging
If the alleles are the same for SNVs in the files to be merged, the merge can happen.
Otherwise, SNVs with identical location and differing alleles will be flagged:
Example File1 File2
(a) 1 SNV1 0 10000 A C 1 SNV1 0 10000 C A
(b) 1 SNV2 0 20000 A G 1 SNV2 0 20000 C T
(c) 1 SNV3 0 30000 A G 1 SNV3 0 30000 C A
(d) 1 SNV4 0 40000 A T 1 SNV4 0 40000 T A
(d) 1 SNV5 0 50000 G C 1 SNV5 0 50000 C G
a. In this case, the two files will merge because they each contain an SNV with A and
C alleles (order of the alleles in the map files used by the PLINK binary format (bim
files) does not matter; this simply indicates that the minor allele is different in each
file before the merge).
b. Merging will result in a mismatch error between opposite strands of the same SNV in
different files and will give you an error and a .missnp file. The --flip command
allows the change of the alleles to the opposite strand.
This example shows what two files would look like if the SNVs were typed on opposite
strands for the two sets. In the first file, the alleles possible for the SNV are A and G,
which are the reverse complement of C and T for the second file. Once the --flip
command is used on one of the two files, the two files can be merged.
c. Sometimes the two sets of alleles are not reverse complements of each other. This
could be the case with a tri-allelic variant, or a mistake in genotyping. This will also
be listed as an error in the .missnp file.
In this case, the reverse complement of the second file’s alleles (C,A → G,T) will not
cause them to match with the first file’s A,G. The SNV should be dropped or the correct
alleles determined and only genotypes from that file saved.
d. The last case involves SNVs that have either A,T or C,G for the two alleles of the
SNV. It is impossible to determine on which strand the SNV in each of the two files
was typed. However, you may be able to use --flip-scan by using the pheno
field as an indicator of the set as a proxy for case-control status. This looks at allele
frequency and LD difference in the two sets of calls.
Often the safest thing in large datasets when A,T or C,G flips may have occurred is to
drop all of those SNVs as part of the QC process.
Extended Family Data
Samples genotyped for GWAS or NGS studies are often restricted to a key set of individ-
uals. In cases where these samples are part of extended families, other connecting family
members may not have genotypic data. While most QC checks do not require these
individuals, any checking of relationships or later data analysis of extended families will
need complete family structure.
The PLINK long (lgen) format uses a family file that may include additional, non-
genotyped individuals who are not listed in the genotype file. Adding connecting indi-
viduals who were not genotype/sequyenced can be done most easily by including them
in this family file.
To add connecting individuals, files can be saved to lgen format and then, after modifica-
tions are made to its fam file, the lgen format can be converted to a text or binary format.
Slifer
These files will now contain connecting, non-genotyped individuals.
4 of 20
Current Protocols in Human Genetics
Figure 1 Pedigree plot of a complete family for PLINK analysis.
Key point
Make sure SNV names have been updated if initially input as missing (.) and are being
filtered by name; this is an issue often seen with VCF files. For details of how to add
names, see Background Information, “VCF files.”
X. Chromosome Genotypes
PLINK processes X chromosome data for females using 3 genotypes and 2 alleles
(as expected with autosomes). For males, all X chromosome data is expected to be
homozygous. The genotypes are coded as if the sample has two copies of the allele,
instead of the actual single copy plus Y chromosome (C/C rather than C/Y)
Example
Alleles Female genotypes possible Male genotypes possible
A,C A/A, A/C, C,C A/A or C/C
It is important to remember that although this is the general rule for X genotypes, two
pseudo-autosomal regions exist on the X chromosome. In these two regions, males can
be correctly heterozygous by recombining with the corresponding regions on Y. Often
genotypic data is already coded to reflect this by using “X” or “23” to represent the X
chromosome and “XY” or “25” to represent the pseudo-autosomal part of X.
Each time PLINK runs, it will identify heterozygous haploid X chromosome SNVs
and non-male Y SNVs by making an .hh file. These errors may indicate the pseudo-
autosomal region of X. To identify these possible errors, the --split-x flag can be
used.
X chromosome data can also be parsed into pseudo-autosomal regions using the --
split-x command. This is an important step before assessing gender assignments.
In order to zero out these genotypes from the final file after any errors have been corrected,
use --set-hh-missing. Once this is addressed, sex checking options can be run.
Key point
By default, non-founders (those with parents) are not counted by several of PLINK’s
QC steps. This can sometimes result in very few individuals being used for these cal-
culations. Use the --nonfounders flag to include them, which may be necessary for
small sample sets, or output statistics will be NaN (Not a Number) indicating that the
Slifer
calculations could not be performed.
6 of 20
Current Protocols in Human Genetics
Table 3 Fundamental Quality Control Steps Used in PLINK
Missingness
Both samples and SNVs can be excluded from the dataset if they fail to meet a missing
data threshold. The rate of missingness can be checked in both samples (% missing for
a sample across your variants) and SNVs (% missing for a particular variant among
your samples) with the command --missing and then may be used to drop samples
(--mind) or SNVs (--geno) based on a threshold.
Key point
“FMISS” in output is the frequency of missing data in the .imiss and .lmiss output
files, and is different from the “NMISS” value of the number of non-missing genotypes
that is output in association analysis.
Differential Missingness
PLINK offers a basic association test between cases and controls to determine if the
missing data in each group is significant; this option is --test-missing. This is an
important check in case-control data sets to ensure that any associations found in the
analysis are not based on different amounts of missing data in the two groups.
Allele Frequency
For older types of GWAS analysis, the minor allele frequency (MAF) threshold was
generally restricted to > 5%. Newer chips and NGS often target rare variants and are
interested in these low frequency SNVs. Regardless, less common SNVs may need to be
identified, as they are not always analyzed in the same way as common variants. The
calling of rare SNVs is also more prone to error, so QC can help identify issues. The
minor allele frequency of the data set can be calculated using the --freq flag. If the
SNVs should be filtered based on the minor allele frequency, use --maf.
Key point
Both of these default to founders-only unless the --nonfounders flag is used.
Hardy-Weinberg Equilibrium
Tests for Hardy-Weinberg equilibrium are a key step in the QC process since they can
Slifer
indicate problematic or badly-called SNVs that should be removed. Because of the
7 of 20
Current Protocols in Human Genetics
inclusion of rare variants in newer data sets, these deviations may be more appropriate
for more common SNVs (>5%).
Key point
Be sure to use the --nonfounders flag if you have related samples and would like to
use an unrelated subset to determine this.
Gender Check
If X chromosome data is present, the sex of the samples can be determined from the input
genotypes using --check-sex. If present, Y chromosome SNVs may also be used
with the --check-sex ycount command. The output from the sex check command
uses a het statistic to indicate how far a sample’s X chromosome genotypes deviate from
the expected heterozygosity for that sex.
Key point
PLINK guidelines define males as having an inbreeding homozygosity estimate of F >
0.8 and females as having F < 0.2. PLINK will label these samples outside of these
thresholds as PROBLEM in the output file. However, there are quite frequently samples
which deviate slightly from these thresholds that should be considered correctly identified
in the data set.
Mendelian Errors
One important difference in running an analysis using related samples versus unrelated
cases and controls is that it is possible to identify incorrectly inherited genotypes—those
that do not match with what is assumed a priori. Since a child inherits one allele at a
locus from each parent, these copies should segregate accordingly. All child genotypes
should have an allele from each parent. These errors may be random genotyping “noise,”
may indicate a problem with the samples, or (rarely) indicate inheriting a deletion from
one parent.
Earlier versions of PLINK (i.e., v. 1.07) would check Mendelian errors only in complete
trios, families where both parents and child were typed. A Mendelian error is detected
when an allele contained in the genotype of the child is not present or is different from
those present in either parent. While it is possible to have de novo alleles (absent in both
parents and newly mutated in the child), this is very rare; thus, most instances would be
considered as Mendelian errors. The current version has flags that allow children with
only one typed parent (--mendel-duo) and extended families (--mendel-gen) to
be checked as well. Use --mendel to generate a report and as --set-me-missing
to filter.
Examples of Mendelian Errors
Individual Father Mother Genotype
(a) Incorrect Trio
F1 1 1000 1001 A/C
F1 1000 0 0 C/C
F1 1001 0 0 C/C
Individual F1_1 has an “A” allele that is not present in either parent
(b) Incorrect Duo
F2 1 1000 1001 G/G
F2 1000 0 0 0/0
F2 1001 0 0 A/A
Slifer Individual F2_1 does not inherit either of the mother’s “A” alleles
8 of 20
Current Protocols in Human Genetics
(c) Incorrect Extended Family
F3 1 1000 1001 A/A
F3 1000 0 0 A/C
F3 1001 100 101 0/0
F3 100 0 0 C/C
F3 101 0 0 C/C
Individual F3_1 has a mother who can be inferred to have a “C” allele. This is
inconsistent with F3_1 being homozygous for “A”. See Figure 2.
Key points
In some cases, large extended families may not be completely cleaned by PLINK. Once
the outstanding errors are determined by a program such as PEDSTATS (Wigginton &
Abecasis, 2005), they can be zeroed out in PLINK using the --zero and --zero-
cluster commands.
Relatedness
While the --mendel option evaluates each SNV separately, the --genome command
looks at all genotypes simultaneously and can be used to determine if samples are
related correctly within the file. This is especially helpful if you have a set of unrelated
individuals and you want to identify any duplicate samples. The command calculates
identity-by-descent (IBD) between pairs of samples based on the putative relative type
9 of 20
Current Protocols in Human Genetics
Table 4 Expected Values for Common Types of Relationships
in the pedigree file. IBD is a measure of the type of relatedness between two samples; the
idea is to look at how much sharing of alleles from a common ancestor these individuals
have, with PLINK outputting both the expected and actual sharing between pairs of
individuals. SNVs should be LD-pruned first, before this command is run, to ensure that
independently inherited SNVs are evaluated.
The output from the --genome command approximates the percentage IBD over-
all, representing pairs as sharing, across all sites: zero alleles IBD (z0), one
allele IBD (z1), or two alleles IBD (z2). Unrelated individuals should share
zero alleles IBD (z0), while MZ twins or duplicate samples would share both
alleles 100% (z2 = 1). A value of PI_HAT (the proportion IBD, defined as P(IBD = 2)
+ 0.5*P(IBD = 1)) may be set to a value greater than a threshold above zero to exclude
pairs of related samples in an unrelated data set. This number can vary depending on the
nature of the overall data set and how conservative an estimate of relatedness is wanted.
Table 4 lists expected values of IBD sharing and PI_HAT by relationship type.
These values can be used in the QC process to help decide if certain individuals are
outliers to expected ethnic groups. The PLINK --pca command will generate the top
20 PC’s using the algorithm employed by GCTA (Yang et al., 2011). The PC values may
also be used subsequently in the analysis as covariates.
Concordance
Many data sets that are analyzed using GWAS or NGS have been run previously on
other genotyping platforms. To verify the integrity of the samples, genotypes can be
checked for concordance between the two platforms. This can be done by merging the
two data sets, one sample at a time. The option --merge-mode 7 will give a percent
concordance as part of the output, comparing non-zero calls in the two datasets. While
100% would be the optimal concordance, identical samples run on both platforms may
have concordance values somewhat less than 100% due to miscellaneous genotyping
errors. The IDs of the individual to compare must be the same in both files to be merged.
analysis, may be better run using alternate programs like PLINK/SEQ (https://atgu.
mgh.harvard.edu/plinkseq). Table 5 outlines commonly used analysis options.
The --condition option can also be included, which will test all SNVs but add in
the allelic dosage of the designated variant as a covariate. A --genotypic flag may
be used to look at genotype effects of an SNV in addition to the additive effect.
The output file from the --logistic option will list results for the overall model and
each covariate separately. The TEST column indicates which model the line represents.
Key points
The column NMISS in logistic/linear output is the number of non-missing samples.
The A1 allele is the tested allele (usually the minor allele) and the A2 allele is the
reference allele.
In this case, the line with TEST = “ADD” contains the p value for the overall model,
including all covariates. If only the results for the overall test are desired, the –hide-
covar flag will suppress the covariate-specific lines.
The option --no-snp will output effects of just the phenotype and the covariates if an
Slifer
evaluation of the non-genetic effects is desired.
11 of 20
Current Protocols in Human Genetics
Plot results with Haploview
The program Haploview (Barrett, Fry, Maller, & Daly, 2005) can be used to visualize asso-
ciation results generated by PLINK. The program will take an output assoc/logistic/linear
file and allow the log10 (p values) to be plotted. There must be a column of SNV names
in the file with the header SNP, and the map information that is present in the PLINK
regression output can be used as an “integrated map file” by Haploview.
PLINK also offers the --mds-plot option, which will perform multidimensional
scaling. This output can be combined with the cluster file to input and visualize in
Haploview (Barrett et al., 2005).
Quantitative traits
Quantitative traits include traits that can be measured on a continuous scale rather than
dichotomized (such as case/control status). These traits can provide more power to an
analysis since they enable all variance of the trait to be used in calculations rather than
its simplification into two categories. For instance, body mass index (BMI) or age can be
analyzed using the PLINK commands --assoc or --linear.
Multiple phenotypes may be read into PLINK using a separate phenotype file. This file
should have the key fields with headers FID and IID, followed by quantitative traits and
covariates. The --linear command allows you to adjust for additional covariates.
This requires a covariate file to be specified. Each covariate must be listed for the sample
matching the main quantitative trait. Importantly, if any covariates being adjusted for
are missing, then that individual cannot be used in the --linear model regression
analysis. The phenotype and covariate files will need to be created only for individuals
with complete data for the trait and covariate.
Currently, PLINK v1.90b does not allow multi-phenotype analysis, but the program MV-
PLINK (Ferreira & Purcell, 2009), based on an earlier PLINK v1.06 version, may be
downloaded separately. In addition, SNPTEST (Marchini, Howie, Myers, McVean, &
Slifer
Donnelly, 2007) has also implemented the analysis of multiple phenotypes.
12 of 20
Current Protocols in Human Genetics
Key point
The covariate file can be formatted similarly to the phenotype file. Please note that for
quantitative traits, you will want to ensure that the trait is normally distributed. In some
instances, you will need to transform the trait using log-transformation or other type of
transformation (e.g., sqrt, Box-Cox).
However, there are alternative approaches that adjust for the kinship matrix defining
the genetically determined familial relationships in the cohort. One such program is
EMMAX (Kang et al., 2010), which has been implemented in the EPACTS software
suite (https://genome.sph.umich.edu/wiki/EPACTS). With EMMAX, you need to use the
PLINK --make-kin function to generate the kinship matrix. Then, this kinship matrix
is incorporated as a parameter in the regression-based association mapping. Another
option is the R package GWAF (Chen & Yang, 2010). Both of these programs may
be used with both dichotomous and quantitative traits. GWAF will run generalized
estimating equations using a correlation matrix of each family as a cluster when analyzing
dichotomous traits, and will run a limited mixed model and a kinship matrix when
evaluating a quantitative trait.
Detection of Epistasis
PLINK does not offer comprehensive gene-based association testing, although some tests
have been incorporated into PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq).
PLINK allows for the testing of epistasis, which occurs when the effect of one gene
is dependent on the presence of one or more modifier genes (also see Cole, Hall, Ur-
banowicz, Gilbert-Diamond, & Moore, 2017). The PLINK option for the testing looks at
SNV-SNV-based epistasis using the --fast-epistasis command. This command
will scan based on joint 3 by 3 joint genotype count tables. Newer commands in ver-
sion v1.90b include boost and joint effects options for extended versions of the
original command.
Haplotype Evaluation
Linkage disequilibrium (LD) refers to variants that are inherited together in sets when
transmitted from parents to offspring. Including all such SNVs can inflate analysis
results. A subset of SNVs can be pruned from the overall set to select only those that
are in linkage equilibrium. This means that they are inherited independently rather than
in a set, or haplotype block. This can be done using either the --indep or --indep-
pairwise flags. The --indep-pairwise option is generally preferred, as it uses a
window size and a pairwise r2 threshold.
LD statistics can be output for a dataset in a variety of ways. The --r2 dprime flag
will output the squared regression coefficients for each pair of SNVs and the D-prime
statistic. The r2 flag defaults to dropping pairs with r2 < 0.2; to include them, use
–ld-window-r2.
Key point
While PLINK will default to using the phenotype listed in the 6th column of the ped or
fam file, alternate phenotypes may be loaded using the --pheno flag. This file should
have a header, with the first two columns listed as FID and IID to correspond individuals
listed in the other PLINK file types. Additional phenotypes can be either dichotomous
(1/2) or continuous (age, BMI, etc.). Any covariates needed for analysis can also be read
in using --covar, with a similar file that has a header and the first two FID and IID
columns.
A map file is read in with this file to indicate the names of the SNVs and their chromosome
and basepair locations. The third column, genetic position, may be left as zero if not
needed for linkage or recombination information.
By default, PLINK expects a common prefix for any given file type read in. Thus, if a ped
file is called STUDY.ped, it will expect the map file to be STUDY.map. If the prefix is
consistent, the files can be read in with a single flag, in this case --file:
Key point
If --out <name> is not listed in the PLINK command, the prefix of all output files
Slifer
will default to plink.
14 of 20
Current Protocols in Human Genetics
Example PLINK ped file
Family(FID) Individual(IID) Father Mother Sex Aff_Status SNVgenotypes . . .
F1 1 1000 1001 1 2 AA CC
F1 1000 0 0 1 1 AG CC
F1 10001 0 0 2 1 AA CC
The files are read in using the --bfile flag. If data is input from other formats such
as VCF, the –keep-allele-order command will keep A1 and A2 in the same order as the
input. To output PLINK binary file types, use --make-bed.
PLINK identifies the minor allele (in the total dataset), which is always listed as “A1” in
the .bim file. This is also an easy way to determine which SNVs are monomorphic and
should be dropped.
Key point
The third column, genetic position, may be left as zero if not needed for linkage or
recombination analysis.
5. A 6-column family file like the .fam file used with PLINK binary files is also
necessary for this file input type Slifer
15 of 20
Current Protocols in Human Genetics
VCF Files
VCF files (Variant Call Format) generated from sequencing analysis can be read into
PLINK and converted to binary files. These files are input into PLINK using the --vcf
flag with the input file name. Data may be output from PLINK into VCF format using
--recode vcf.
NGS data often contains annotation information in the INFO field in addition to the
genotypes of the samples. Not all detailed annotation information is kept in PLINK,
but can be kept using PLINK/SEQ. A limited amount of filtering can be done when
reading a VCF file into PLINK; minimum QUAL and GQ values can be set using --vcf-
min-qual and --vcf-min-gq. Further filtering and manipulation of VCF files can be
done using VCFtools (Danecek et al., 2011) or BCFTOOLS (https://samtools.github.io/
bcftools/bcftools.html).
Key point
Values from the INFO flags in the VCF file will be lost after being read into PLINK.
REF and ALT allele assignments will not be conserved unless additional flags are used.
A program such as PLINK/SEQ is better equipped to handle these options.
Alleles are ordered in a VCF file according to REF and ALT annotation designation. When
read into PLINK, this information is lost if the --keep-allele-order command
is not used. VCF files often contain indels as well as SNVs, and may have locations
with multiple alleles. Neither of these scenarios is handled by PLINK unless the indels
are coded as a bi-allelic insertion/deletion and SNVs as --biallelic-only. For
analysis of more complicated indels and CNVs, PLINK/SEQ is recommended.
Key point
NGS data for novel SNVs is often missing an SNV name in the third column of a VCF
file. These SNVs when input into PLINK will be given “.” for names. The chromosome
and basepair position (and alleles if needed) can be used for names.
Example
Init.bim
1 . 0 50000 A C
Slifer
16 of 20
Current Protocols in Human Genetics
Table 6 Additional Analysis Software
PLINK allows the use of a threshold to set when reading in data, but defaults to 0.1,
converting the probabilities to hard calls of the data. Other genetic analysis programs
will use all available data instead of converting the genotypes and generating missing
data. Such programs include SNPTEST (Marchini et al., 2007), ProbABEL (Aulchenko,
Struchalin, & van Duijn, 2010), and the R package GWAF (Chen & Yang, 2010).
Key points
For imputed data PLINK uses a default threshold of 0.1 to convert imputed genotypes
to hard-call genotypes and run them in PLINK binary format. This flag can be changed
as needed, and in order to keep all imputed genotypes, it can be set at 0.49 using the
--hard-call-threshold flag.
Both PLINKv1.07 and PLINKv1.90b allow a --dosage command that is poorly inte-
grated with the rest of the PLINK commands. It will run an association test by default
that allows the genotypes to have decimal allele counts for dosage (a continuous number
from 0 to 2), representing the number of minor alleles from imputed data.
Additional Software
Although PLINK allows flexibility in the type of input files, it was originally de-
signed for analysis of GWAS SNVs. Accordingly, file processing, QC, and analysis for
Slifer
both imputation and NGS sequencing data may be analyzed more robustly using other
17 of 20
Current Protocols in Human Genetics
software and pipeline options. Both VCFtools (Danecek et al., 2011) and BCFTOOLS
(https://samtools.github.io/bcftools/bcftools.html) offer file processing and filtering of
VCF files. The utility program GTOOL (Marchini et al., 2007) is helpful when work-
ing with Oxford format data. Additionally, in recent years, other analytic suites in-
cluding GCTA (genome-wide complex trait analysis; Yang et al., 2011), SNPTEST
(Marchini et al., 2007), and EPACTS (efficient and parallelizable association container
toolbox; https://genome.sph.umich.edu/wiki/EPACTS) have incorporated several sophis-
ticated modeling features, including mixed model association analysis and cluster-based
algorithms. PLINK (v1.90b) has adapted GCTA into its pipeline. Table 6 presents a
summary of additional software for genetic data analysis.
SUMMARY/CONCLUDING REMARKS
The development of PLINK in 2007 was a key turning point in the ability of researchers to
process and analyze the large numbers of SNVs that had started to be routinely generated
by labs in the early years of the 21st century. The program has subsequently been updated
to allow the input of other types of data—including imputed and sequencing data—that
have become more common in the subsequent years. It offers the advantage of a single,
easy-to-use program that can process, perform quality control, and run basic analysis on
genotyping datasets in a single package.
ACKNOWLEDGEMENTS
The author is partially supported by the John P. Hussman Institute for Human Genomics
at the University of Miami Miller School Of Medicine, and would like to thank Dr.
Evadnie Rampersaud for valuable input and discussion.
LITERATURE CITED
Aulchenko, Y. S., Struchalin, M. V., & van Duijn, C. M. (2010). ProbABEL package for genome-wide
association analysis of imputed data. BMC Bioinformatics, 11, 134. doi: 10.1186/1471-2105-11-134
Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and
haplotype maps. Bioinformatics, 21, 263-265. doi: 10.1093/bioinformatics/bth457.
Chen, M.-H., & Yang, Q. (2010). GWAF: An R package for genome-wide association analyses with family
data. Bioinformatics, 26(4), 580–181. doi: 10.1093/bioinformatics/btp710.
Cole, B. S., Hall, M. A., Urbanowicz, R. J., Gilbert-Diamond, D., & Moore, J. H. (2017). Analysis of gene-
gene interactions. Current Protocols in Human Genetics, 95, 1.14.1–1.14.10. doi: 10.1002/cphg.45.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... 1000 Genomes Project
Analysis Group (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. doi:
10.1093/bioinformatics/btr330.
Ferreira, M. A., & Purcell, S. M. (2009). A multivariate test of association. Bioinformatics, 25, 132–133.
doi: 10.1093/bioinformatics/btn563.
Hancock, D. B., & Scott, W. K. (2012). Population-based case-control association studies. Current Protocols
in Human Genetics, 74, 1.17.1–1.17.20. doi: 10.1002/0471142905.hg0117s74.
Hellwege, J. N., Keaton, J. M., Giri, A., Gao, X., Velez Edwards, D. R., . . . Edwards, T. L. (2017). Population
stratification in genetic association studies. Current Protocols in Human Genetics, 95, 1.22.1–1.22.23.
doi: 10.1002/cphg.48.
Howie, B. N., Donnelly, P., & Marchini, J. (2009). A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genetics, 5(6), e1000529. doi: 10.1371/jour-
nal.pgen.1000529.
Igo, R. P., Jr., Cooke Bailey, J. N., Romm, J., Haines, J.L., & Wiggs, J.L. (2016). Quality control for
the Illumina HumanExome BeadChip. Current Protocols in Human Genetics, 90, 2.14.1–2.14.16. doi:
10.1002/cphg.15.
Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S. Y., Freimer, N. B., ... Eskin, E. (2010).
Variance component model to account for sample structure in genome-wide association studies. Nature
Genetics, 42, 348–354. doi: 10.1038/ng.548.
Lambert, G., Tsinajinnie, D., & Duggan, D. (2013). Single nucleotide polymorphism genotyping
Slifer using BeadChip microarrays. Current Protocols in Human Genetics, 78, 2.9:2.9.1–2.9.34. doi:
10.1002/0471142905.hg0209s78.
18 of 20
Current Protocols in Human Genetics
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... 1000 Genome Project Data
Processing Subgroup (2009). The sequence alignment/Map format and SAMtools. Bioinformatics, 25,
2078–2079. doi: 10.1093/bioinformatics/btp352.
Marchini, J., Howie, B., Myers, S., McVean, G., & Donnelly, P. (2007). A new multipoint method for
genome-wide association studies via imputation of genotypes. Nature Genetics, 39, 906–913. doi:
10.1038/ng2088.
Patterson, N., Alkes, L. P., & Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2.12,
e190. doi: 10.1371/journal.pgen.0020190.
Porcu, E., Sanna, S., Fuchsberger, C., & Fritsche, L. G. (2013). Genotype imputation in
genome-wide association studies. Current Protocols in Human Genetics, 78, 1.25.1–1.25.14. doi:
10.1002/0471142905.hg0125s78.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006). Principal
components analysis corrects for stratification in genome-wide association studies. Nature Genetics,
38.8, 904–909. doi: 10.1038/ng1847.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., ... Sham, P. C. (2007).
PLINK: A toolset for whole-genome association and population-based linkage analysis. American Jour-
nal of Human Genetics, 81, 559–575. doi: 10.1086/519795.
Thornton, T. A. (2015). Statistical methods for genome-wide and sequencing association studies of
complex traits in related samples. Current Protocols in Human Genetics, 84, 1.28.1-1.28.9. doi:
10.1002/0471142905.hg0128s84.
Turner, S., Armstrong, L. L., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., . . . Ritchie,
M. D. (2011). Quality control procedures for genome-wide association studies. Current Protocols in
Human Genetics, 68, 1.19:1.19.1–1.19.18. doi: 10.1002/0471142905.hg0119s68.
Wigginton, J. E., & Abecasis, G. R. (2005). PEDSTATS: Descriptive statistics, graphics and quality assess-
ment for gene mapping data. Bioinformatics, 21, 3445–3447. doi: 10.1093/bioinformatics/bti529.
Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: A tool for Genome-wide complex
trait analysis. American Journal of Human Genetics, 88(1), 76–82. doi: 10.1016/j.ajhg.2010.11.011.
INTERNET RESOURCES
https://samtools.github.io/bcftools/bcftools.html
BCFTOOLS: a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary
counterpart BCF.
https://github.com/DReichLab/EIG
EIGENSOFT: Use genome-wide data to determine Principal Components of individuals.
https://genome.sph.umich.edu/wiki/EPACTSEPACTS
Efficient and Parallelizable Association. Container Toolbox
http://cnsgenomics.com/software/gcta/#Overview
GCTA: A tool for genome-wide complex trait analysis.
https://www.well.ox.ac.uk/cfreeman/software/gwas/gtool.html
GTOOL: A program for transforming sets of genetic data.
https://cran.r-project.org/web/packages/GWAF/index.html
GWAF: an R package for association testing of genotypes and inferred SNPs using binary or continuous
phenotypes.
https://www.broadinstitute.org/haploview/downloads
Haploview: a program designed to simplify and expedite the process of haplotype analysis by providing a
common interface.
http://illumina.com
Illumina Web site containing information about various genotyping arrays, reference files, and manuals,
including how to download and use Illumina GenomeStudio to convert raw data into a Final Report.
https://genepi.qimr.edu.au/staff/manuelF/multivariate/main.html
MV-PLINK: A multivariate test of association.
https://csg.sph.umich.edu/abecasis/pedstats/index.html
PEDSTATS: A tool for the validation and summary of pedigree files.
https://pngu.mgh.harvard.edu/purcell/plink/
PLINK: an open-source whole-genome association analysis toolset, version 1.07.
http://www.cog-genomics.org/plink/1.9/
PLINK Web site for version 1.90vb.
https://atgu.mgh.harvard.edu/plinkseq/
PLINK/SEQ: A library for the analysis of genetic variation data including rare variants and gene testing. Slifer
19 of 20
Current Protocols in Human Genetics
https://www.genabel.org/packages/ProbABEL
ProbABEL: A tool for genome-wide association analysis of imputed data.
https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
SNPTEST: Analysis of single SNPs in genome association studies.
https://vcftools.github.io/man_latest.html
VCFtools: A set of tools written in Perl and C++ for working with VCF files.
Slifer
20 of 20
Current Protocols in Human Genetics