You are on page 1of 20

PLINK: Key Functions for

Data Analysis
Susan H. Slifer1
1
University of Miami Miller School of Medicine, Miami, Florida
Genetic data analysis of large numbers of single nucleotide variants (SNVs),
including genome-wide association studies (GWAS), exome chips, and whole
exome (WES) or whole-genome (WGS) sequencing data, requires well defined
processing steps. As a result, several freely available analytic toolkits have
been developed to streamline these processes. Among these, PLINK is the most
comprehensive in terms of its quality control and analytic modules, although
its focus remains on SNVs. PLINK fulfills two analytic needs—aiding the
process of performing quality control (QC) on large data sets and providing
basic statistical tools to analyze the variants in genetic models. The current
version of PLINK (v1.90b) has incorporated several sophisticated statistical
modeling features, such as those that were introduced by GCTA (genome-
wide complex trait analysis), including mixed-model association analysis and
cluster-based algorithms. Although PLINK is diverse in its applicability to
data management and analysis, in some instances, other available tools offer
more optimal options. Here we provide a practical overview of major PLINK
features with respect to QC, data management, and association mapping, along
with learned shortcuts and limitations to be considered. In cases where PLINK
features are limited, we provide alternative approaches using additional freely
available pipelines. C 2018 by John Wiley & Sons, Inc.

Keywords: association r GWAS r NGS r QC r SNV r software

How to cite this article:


Slifer, S. H. (2018). PLINK: Key functions for data analysis.
Current Protocols in Human Genetics, 97, e59. doi:
10.1002/cphg.59

INTRODUCTION
Genotyping strategies changed rapidly between the late 1990’s and the first years of
the 21st century. Earlier selection of genetic markers had often targeted known candi-
date regions or genes, with final data sets numbering in the hundreds of variants. As
genotyping methods rapidly improved, by the middle of the first decade of the 21st
century, this transition enabled genome-wide sets of markers and large-scale SNV and
indel (insertion and deletion) analysis to be run for hundreds of thousands of variants at
once. This required a new and fast data-processing framework. We now routinely process
datasets from SNV chip arrays that contain 1 to 2 million variants. Currently, through
next-generation sequencing (NGS), whole-exome and whole-genome sequencing can
identify millions of SNVs and indels, putting additional analytic burdens on computa-
tional pipelines. PLINK, introduced in 2007 (Purcell et al., 2007), was one of the first
freely available tools developed to enable the processing of large amounts of genotypic
data, and, along with PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq/), remains a
useful and comprehensive tool to date.

Excellent, detailed documentation on PLINK is available on the PLINK Web site. Most
of the older v1.07 commands are supported in the new v1.90b, and the older manual
Slifer

Current Protocols in Human Genetics e59, Volume 97 1 of 20


Published in Wiley Online Library (wileyonlinelibrary.com).
doi: 10.1002/cphg.59
Copyright C 2018 John Wiley & Sons, Inc.
can provide greater depth on some of the key functions such as options for association
tests. The PLINK v1.90b documentation often redirects to the older manual for further
clarification. In addition, there are numerous tutorials available on the Internet that will
walk the user through all steps of the process using an example dataset. The goal of this
document is to provide an overview of the most important capabilities of PLINK, along
with recommendations gathered from an analyst’s perspective.

PLINK USAGE
Computer Resources
PLINK may be run on Linux, Windows, and OS X platforms. The current PLINK v1.90b
has been optimized to run much faster than previous versions. According to PLINK
documentation, this current version should enable even desktop machines to process
millions of SNVs for tens of thousands of individuals at a time. For more memory-
intensive analytic computations, a parallel option is available that allows distributed
processing across clusters. In these instances, running PLINK on a Linux machine may
be the best option.

Command Line Input


PLINK is a command-line program with flags and commands given at the command line
prompt. The basic command is plink followed by a flag --flag, indicating which input
files and parameters to use. You will get an error message at the prompt if no additional
commands are used to indicate what should be done after reading in the file. Either an
output or data-processing command needs to be added to tell PLINK how to process the
data.

PROCESSING GENOTYPIC DATA WITH PLINK


A key feature of PLINK lies in its ability to read many different forms of genotypic
data and allow reformatting for data analysis. In this way, large datasets may be filtered,
subsetted, and readied for further analysis steps. Table 1 summarizes key PLINK options
for processing genotypic data.

File Types
PLINK is designed to accept most file types commonly used for SNV genotypic data as
input. All files read into PLINK are converted to binary format internally before filtering
or calculation is performed by the program. This PLINK format is especially helpful

Table 1 Key PLINK Options for Processing Genotypic Data

File processing step PLINK command


Merging files --merge, --bmerge,
--merge-list
Allele mismatches --flip,
--flip-scan
Filtering individuals --keep, --remove
Filtering SNVs --extract,
--exclude, --thin,
--thin-count
X chromosome --split-x,
Slifer regions --set-hh-missing

2 of 20
Current Protocols in Human Genetics
Table 2 Common PLINK Input Files

Number of
Input file type input files Genotypes Pedigree structure Map info
PLINK text files 2 <file>.ped <file>.ped <file>.map
PLINK binary files 3 <file>.bed <file>.fam <file>.bim
LGEN file 3 <file>.lgen <file>.fam <file>.map
VCF file 1 <file>.vcf None <files>.vcf
(.gz) (.gz)
Oxford format 2 <file>.gen <file>.sample <file>.gen

since it allows a compressed data set to be used for analysis. Depending on the input
format, additional map and/or phenotypic data files may be required. See Background
Information for details on file types; Table 2 summarizes common PLINK files.
Merging Files
Files containing different SNVs and/or samples may be merged using PLINK. PLINK
text files can be merged using --merge and PLINK binary files merged using --
bmerge. The flag --merge-list is used to merge a list of files. While there is no
requirement that the files to be merged must contain the same individuals and variants,
if they do, the genotypes are compared before being merged.

The merge command has several options for --merge-mode; the files can be merged
together or simply compared and summarized for differences. There are options to drop
mismatching calls or to use those from one file or the other.

When two (or more) datasets are merged, the name, chromosome, basepair, and alleles
are compared.

Example File1 File2


(a) 1 SNP1 0 10000 A C 1 SNP1 0 10000 A C
(b) 1 SNP2 0 20000 A G 1 SNP22 0 20000 A G
(c) 1 SNP3 0 30000 A G 3 SNP3 0 33333 A G

a. If files match for chrom, basepair, and alleles, then the merge will combine individuals
and have a single SNP1 position:
Merged file
1 SNP1 0 10000 A C

b. If the chr-basepair and alleles are the same but SNV names are different, then the
merged file will contain records for both “SNP2” and “SNP22” with PLINK warning
that this is the case. If the --merge-equal-position command is used, then
the two SNVs will be combined into a single SNP in the merged file called “SNP2,”
the name of the first file.
Merged file (with
--merge-equal-position)
1 SNP2 0 20000 A G

c. If the same SNV name is used in both files, but there is a difference in chr or basepair
position, the merged file will use the basepair location of the first file:
Merged file
Slifer
1 SNP3 0 30000 A G
3 of 20
Current Protocols in Human Genetics
Allelic Checks When Merging
If the alleles are the same for SNVs in the files to be merged, the merge can happen.
Otherwise, SNVs with identical location and differing alleles will be flagged:
Example File1 File2
(a) 1 SNV1 0 10000 A C 1 SNV1 0 10000 C A
(b) 1 SNV2 0 20000 A G 1 SNV2 0 20000 C T
(c) 1 SNV3 0 30000 A G 1 SNV3 0 30000 C A
(d) 1 SNV4 0 40000 A T 1 SNV4 0 40000 T A
(d) 1 SNV5 0 50000 G C 1 SNV5 0 50000 C G

a. In this case, the two files will merge because they each contain an SNV with A and
C alleles (order of the alleles in the map files used by the PLINK binary format (bim
files) does not matter; this simply indicates that the minor allele is different in each
file before the merge).
b. Merging will result in a mismatch error between opposite strands of the same SNV in
different files and will give you an error and a .missnp file. The --flip command
allows the change of the alleles to the opposite strand.
This example shows what two files would look like if the SNVs were typed on opposite
strands for the two sets. In the first file, the alleles possible for the SNV are A and G,
which are the reverse complement of C and T for the second file. Once the --flip
command is used on one of the two files, the two files can be merged.

c. Sometimes the two sets of alleles are not reverse complements of each other. This
could be the case with a tri-allelic variant, or a mistake in genotyping. This will also
be listed as an error in the .missnp file.
In this case, the reverse complement of the second file’s alleles (C,A → G,T) will not
cause them to match with the first file’s A,G. The SNV should be dropped or the correct
alleles determined and only genotypes from that file saved.

d. The last case involves SNVs that have either A,T or C,G for the two alleles of the
SNV. It is impossible to determine on which strand the SNV in each of the two files
was typed. However, you may be able to use --flip-scan by using the pheno
field as an indicator of the set as a proxy for case-control status. This looks at allele
frequency and LD difference in the two sets of calls.
Often the safest thing in large datasets when A,T or C,G flips may have occurred is to
drop all of those SNVs as part of the QC process.
Extended Family Data
Samples genotyped for GWAS or NGS studies are often restricted to a key set of individ-
uals. In cases where these samples are part of extended families, other connecting family
members may not have genotypic data. While most QC checks do not require these
individuals, any checking of relationships or later data analysis of extended families will
need complete family structure.

The PLINK long (lgen) format uses a family file that may include additional, non-
genotyped individuals who are not listed in the genotype file. Adding connecting indi-
viduals who were not genotype/sequyenced can be done most easily by including them
in this family file.

To add connecting individuals, files can be saved to lgen format and then, after modifica-
tions are made to its fam file, the lgen format can be converted to a text or binary format.
Slifer
These files will now contain connecting, non-genotyped individuals.
4 of 20
Current Protocols in Human Genetics
Figure 1 Pedigree plot of a complete family for PLINK analysis.

1. Recode a PLINK file to the lgen format


The initial PLINK file contains two individuals (F1_1 and F1_100) with no family
information
plink --file init --recode lgen --out temp
temp.fam
F1 1 0 0 1 1
F1 100 0 0 2 1
2. Copy the extended family information to the same prefix as the .lgen file.
The two individuals are siblings; their parents need to be added to the file
cp extended.fam temp.fam
extended.fam
F1 1 1000 1001 1 1
F1 100 1000 1001 2 1
F1 1000 0 0 1 1
F1 1001 0 0 2 1
3. Use PLINK to recode the files into a new binary format
plink --lfile temp --make-bed --out Merge
Merge.fam
F1 1 1000 1001 1 1
F1 100 1000 1001 2 1
F1 1000 0 0 1 1
F1 1001 0 0 2 1
See Figure 1.
Filtering Files
Files read into PLINK can be subset based on sample and/or SNV parameters. Commands
such as --chr will output only the chromosomes listed and --not-chr will exclude
Slifer
them. SNVs can also be selected randomly with --thin and --thin-count.
5 of 20
Current Protocols in Human Genetics
Samples from a list can be included using --keep or excluded with --remove. A list
of variants can be subsetted from the overall set by using --extract or dropped with
--exclude.

Key point
Make sure SNV names have been updated if initially input as missing (.) and are being
filtered by name; this is an issue often seen with VCF files. For details of how to add
names, see Background Information, “VCF files.”

X. Chromosome Genotypes
PLINK processes X chromosome data for females using 3 genotypes and 2 alleles
(as expected with autosomes). For males, all X chromosome data is expected to be
homozygous. The genotypes are coded as if the sample has two copies of the allele,
instead of the actual single copy plus Y chromosome (C/C rather than C/Y)

Example
Alleles Female genotypes possible Male genotypes possible
A,C A/A, A/C, C,C A/A or C/C

It is important to remember that although this is the general rule for X genotypes, two
pseudo-autosomal regions exist on the X chromosome. In these two regions, males can
be correctly heterozygous by recombining with the corresponding regions on Y. Often
genotypic data is already coded to reflect this by using “X” or “23” to represent the X
chromosome and “XY” or “25” to represent the pseudo-autosomal part of X.

Each time PLINK runs, it will identify heterozygous haploid X chromosome SNVs
and non-male Y SNVs by making an .hh file. These errors may indicate the pseudo-
autosomal region of X. To identify these possible errors, the --split-x flag can be
used.

X chromosome data can also be parsed into pseudo-autosomal regions using the --
split-x command. This is an important step before assessing gender assignments.

In order to zero out these genotypes from the final file after any errors have been corrected,
use --set-hh-missing. Once this is addressed, sex checking options can be run.

BASIC QUALITY CONTROL USING PLINK


Quality control (QC) is an essential first step before running data analysis (also see
Turner et al., 2011). After this process, the overall data set should contain data cleaned of
spurious genotyping errors, and any sample or SNV errors should be identified. Several
key QC steps can be run on a file after being read into PLINK. Most of these steps have
two options: to summarize a dataset or to filter the data itself. PLINK documentation
separates these two kinds of processes as “Basic Statistics” and “Input Filtering,” but
it is easier to think of them as related options. PLINK QC steps include checking for a
variety of sample and SNV errors. Such checking is one of the key features of PLINK in
data analysis. Table 3 lists fundamental QC steps.

Key point
By default, non-founders (those with parents) are not counted by several of PLINK’s
QC steps. This can sometimes result in very few individuals being used for these cal-
culations. Use the --nonfounders flag to include them, which may be necessary for
small sample sets, or output statistics will be NaN (Not a Number) indicating that the
Slifer
calculations could not be performed.
6 of 20
Current Protocols in Human Genetics
Table 3 Fundamental Quality Control Steps Used in PLINK

Quality control step PLINK summary commands PLINK filtering commands


Missingness --missing --geno, --mind
Differential missingness --test-missing
Allele frequency check --freq --maf
Hardy-Weinberg equilibrium --hardy --hwe
check
Gender check --check-sex,
--check-sex ycount
Mendelian error check --mendel, --set-me-missing,
--mendel-duo, --zero,
--mendel-gen --zero-cluster
Relatedness checking --genome
Evaluate principal --pca
components
Concordance check --merge-mode 7

Missingness
Both samples and SNVs can be excluded from the dataset if they fail to meet a missing
data threshold. The rate of missingness can be checked in both samples (% missing for
a sample across your variants) and SNVs (% missing for a particular variant among
your samples) with the command --missing and then may be used to drop samples
(--mind) or SNVs (--geno) based on a threshold.

Key point
“FMISS” in output is the frequency of missing data in the .imiss and .lmiss output
files, and is different from the “NMISS” value of the number of non-missing genotypes
that is output in association analysis.

Differential Missingness
PLINK offers a basic association test between cases and controls to determine if the
missing data in each group is significant; this option is --test-missing. This is an
important check in case-control data sets to ensure that any associations found in the
analysis are not based on different amounts of missing data in the two groups.

Allele Frequency
For older types of GWAS analysis, the minor allele frequency (MAF) threshold was
generally restricted to > 5%. Newer chips and NGS often target rare variants and are
interested in these low frequency SNVs. Regardless, less common SNVs may need to be
identified, as they are not always analyzed in the same way as common variants. The
calling of rare SNVs is also more prone to error, so QC can help identify issues. The
minor allele frequency of the data set can be calculated using the --freq flag. If the
SNVs should be filtered based on the minor allele frequency, use --maf.

Key point
Both of these default to founders-only unless the --nonfounders flag is used.

Hardy-Weinberg Equilibrium
Tests for Hardy-Weinberg equilibrium are a key step in the QC process since they can
Slifer
indicate problematic or badly-called SNVs that should be removed. Because of the
7 of 20
Current Protocols in Human Genetics
inclusion of rare variants in newer data sets, these deviations may be more appropriate
for more common SNVs (>5%).

Deviations from Hardy-Weinberg can be calculated using the --hardy option. To


exclude SNVs based on Hardy-Weinberg statistics, use --hwe.

Key point
Be sure to use the --nonfounders flag if you have related samples and would like to
use an unrelated subset to determine this.

Gender Check
If X chromosome data is present, the sex of the samples can be determined from the input
genotypes using --check-sex. If present, Y chromosome SNVs may also be used
with the --check-sex ycount command. The output from the sex check command
uses a het statistic to indicate how far a sample’s X chromosome genotypes deviate from
the expected heterozygosity for that sex.

Key point
PLINK guidelines define males as having an inbreeding homozygosity estimate of F >
0.8 and females as having F < 0.2. PLINK will label these samples outside of these
thresholds as PROBLEM in the output file. However, there are quite frequently samples
which deviate slightly from these thresholds that should be considered correctly identified
in the data set.

Mendelian Errors
One important difference in running an analysis using related samples versus unrelated
cases and controls is that it is possible to identify incorrectly inherited genotypes—those
that do not match with what is assumed a priori. Since a child inherits one allele at a
locus from each parent, these copies should segregate accordingly. All child genotypes
should have an allele from each parent. These errors may be random genotyping “noise,”
may indicate a problem with the samples, or (rarely) indicate inheriting a deletion from
one parent.

Earlier versions of PLINK (i.e., v. 1.07) would check Mendelian errors only in complete
trios, families where both parents and child were typed. A Mendelian error is detected
when an allele contained in the genotype of the child is not present or is different from
those present in either parent. While it is possible to have de novo alleles (absent in both
parents and newly mutated in the child), this is very rare; thus, most instances would be
considered as Mendelian errors. The current version has flags that allow children with
only one typed parent (--mendel-duo) and extended families (--mendel-gen) to
be checked as well. Use --mendel to generate a report and as --set-me-missing
to filter.
Examples of Mendelian Errors
Individual Father Mother Genotype
(a) Incorrect Trio
F1 1 1000 1001 A/C
F1 1000 0 0 C/C
F1 1001 0 0 C/C
Individual F1_1 has an “A” allele that is not present in either parent
(b) Incorrect Duo
F2 1 1000 1001 G/G
F2 1000 0 0 0/0
F2 1001 0 0 A/A
Slifer Individual F2_1 does not inherit either of the mother’s “A” alleles
8 of 20
Current Protocols in Human Genetics
(c) Incorrect Extended Family
F3 1 1000 1001 A/A
F3 1000 0 0 A/C
F3 1001 100 101 0/0
F3 100 0 0 C/C
F3 101 0 0 C/C
Individual F3_1 has a mother who can be inferred to have a “C” allele. This is
inconsistent with F3_1 being homozygous for “A”. See Figure 2.

Key points
In some cases, large extended families may not be completely cleaned by PLINK. Once
the outstanding errors are determined by a program such as PEDSTATS (Wigginton &
Abecasis, 2005), they can be zeroed out in PLINK using the --zero and --zero-
cluster commands.

Use the --non-founders flag when running these commands.

Relatedness
While the --mendel option evaluates each SNV separately, the --genome command
looks at all genotypes simultaneously and can be used to determine if samples are
related correctly within the file. This is especially helpful if you have a set of unrelated
individuals and you want to identify any duplicate samples. The command calculates
identity-by-descent (IBD) between pairs of samples based on the putative relative type

Figure 2 Pedigree plot of Family 3 with Mendelian inconsistency. Slifer

9 of 20
Current Protocols in Human Genetics
Table 4 Expected Values for Common Types of Relationships

Relationship type z0 z1 z2 PI_HAT


Unrelated 1 0 0 0
Monozygotic 0 0 1 1
(MZ) twin
Full siblings 0.25 0.5 0.25 0.5
Half siblings 0.5 0.5 0 0.25
Parent-offspring 0 1 0 0.5

in the pedigree file. IBD is a measure of the type of relatedness between two samples; the
idea is to look at how much sharing of alleles from a common ancestor these individuals
have, with PLINK outputting both the expected and actual sharing between pairs of
individuals. SNVs should be LD-pruned first, before this command is run, to ensure that
independently inherited SNVs are evaluated.

The output from the --genome command approximates the percentage IBD over-
all, representing pairs as sharing, across all sites: zero alleles IBD (z0), one
allele IBD (z1), or two alleles IBD (z2). Unrelated individuals should share
zero alleles IBD (z0), while MZ twins or duplicate samples would share both
alleles 100% (z2 = 1). A value of PI_HAT (the proportion IBD, defined as P(IBD = 2)
+ 0.5*P(IBD = 1)) may be set to a value greater than a threshold above zero to exclude
pairs of related samples in an unrelated data set. This number can vary depending on the
nature of the overall data set and how conservative an estimate of relatedness is wanted.
Table 4 lists expected values of IBD sharing and PI_HAT by relationship type.

Evaluation of Principal Components


Principal components (PCs) may be used as an estimate of the genetic ancestry of an
unrelated group of samples (also see Hellwege et al., 2017). This is an important check
in case-control data sets to ensure that any associations found in the analysis are based
on disease status rather than the differences in the genetic background of the two groups.

These values can be used in the QC process to help decide if certain individuals are
outliers to expected ethnic groups. The PLINK --pca command will generate the top
20 PC’s using the algorithm employed by GCTA (Yang et al., 2011). The PC values may
also be used subsequently in the analysis as covariates.

Concordance
Many data sets that are analyzed using GWAS or NGS have been run previously on
other genotyping platforms. To verify the integrity of the samples, genotypes can be
checked for concordance between the two platforms. This can be done by merging the
two data sets, one sample at a time. The option --merge-mode 7 will give a percent
concordance as part of the output, comparing non-zero calls in the two datasets. While
100% would be the optimal concordance, identical samples run on both platforms may
have concordance values somewhat less than 100% due to miscellaneous genotyping
errors. The IDs of the individual to compare must be the same in both files to be merged.

COMMONLY USED ANALYSIS OPTIONS


PLINK was initially designed for the analysis of the more common SNVs gener-
ated for early GWAS analysis. Several of its analysis options may continue to be
used on newer data formats as well, especially for analysis of common (MAF >5%)
Slifer
SNVs. Analysis of rare variants, including options to look at clustering and gene-based
10 of 20
Current Protocols in Human Genetics
Table 5 Primary PLINK Analysis Options

Analysis option PLINK commands


Single variant association analysis --assoc, --linear, --logistic
Analysis of related samples --qfam, --make-kin
Epistasis --fast-epistasis
Evaluate population substructure --cluster, --pca, --mds-plot
Haplotype analysis --indep-pairwise

analysis, may be better run using alternate programs like PLINK/SEQ (https://atgu.
mgh.harvard.edu/plinkseq). Table 5 outlines commonly used analysis options.

Analysis of Unrelated Samples


Case control studies
Association tests can be run by PLINK to evaluate case-control data to determine if an
SNV has an effect on disease status (also see Hancock & Scott, 2012). The simplest
model looks at the dosage (additive effect) of the number of copies of the minor allele
of an SNV using the command --assoc. This runs a 1df chi-square allelic test for
binary traits that is essentially a WALD test. The --fisher flag can be substituted for
--assoc in lower-frequency data sets for an exact test. The commands --counts
and --ci will output counts and confidence intervals, respectively.

If a more sophisticated regression model is desired, the command --logistic


will run regression for binary traits that includes covariates and/or interac-
tions. This requires a covariate file to be specified. Importantly, if any covari-
ates being adjusted for are missing, then that individual cannot be used in the
--logistic model regression analysis. The covariate file will need to be created
only for individuals with complete data for the trait and covariate.

The --condition option can also be included, which will test all SNVs but add in
the allelic dosage of the designated variant as a covariate. A --genotypic flag may
be used to look at genotype effects of an SNV in addition to the additive effect.

The output file from the --logistic option will list results for the overall model and
each covariate separately. The TEST column indicates which model the line represents.

Key points
The column NMISS in logistic/linear output is the number of non-missing samples.

The A1 allele is the tested allele (usually the minor allele) and the A2 allele is the
reference allele.

Example PLINK logistic output file


CHR SNP BP A1 TEST NMISS OR STAT P
1 rs1234 10000 A ADD 810 0.4932 −1.975 0.0483
1 rs1234 10000 A SEX 810 1.32 1.812 0.0699
1 rs1234 10000 A AGE 810 1.954 0.2502 0.8024

In this case, the line with TEST = “ADD” contains the p value for the overall model,
including all covariates. If only the results for the overall test are desired, the –hide-
covar flag will suppress the covariate-specific lines.

The option --no-snp will output effects of just the phenotype and the covariates if an
Slifer
evaluation of the non-genetic effects is desired.
11 of 20
Current Protocols in Human Genetics
Plot results with Haploview
The program Haploview (Barrett, Fry, Maller, & Daly, 2005) can be used to visualize asso-
ciation results generated by PLINK. The program will take an output assoc/logistic/linear
file and allow the log10 (p values) to be plotted. There must be a column of SNV names
in the file with the header SNP, and the map information that is present in the PLINK
regression output can be used as an “integrated map file” by Haploview.

Evaluation of population substructure


It is vital in many studies to ensure that case and control groups have similar sub-
structure and that association results are not a product of the genetic architecture (see
also Turner et al., 2011 and Hellwege et al., 2017). Since SNVs often have differing allele
frequencies among different ethnic backgrounds, a case group that is predominantly a
different ethnicity than a control group may result in false positive results that do not
indicate association with the disease but with the genetic background of the populations
instead.

The --cluster option will use an approximation of identity-by-state (IBS), which


looks for alleles that are common, regardless of if they are derived from a common
ancestor. Output files will group individuals by identified cluster. It is also possible to
re-use a previously computed IBD genome file to perform the clustering by using the
--read-genome flag.

Another approach to evaluate substructure is to use dimension reduction. One way to do


this is to calculate the principal components (PCs) for the samples. In many analyses
with mixed samples, the PCs are used as covariates in the analysis as a way to account
for substructure. The --pca command will generate the top 20 PCs using the algorithm
employed by GCTA (Lee et al., 2011). EIGENSOFT 6.0 (Patterson, Alkes, & Reich,
2006; Price et al., 2006) can be used instead if more options, such as outlier removal, are
desired.

PLINK also offers the --mds-plot option, which will perform multidimensional
scaling. This output can be combined with the cluster file to input and visualize in
Haploview (Barrett et al., 2005).

Quantitative traits
Quantitative traits include traits that can be measured on a continuous scale rather than
dichotomized (such as case/control status). These traits can provide more power to an
analysis since they enable all variance of the trait to be used in calculations rather than
its simplification into two categories. For instance, body mass index (BMI) or age can be
analyzed using the PLINK commands --assoc or --linear.

Multiple phenotypes may be read into PLINK using a separate phenotype file. This file
should have the key fields with headers FID and IID, followed by quantitative traits and
covariates. The --linear command allows you to adjust for additional covariates.
This requires a covariate file to be specified. Each covariate must be listed for the sample
matching the main quantitative trait. Importantly, if any covariates being adjusted for
are missing, then that individual cannot be used in the --linear model regression
analysis. The phenotype and covariate files will need to be created only for individuals
with complete data for the trait and covariate.

Currently, PLINK v1.90b does not allow multi-phenotype analysis, but the program MV-
PLINK (Ferreira & Purcell, 2009), based on an earlier PLINK v1.06 version, may be
downloaded separately. In addition, SNPTEST (Marchini, Howie, Myers, McVean, &
Slifer
Donnelly, 2007) has also implemented the analysis of multiple phenotypes.

12 of 20
Current Protocols in Human Genetics
Key point
The covariate file can be formatted similarly to the phenotype file. Please note that for
quantitative traits, you will want to ensure that the trait is normally distributed. In some
instances, you will need to transform the trait using log-transformation or other type of
transformation (e.g., sqrt, Box-Cox).

Analysis of Related Samples


When individuals in the cohort are related, traditional association tests using PLINK
--assoc, --linear, or --logistic will underestimate the variance and lead to
overestimates in the significance level of the association statistic (also see Thornton,
2015). In this situation, an alternative approach is needed. PLINK currently offers the
--qfam option.

However, there are alternative approaches that adjust for the kinship matrix defining
the genetically determined familial relationships in the cohort. One such program is
EMMAX (Kang et al., 2010), which has been implemented in the EPACTS software
suite (https://genome.sph.umich.edu/wiki/EPACTS). With EMMAX, you need to use the
PLINK --make-kin function to generate the kinship matrix. Then, this kinship matrix
is incorporated as a parameter in the regression-based association mapping. Another
option is the R package GWAF (Chen & Yang, 2010). Both of these programs may
be used with both dichotomous and quantitative traits. GWAF will run generalized
estimating equations using a correlation matrix of each family as a cluster when analyzing
dichotomous traits, and will run a limited mixed model and a kinship matrix when
evaluating a quantitative trait.

Detection of Epistasis
PLINK does not offer comprehensive gene-based association testing, although some tests
have been incorporated into PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq).

PLINK allows for the testing of epistasis, which occurs when the effect of one gene
is dependent on the presence of one or more modifier genes (also see Cole, Hall, Ur-
banowicz, Gilbert-Diamond, & Moore, 2017). The PLINK option for the testing looks at
SNV-SNV-based epistasis using the --fast-epistasis command. This command
will scan based on joint 3 by 3 joint genotype count tables. Newer commands in ver-
sion v1.90b include boost and joint effects options for extended versions of the
original command.

Haplotype Evaluation
Linkage disequilibrium (LD) refers to variants that are inherited together in sets when
transmitted from parents to offspring. Including all such SNVs can inflate analysis
results. A subset of SNVs can be pruned from the overall set to select only those that
are in linkage equilibrium. This means that they are inherited independently rather than
in a set, or haplotype block. This can be done using either the --indep or --indep-
pairwise flags. The --indep-pairwise option is generally preferred, as it uses a
window size and a pairwise r2 threshold.

LD statistics can be output for a dataset in a variety of ways. The --r2 dprime flag
will output the squared regression coefficients for each pair of SNVs and the D-prime
statistic. The r2 flag defaults to dropping pairs with r2 < 0.2; to include them, use
–ld-window-r2.

Additionally, haplotype blocks, as defined by Haploview (Barrett et al., 2005), can be


Slifer
generated by PLINK using the --blocks command.
13 of 20
Current Protocols in Human Genetics
BACKGROUND INFORMATION
Common File Types
PLINK is designed to accept most file types commonly used for SNV genotype data.
Depending on the format, these files may require additional maps and/or phenotypic
information. Table 2 summarizes common PLINK file types.

Key point
While PLINK will default to using the phenotype listed in the 6th column of the ped or
fam file, alternate phenotypes may be loaded using the --pheno flag. This file should
have a header, with the first two columns listed as FID and IID to correspond individuals
listed in the other PLINK file types. Additional phenotypes can be either dichotomous
(1/2) or continuous (age, BMI, etc.). Any covariates needed for analysis can also be read
in using --covar, with a similar file that has a header and the first two FID and IID
columns.

PLINK Text Files


The original “PLINK text file” format, called a ped file, consists of a space or tab-
delimited text file (no header) with the first six columns indicating pedigree information
and SNV genotypes in subsequent columns. Default values for affection status are 1 =
Control and 2 = Case, with 0 or -9 indicating a missing value. Sex is represented by 1 =
Male and 2 = Female. Each SNV is represented by two columns, one for each allele.

Example PLINK map file


Chr SNV pos bp
1 rs12345 0 50000000
20 rs5678 0 60000000

A map file is read in with this file to indicate the names of the SNVs and their chromosome
and basepair locations. The third column, genetic position, may be left as zero if not
needed for linkage or recombination information.

Example PLINK fam file


Family(FID) Individual(IID) Father Mother Sex Aff_Status
F1 1 1000 1001 1 2
F1 1000 0 0 1 1
F1 10001 0 0 2 1

By default, PLINK expects a common prefix for any given file type read in. Thus, if a ped
file is called STUDY.ped, it will expect the map file to be STUDY.map. If the prefix is
consistent, the files can be read in with a single flag, in this case --file:

plink --file STUDY [--additional flags]


Alternate flags are used if the names of the files do not have the same prefix (--ped and
--map):

plink --ped STUDY1.ped --map STUDY2.map [--additional


flags]
To output a PLINK text file, use the --recode flag.

Key point
If --out <name> is not listed in the PLINK command, the prefix of all output files
Slifer
will default to plink.
14 of 20
Current Protocols in Human Genetics
Example PLINK ped file
Family(FID) Individual(IID) Father Mother Sex Aff_Status SNVgenotypes . . .
F1 1 1000 1001 1 2 AA CC
F1 1000 0 0 1 1 AG CC
F1 10001 0 0 2 1 AA CC

PLINK Binary Files


PLINK has a binary feature that allows an input format to be saved in a much smaller
binary file. This file type, called a .bed file, has an associated detailed map file (.bim)
that is identical to a PLINK map file but also indicates the two alleles of the SNV. A third
family (.fam) file is needed as well.

The files are read in using the --bfile flag. If data is input from other formats such
as VCF, the –keep-allele-order command will keep A1 and A2 in the same order as the
input. To output PLINK binary file types, use --make-bed.

PLINK identifies the minor allele (in the total dataset), which is always listed as “A1” in
the .bim file. This is also an easy way to determine which SNVs are monomorphic and
should be dropped.

Example PLINK bim file


Chr SNV Pos Bp A1 A2
(a) 1 rs12345 0 50000000 C T
(b) 20 rs5678 0 60000000 0 G
a. The location of the “C” allele in this bim file indicates that it is the minor allele.
b. The SNV is monomorphic with a zero listed as the first allele.

Key point
The third column, genetic position, may be left as zero if not needed for linkage or
recombination analysis.

Long Format (LGEN) Files


Final report files generated by Illumina GenomeStudio (https://illumina.com; also see
Lambert, Tsinajinnie, & Duggan, 2013 and Igo, Cooke Bailey, Romm, Haines, & Wiggs,
2016) can be read into PLINK, after minimal reformatting, using the --lgen option.
These file types may be output using --recode transpose and --recode lgen.

To make an lgen file for PLINK:

1. Choose the Standard Final Report Option


2. Select 4 columns: “Sample Name, SNV Name, Allele 1, Allele 2”
3. Next, select those 4 columns, skipping the header, add in the family name, and change
missing alleles to zero

awk 'NR>10{print $0}' FinalReport.txt | awk '{print


''FAM1'',$1,$2,$3,$4}' | sed 's/ -/ 0/g' > output.lgen

4. Use the SNP_Map.txt file to generate a map file

awk 'NR>1{print $3,$2,''0'',$4}' SNP_Map.txt > out-


put.map

5. A 6-column family file like the .fam file used with PLINK binary files is also
necessary for this file input type Slifer

15 of 20
Current Protocols in Human Genetics
VCF Files
VCF files (Variant Call Format) generated from sequencing analysis can be read into
PLINK and converted to binary files. These files are input into PLINK using the --vcf
flag with the input file name. Data may be output from PLINK into VCF format using
--recode vcf.

NGS data often contains annotation information in the INFO field in addition to the
genotypes of the samples. Not all detailed annotation information is kept in PLINK,
but can be kept using PLINK/SEQ. A limited amount of filtering can be done when
reading a VCF file into PLINK; minimum QUAL and GQ values can be set using --vcf-
min-qual and --vcf-min-gq. Further filtering and manipulation of VCF files can be
done using VCFtools (Danecek et al., 2011) or BCFTOOLS (https://samtools.github.io/
bcftools/bcftools.html).

Key point
Values from the INFO flags in the VCF file will be lost after being read into PLINK.
REF and ALT allele assignments will not be conserved unless additional flags are used.
A program such as PLINK/SEQ is better equipped to handle these options.

Alleles are ordered in a VCF file according to REF and ALT annotation designation. When
read into PLINK, this information is lost if the --keep-allele-order command
is not used. VCF files often contain indels as well as SNVs, and may have locations
with multiple alleles. Neither of these scenarios is handled by PLINK unless the indels
are coded as a bi-allelic insertion/deletion and SNVs as --biallelic-only. For
analysis of more complicated indels and CNVs, PLINK/SEQ is recommended.

Key point
NGS data for novel SNVs is often missing an SNV name in the third column of a VCF
file. These SNVs when input into PLINK will be given “.” for names. The chromosome
and basepair position (and alleles if needed) can be used for names.
Example
Init.bim
1 . 0 50000 A C

This can be updated in the bim file using awk:

awk '{if($2==''.'') print $1,$1'':''$4'':''5'':''$6,$3,


$4,$5,$6; else print $0}' init.bim > fix.bim
Fix.bim
1 1:5000:A:C 0 50000 A C

Imputed Genotypes/Oxford Format


Imputed data (also see Porcu, Sanna, Fuchsberger, & Fritsche, 2013) that is gen-
erated from programs such as IMPUTE2 (Howie, Donnelly, & Marchini, 2009)
will code the probabilities of the three possible genotypes for an SNV (p =
A/A, p = A/B, p = B/B) rather than outputting a single hard call of the geno-
type. Imputation data in the Oxford format file from IMPUTE2 can also be
loaded into the program using --data for STUDY.gen and STUDY.sample or
--gen and --sample for files with different prefixes. Use --recode oxford to
output data from PLINK into Oxford format.

Slifer

16 of 20
Current Protocols in Human Genetics
Table 6 Additional Analysis Software

Name Type of analysis Web site


BCFTOOLS Software used to manipulate https://samtools.github.io/bcftools/bcftools.html
VCF files
EIGENSOFT Calculate principal components https://github.com/DReichLab/EIG
using genetic data
EPACTS Association analysis of related https://genome.sph.umich.edu/wiki/EPACTS
individuals
GCTA Genome-wide complex trait http://cnsgenomics.com/software/gcta/#Overview
analysis
GTOOL A program to transform different https://www.well.ox.ac.uk/cfreeman/software/gwas/gtool.html
types of genetic data
GWAF Analyze imputed data and related https://cran.r-project.org/web/packages/GWAF/index.html
samples
Haploview Graphically represent association https://www.broadinstitute.org/haploview/downloads
results and LD
MV-PLINK Multivariate analysis of https://genepi.qimr.edu.au/staff/manuelF/multivariate/main.html
association
PEDSTATS Validate and check family https://csg.sph.umich.edu/abecasis/pedstats/index.html
pedigree files
PLINK/SEQ Analysis of rare variants and https://atgu.mgh.harvard.edu/plinkseq/
gene testing
ProbABEL Association analysis of imputed https://www.genabel.org/packages/ProbABEL
data
SNPTEST Analysis of single SNPs in https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
genome-wide association studies
VCFtools Software for processing VCF https://vcftools.github.io/man_latest.html
files

PLINK allows the use of a threshold to set when reading in data, but defaults to 0.1,
converting the probabilities to hard calls of the data. Other genetic analysis programs
will use all available data instead of converting the genotypes and generating missing
data. Such programs include SNPTEST (Marchini et al., 2007), ProbABEL (Aulchenko,
Struchalin, & van Duijn, 2010), and the R package GWAF (Chen & Yang, 2010).

Key points
For imputed data PLINK uses a default threshold of 0.1 to convert imputed genotypes
to hard-call genotypes and run them in PLINK binary format. This flag can be changed
as needed, and in order to keep all imputed genotypes, it can be set at 0.49 using the
--hard-call-threshold flag.

Both PLINKv1.07 and PLINKv1.90b allow a --dosage command that is poorly inte-
grated with the rest of the PLINK commands. It will run an association test by default
that allows the genotypes to have decimal allele counts for dosage (a continuous number
from 0 to 2), representing the number of minor alleles from imputed data.

Additional Software
Although PLINK allows flexibility in the type of input files, it was originally de-
signed for analysis of GWAS SNVs. Accordingly, file processing, QC, and analysis for
Slifer
both imputation and NGS sequencing data may be analyzed more robustly using other
17 of 20
Current Protocols in Human Genetics
software and pipeline options. Both VCFtools (Danecek et al., 2011) and BCFTOOLS
(https://samtools.github.io/bcftools/bcftools.html) offer file processing and filtering of
VCF files. The utility program GTOOL (Marchini et al., 2007) is helpful when work-
ing with Oxford format data. Additionally, in recent years, other analytic suites in-
cluding GCTA (genome-wide complex trait analysis; Yang et al., 2011), SNPTEST
(Marchini et al., 2007), and EPACTS (efficient and parallelizable association container
toolbox; https://genome.sph.umich.edu/wiki/EPACTS) have incorporated several sophis-
ticated modeling features, including mixed model association analysis and cluster-based
algorithms. PLINK (v1.90b) has adapted GCTA into its pipeline. Table 6 presents a
summary of additional software for genetic data analysis.

SUMMARY/CONCLUDING REMARKS
The development of PLINK in 2007 was a key turning point in the ability of researchers to
process and analyze the large numbers of SNVs that had started to be routinely generated
by labs in the early years of the 21st century. The program has subsequently been updated
to allow the input of other types of data—including imputed and sequencing data—that
have become more common in the subsequent years. It offers the advantage of a single,
easy-to-use program that can process, perform quality control, and run basic analysis on
genotyping datasets in a single package.

ACKNOWLEDGEMENTS
The author is partially supported by the John P. Hussman Institute for Human Genomics
at the University of Miami Miller School Of Medicine, and would like to thank Dr.
Evadnie Rampersaud for valuable input and discussion.

LITERATURE CITED
Aulchenko, Y. S., Struchalin, M. V., & van Duijn, C. M. (2010). ProbABEL package for genome-wide
association analysis of imputed data. BMC Bioinformatics, 11, 134. doi: 10.1186/1471-2105-11-134
Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and
haplotype maps. Bioinformatics, 21, 263-265. doi: 10.1093/bioinformatics/bth457.
Chen, M.-H., & Yang, Q. (2010). GWAF: An R package for genome-wide association analyses with family
data. Bioinformatics, 26(4), 580–181. doi: 10.1093/bioinformatics/btp710.
Cole, B. S., Hall, M. A., Urbanowicz, R. J., Gilbert-Diamond, D., & Moore, J. H. (2017). Analysis of gene-
gene interactions. Current Protocols in Human Genetics, 95, 1.14.1–1.14.10. doi: 10.1002/cphg.45.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... 1000 Genomes Project
Analysis Group (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158. doi:
10.1093/bioinformatics/btr330.
Ferreira, M. A., & Purcell, S. M. (2009). A multivariate test of association. Bioinformatics, 25, 132–133.
doi: 10.1093/bioinformatics/btn563.
Hancock, D. B., & Scott, W. K. (2012). Population-based case-control association studies. Current Protocols
in Human Genetics, 74, 1.17.1–1.17.20. doi: 10.1002/0471142905.hg0117s74.
Hellwege, J. N., Keaton, J. M., Giri, A., Gao, X., Velez Edwards, D. R., . . . Edwards, T. L. (2017). Population
stratification in genetic association studies. Current Protocols in Human Genetics, 95, 1.22.1–1.22.23.
doi: 10.1002/cphg.48.
Howie, B. N., Donnelly, P., & Marchini, J. (2009). A flexible and accurate genotype imputation method for the
next generation of genome-wide association studies. PLoS Genetics, 5(6), e1000529. doi: 10.1371/jour-
nal.pgen.1000529.
Igo, R. P., Jr., Cooke Bailey, J. N., Romm, J., Haines, J.L., & Wiggs, J.L. (2016). Quality control for
the Illumina HumanExome BeadChip. Current Protocols in Human Genetics, 90, 2.14.1–2.14.16. doi:
10.1002/cphg.15.
Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S. Y., Freimer, N. B., ... Eskin, E. (2010).
Variance component model to account for sample structure in genome-wide association studies. Nature
Genetics, 42, 348–354. doi: 10.1038/ng.548.
Lambert, G., Tsinajinnie, D., & Duggan, D. (2013). Single nucleotide polymorphism genotyping
Slifer using BeadChip microarrays. Current Protocols in Human Genetics, 78, 2.9:2.9.1–2.9.34. doi:
10.1002/0471142905.hg0209s78.
18 of 20
Current Protocols in Human Genetics
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... 1000 Genome Project Data
Processing Subgroup (2009). The sequence alignment/Map format and SAMtools. Bioinformatics, 25,
2078–2079. doi: 10.1093/bioinformatics/btp352.
Marchini, J., Howie, B., Myers, S., McVean, G., & Donnelly, P. (2007). A new multipoint method for
genome-wide association studies via imputation of genotypes. Nature Genetics, 39, 906–913. doi:
10.1038/ng2088.
Patterson, N., Alkes, L. P., & Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics, 2.12,
e190. doi: 10.1371/journal.pgen.0020190.
Porcu, E., Sanna, S., Fuchsberger, C., & Fritsche, L. G. (2013). Genotype imputation in
genome-wide association studies. Current Protocols in Human Genetics, 78, 1.25.1–1.25.14. doi:
10.1002/0471142905.hg0125s78.
Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., & Reich, D. (2006). Principal
components analysis corrects for stratification in genome-wide association studies. Nature Genetics,
38.8, 904–909. doi: 10.1038/ng1847.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., ... Sham, P. C. (2007).
PLINK: A toolset for whole-genome association and population-based linkage analysis. American Jour-
nal of Human Genetics, 81, 559–575. doi: 10.1086/519795.
Thornton, T. A. (2015). Statistical methods for genome-wide and sequencing association studies of
complex traits in related samples. Current Protocols in Human Genetics, 84, 1.28.1-1.28.9. doi:
10.1002/0471142905.hg0128s84.
Turner, S., Armstrong, L. L., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., . . . Ritchie,
M. D. (2011). Quality control procedures for genome-wide association studies. Current Protocols in
Human Genetics, 68, 1.19:1.19.1–1.19.18. doi: 10.1002/0471142905.hg0119s68.
Wigginton, J. E., & Abecasis, G. R. (2005). PEDSTATS: Descriptive statistics, graphics and quality assess-
ment for gene mapping data. Bioinformatics, 21, 3445–3447. doi: 10.1093/bioinformatics/bti529.
Yang, J., Lee, S. H., Goddard, M. E., & Visscher, P. M. (2011). GCTA: A tool for Genome-wide complex
trait analysis. American Journal of Human Genetics, 88(1), 76–82. doi: 10.1016/j.ajhg.2010.11.011.
INTERNET RESOURCES
https://samtools.github.io/bcftools/bcftools.html
BCFTOOLS: a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary
counterpart BCF.
https://github.com/DReichLab/EIG
EIGENSOFT: Use genome-wide data to determine Principal Components of individuals.
https://genome.sph.umich.edu/wiki/EPACTSEPACTS
Efficient and Parallelizable Association. Container Toolbox
http://cnsgenomics.com/software/gcta/#Overview
GCTA: A tool for genome-wide complex trait analysis.
https://www.well.ox.ac.uk/cfreeman/software/gwas/gtool.html
GTOOL: A program for transforming sets of genetic data.
https://cran.r-project.org/web/packages/GWAF/index.html
GWAF: an R package for association testing of genotypes and inferred SNPs using binary or continuous
phenotypes.
https://www.broadinstitute.org/haploview/downloads
Haploview: a program designed to simplify and expedite the process of haplotype analysis by providing a
common interface.
http://illumina.com
Illumina Web site containing information about various genotyping arrays, reference files, and manuals,
including how to download and use Illumina GenomeStudio to convert raw data into a Final Report.
https://genepi.qimr.edu.au/staff/manuelF/multivariate/main.html
MV-PLINK: A multivariate test of association.
https://csg.sph.umich.edu/abecasis/pedstats/index.html
PEDSTATS: A tool for the validation and summary of pedigree files.
https://pngu.mgh.harvard.edu/purcell/plink/
PLINK: an open-source whole-genome association analysis toolset, version 1.07.
http://www.cog-genomics.org/plink/1.9/
PLINK Web site for version 1.90vb.
https://atgu.mgh.harvard.edu/plinkseq/
PLINK/SEQ: A library for the analysis of genetic variation data including rare variants and gene testing. Slifer

19 of 20
Current Protocols in Human Genetics
https://www.genabel.org/packages/ProbABEL
ProbABEL: A tool for genome-wide association analysis of imputed data.
https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
SNPTEST: Analysis of single SNPs in genome association studies.
https://vcftools.github.io/man_latest.html
VCFtools: A set of tools written in Perl and C++ for working with VCF files.

Slifer

20 of 20
Current Protocols in Human Genetics

You might also like