Cluster

Available online at www.sciencedirect.
com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
Procedia
Procedia Computer
Computer Science
Science 15700 (2019)
(2019) 000–000
306–312 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
4th International Conference on Computer Science and Computational Intelligence 2019

4th International Conference on Computer
(ICCSCI), 12-13Science and Computational
September 2019 Intelligence 2019
(ICCSCI), 12-13 September 2019
Fast
Fast and
and Effective
Effective Clustering
Clustering Method
Method for
for Ancestry
Ancestry Estimation
Estimation
Arif Budiartoa,b,∗, Bharuno Mahesworob , James Baurleyb,c , Teddy Suparyantob , Bens
Arif Budiartoa,b,∗, Bharuno Mahesworob , Jamesb,dBaurleyb,c , Teddy Suparyantob , Bens
Pardameanb,d
a
Pardamean
Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia 11480
a Computer Science Department, SchoolResearch
of Computer Science,
bBioinformatics and Data Science Center, Bina Bina Nusantara
Nusantara University,
University, Jakarta,
Jakarta, Indonesia
Indonesia 1148011480
b Bioinformatics and Data Science Research Center, Bina
c BioRealm, LLC,Nusantara
USA University, Jakarta, Indonesia 11480
d Computer Science Department, BINUS Graduate Program - cMaster
BioRealm, LLC, USA
of Computer Science, Bina Nusantara University, Jakarta, 11480, Indonesia
d Computer Science Department, BINUS Graduate Program - Master of Computer Science, Bina Nusantara University, Jakarta, 11480, Indonesia
Abstract
Abstract
Ancestry estimation which provides family history information is one of the most popular services in direct-to-consumer genomic
Ancestry
testing. Itestimation
is also anwhich provides
important task family
which history
aimed to information
reduce theis confounding
one of the mostby popular
ancestryservices
on the in direct-to-consumer
relationship genomic
of genotypes and
testing. It isinalso
disease risk an important
assocation studies.task which
Several aimed have
methods to reduce the confounding
been developed by ancestry
to generate on the relationship
the best ancestry of genotypes
estimated scores and
even though
disease
some ofrisk
themin assocation studies.
are still facing Severalcomputation
inefficient methods have beenIndeveloped
time. this paper,toagenerate the best
combination ancestry
method estimated
between scores
KMeans even though
clustering and
some
PCA isofproposed
them areestimate
still facing inefficient
ancestry computation
estimation from SNP time. In this paper,
genotyping data. aThis
combination
method wasmethod between
compared withKMeans
baselineclustering and
model, called
PCA is proposed estimate
fastSTRUCTURE, in termancestry estimation
of the quality from SNPand
of clustering genotyping data.time.
computation This Public
methoddata
was from
compared
1000 with baseline
Genome model,
project called
is used to
fastSTRUCTURE,
train and evaluate thein proposed
term of the quality
model andoftheclustering
baseline and computation
model. The proposedtime.model
Publiccan
data from 1000generate
successfully Genomeclusters
projectwith
is used to
better
train and evaluate the proposed model and the baseline model. The proposed model can successfully generate
accuracy than fastSTRUCTURE (91.02% over 90.39%). More importantly, it can boost the computation time until 100 times faster clusters with better
accuracy than fastSTRUCTURE
than fastSTRUCTURE (from 490(91.02%
secondsover 90.39%).
to 4.86 More importantly, it can boost the computation time until 100 times faster
seconds).
than fastSTRUCTURE (from 490 seconds to 4.86 seconds).
c 2019

© 2019 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
c 2019an

This The Authors. Published by Elsevier B.V.
This is
is an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license https://creativecommons.org/licenses/by-nc-nd/4.0/)
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an
Peer-reviewopen access
Peer-review Statement: article under
Peer-review
under responsibility the CC
of under BY-NC-ND license
responsibility
the scientific https://creativecommons.org/licenses/by-nc-nd/4.0/)
of the
committee of scientific committee ofConference
the 4th International the 4th International
on Computer Conference on
Science and
Peer-review
Computational
Computer Statement: Peer-review
Intelligence
Science 2019. under
and Computational responsibility
Intelligence 2019. of the scientific committee of the 4th International Conference on
Computer Science and Computational Intelligence 2019.
Keywords: Ancestry Estimation; Population Stratification; Clustering; Bioinformatics; Genomics
Keywords: Ancestry Estimation; Population Stratification; Clustering; Bioinformatics; Genomics
1. Introduction
1. Introduction
In direct-to-consumer genomic testing, one of the most popular services is provide family history information for
theInusers.
direct-to-consumer genomic testing,
This ancestry estimation can beone of using
done the most popular
three services
different methods,is provide
namelyfamily history information
Y chromosome for
testing, Mi-
the users. This ancestry estimation can be done using three different methods, namely Y chromosome
tochondrial Deoxyribonucleic acid (DNA) testing, and single nucleotide polymorphism (SNP) genotyping 1 . Despite testing,
1 Mi-
tochondrial
advances in Deoxyribonucleic acid (DNA)
genotyping technology testing,
in recent years, and
somesingle nucleotide
challenges polymorphism
are still (SNP)
hindering the genotyping
quality of genetic.ancestry
Despite
advances in genotyping technology in recent years, some challenges are still hindering the quality of genetic ancestry
∗ Corresponding author. Tel.: +6221-534-5830; fax: +6221-530-0244.

∗ Corresponding
E-mail address:author. Tel.: +6221-534-5830; fax: +6221-530-0244.
abudiarto@binus.edu
E-mail address: abudiarto@binus.edu
1877-0509 c 2019 The Authors. Published by Elsevier B.V.
1877-0509
1877-0509 © 2019 The
c 2019 The Authors.
Authors. Published
Published by
by Elsevier B.V.
Elsevier B.V.
This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/)
This isisan
This anopen
openaccess
access article
article under
under the the BY-NC-ND
CC CC BY-NC-ND licenselicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
Peer-review Statement: Peer-reviewofunder
under responsibility responsibility
the scientific of the scientific
committee of the 4thcommittee of theConference
International 4th International Conference
on Computer on Computer
Science Science and
and Computational
Peer-review Statement:
Computational Peer-review
Intelligence 2019. under responsibility of the scientific committee of the 4th International Conference on Computer Science and
Intelligence 2019.
Computational Intelligence 2019.
10.1016/j.procs.2019.08.171
Arif Budiarto et al. / Procedia Computer Science 157 (2019) 306–312 307
2 Arif Budiarto et al. / Procedia Computer Science 00 (2019) 000–000
estimation, such as lack of reference genetic data for some under-represented populations. This issue significantly
decreases the accuracy of the estimation.
In genomic association research, ancestry estimation is an important tasks which aimed to reduce the confounding
by ancestry on the relationship of genotypes and disease risk 2 . Ancestry estimation has also already proven to increase
the quality of Genome-wide Association Study (GWAS) in South Africa population which tries to explore ancestry-
specific risk for Tuberculosis 3 . Recently published GWAS research on colorectal cancer in South Sulawesi population
also include ancestry estimation score as one of the covariates 4 . Several methods have been developed to generate the
best ancestry estimated scores even though some of them are still facing inefficient computation time 5,6,7,8 .
In this paper an unsupervised model is proposed as a method of ancestry estimation from SNP genotyping. This
method was compared with the variational Bayesian model in term of the quality of clustering and computation time.
This proposed model was tested by inferring ancestry from genotypes in an South Sulawesi, Indonesian population.
2. Related Works
Currently, there are two common algorithms for ancestry estimation. The first one is a model-based method to
estimate global ancestry, such as STRUCTURE 5,9,10 . One challenge of this method is sampling from the posterior
distribution is computationally expensive. A variational Bayesian alternative has been proposed and implemented in
software called fastSTRUCTURE 8 . Variational Bayesian is a statistical machine learning algorithm which is usually
used to approximate intractable integrals. Unlike the standard Bayesian method, Variational Bayesian technique will
handle inference problem as an optimization problem rather than a sampling problem 11 . As the result, fastSTRUC-
TURE provide a faster and more efficient model to estimate ancestry estimation. Another commonly model-based
method are ADMIXTURE which propose faster execution time than STRUCTURE 12 .
The second genetic ancestry estimation approach is Principal Component Analysis (PCA)-based methods which
allow a projection of highly dimensional data into lower dimensional data. PCA uses matrix operations and statistics
to calculate and extracts the leading patterns in the matrix which is then projected into new data with smaller dimen-
sion 13 . By using this method, the important information extracted from the dataset can be represent in two-dimensional
or orthogonal variables 14 . The examples of PCA-based ancestry estimation are EIGENSTART 6 and SMARTPCA 7 .
3. Material and Methods
3.1. Datasets
To train our clustering method, we use a public dataset from the 1000 genome project 15 . 1000 genome project is
an International research effort which covers the most common human genetics variations from multiple populations.
Genomics data of 2504 individuals from 26 different populations has been collected using sequencing and genotyping
methods. This genomic data includes more than 80 million SNPs.
To implement our proposed ancestry estimation method, all the SNPs were filtered based on Ancestry-Informative
Markers (AIM) which consist only of relevant markers associated with ancestry information of humans. There are
more than 5 thousand SNPs which are included in this list.
To evaluate the result, we used genomics data from the first Genome-Wide Association Study (GWAS) in Indonesia
which focused on observing the genetic risk factors for colorectal cancer in a South Sulawesi population 4 . There was
genomics data on 173 samples from Makassar, South Sulawesi, Indonesia.
3.2. Data Analysis
Several data preprocessing methods were implemented to both training and test data. For training data we excluded
any related individuals which reduced the total sample into 2405 individuals. Basic data cleansing was also done to
exclude the SNPs which are not presented in the test dataset. This step produced 3577 out of 5000 SNPs which finally
can be fed into the proposed method.
In the proposed method we use K-Means clustering to clusters each subjects to 5 different groups based on all 3577
SNPs. This 5 groups are come from the stratification in 1000 Genome project refer to 5 big population in the world,
Fig. 1: Data Analysis Process Diagram
namely American (AMR), European (EUR), African (AFR), South Asian (SAS), and East Asian (EAS). K-Means is
an unsupervised machine learning algorithm that aims to divide a number of observations into k clusters where each
observation is classified to the cluster with the closest mean 16 . The K-means algorithm allocates each observation
point to the closest cluster, while keeping the cluster as small as possible. To plot each subjects in two-dimensioinal
graph, PCA was implemented to reduce the dimension of data into two factors. Then all subjects will be plotted based
on the clusters’ label.
This proposed method is compared with two others methods. The first one, we used fastSTRUCTURE software
to generate 5 scores for each subjects which represent the population. Then using the same PCA factors, all subjects
were plotted based on the predicted population. We used fastSTRUCTURE because this method is the most recent
ancestry estimation software among the others. We also implemented K-Means clustering only to the two factors from
PCA and then plot the subjects based on the predicted cluster. A confusion matrix-style table was used to evaluate the
performance of each method to generate clusters for all subjects. The capability to infer test data and plot the subjects
into appropriate cluster was also assessed. A complete end-to-end analysis process is illustrated in Figure 1.
4. Results and Discussion
The clustering result of training data from 1000 genome public data of this study are presented in the confusion
matrix Table 1, 2, 3 below. The KMeans clustering method and Fast Structure clustering method gave similar result.
The difference from these results are on the clustering Americans samples. Where both also gave poor precision rate.
Table 1 shows that, the KMeans method, predicted the Americans sample into 3 main classes with the majority of
them are classified in the same class as South Asian samples. On the other hand, the Fast Structure clustering method,
showed on the Table 2, classified the majority of the Americans samples in the same class as Europeans samples.
Lastly on Table 3, we tried clustering the training data with KMeans clustering method but only using its two
principle components. The accuracy of this method was 82.62% which is much lower compared to the method using
all 3508 SNPs, which had an accuracy of 91.02%. Some of the Africans samples were clustered into its own class,
where the class were expected for American samples.
When testing the models, all of the model predict the test samples, Makasar samples, as East Asian population
with 100% precision on Table 4. The Makasar samples are expected in the same class as East Asian Sample because
they are included in South East Asia Region. And the East Asian samples in this study includes Vietnamese, which is
located in the South East Asia.
Table 1: KMeans Method Confusion Matrix
Model Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Total Precision
EUR 489 0 0 0 0 489 100%

EAS 0 491 0 0 0 491 100%
AFR 0 0 608 1 3 612 99.35%
SAS 0 0 0 475 0 475 100%
AMR 68 0 2 142 126 338 37.28%
Total 557 491 610 618 129 2405
Recall 88.87% 100% 99.67% 76.86% 97.67% 91.02%
Table 2: Fast Structure Method Confusion Matrix
EUR 489 0 0 0 0 489 100%

EAS 0 491 0 0 0 491 100%
AFR 1 2 608 0 1 612 99.35%
SAS 0 0 0 475 0 475 100%
AMR 194 29 3 1 111 338 32.84%
Total 684 522 611 476 112 2405
Recall 71.49% 94.06% 99.51% 99.79% 99.11% 90.39%
Table 3: KMeans PCA Method Confusion Matrix
EUR 489 0 0 0 0 489 100%

EAS 0 491 0 0 0 491 100%
AFR 0 2 529 1 80 612 86.44%
SAS 0 0 0 475 0 475 100%
AMR 24 86 0 225 3 338 0.83%
Total 513 579 529 701 83 2405
Recall 95.32% 84.80% 100& 67.76% 3.61% 82.62%
Table 4: Test result
Model EUR EAS AFR SAS AMR Total Precision
KMeans 0 173 0 0 0 173 100%

Fast Structure 0 173 0 0 0 173 100%
KMeans PCA 0 173 0 0 0 173 100%
4.1. Data Visualization
To evaluate the samples spread, the classification result are visualized in scatter plots. Each sample of the dataset
are visualize as a point in the maps based on their principle components. In Figure 2, the colours of the samples are
visualized in the true population of the samples. The figure shows that Africans samples are separated on the far right
of the map, the East Asians samples are on top left corner of the maps and the Europeans samples are on the bottom
left corner of the maps. While the Americans samples spread out across the map with South Asian samples sit on the
middle of it. The test sample, Makasar samples sit very close, almost entirely overlap, with East Asian samples.
Fig. 2: Ground Truth Scatter Plot Fig. 3: KMeans Model Scatter Plot
Fig. 4: Fast Structure Model Scatter Plot Fig. 5: KMeans PCA Model Scatter Plot
The Americans sample are from ethnic Peruvians, Mexicans, Puerto Ricans and Colombians. These Hispanic
ethnicity are the descendants of the marriage of Spanish and Native Americans couples, which carry genetic in for-
mations from both Spanish and Native Americans parents 17,18,19 . Therefore, the location of the Americans samples
on the scatter plot are not clustered in one location.
In Figure 3, the samples are colored with its predicted label from KMeans clustering method. The test sample,
Makasar ethnicity, are entirely predicted in the same class as East Asians samples. However in this plot, the test
sample is colored as class 5 just to differentiate the train and test sample. The Europeans, Africans, East Asians and
South Asian samples were successfully classified with at least 99% precision rate. On the other hand, Americans
samples were poorly classified with majority of them were classified in the same class as Europeans or South Asians
samples.
On the fast structure clustering plot, Figure 4, the figure is almost similar to the KMeans clustering plot where the
Europeans, Africans, East Asians and South Asian samples were successfully classified. In this method, the majority
of the Americans sample are classified in the same class as Europeans. Strangely, some of the Americans sample are
classified in the same class as the East Asians.
In the last plot, Figure 5, the K-Means PCA models gave very poor result. The Europeans and East Asians samples
were successfully classified in their own class. Meanwhile, majority of the Americans samples were classified in the
same class with the South Asian samples. Also, the Africans were devided into two classes.
4.2. Processing Time
To minimalize bias, we run Kmeans clustering model 5 times to measure the processing time. The average process-
ing time were 4.8618818 seconds with 0.0455760487 seconds of standard deviation. This processing time is much
faster compared to Fast Structure clustering model which is 490 seconds.
We also measure the processing time of the KMeans PCA clustering model. Since it only calculating two principle
components, the average processing time of this model is 0.0485584 seconds with 0.0026308776 seconds of standard
deviation. Despite of poor result, this method is very fast.
5. Conclusion
The proposed method in this paper can succesfully generate clusters with precision score slightly higher than
fastSTRUCTURE. However the KMeans model processing time is 4.8 seconds which is 100 times faster than the
fastSTRUCTURE. This processing time can be decrease even more by reducing the dimensions of the dataset. In this
study, the KMeans model on the dataset which dimensions has been decreased into two principle components, only
need 0.048 seconds of processing time, despite its poor performance with 82.62% of accuracy. In further research,
this method can be used but with optimal principle components. So that, gave acceptable result with fast processing
time.
All of the model predict the test samples, Makasar samples, with the same class as the East Asian samples, with
100% precision rate. The results are in line with expectations. However all models also struggling in clustering the
American samples. This issue can be further explored in the next research either by reducing the cluster number to
ignore American cluster or implementing more complex method. Another future research direction can be taken is
toimplement hierarchical clustering that can infer more detail population based on 1000 genome project.
Acknowledgement
Colorectal cancer genotyping data used in this study was collected and processed by Indonesian Colorectal Cancer
Consortium (IC3).
References
1. Reference, G.H.. What is genetic ancestry testing? 2019. URL https://ghr.nlm.nih.gov/primer/dtcgenetictesting/

ancestrytesting.
2. Wang, C., Zhan, X., Bragg-Gresham, J., Kang, H.M., Stambolian, D., Chew, E.Y., et al. Ancestry estimation and control of population
stratification for sequence-based association studies. Nature Genetics 2014;46(4):409–415. doi:\bibinfo{doi}{10.1038/ng.2924}. URL http:
//www.nature.com/articles/ng.2924.
3. Price, A.L., Zaitlen, N.A., Reich, D., Patterson, N.. New approaches to population stratification in genome-wide associ-
ation studies. Nature Reviews Genetics 2010;11(7):459–463. doi:\bibinfo{doi}{10.1038/nrg2813}. URL http://www.ncbi.
nlm.nih.gov/pubmed/20548291http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2975875http:
//www.nature.com/articles/nrg2813.
4. Yusuf, I., Miskad, U.A., Lusikooy, R.E., Arsyad, A., Irwan, A., Mathew, G., et al. Genetic risk factors for colorectal cancer in multieth-
nic indonesians. bioRxiv 2019;doi:\bibinfo{doi}{10.1101/626739}. https://www.biorxiv.org/content/early/2019/05/03/626739.
full.pdf; URL https://www.biorxiv.org/content/early/2019/05/03/626739.
5. Pritchard, J.K., Stephens, M., Donnelly, P.. Inference of population structure using multilocus genotype data. Genetics 2000;
155(2):945–59. URL http://www.ncbi.nlm.nih.gov/pubmed/10835412http://www.pubmedcentral.nih.gov/articlerender.
fcgi?artid=PMC1461096.
6. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.. Principal components analysis corrects for stratification
in genome-wide association studies. Nature Genetics 2006;38(8):904–909. doi:\bibinfo{doi}{10.1038/ng1847}. URL http://www.ncbi.
nlm.nih.gov/pubmed/16862161http://www.nature.com/articles/ng1847.
7. Patterson, N., Price, A.L., Reich, D.. Population structure and eigenanalysis. PLoS genetics 2006;2(12):e190. doi:\bibinfo{doi}
{10.1371/journal.pgen.0020190}. URL http://www.ncbi.nlm.nih.gov/pubmed/17194218http://www.pubmedcentral.nih.gov/
articlerender.fcgi?artid=PMC1713260.
8. Raj, A., Stephens, M., Pritchard, J.K.. fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics
2014;197(2):573–89. doi:\bibinfo{doi}{10.1534/genetics.114.164350}. URL http://www.ncbi.nlm.nih.gov/pubmed/24700103http:
//www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC4063916.
9. Falush, D., Stephens, M., Pritchard, J.K.. Inference of population structure using multilocus genotype data: linked loci and correlated allele
frequencies. Genetics 2003;164(4):1567–87. URL http://www.ncbi.nlm.nih.gov/pubmed/12930761http://www.pubmedcentral.
nih.gov/articlerender.fcgi?artid=PMC1462648.
10. Hubisz, M.J., Falush, D., Stephens, M., Pritchard, J.K.. Inferring weak population structure with the assistance of sample group information.
Molecular ecology resources 2009;9(5):1322–32. doi:\bibinfo{doi}{10.1111/j.1755-0998.2009.02591.x}. URL http://www.ncbi.nlm.
nih.gov/pubmed/21564903http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3518025.
11. Paisley, J., Blei, D., Jordan, M.. Variational Bayesian Inference with Stochastic Search 2012;1206.6430; URL http://arxiv.org/abs/
1206.6430.
12. Alexander, D.H., Novembre, J., Lange, K.. Fast model-based estimation of ancestry in unrelated individuals. Genome research 2009;
19(9):1655–64. doi:\bibinfo{doi}{10.1101/gr.094052.109}. URL http://www.ncbi.nlm.nih.gov/pubmed/19648217http://www.
pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2752134.
13. Wold, S., Esbensen, K., Geladi, P.. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 1987;2(1):37 – 52.
doi:\bibinfo{doi}{https://doi.org/10.1016/0169-7439(87)80084-9}. Proceedings of the Multivariate Statistical Workshop for Geologists and
Geochemists; URL http://www.sciencedirect.com/science/article/pii/0169743987800849.
14. Abdi, H., Williams, L.J.. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics ????;2(4):433–
459. doi:\bibinfo{doi}{10.1002/wics.101}. https://onlinelibrary.wiley.com/doi/pdf/10.1002/wics.101; URL https://
onlinelibrary.wiley.com/doi/abs/10.1002/wics.101.
15. Gibbs, R.A., Boerwinkle, E., Doddapaneni, H., Han, Y., Korchina, V., Kovar, C., et al. A global reference for human genetic variation.
Nature 2015;526(7571):68–74. doi:\bibinfo{doi}{10.1038/nature15393}. URL http://www.nature.com/articles/nature15393.
16. Jain, A.K., Dubes, R.C.. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.; 1988. ISBN 0-13-022278-X.
17. diversity, , ethnic minority Psychology, L.C.D.C., undefined 2001, . Hispanics, Latinos, or Americanos: The evolution of identity. psycneta-
paorg ????;URL https://psycnet.apa.org/getdoi.cfm?doi=10.1037/1099-9809.7.2.115.
18. Perez, A.D., Hirschman, C.. The Changing Racial and Ethnic Composition of the US Population: Emerging American Identities. Population
and development review 2009;35(1):1–51. doi:\bibinfo{doi}{10.1111/j.1728-4457.2009.00260.x}. URL http://www.ncbi.nlm.nih.gov/
pubmed/20539823http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC2882688.
19. Quarterly, D.W.W.H., undefined 1992, . The Spanish legacy in North America and the historical imagination. academicoupcom ????;URL
https://academic.oup.com/whq/article-abstract/23/1/4/1887446.

Cluster

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

4th International Conference on Computer Science and Computational Intelligence 2019

∗ Corresponding author. Tel.: +6221-534-5830; fax: +6221-530-0244.

3. Material and Methods

3.2. Data Analysis

Fig. 1: Data Analysis Process Diagram

4. Results and Discussion

Table 1: KMeans Method Confusion Matrix

Model Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Total Precision

EUR 489 0 0 0 0 489 100%

Table 2: Fast Structure Method Confusion Matrix

Model Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Total Precision

EUR 489 0 0 0 0 489 100%

Table 3: KMeans PCA Method Confusion Matrix

Model Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Total Precision

EUR 489 0 0 0 0 489 100%

Table 4: Test result

Model EUR EAS AFR SAS AMR Total Precision

KMeans 0 173 0 0 0 173 100%

4.1. Data Visualization

4.2. Processing Time

1. Reference, G.H.. What is genetic ancestry testing? 2019. URL https://ghr.nlm.nih.gov/primer/dtcgenetictesting/

You might also like