Object-Attribute Biclustering For Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Object-Attribute Biclustering for Elimination of
Missing Genotypes in Ischemic Stroke

Genome-Wide Data
Fouzi Takelait
Supervisor: Dmitry I. Ignatov, Associate Professor
National Research University Higher School of Economics

Moscow, Russian Federation
June 10, 2020

Outline
Problem Statement
Formal Concept Analysis
Clustering vs. Biclustering
Biclustering
Oa-bicluster Generation
Experiments
Conclusion and future work
Fouzi Takelait Higher School of Economics June 6, 2020 2 / 24

Microarrays Genotyping
Genetic Variations
Haplotype: A set of DNA variations (polymorphisms such as SNPs

— Single-nucleotide polymorphism) adjacent to one another at the
same locus that tend to be inherited together. This set of alleles is
often referred to as linked polymorphisms.

Microarrays Genotyping
Missing SNP Genotypes
Here is an example of an experimental

protocol comparing gene expression
of two different cells (healthy cells
versus pathological cells) to find SNPs
responsible for complex diseases such as
Ischemic Stroke.
Some causes of missing genotype data:
• The standard human arrays have
from 500 thousand to 2 million
SNPs on the microarray.
• Total sequence variation is
determined only on a small fraction
of individuals.
• SNPs of interest are not on the
genotying platform.
• Biases can be introduced during
sample preparation or during
hybridization or due to genomic
variation.
[Wille, 1982; Ganter & Wille, 1999]
Definition 1
A formal context K is a triple (G, M, I) consisting of a set of rows
G, called objects, and that of M, called attributes, and a binary
relation I ⊆ G × M between G and M is an incidence relation. The
notation gIm or (g, m) ∈ I means that the object g has the attribute
m.
Definition 2
For A ⊆ G and for B ⊆ M, we have
def
A′ = {m ∈ M | gIm for all g ∈ A},
def
B′ = {g ∈ G | gIm for all m ∈ B}.
These two operators are the derivation operators for K = (G, M, I).
[Wille, 1982; Ganter & Wille, 1999]
a b c d e • {T2 }′ = {a, b, c} is the object intent T′2 of T2 ,

T1 ×
T2 × × × • {T2 , T3 , T4 }′ = {c},
T3 × • {e}′ = {T4 , T5 } is the attribute extent e′ of e,
T4 × × × × ×
• {T1 , T5 }′ = ∅.
T5 ×
Definition 3
(A, B) is a formal concept of formal context K = (G, M, I) iff
A ⊆ B, B ⊆ M, A′ = B, and A = B′
A is concept extent and B is concept extent of the formal concept

(A, B).

Clustering vs. Biclustering
Clustering vs. Biclustering [Denitto Matteo, 2017]
Clustering versus Biclustering. Given a general data matrix (left): on the one hand classical clustering approaches
retrieve submatrices where a subset of rows behave coherently in all the columns (middle top), or vice versa
(middle bottom); on the other hand biclustering techniques recover submatrices where a particular subset of rows
behave coherently in a certain subset of columns, and vice versa.

Biclustering
Biclusters [Madeira & Oliveira, 2004]
Consider A = (X, Y) ∈ Rn×m , a matrix with set of rows/objects

X = {x1 , . . . , xn } and set of columns/attributes Y = {y1 , . . . , ym }.
y1 y2 y3 y4 y5
x1
x2
x3
x4
x5
example of grouping objects and attributes (biclusters)
A submatrix constructed from a subset of rows I ⊆ X and that of

columns J ⊆ Y is denoted by (I, J) and is called a bicluster of A.

Biclustering
Object-Attribute Biclusters [Ignatov & Kaminskaya et al., 2011; Ignatov
& Watson, 2010]
Definition 4
If (g, m) ∈ I, then (m′ , g′ ) is called an object-attribute or oa-
bicluster with density: ρ(m′ , g′ ) = |I ∩ (m′ × g′ )|/|m′ | · |g′ | (ratio
between all possible relationships and those that actually exist).
g'
Proposition 1
m 1. For any bicluster (A, B) ⊆ 2G × 2M it is true that 0 ≤ ρ(A, B) ≤ 1,
2. Oa-bicluster (m′ , g′ ) is a formal concept iff ρ = 1,
3. If (m′ , g′ ) is a oa-bicluster, then (g′′ , g′ ) ≤ (m′ , m′′ ).
m' g g''
m''
OA-Bicluster Generation
Proposed Algorithm [Ignatov & Kuznetsov et al., 2012]
Algorithm 1: Algorithm for oa-bicluster Generation.

Data: K = (G, M, I) is a formal context, ρmin is a threshold density value of bicluster density
Result: B = {(Ak , Bk )|(Ak , Bk ) – bicluster}
1 begin
2 Obj.Size = |G|
3 Attr.Size = |M|
4 B ←− ∅
5 # Step 1: generate all objects in the context.
6 for g ∈ G do
7 Obj[g] = g′
8 # Step 2: generate all attributes in the context.
9 for m ∈ M do
10 Attr[m] = m′
11 # Step 3: generate all biclusters that fulfill the minimum density requirement.
12 for g ← 0 to |G| do
13 for m ∈ Obj[g] do
14 if ρ(Attr[m], Obj[g]) ≥ ρmin then
15 B.Add((Attr[m], Obj[g]))
Proposition
For a given formal context K = (G, M, I) and ρmin > 0 the largest number of oa-biclusters is equal to |I|, all
oa-biclusters can be generated in time O(|I| · (|G| + |M|)).

Data Collection
Data Description and Sample [Institute of Molecular Genetics of the
Russian Academy of Science, 2019]
real-world dataset sample in a form of object-attribute matrix.

Data Collection
Data Description and Sample [Institute of Molecular Genetics of the
Russian Academy of Science, 2019]
In the resulting context K = (G, M, I), the formal concept that is

almost equivalent to identifying the binary relation patients×SNPs
in terms of object-attribute tables in FCA is formed where objects
from G stand for individuals (patients) and attributes from M stand
for SNPs, and gIm means that an individual g has a SNP m.
The dataset includes information about the contexts:
• |G| = 1323 (no. of individuals),
• |M| = 85142 (no. of SNPs),
• |I| = 45075 (represents the total number of attributes with
missing values in the dataset).
0.491% of the whole data matrix are missing SNP genotype data.

Data Collection
Missing SNP values
distribution of the number of missing SNP values by columns

before elimination.

Experiments
Identification of Biclusters with Missing SNPs
distribution of the number of biclusters by their density (top-left), extent (top-right) and intent sizes (bottom).

Experiments
Identification of Biclusters with Missing SNPs
– For the selection of large dense biclusters, we set the density

constraint to be ρ = 0.9.
– The number of attributes per object and number objects per
attribute are bounded as follows: 3 ≤ |g′ | ≤ 1500 and 3 ≤ |m′ | ≤
80000, respectively.
Example.
ρmin = 94.08%, ρmax = 100%, |g′ | = 122, and 3 ≤ |m′ | ≤ 80000

Experiments
Elimination of Large Biclusters with Missing Genotypes
The proposed biclustering algorithm resulted an improvement in

terms of SNPs with missing genotypes, a fraction of SNPs is reduced
by 0.172% and about 19 min difference.
no. samples no. SNPs no. NaNs NaNs Fraction

before elimination 1223 85142 45075 0.49%
after elimination 1472 82690 81360 0.31%
basic statistics of the datasets before and after elimination of missing
values.
no. biclusters Total Time, s

before elimination 383733 3433.2
after elimination 1472 2293.7
total number of generated biclusters by the proposed algorithm and their
running time

Experiments
Elimination of Large Biclusters with Missing Genotypes
distribution of the number of SNPs with missing genotypes by

columns after elimination.

Experiments
Prallel Programming
– Parallel programming allows us to execute a standard linear program in similar and independent steps simultaneously,
which can make our program faster,
– Parallel programming let us run several instructions at once.
1 def biclGen(row, data, Objects, Attributes):

2 B = {}, Objects = {}, Attributes = {}, idx = 0
3 for column in range(data.shape[1]):
4 if data[row][column] == -1:
5
6 if [Attributes[column], Objects[row]] not in
B.values():
7 B[(row, column)] = [Attributes[column],
Objects[row]]
8 idx = idx+1
9 if idx % 1000 == 0:
10 print(idx)
11 return B
a parallel code chunk to generated Object-Attribute Biclusters.

Experiments
Large Dense Biclusters Elimination and Classification Influence
first dataset experiments.

Data Shape: (1223, 85142)
We applied gradient boosting on decision algorithm (GBDT)
algorithm to our initial dataset (before elimination of SNPs with
missing genotypes).
The following parameters for the classifier were taken:
• Maximum number of trees: 2;
• Tree depth limit: 2;
• Loss function: binary cross-entropy (log-loss/binary cross-entropy).

Experiments
Large Dense Biclusters Elimination and Classification Influence
second dataset experiments.

Data Shape: (1472, 82690)
• First experiment: CatBoost classifier with with train/test split in the proportion of 8:2.
• Second experiment: CatBoost classifier with 3-fold cross-validation.
• Third experiment: LGBM classifier with 3-fold cross-validation.
For the three experiments, balance of classes for model validation were maintained.
Note: CatBoostClassifier and LightGBMClassifier are libraries already implemented in Python. Both libraries can
handle missing values.
no. trees depth accuracy F1-score precision recall

2 2 0.715 0.834 0.715 1.000
CatBoostClassifier 5 2 0.773 0.862 0.761 0.995
5 3 0.773 0.862 0.761 0.995
4 3 0.768 0.859 0.990 0.759
CatBoostClassifier
5 3 0.768 0.859 0.990 0.759
5 3 0.753 0.852 0.997 0.744
5 5 0.753 0.852 0.996 0.744
LGBMClassifier
4 4 0.751 0.851 0.997 0.742
4 3 0.749 0.850 0.997 0.741
5 4 0.756 0.854 0.996 0.747
scores results of different machine learning classifiers applied to the dataset after elimination of SNP with missing
genotypes.

Experiments
Detecting of Large Concepts in Large Datasets
In-Close4 is a freely available tool and open source software, which constructs the set of concepts satisfying given
constraints on sizes of extents and intents. In-Close4 takes as input a context and two parameters: minimal sizes of
intent (no. attributes) and extent (no. objects) and outputs a reduced concept (cxt) lattice: all concepts satisfying
the constraints given by parameter values (|intent| ≥ m and |extent| ≥ n, where m, n ∈ N).
Min intent size Min extent size Total Time, s Number of Concepts
0 45 21.2 18617
0 40 23.6 34400
0 30 35.8 68477
0 20 46.1 165864
0 10 64.3 214007
0 5 188.3 1220576
number of concepts and elapsed time generated by In-Close4 algorithm before eliminating SNPs with missing
genotypes.
Min intent size Min extent size Total Time, s Number of Concepts
0 40 10.4 2743
0 30 10.6 4196
0 20 12.6 19620
the number of concepts and elapsed time generated by In-Close4 algorithm after eliminating SNPs with missing
genotypes.

Conclusion
• A new approach to process the missing values in datasets produced

by SNP genotyping with DNA-microarrays is proposed. It is based on
OA-biclustering and allowed us to estimate and eliminate the SNPs
carefully with missing genotypes.
• Results of the OA-biclustering algorithm showed the possibility of
detecting relatively large dense biclusters, which significantly helped
in removing the effects of data leaks and overfitting while applying
ML algorithms.
• We compared our algorithm with In-Close4. The findings showed
that the number of OA-biclusters generated by our algorithm is
significantly lower than the number of formal concepts or biclusters
generated by the In-Close4 algorithm.

Future work
• The proposed algorithm needs further comparison with other

approaches like the DeBi algorithm [Akdes Serin & Martin Vingron,
2011]. It also requires some time complexity improvements to increase
the scalability and quality of the extensive bicluster finding process
for massive datasets.
• we can suggest the development of a new technique to impute
missing genotypes. Chowdhury et al. applied biclustering recently to
impute the missing values in gene expression data.

Thank you!
Questions?

Object-Attribute Biclustering For Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object-Attribute Biclustering For Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Uploaded by

Copyright:

Available Formats

Object-Attribute Biclustering for Elimination of

Missing Genotypes in Ischemic Stroke

National Research University Higher School of Economics

June 10, 2020

Formal Concept Analysis

Clustering vs. Biclustering

Conclusion and future work

Fouzi Takelait Higher School of Economics June 6, 2020 2 / 24

Haplotype: A set of DNA variations (polymorphisms such as SNPs

Fouzi Takelait Higher School of Economics June 6, 2020 3 / 24

Here is an example of an experimental

a b c d e • {T2 }′ = {a, b, c} is the object intent T′2 of T2 ,

A is concept extent and B is concept extent of the formal concept

Fouzi Takelait Higher School of Economics June 6, 2020 6 / 24

behave coherently in a certain subset of columns, and vice versa.

Fouzi Takelait Higher School of Economics June 6, 2020 7 / 24

Consider A = (X, Y) ∈ Rn×m , a matrix with set of rows/objects

A submatrix constructed from a subset of rows I ⊆ X and that of

Fouzi Takelait Higher School of Economics June 6, 2020 8 / 24

Algorithm 1: Algorithm for oa-bicluster Generation.

Fouzi Takelait Higher School of Economics June 6, 2020 10 / 24

real-world dataset sample in a form of object-attribute matrix.

In the resulting context K = (G, M, I), the formal concept that is

Fouzi Takelait Higher School of Economics June 6, 2020 12 / 24

distribution of the number of missing SNP values by columns

Fouzi Takelait Higher School of Economics June 6, 2020 13 / 24

Fouzi Takelait Higher School of Economics June 6, 2020 14 / 24

– For the selection of large dense biclusters, we set the density

ρmin = 94.08%, ρmax = 100%, |g′ | = 122, and 3 ≤ |m′ | ≤ 80000

The proposed biclustering algorithm resulted an improvement in

no. samples no. SNPs no. NaNs NaNs Fraction

no. biclusters Total Time, s

Fouzi Takelait Higher School of Economics June 6, 2020 16 / 24

distribution of the number of SNPs with missing genotypes by

Fouzi Takelait Higher School of Economics June 6, 2020 17 / 24

– Parallel programming let us run several instructions at once.

1 def biclGen(row, data, Objects, Attributes):

a parallel code chunk to generated Object-Attribute Biclusters.

first dataset experiments.

Fouzi Takelait Higher School of Economics June 6, 2020 19 / 24

second dataset experiments.

no. trees depth accuracy F1-score precision recall

Fouzi Takelait Higher School of Economics June 6, 2020 20 / 24

Fouzi Takelait Higher School of Economics June 6, 2020 21 / 24

• A new approach to process the missing values in datasets produced

Fouzi Takelait Higher School of Economics June 6, 2020 22 / 24

• The proposed algorithm needs further comparison with other

Fouzi Takelait Higher School of Economics June 6, 2020 23 / 24

Fouzi Takelait Higher School of Economics June 6, 2020 24 / 24

You might also like