You are on page 1of 24

Object-Attribute Biclustering for Elimination of

Missing Genotypes in Ischemic Stroke


Genome-Wide Data

Fouzi Takelait
Supervisor: Dmitry I. Ignatov, Associate Professor

National Research University Higher School of Economics


Moscow, Russian Federation

June 10, 2020


Outline

Problem Statement

Formal Concept Analysis

Clustering vs. Biclustering

Biclustering

Oa-bicluster Generation

Experiments

Conclusion and future work

Fouzi Takelait Higher School of Economics June 6, 2020 2 / 24


Microarrays Genotyping
Genetic Variations

Haplotype: A set of DNA variations (polymorphisms such as SNPs


— Single-nucleotide polymorphism) adjacent to one another at the
same locus that tend to be inherited together. This set of alleles is
often referred to as linked polymorphisms.

Fouzi Takelait Higher School of Economics June 6, 2020 3 / 24


Microarrays Genotyping
Missing SNP Genotypes

Here is an example of an experimental


protocol comparing gene expression
of two different cells (healthy cells
versus pathological cells) to find SNPs
responsible for complex diseases such as
Ischemic Stroke.
Some causes of missing genotype data:
• The standard human arrays have
from 500 thousand to 2 million
SNPs on the microarray.
• Total sequence variation is
determined only on a small fraction
of individuals.
• SNPs of interest are not on the
genotying platform.
• Biases can be introduced during
sample preparation or during
hybridization or due to genomic
variation.
Fouzi Takelait Higher School of Economics June 6, 2020 4 / 24
Formal Concept Analysis
[Wille, 1982; Ganter & Wille, 1999]

Definition 1
A formal context K is a triple (G, M, I) consisting of a set of rows
G, called objects, and that of M, called attributes, and a binary
relation I ⊆ G × M between G and M is an incidence relation. The
notation gIm or (g, m) ∈ I means that the object g has the attribute
m.

Definition 2
For A ⊆ G and for B ⊆ M, we have
def
A′ = {m ∈ M | gIm for all g ∈ A},
def
B′ = {g ∈ G | gIm for all m ∈ B}.
These two operators are the derivation operators for K = (G, M, I).
Fouzi Takelait Higher School of Economics June 6, 2020 5 / 24
Formal Concept Analysis
[Wille, 1982; Ganter & Wille, 1999]

a b c d e • {T2 }′ = {a, b, c} is the object intent T′2 of T2 ,


T1 ×
T2 × × × • {T2 , T3 , T4 }′ = {c},
T3 × • {e}′ = {T4 , T5 } is the attribute extent e′ of e,
T4 × × × × ×
• {T1 , T5 }′ = ∅.
T5 ×

Definition 3
(A, B) is a formal concept of formal context K = (G, M, I) iff

A ⊆ B, B ⊆ M, A′ = B, and A = B′

A is concept extent and B is concept extent of the formal concept


(A, B).

Fouzi Takelait Higher School of Economics June 6, 2020 6 / 24


Clustering vs. Biclustering
Clustering vs. Biclustering [Denitto Matteo, 2017]

Clustering versus Biclustering. Given a general data matrix (left): on the one hand classical clustering approaches

retrieve submatrices where a subset of rows behave coherently in all the columns (middle top), or vice versa

(middle bottom); on the other hand biclustering techniques recover submatrices where a particular subset of rows

behave coherently in a certain subset of columns, and vice versa.

Fouzi Takelait Higher School of Economics June 6, 2020 7 / 24


Biclustering
Biclusters [Madeira & Oliveira, 2004]

Consider A = (X, Y) ∈ Rn×m , a matrix with set of rows/objects


X = {x1 , . . . , xn } and set of columns/attributes Y = {y1 , . . . , ym }.

y1 y2 y3 y4 y5
x1
x2
x3
x4
x5
example of grouping objects and attributes (biclusters)

A submatrix constructed from a subset of rows I ⊆ X and that of


columns J ⊆ Y is denoted by (I, J) and is called a bicluster of A.

Fouzi Takelait Higher School of Economics June 6, 2020 8 / 24


Biclustering
Object-Attribute Biclusters [Ignatov & Kaminskaya et al., 2011; Ignatov
& Watson, 2010]

Definition 4
If (g, m) ∈ I, then (m′ , g′ ) is called an object-attribute or oa-
bicluster with density: ρ(m′ , g′ ) = |I ∩ (m′ × g′ )|/|m′ | · |g′ | (ratio
between all possible relationships and those that actually exist).
g'
Proposition 1
m 1. For any bicluster (A, B) ⊆ 2G × 2M it is true that 0 ≤ ρ(A, B) ≤ 1,
2. Oa-bicluster (m′ , g′ ) is a formal concept iff ρ = 1,
3. If (m′ , g′ ) is a oa-bicluster, then (g′′ , g′ ) ≤ (m′ , m′′ ).

m' g g''

m''
Fouzi Takelait Higher School of Economics June 6, 2020 9 / 24
OA-Bicluster Generation
Proposed Algorithm [Ignatov & Kuznetsov et al., 2012]

Algorithm 1: Algorithm for oa-bicluster Generation.


Data: K = (G, M, I) is a formal context, ρmin is a threshold density value of bicluster density
Result: B = {(Ak , Bk )|(Ak , Bk ) – bicluster}
1 begin
2 Obj.Size = |G|
3 Attr.Size = |M|
4 B ←− ∅
5 # Step 1: generate all objects in the context.
6 for g ∈ G do
7 Obj[g] = g′
8 # Step 2: generate all attributes in the context.
9 for m ∈ M do
10 Attr[m] = m′
11 # Step 3: generate all biclusters that fulfill the minimum density requirement.
12 for g ← 0 to |G| do
13 for m ∈ Obj[g] do
14 if ρ(Attr[m], Obj[g]) ≥ ρmin then
15 B.Add((Attr[m], Obj[g]))

Proposition
For a given formal context K = (G, M, I) and ρmin > 0 the largest number of oa-biclusters is equal to |I|, all
oa-biclusters can be generated in time O(|I| · (|G| + |M|)).

Fouzi Takelait Higher School of Economics June 6, 2020 10 / 24


Data Collection
Data Description and Sample [Institute of Molecular Genetics of the
Russian Academy of Science, 2019]

real-world dataset sample in a form of object-attribute matrix.


Fouzi Takelait Higher School of Economics June 6, 2020 11 / 24
Data Collection
Data Description and Sample [Institute of Molecular Genetics of the
Russian Academy of Science, 2019]

In the resulting context K = (G, M, I), the formal concept that is


almost equivalent to identifying the binary relation patients×SNPs
in terms of object-attribute tables in FCA is formed where objects
from G stand for individuals (patients) and attributes from M stand
for SNPs, and gIm means that an individual g has a SNP m.
The dataset includes information about the contexts:
• |G| = 1323 (no. of individuals),
• |M| = 85142 (no. of SNPs),
• |I| = 45075 (represents the total number of attributes with
missing values in the dataset).
0.491% of the whole data matrix are missing SNP genotype data.

Fouzi Takelait Higher School of Economics June 6, 2020 12 / 24


Data Collection
Missing SNP values

distribution of the number of missing SNP values by columns


before elimination.

Fouzi Takelait Higher School of Economics June 6, 2020 13 / 24


Experiments
Identification of Biclusters with Missing SNPs

distribution of the number of biclusters by their density (top-left), extent (top-right) and intent sizes (bottom).

Fouzi Takelait Higher School of Economics June 6, 2020 14 / 24


Experiments
Identification of Biclusters with Missing SNPs

– For the selection of large dense biclusters, we set the density


constraint to be ρ = 0.9.
– The number of attributes per object and number objects per
attribute are bounded as follows: 3 ≤ |g′ | ≤ 1500 and 3 ≤ |m′ | ≤
80000, respectively.
Example.

ρmin = 94.08%, ρmax = 100%, |g′ | = 122, and 3 ≤ |m′ | ≤ 80000


Fouzi Takelait Higher School of Economics June 6, 2020 15 / 24
Experiments
Elimination of Large Biclusters with Missing Genotypes

The proposed biclustering algorithm resulted an improvement in


terms of SNPs with missing genotypes, a fraction of SNPs is reduced
by 0.172% and about 19 min difference.

no. samples no. SNPs no. NaNs NaNs Fraction


before elimination 1223 85142 45075 0.49%
after elimination 1472 82690 81360 0.31%
basic statistics of the datasets before and after elimination of missing
values.

no. biclusters Total Time, s


before elimination 383733 3433.2
after elimination 1472 2293.7
total number of generated biclusters by the proposed algorithm and their
running time

Fouzi Takelait Higher School of Economics June 6, 2020 16 / 24


Experiments
Elimination of Large Biclusters with Missing Genotypes

distribution of the number of SNPs with missing genotypes by


columns after elimination.

Fouzi Takelait Higher School of Economics June 6, 2020 17 / 24


Experiments
Prallel Programming
– Parallel programming allows us to execute a standard linear program in similar and independent steps simultaneously,
which can make our program faster,

– Parallel programming let us run several instructions at once.

1 def biclGen(row, data, Objects, Attributes):


2 B = {}, Objects = {}, Attributes = {}, idx = 0
3 for column in range(data.shape[1]):
4 if data[row][column] == -1:
5
6 if [Attributes[column], Objects[row]] not in
B.values():
7 B[(row, column)] = [Attributes[column],
Objects[row]]
8 idx = idx+1
9 if idx % 1000 == 0:
10 print(idx)
11 return B

a parallel code chunk to generated Object-Attribute Biclusters.


Fouzi Takelait Higher School of Economics June 6, 2020 18 / 24
Experiments
Large Dense Biclusters Elimination and Classification Influence

first dataset experiments.


Data Shape: (1223, 85142)
We applied gradient boosting on decision algorithm (GBDT)
algorithm to our initial dataset (before elimination of SNPs with
missing genotypes).
The following parameters for the classifier were taken:
• Maximum number of trees: 2;
• Tree depth limit: 2;
• Loss function: binary cross-entropy (log-loss/binary cross-entropy).

Fouzi Takelait Higher School of Economics June 6, 2020 19 / 24


Experiments
Large Dense Biclusters Elimination and Classification Influence

second dataset experiments.


Data Shape: (1472, 82690)
• First experiment: CatBoost classifier with with train/test split in the proportion of 8:2.
• Second experiment: CatBoost classifier with 3-fold cross-validation.
• Third experiment: LGBM classifier with 3-fold cross-validation.

For the three experiments, balance of classes for model validation were maintained.
Note: CatBoostClassifier and LightGBMClassifier are libraries already implemented in Python. Both libraries can
handle missing values.

no. trees depth accuracy F1-score precision recall


2 2 0.715 0.834 0.715 1.000
CatBoostClassifier 5 2 0.773 0.862 0.761 0.995
5 3 0.773 0.862 0.761 0.995
4 3 0.768 0.859 0.990 0.759
CatBoostClassifier
5 3 0.768 0.859 0.990 0.759
5 3 0.753 0.852 0.997 0.744
5 5 0.753 0.852 0.996 0.744
LGBMClassifier
4 4 0.751 0.851 0.997 0.742
4 3 0.749 0.850 0.997 0.741
5 4 0.756 0.854 0.996 0.747

scores results of different machine learning classifiers applied to the dataset after elimination of SNP with missing
genotypes.

Fouzi Takelait Higher School of Economics June 6, 2020 20 / 24


Experiments
Detecting of Large Concepts in Large Datasets

In-Close4 is a freely available tool and open source software, which constructs the set of concepts satisfying given
constraints on sizes of extents and intents. In-Close4 takes as input a context and two parameters: minimal sizes of
intent (no. attributes) and extent (no. objects) and outputs a reduced concept (cxt) lattice: all concepts satisfying
the constraints given by parameter values (|intent| ≥ m and |extent| ≥ n, where m, n ∈ N).

Min intent size Min extent size Total Time, s Number of Concepts
0 45 21.2 18617
0 40 23.6 34400
0 30 35.8 68477
0 20 46.1 165864
0 10 64.3 214007
0 5 188.3 1220576

number of concepts and elapsed time generated by In-Close4 algorithm before eliminating SNPs with missing
genotypes.

Min intent size Min extent size Total Time, s Number of Concepts
0 40 10.4 2743
0 30 10.6 4196
0 20 12.6 19620

the number of concepts and elapsed time generated by In-Close4 algorithm after eliminating SNPs with missing
genotypes.

Fouzi Takelait Higher School of Economics June 6, 2020 21 / 24


Conclusion

• A new approach to process the missing values in datasets produced


by SNP genotyping with DNA-microarrays is proposed. It is based on
OA-biclustering and allowed us to estimate and eliminate the SNPs
carefully with missing genotypes.
• Results of the OA-biclustering algorithm showed the possibility of
detecting relatively large dense biclusters, which significantly helped
in removing the effects of data leaks and overfitting while applying
ML algorithms.
• We compared our algorithm with In-Close4. The findings showed
that the number of OA-biclusters generated by our algorithm is
significantly lower than the number of formal concepts or biclusters
generated by the In-Close4 algorithm.

Fouzi Takelait Higher School of Economics June 6, 2020 22 / 24


Future work

• The proposed algorithm needs further comparison with other


approaches like the DeBi algorithm [Akdes Serin & Martin Vingron,
2011]. It also requires some time complexity improvements to increase
the scalability and quality of the extensive bicluster finding process
for massive datasets.
• we can suggest the development of a new technique to impute
missing genotypes. Chowdhury et al. applied biclustering recently to
impute the missing values in gene expression data.

Fouzi Takelait Higher School of Economics June 6, 2020 23 / 24


Thank you!

Questions?

Fouzi Takelait Higher School of Economics June 6, 2020 24 / 24

You might also like