Professional Documents
Culture Documents
Fouzi Takelait
Supervisor: Dmitry I. Ignatov, Associate Professor
Problem Statement
Biclustering
Oa-bicluster Generation
Experiments
Definition 1
A formal context K is a triple (G, M, I) consisting of a set of rows
G, called objects, and that of M, called attributes, and a binary
relation I ⊆ G × M between G and M is an incidence relation. The
notation gIm or (g, m) ∈ I means that the object g has the attribute
m.
Definition 2
For A ⊆ G and for B ⊆ M, we have
def
A′ = {m ∈ M | gIm for all g ∈ A},
def
B′ = {g ∈ G | gIm for all m ∈ B}.
These two operators are the derivation operators for K = (G, M, I).
Fouzi Takelait Higher School of Economics June 6, 2020 5 / 24
Formal Concept Analysis
[Wille, 1982; Ganter & Wille, 1999]
Definition 3
(A, B) is a formal concept of formal context K = (G, M, I) iff
A ⊆ B, B ⊆ M, A′ = B, and A = B′
Clustering versus Biclustering. Given a general data matrix (left): on the one hand classical clustering approaches
retrieve submatrices where a subset of rows behave coherently in all the columns (middle top), or vice versa
(middle bottom); on the other hand biclustering techniques recover submatrices where a particular subset of rows
y1 y2 y3 y4 y5
x1
x2
x3
x4
x5
example of grouping objects and attributes (biclusters)
Definition 4
If (g, m) ∈ I, then (m′ , g′ ) is called an object-attribute or oa-
bicluster with density: ρ(m′ , g′ ) = |I ∩ (m′ × g′ )|/|m′ | · |g′ | (ratio
between all possible relationships and those that actually exist).
g'
Proposition 1
m 1. For any bicluster (A, B) ⊆ 2G × 2M it is true that 0 ≤ ρ(A, B) ≤ 1,
2. Oa-bicluster (m′ , g′ ) is a formal concept iff ρ = 1,
3. If (m′ , g′ ) is a oa-bicluster, then (g′′ , g′ ) ≤ (m′ , m′′ ).
m' g g''
m''
Fouzi Takelait Higher School of Economics June 6, 2020 9 / 24
OA-Bicluster Generation
Proposed Algorithm [Ignatov & Kuznetsov et al., 2012]
Proposition
For a given formal context K = (G, M, I) and ρmin > 0 the largest number of oa-biclusters is equal to |I|, all
oa-biclusters can be generated in time O(|I| · (|G| + |M|)).
distribution of the number of biclusters by their density (top-left), extent (top-right) and intent sizes (bottom).
For the three experiments, balance of classes for model validation were maintained.
Note: CatBoostClassifier and LightGBMClassifier are libraries already implemented in Python. Both libraries can
handle missing values.
scores results of different machine learning classifiers applied to the dataset after elimination of SNP with missing
genotypes.
In-Close4 is a freely available tool and open source software, which constructs the set of concepts satisfying given
constraints on sizes of extents and intents. In-Close4 takes as input a context and two parameters: minimal sizes of
intent (no. attributes) and extent (no. objects) and outputs a reduced concept (cxt) lattice: all concepts satisfying
the constraints given by parameter values (|intent| ≥ m and |extent| ≥ n, where m, n ∈ N).
Min intent size Min extent size Total Time, s Number of Concepts
0 45 21.2 18617
0 40 23.6 34400
0 30 35.8 68477
0 20 46.1 165864
0 10 64.3 214007
0 5 188.3 1220576
number of concepts and elapsed time generated by In-Close4 algorithm before eliminating SNPs with missing
genotypes.
Min intent size Min extent size Total Time, s Number of Concepts
0 40 10.4 2743
0 30 10.6 4196
0 20 12.6 19620
the number of concepts and elapsed time generated by In-Close4 algorithm after eliminating SNPs with missing
genotypes.
Questions?