You are on page 1of 110

The Analysis of Microarray Data

Interim Report

Basel Abu Jamous


baselabujamous@ieee.org
MSc Student of Information and Intelligence Engineering
Department of Electrical Engineering and Electronics, University of Liverpool, UK

Supervisor
Professor Asoke K Nandi
A.Nandi@liverpool.ac.uk
David Jardine Professor of Signal Processing
Department of Electrical Engineering and Electronics, University of Liverpool, UK

July-2011
Proclamation
This document is prepared upon the request of the Department of Electrical Engineering and
Electronics at the University of Liverpool (Liverpool, UK) as an interim report for the author’s MSc
project. A large portion of this document is expected to be included in the final report of the author’s
MSc Project after performing the appropriate modifications.

The MSc project, which is titled as “The Analysis of Microarray Data”, is supervised by Professor
Asoke K. Nandi (David Jardine Professor of Signal Processing, Department of Electrical Engineering
and Electronics, University of Liverpool, UK).

The author proclaims that the entirety of this document is original and it obeys all valid
documents rules and regulations of the Department of Electrical Engineering and Electronics
at the University of Liverpool.

Basel Abu Jamous

i
Acknowledgements
I like to take the chance to thank everyone who helped and supported in reaching this stage of the
project and in producing this interim report document.

I would like to thank my supervisor Professor Asoke Nandi whose firm position in engineering
and wide professional relations were the main reasons for having this interesting and important
project proposed at the first instance. I thank him again for his continuous support and for the trust he
has put on me to work out the project successfully.

I also thank Dr Rui Fa, the researcher at the Department of Electrical Engineering and Electronics
at the University of Liverpool, who advised me through many of the technical details of the project.

I cannot forget to thank my parents who encouraged me to pursue an MSc degree overseas and
supported me with all of the physiological and financial available means.

I thank all of my friends in Jordan and in the UK who have never stopped supporting me with
every encouraging words and feelings.

Without all of this support and help, this report would have never been as it is today and the
current progress of the project would have never been achieved.

ii
Abstract
This interim report presents the current progress of the MSc project “The Analysis of Microarray
Data”. The project is divided into two independent subprojects by considering two different
microarray data sets; The University of Stanford yeast cell-cycle microarray data set and the
University of Oxford hemoglobin molecules data set. The main objective of analyzing the first data
set is gene discovery through clustering the genes according to the cell-cycle stages at which they
show peaks. The second data set analysis aims at identifying the most optimal subset of genes that
directly influence the target biological function then discovering their relative patterns.

With MATLAB as a programming language and tool, 4 clustering techniques have been applied
over the first data set with 19 different configurations. The clustering results were validated using 5
indices then a shortlist of 11 configurations was extracted. A new novel method, which is called
“combined fuzzy partition matrix formulation and binarization”, is proposed in this research to
combine different clustering results. The proposed method has succeeded in reaching a deeper level of
understanding of the shortlisted clustering results. Investigating the new method and its originality has
been added as an additional theoretical objective of the project.

Currently, analyzing the first data set has been finished except for results’ comparisons and
conclusions’ finalizing. The next phase consists of further investigation of the proposed method and
undertaking the analysis of the second data set.

Keywords: Microarrays, gene clustering, yeast cell-cycle, combined clustering results, fuzzy
partition matrix binarization, k-means, SOMs, hierarchical clustering, SOON clustering.

iii
Table of Contents

PROCLAMATION ............................................................................................................... I
ACKNOWLEDGEMENTS .................................................................................................... II
ABSTRACT...................................................................................................................... III
TABLE OF CONTENTS ..................................................................................................... IV
TABLE OF FIGURES ........................................................................................................ VI
TABLE OF TABLES ......................................................................................................... VIII

CHAPTER 1. INTRODUCTION ............................................................................. 1

1.1. THE DATA SETS ....................................................................................................... 2


1.2. RESEARCH QUESTIONS – PROJECT’S OBJECTIVES .................................................... 3
1.3. THE REPORT’S STRUCTURE APPROACH .................................................................... 4

CHAPTER 2. LITERATURE REVIEW (BACKGROUND) ..................................... 5

2.1. THE HISTORY OF MICROARRAYS ............................................................................... 7


2.2. MICROARRAY STRUCTURE AND ANALYSIS MODEL ...................................................... 8
2.3. GENE SELECTION (GS) – OPEN-LOOP METHODS ..................................................... 10
2.4. GENE SELECTION (GS) – CLOSED LOOP METHODS ................................................. 17
2.5. GENE CLUSTERING ................................................................................................ 20
2.6. SUPERVISED CLASSIFICATION ................................................................................. 27
2.7. LITERATURE REVIEW CONCLUSION ......................................................................... 32

CHAPTER 3. TIME PLAN ................................................................................... 33

3.1. FIRST TIME PLAN (SET IN DECEMBER 2010) ............................................................ 33


3.2. SECOND TIME PLAN (SET IN JUNE 2011) ................................................................. 34
3.3. CURRENT PROGRESS ASSESSMENT........................................................................ 36
3.4. FORWARD W ORK PLAN .......................................................................................... 38
3.5. TIME PLAN CONCLUSIONS ...................................................................................... 39

CHAPTER 4. METHODOLOGY .......................................................................... 40

4.1. PARTITION MATRIX FORMAT FOR CLUSTERING RESULTS .......................................... 40


4.2. COMBINED FUZZY PARTITION MATRIX ..................................................................... 42
4.3. COMBINED FUZZY PARTITION MATRIX BINARIZATION ................................................ 44
4.4. MATLAB FOR MICROARRAY DATA ANALYSIS........................................................... 49

iv
CHAPTER 5. RESULTS AND ANALYSIS .......................................................... 50

5.1. CLUSTERING VALIDATION INDICES RESULTS ............................................................ 52


5.2. COMBINED FUZZY PARTITION MATRIX AND BINARIZATION ANALYSIS .......................... 66

CHAPTER 6. CONCLUSIONS ............................................................................ 82

CHAPTER 7. REFERENCES .............................................................................. 83

APPENDIX A. COMBINED FUZZY PARTITION MATRIX RESULTS................. 87

APPENDIX B. COMBINED FUZZY PARTITION MATRIX BINARIZATION


RESULTS ................................................................................................................ 95

v
Table of Figures
FIGURE ‎2.1: GENERAL MODEL FOR MICROARRAY DATA ANALYSIS .......................................................................................... 8

FIGURE ‎2.2: 2-CLASS PROBLEM. (A) HIGH MEANS DISTANCE AND LOW INTRA-CLASS VARIANCE – THE BEST SEPARABILITY. (B) LOW
MEANS DISTANCE AND LOW INTRA-CLASS VARIANCE. (C) HIGH MEANS DISTANCE AND HIGH INTRA-CLASS VARIANCE. (D) LOW

MEANS DISTANCE AND HIGH INTRA-CLASS VARIANCE – THE WORST SEPARABILITY. .......................................................... 13

FIGURE ‎3.1: GANTT CHART FOR THE PROJECT'S TIME PLAN – SET IN DECEMBER 2010 ............................................................ 34

FIGURE ‎3.2: GANTT CHART FOR THE PROJECT'S TIME PLAN – SET IN JUNE 2011 .................................................................... 35

FIGURE ‎3.3: GANTT CHART FOR THE PROJECT'S CURRENT PROGRESS AND THE REMAINING TIME PLAN – SET ON 15 JULY 2011 .... 37
TH

FIGURE ‎4.1: SAMPLE OF A CRISP (BINARY) CLUSTERING PARTITION MATRIX ........................................................................... 41

FIGURE ‎4.2: SAMPLE OF A FUZZY CLUSTERING PARTITION MATRIX ....................................................................................... 41

FIGURE ‎4.3: TWO BINARY PARTITION MATRICES TO BE COMBINED ....................................................................................... 42

FIGURE ‎4.4: SAMPLE PAIRWISE MATRIX FOR FUZZY PARTITION MATRICES’ ROWS ALLOCATION................................................... 43

FIGURE ‎4.5: SAMPLE OF A COMBINED FUZZY CLUSTERING PARTITION MATRIX ........................................................................ 44

FIGURE ‎4.6: INTERSECTION BINARIZATION SAMPLE RESULT ................................................................................................ 45

FIGURE ‎4.7: UNION BINARIZATION SAMPLE RESULT .......................................................................................................... 45

FIGURE ‎4.8: MAX BINARIZATION SAMPLE RESULT............................................................................................................. 46

FIGURE ‎4.9: DIFFERENCE THRESHOLD BINARIZATION SAMPLE RESULT FOR (A) Α = 0.7 AND (B) Α = 0.3. ..................................... 46

FIGURE ‎4.10: VALUE THRESHOLD BINARIZATION SAMPLE RESULTS FOR (A) Α = 0.7, (B) Α = 0.5 AND (C) Α = 0.3. ........................ 47

FIGURE ‎4.11: TOP THRESHOLD BINARIZATION SAMPLE RESULT FOR (A) Α = 0.1, (B) Α = 0.2 AND (C) Α = 0.3. ............................. 48

FIGURE ‎5.1: CLUSTERING VALIDATION INDICES VALUES FOR K-MEANS CLUSTERING EXPERIMENTS .............................................. 53

FIGURE ‎5.2: THE NUMBER OF GENERATED CLUSTERS AGAINST THE NUMBER OF REQUESTED CLUSTERS FOR K-MEANS CLUSTERING WITH
UNIFORM RANDOM INITIALIZATION ....................................................................................................................... 54

FIGURE ‎5.3: DB INDEX VALUES FOR THE SOMS CLUSTERING EXPERIMENTS RESULTS ............................................................... 55

FIGURE ‎5.4: CH INDEX VALUES FOR THE SOMS CLUSTERING EXPERIMENTS RESULTS ............................................................... 56

FIGURE ‎5.5: XB INDEX VALUES FOR THE SOMS CLUSTERING EXPERIMENTS RESULTS ............................................................... 57

FIGURE ‎5.6: I INDEX VALUES FOR THE SOMS CLUSTERING EXPERIMENTS RESULTS................................................................... 58

FIGURE ‎5.7: DI INDEX VALUES FOR THE SOMS CLUSTERING EXPERIMENTS RESULTS ................................................................ 59

FIGURE ‎5.8: CLUSTERING VALIDATION INDICES VALUES FOR HIERARCHICAL CLUSTERING EXPERIMENTS ........................................ 60

FIGURE ‎5.9: THE GENERATED NUMBER OF CLUSTERS (K) VERSUS THE CLUSTERS' RADII (D0) FOR THE SOON CLUSTERING EXPERIMENTS
..................................................................................................................................................................... 62
FIGURE ‎5.10: A ZOOMED VERSION OF FIGURE ‎5.9, THE GENERATED NUMBER OF CLUSTERS (K) VERSUS THE CLUSTERS' RADII (D0) FOR
THE SOON CLUSTERING EXPERIMENTS IN THE RANGE OF D0 FROM 4.0 TO 6.5 ............................................................. 63

FIGURE ‎5.11: CLUSTERING VALIDATION INDICES VALUES FOR SOON CLUSTERING EXPERIMENTS ............................................... 64

FIGURE ‎5.12: TIME TAKEN FOR CLUSTERING BY THE SOON CLUSTERING EXPERIMENTS FOR DIFFERENT VALUES OF THE RADIUS (D0) 65

vi
FIGURE ‎5.13: INTERSECTION AND UNION BINARIZATION CLUSTERS MEANS .......................................................................... 67

FIGURE ‎5.14: THE 11 GENES THAT ARE ASSIGNED TO DIFFERENT CLUSTERS BY THE BIOLOGICAL SUGGESTION AND THE INTERSECTION
BINARIZATION. ................................................................................................................................................. 68

FIGURE ‎5.15: MAX BINARIZATION CLUSTERS MEANS ....................................................................................................... 69

FIGURE ‎5.16: GENES OF THE 4 CLUSTER AS A RESULT OF THE MAX BINARIZATION TECHNIQUE ................................................. 70
TH

FIGURE ‎5.17: THE GENES THAT ARE ASSIGNED TO MULTIPLE CLUSTERS IN THE MAX BINARIZATION TECHNIQUE ............................. 70

FIGURE ‎5.18: VALUE THRESHOLDING BINARIZATION CLUSTERS MEANS ............................................................................... 72

FIGURE ‎5.19: CLUSTER 4 GENES - VALUE THRESHOLDING BINARIZATION (ALPHA = 0.3)........................................................... 73

FIGURE ‎5.20: MULTIPLE-ASSIGNED GENES IN VALUE THRESHOLDING BINARIZATION................................................................ 74

FIGURE ‎5.21: DIFFERENCE THRESHOLDING BINARIZATION CLUSTERS MEANS ........................................................................ 76

FIGURE ‎5.22: THE FIRST 20 GENES (OUT OF 40) THAT ARE UNASSIGNED BY DIFFERENCE BINARIZATION AT ALPHA = 0.3 ................ 77

FIGURE ‎5.23: THE LAST 20 GENES (OUT OF 40) THAT ARE UNASSIGNED BY DIFFERENCE BINARIZATION AT ALPHA = 0.3 ................. 78

FIGURE ‎5.24: TOP THRESHOLDING BINARIZATION CLUSTERS MEANS .................................................................................. 79

FIGURE ‎5.25: THE 9 GENES THAT ARE ASSIGNED TO THE 5 CLUSTERS BY THE TOP THRESHOLDING BINARIZATION WITH ALPHA = 0.4 . 80

FIGURE ‎5.26: THE 15 GENES THAT ARE ASSIGNED TO MORE THAN ONE CLUSTER BY THE TOP THRESHOLDING BINARIZATION AT
..................................................................................................................................................... 81

vii
Table of Tables
TABLE ‎2.1: MATLAB FUNCTIONS FOR OPEN-LOOP GENE SELECTION.................................................................................. 15

TABLE ‎2.2: MATLAB FUNCTIONS FOR CLUSTERING ......................................................................................................... 25

TABLE ‎2.3: MATLAB FUNCTIONS FOR CLASSIFICATION ..................................................................................................... 28

TABLE ‎2.4: SAMPLES OF MICROARRAY DATA ANALYSIS APPLICATIONS AND RESULTS FROM THE LITERATURE ................................. 30

TABLE ‎2.5: THE FULL NAMES OF THE GENE SELECTION, CLASSIFICATION AND VALIDATION METHODS' ABBREVIATIONS THAT ARE USED

TABLE ‎2.4 ....................................................................................................................................................... 31

TABLE ‎5.1: A SUMMARY OF THE EXPERIMENTS THAT HAVE BEEN CARRIED OUT OVER THE STANFORD UNIVERSITY YEAST CELL-CYCLE
MICROARRAY DATA SET. ..................................................................................................................................... 50

TABLE ‎5.2: SUMMARY OF THE BINARIZATION EXPERIMENTS OF THE COMBINED FUZZY PARTITION MATRIX................................. 66

TABLE ‎A.1: COMBINED FUZZY PARTITION MATRIX FOR THE 11 SHORTLISTED EXPERIMENTS .................................................... 87

TABLE ‎B.1: INTERSECTION BINARIZATION RESULTS........................................................................................................... 96

TABLE ‎B.2: MAX BINARIZATION RESULTS ....................................................................................................................... 97

TABLE ‎B.3: VALUE THRESHOLDING BINARIZATION RESULTS ( ) .............................................................................. 98

TABLE ‎B.4: DIFFERENCE THRESHOLDING BINARIZATION RESULTS ( ) ....................................................................... 99

TABLE ‎B.5: TOP THRESHOLDING BINARIZATION RESULTS ( ) ............................................................................. 100

viii
Chapter 1. Introduction
This interim report aims at presenting and discussing the progress of the MSc Project “Microarray
Array Data Analysis” up to this date. In addition to the literature review part which revises the
relevant work previously done by others and the detailed methodology, results and analysis for the
current outcomes of this project, this report discusses the different versions of the project’s time plan
which have been revised and updated while comparing the most recent one with the actual progress.

This research investigates microarray data sets by applying different machine learning techniques
over them. Microarrays are sets of row biological genetic data in which the genetic expressions of a
set of genes for a set of biological cells are stored. The basic data structure upon which the analysis is
done is a 2D array known as the sample-gene matrix which shows the gene expressions of a set of
genes (rows) for a set of samples (columns).

In this project, two microarray data sets are under consideration; the University of Stanford yeast
cell-cycle data set and the University of Oxford hemoglobin data set. The project’s plan has almost
been divided into two large periods of time in each of which one of the two data sets is analyzed. The
pre-interim report period has been entirely dedicated for the analysis of the University of Stanford
data set. Thus, the methodology and the results and analysis parts of this report exclusively discuss
processing the aforementioned data set. The second data set (i.e., the University of Oxford data set)
has not been investigated yet. The properties of these data sets are given in more details in section ‎1.1.

The detailed objectives of the research defer for the different data sets. Even though, a general
research objective can be stated at this stage for both sets: This research aims at extracting as much as
feasible of useful biological information from the row microarray data sets provided. This objective is
formulated as a set of research questions in section ‎1.2.

Amongst the machine learning techniques which are used in this context are: gene selection; a
subset of the most informative genes is selected from the entire given gene set, gene clustering; the
given genes are grouped into number of groups according to certain similarity criteria, and sample
classification; a supervised classifier is trained with the available samples that are already classified
into known classes to be able to predict the correct class to which a new unseen sample belongs. Up to
this stage of the project, clustering has been the class of machine learning methods which was used.

MATLAB was chosen as the implementation language and tool for this research project. This is
because of its well suited environment which increases the productivity by offering easy access and
usability for many general and special purpose functions in the field of this research. Machine
learning techniques, statistical functions, mathematical functions and plots are some examples of what
have been made ready to be used by the researcher in MATLAB.

1
The rest of this introductory chapter is organized in three sections. Section ‎1.1 introduces the 2
aforementioned data sets and their main properties, section ‎1.2 details the research questions and
section ‎1.3 sets the structure and the organization for the rest of this report.

1.1. The Data Sets


Two microarray data sets are considered for analysis in this project; the University of Stanford
yeast cell-cycle data set and the University of Oxford hemoglobin data set. Despite this, the way the
preparation of the project has been done makes it open for analyzing any other possible data sets.

Subsections ‎1.1.1 and ‎1.1.2 discuss these two data sets in more details.

1.1.1. The University of Stanford Yeast Cell-Cycle Data Set


Provided by the University of Stanford, this data set, which consists of 384 gene measurements for
17 data points (samples), represents yeast cell-cycle. The data set synthesis was introduced in 1998
[1], and later on analyzed in many publications such as [2; 3; 4].

The 17 time points are spaced with 10 minutes in between with a total period of 160 minutes
covering over 2 complete mitosis cell-cycles. The original data set consists of approximately 6000
genes. The biological expectation is that there are 5 main stages in the yeast cell cycle, the piece of
information according to which, in [3], a subset of 384 genes was selected such that they peak over
different stages periodically.

The five stages suggested by the biologists are: early gap 1 (G1) which corresponds to the
beginning of Interphase, late G1, Synthesis (S), Prometaphase (G2) and Metaphase (M) [1; 2; 3].

The 384 genes are listed in the first three columns of Table ‎A.1 in ‎Appendix A. The first column
shows the sequential number (index) of the gene with respect to the 384 genes, the second column
gives the name of the gene and the third column gives the biological suggestion for the cluster to
which the gene belongs. The clusters are numbered (1 to 5) reflecting the number of the stage at
which each gene peaks. The rest of the columns of the table are not relevant to this instant of
discussion and shall be referenced by the following chapters when suitable.

The data set can be downloaded from http://faculty.washington.edu/kayee/model/ and the


description of the genes can be found in http://genomics.stanford.edu/.

1.1.2. The Oxford University Hemoglobin Molecules Data Set


This data set, which is provided by the University of Oxford, consists of the expression levels of
around 20000 probes measured in 12 data samples from 4 time points (3 per each). Up this point of
the project, this data set has not been investigated yet.

2
1.2. Research Questions – Project’s Objectives
As a general research question, the microarray data sets are analyzed by gene selection, gene
clustering and sample classification to extract as much biological information as possible. The
expected results should give deeper understanding about the nature of the genes that influence the
target biological functionality meant by the data set and this should bring the focus of the biologists
and medics toward a narrower path of study and research.

More specific research questions exist for each of the two data sets discussed in section ‎1.1.

The yeast cell-cycle data set consists of a filtered set of genes (384 genes out of about 6000), so no
gene selection is applied there. The main side of analysis for this data set is gene discovery through
gene clustering. Some of the questions that need to be answered in this research are:

1. How accurate is the biological suggestion of having 5 clusters (stages)? If not, what would be
the most accurate number?
2. What are the genes that are included in each of clusters? In other words, what are the genes
whose values peak in each of the main cell-cycle stages? And how accurate is the biological
grouping suggestion?
3. Sense 2 cell-cycles are provided by the data points, how different do the genes perform in the
same stages of these 2 cycles?
4. How do the outcomes of this research compare with the outcomes of the other’s previous
works in terms of the 3 aforementioned questions?

Within the previous progress of the project, a new method for combining the results of different
clustering techniques has been revealed and proposed, it is named as “combined fuzzy clustering
partition matrix formulation and binarization”. As a consequence of that, the following research
questions have been added to investigate the methodology part rather than the application:

1. How does the proposed method defer from the ensemble clustering methods that exist in the
literature? Can it be claimed original?
2. What is the theoretical justification of proposing this method and what are the theoretical
services the method is expected to offer?
3. Practically, what is the performance of this method when applied on the University of Stanford
data set?

For the second data set, the University of Oxford data set, clearer research questions are expected
to be set as the data set starts to be analyzed. Pre-analysis is needed to discover the sides that seem to
have some questions to be investigated. Even though, these are some initially proposed questions:

3
1. Out of the full set of 20000 probe measurements, what is the most optimal subset of probes
whose measurements show informative profiles over the 12 samples (4 time points)?
2. How consistent are the measurements of the same probes at the same time point for different
samples?

These questions are subject to be revised and modified once this data set starts to be analyzed.

1.3. The Report’s Structure Approach


This interim report aims at documenting what have already been done in this project so far and
drawing the plan for what is expected to be done later on. Due to the existence of two independent
data sets within this project, most of this report can be seen as a semi-final report for the analysis of
the first data set. This justifies the fact that this report is long.

‎Chapter 2 presents a review of the literature related to this project. This chapter includes reviews
about different machine learning families of methods which have already been used in this project and
which are expected to be used in the subsequent phases. Clustering techniques, which are reviewed in
section ‎2.5, are what have been mainly used so far in the analysis of the first microarray data set.
Although some of the other parts of the literature review haven’t been touched yet in application, they
are expected to be needed in the analysis of the second data set.

‎Chapter 3 gives the time plan for the project. It discusses the first time plan which was set in
December 2010, an updated plan set in early June 2011 and a newly modified time plan which shows
the current progress. This last time plan has been set at the time of writing this report (Mid July 2011).

‎Chapter 4 discusses the methodology which was followed in the project so far. It presents most of
the details of the new proposed method while leaving discussing the methods that have been used
from the previous literature to ‎Chapter 2 and the data sets description to section ‎1.1.

‎Chapter 5 presents the current results of the project with detailed analysis and arguments. In this
interim report, the results in chapter 5 are all related to the University of Stanford data set because the
University of Oxford data set has not been touched yet.

‎Chapter 6 concludes the project’s current progress and ‎Chapter 7 lists the references.

‎Appendix A lists the 384 genes of the University of Stanford’s data set with their names,
biologically suggested classes and the contents of the combined fuzzy partition matrix which results
from applying the new proposed method over the data set. ‎Appendix B lists 5 tables for the results of
5 different clustering experiments. Both appendices are referenced in the course of this report when
appropriate.

4
Chapter 2. Literature Review (Background)
Biology has always been one of the hottest areas of research because of its strong effect on public
health and environment. In the recent decades, very high throughput of biological data has been
extracted and put into study and analysis. Chemists, statisticians, mathematicians, and information
engineers have collaborated with biologists and medics to help in the analysis of this high throughput
data. This collaboration has introduced new interdisciplinary fields that are growing in a fast pace due
to their success and their promising undiscovered sides [5].

Bioinformatics is one of the promising and growing fields where biology meets information
technology. Although it is a new discipline, it covers a wide range of research sub-fields such as
microarray data analysis, proteomics, pathway analysis and many others. Microarray technology is
used to examine simultaneous gene expression profiles of different cells and tissues [6; 7] and it is the
main focus of this research.

DNA microarrays are grids of DNA spots (probes) on a substrate used to detect complementary
sequences while the substrate can be glass, plastic or silicon [6]. The probes measure certain chemical
structures’ levels in the genetic material where many probes readings can present one gene and one
probe reading might be shared by more than one gene, which depends on the particular fabrication
and treatment of the substrate and the genetic data in the particular technology used. The
measurements of the probes show up on them as different colors and intensities which can be fed to
the computer by scanning the microarray with high resolution scanners [6].

The scanned microarray images represent raw unprocessed data which is not ready for actual
machine learning analysis. A preprocessing step is taken at this point to turn the set of raw images
into the standard data structure form of the microarray [7; 8]. The scope of this research project does
not cover preprocessing since microarray data is expected to be received in its standard preprocessed
form.

The standard form of the microarray data structure is what is known in machine learning as
sample-feature matrix. Sample-feature matrix is a matrix of rows representing different features
and columns representing different samples. In the special case of microarrays, the term sample-
gene matrix is used because gene levels represent feature values [5]. A detailed discussion about the
structure of this matrix is presented in section ‎2.2.

As known in machine learning, samples should be more than features, even much more than
features in some cases. The situation in microarrays is that samples are much less than features
(genes) with ratios of the rank of 1:100. This problem is known as the curse of dimensionality [5; 8;
9]. To overcome this problem, a subset of the most influencing genes is selected from the entire gene

5
set in a process called gene selection [5; 10; 11]. Gene selection takes many forms and is performed
using different methods which are discussed in more details in sections ‎2.3 and ‎2.4.

In addition to gene selection, unsupervised clustering can be applied over genes to group (cluster)
the genes with close profiles in the same groups (clusters) [12; 13]. This type of analysis aims at
extracting useful biological information about the genes relations where genes that are clustered in the
same cluster most likely to belong to the same biological pathway [12; 14]. In addition to this, gene
clustering is used to help in gene selection such that genes are selected from different clusters to
ensure the diversity and minimize the redundancy [15]. Many clustering techniques have been used in
the context of microarray data analysis and some of them are discussed in more details in section ‎2.5.

Gene selection ends up with a subset of genes which is expected to be the most differentiable
between samples from different classes. These genes are considered as the features that are used in
building supervised classifiers. Classifiers are trained and tested using the available samples then they
are expected to be general such that they correctly classify any new unseen sample into its correct
class [6; 11]. Some classifiers are discussed in section ‎2.6.

For completion, section ‎2.1 gives a brief history about microarray data.

6
2.1. The History of Microarrays
The first reported use of microarrays might be in 1982 in the analysis of 378 arrayed bacterial
colonies which were assayed in multiple replicas for expression of the genes in multiple normal and
tumor tissues [16]. One of the first experiments in which cDNA clones were arrayed onto a filter then
hybridized with cell lysates was for the analysis of the gene expression profiles of colon cancer and
examined the expression of 4000 genes therein [17]. The use of miniaturized microarrays for gene
expression profiling was first reported in 1995 [18], and a complete eukaryotic genome
(Saccharomyces cerevisiae) on a microarray was published in 1997 [19].

Since then, the identification of genes by the Human Genome Project has allowed for the
expansion of the number of cDNA clones or oligonucleotides spotted on a single slide. The initial
sequencing and analysis of the human genome was reported in 2001 [20; 21]. Today, the average
commercial microarray contains roughly 20,000 clones or oligonucleotides, many of which are
unique. Some companies, such as Agilent Technologies, also make a slide that encompasses genes
from the whole genome with over 44,000 genes spotted on their arrays [6].

Affymetrix GeneChip is among the most common technologies that have evolved to fabricate
microarrays [8]. This technology was invented in late 1980s by a team of scientists led by Stephen
P.A. Fodor and has been widely used for microarrays since 1996 [22].

7
2.2. Microarray Structure and Analysis Model
To standardize the analysis of microarrays, a commonly accepted form of the microarray data
structure has evolved. The data structure is an ( 2-D matrix of gene expressions of genes
for samples [5]. In some literature it is defined as the transpose of this definition, i.e. [9].
This data structure is usually referred to as (X):

(Eq.1)

(Eq.1) shows the mathematical definition of the microarray. The expression denotes the
value of the gene (i) for the sample (t).

In most of the times this set of data is associated with groups’ labels vector which maps each
sample’s gene expression vector to a group label . Usually the labels are discrete numeric
values that represent different groups. For example if some of the samples belong to cancer tumors
and the others to normal tumors then might be either 1 or 0 denoting a cancer sample or a normal
sample respectively. (Eq.2) shows the mathematical mapping of to .

(Eq.2)

Such microarray data sets have been analyzed from different points of view and by using different
techniques and methods. Figure ‎2.1 models the main stages of analysis which can be applied over
microarray data sets and are relevant to this project.

Figure ‎2.1: General model for microarray data analysis

8
As can be seen in Figure ‎2.1 (A), the analysis starts by considering the microarray data matrix
which is discussed above. This array results after what is known as preprocessing which turns the
initial row data taken from the experiments into the standard form of sample-gene matrix (X) [7; 8].

After having the matrix (X), the analysis may have many forms. It can be used directly to train a
classifier using the entire gene set. Although this is possible, it has many disadvantages. It demands a
lot of processing resources and the seriousness of this problem depends on the nature of the classifier
to be used. In addition to this, using the entire set of genes misses the biological empirical observation
which says that for a particular target issue most of the genes are irrelevant [5; 23].

The most common way in building classifiers using microarray data is to start with gene selection
to select a subset of genes which is expected to contain the most relevant genes to the particular given
phenotypic issue. Gene selection is usually applied over the sample-gene matrix directly and it tackles
two main issues; discriminative genes and redundant genes. Discriminative genes are those genes
whose profiles have strong statistical differences between different classes, so they are good genes to
be used to differentiate between samples that belong to different classes. Redundant genes are those
genes which have close profiles. Even if these genes are strongly discriminative, including all of them
adds no value since one of them offers almost the same amount of information as all of them [5; 9;
10].

In some cases, to select the best set of genes, it can be useful to start by grouping genes in number
of groups (clusters) that include genes of close profiles. This process is called clustering and it is
applied using many different clustering methods. After clustering the genes into clusters, genes are
selected for classification purposes by taking number of genes from each cluster in a way that these
genes cover different spaces of the classification problem, i.e. are from different clusters [12; 13; 15].

After training the classifier with the samples using the selected subset of genes the classifier needs
to be tested and assigned a numeric performance value. The most common metric to measure
classifiers’ performance is the classification accuracy, which is the percentage of the correctly
classified test samples of the entire test sample set. Many methods have been introduced in the
literature to perform testing and validation for the classifiers [6; 23; 24].

The rest of this chapter discussed these stages in more details.

9
2.3. Gene Selection (GS) – Open-Loop Methods
In machine learning, the sample-feature ratio is extremely important, where (N) is the number

of samples and (M) is the number of features [5]. For statistically meaningful systems, the ratio needs
to be much greater than 1 ; even higher orders are needed for some applications. In
microarray data, genes represent features in machine learning terminology. And the dimensionality of
microarrays usually lies in the range of 2000 to 30000 genes for 20 to 200 samples which results in
ratios of the order of 1:100. This phenomenon of very high dimensionality in microarray data is
known as the curse of dimensionality [5; 8; 9].

According to biologists, most of the genes in the genetic data are irrelevant to the problem being
analyzed. Not only that the curse of dimensionality increases the needed computation cost
dramatically and decreases the statistical significance, but considering the irrelevant genes adds a
noise term which might misguide the analysis [5; 23].

The aforementioned discussion leads to the necessity of filtering the gene set by picking the most
influencing subset of genes to participate in later processing. This process is known as Gene Selection
(GS) which represents Feature Selection in machine learning terms. (Eq.3) mathematically formulates
gene selection:

(Eq.3)

As shown in (Eq.3), gene selection chooses genes from the complete set of genes. is
the index of the selected gene in the original complete set of genes.

Another problem in the context of gene selection is the fact that even the most informative genes
themselves have a lot of redundancy among them. Selecting two highly informative genes that have
redundant expressions reduces the performance in terms of computation cost and accuracy. It is
equivalent to doubling the weight of one of these two genes. This adds another mission to the gene
selection methods that is tackling the problem of redundancy [5; 10; 14].

Many methods have been introduced in the literature for gene selection for both supervised and
unsupervised learning analyses. These methods are classified in to two main classes: filter methods
(open-loop) and wrapper methods (closed-loop) [5; 23]. Filter methods select genes without referring
to the classifier that will be used in the later classification phase. Most filter methods rank the genes
according to some goodness criteria then select the top genes. Wrapper methods are closed-loop
methods that use feedback from the classifier to help in selecting the best gene subset. Usually, these
methods try to maximize the classification accuracy with respect to the selected subset of genes [5;
23].

10
In the following subsections, some of the open-loop gene selection methods are presented while
some closed-loop methods are discussed in section ‎2.4.

2.3.1. Consecutive Ranking


This is a class of unsupervised gene selection methods which depend on the information content of
gene expressions. Let the selected subset of genes be and let be the gene of index . The
information content of the set and the gene are defined as and respectively.
Different information formulas are introduced for which will be discussed in a later stage of this
section.

Consecutive ranking is a recursive process which depends on the information content of the subset
of selected genes as a numeric criterion. The two main classes of consecutive ranking are the Forward
Search (Most Informative First Admitted) and the Backward Elimination (Least Useful First
Eliminated) [5].

2.3.1.1. Forward Search (Most Informative First Admitted)


Forward search starts from an empty list of selected genes then admits the gene which contains
the largest amount of information. In each round, the gene which adds the largest amount of
information content to the set is admitted [5]. (Eq.4) shows the mathematical annotation for the
iteration (t).

(Eq.4)

When the size of reaches , the process terminates.

2.3.1.2. Backward Elimination (Least Useful First Eliminated)


Backward elimination starts with the complete list of (M) genes contained in , then eliminates
genes iteratively until has only genes. In each round, the gene which if eliminated causes the
least loss of information content is eliminated [5; 25]. (Eq.5) represents the mathematical annotation
of the iteration (t).

(Eq.5)

Backward elimination is a form of overselect-then-prune strategies.

Although in the aforementioned description the termination criterion was that the size of has
reached , this is not the only criterion to consider. Pre-setting the number of the selected genes
is not an easy task and cannot be chosen arbitrarily. In many cases the number of selected genes
is kept flexible while the quality of the set is taken as the termination criterion. For example,
when is about , the process terminates [5; 25].

11
More details with examples about consecutive ranking gene selection methods can be found in [5].

2.3.2. Individual Ranking


Individual ranking methods belong to filter methods class where they rank genes in a descending
order then select the top genes. Different criteria have been adopted to rank genes, some of which
are for supervised learning and some are for unsupervised learning [5]. In the next subsections, some
of the criteria used in research for gene ranking are discussed.

2.3.2.1. Information Content


This criterion is used in unsupervised learning and is defined in different proposed ways. The most
common information content metric is Shannon Entropy [5]. (Eq.6) shows the mathematical formula
to calculate Shannon entropy as used in [26].

(Eq.6)

(Eq.7)

Where is the entropy (i.e. information content) of the gene , is the number of
samples, ( ) is the relative expression of the gene for the sample (t) and it is calculated as
shown in (Eq.7). is the value of the gene for the sample (t).

The value of the entropy for each gene ranges from 0 to , where the value of 0 appears for
gene expressions that appear only in a single sample, and the value of for expressions that are
uniformly expressed in all samples. This means that the lower the value of the entropy the more
sample-specific the gene is.

More information about using Shannon entropy as a gene ranking criterion can be found in [26].

2.3.2.2. Signal-to-Noise Ratio (SNR) Criteria


This class of criteria is applied for supervised learning problems where the available samples
are mapped into number of classes (groups). The argument on which all the SNR criteria rely is that
better (more influencing, more informative) genes distribute the data samples such that the means of
the samples of each class are far from each other while the samples of each class are clustered in small
areas around their means (small intra-class variance). Higher mean distances and lower intra-class
variance values lead to better separability for classes [5; 27]. Figure ‎2.2 shows an illustrative example
for a 2-class problem where different couples of genes are selected and led to different separability
characteristics.

12
Using the terms of SNR, the distance between the means is the signal while the intra-class variance
is the noise. The general SNR criteria formula is shown in (Eq.8).

(Eq.8)

These are some of the most common SNR-based gene raking criteria for 2-class problems
considering the “+” class and the “–” class , where are the class-conditional
means and standard deviations for both classes respectively. Each of the following formulas considers
the SNR for the gene .

Figure ‎2.2: 2-Class problem. (a) High means distance and low intra-class variance – the best separability.
(b) Low means distance and low intra-class variance. (c) High means distance and high intra-class
variance. (d) Low means distance and high intra-class variance – the worst separability.

13
1) Signed Fisher’s Discriminant Ratio (Signed-FDR) [1]:

(Eq.9)

2) Second order FDR [1]:

(Eq.10)

3) T-Statistics [1, 7, 12, 24]:

(Eq.11)

Where and are the numbers of training samples in the “+” class and the “–” class
respectively.

4) Symmetric Divergence (SD) [5]:

(Eq.12)

The idea of t-statistics is generalized for classes instead of only two classes.
This generalized criterion is known as F-Ratio [25; 28]. The mathematical formula for F-ratio is
shown in (Eq.13):

(Eq.13)

Where is the number of samples in class , is the total number of samples, and are
the class-conditional mean and standard deviation for samples in class considering gene , and
is the global mean for all samples considering the gene .

2.3.3. Other Open-Loop GS Methods


Many other open-loop methods are introduced in the literature such as principle component
analysis (PCA) [6; 29] and singular value decomposition (SVD) [6; 30].

14
2.3.4. MATLAB for Open-Loop GS
Table ‎2.1 shows number of useful MATLAB functions to be used in the implementation of open-
loop gene selection methods. More description and details can be easily found in the help files of
MATLAB.

Table ‎2.1: MATLAB functions for Open-Loop Gene Selection

Function Description
rankfeatures Rank key features by class separability criteria
mahal Mahalanobis distance
pdist Pairwise distance between pairs of objects
mean Mean value
var Variance
std Standard deviation
max Maximum value
min Minimum value
princomp Principle component analysis on data
svd Singular value decomposition

2.3.5. Discussion and Analysis


The importance of gene selection in microarray data analysis was justified in the introduction of
section ‎2.3 and it was shown that using the entire set of genes in classification is not practical from the
statistical and biological points of view.

The first issue in gene selection is to choose between filter (open-loop) and wrapper (closed-loop)
methods. This point is postponed to the section ‎2.4.5 after wrapper methods are introduced. In this
section, comparisons are carried between different open-loop methods.

More than one issue should be considered when choosing a gene selection method, for instance, if
the data is not associated with class labels then all supervised methods don’t apply, but methods like
consecutive ranking and individual ranking with the information content as the comparison criterion
can be used effectively.

In supervised cases the number of classes plays a major role in choosing the ranking criterion; F-
Ratio can handle problems with more than 2 classes while the other SNR criteria are limited to 2-class
problems.

The difference in performance in terms of quality and processing time between methods might be
considered to choose between iterative and non-iterative methods.

15
Other more detailed issues might be considered, for example, Signed-FDR considers the sign
while second order FDR takes the sign factor out. Each ranking criterion shows different statistical
characteristics but the main point is that the researchers simply compare between these criteria
performance by actual implementation and testing because theoretical comparisons are still not
mature in the context of microarray data analysis.

16
2.4. Gene Selection (GS) – Closed Loop Methods
The main idea of gene selection was discussed in the previous section with detailed discussion
about some open-loop methods. In this section some closed-loop methods are presented.

2.4.1. SVM Recursive Feature Elimination (SVM-RFE) [5; 10]


This wrapper gene selection method depends on the SVM classifier to rank the features (genes)
recursively and eliminate the least informative ones. The general idea of this method is that
eliminating a large number of genes at once can negatively affect the classification because of the
high dependencies and redundancy among genes. This method suggests eliminating genes recursively
by re-ranking them after every iteration according to their contribution to the SVM classifier.

For linear SVMs the classification boundary equation is:

(Eq.14)

Where is the expected class for the feature vector , is the weight vector and is
a constant.

SVM training process takes the training set of feature vectors and their corresponding class labels
as inputs, and gives a set of parameters as an output, where
is the number of training points. The weight vector in the classification boundary equation
(Eq.14) can be calculated from the parameters using:

(Eq.15)

High weight values indicate high influence in the classification process, while low weight values
indicate low influence. Thus, the ranking criterion which is adopted in this method is:

(Eq.16)

SVM-RFE starts with the full set of genes to train an SVM then calculates the ranking criterion
value for each gene using (Eq.15) and (Eq.16). The gene with the lowest value is
eliminated from the training genes set and is put at the bottom of the resulting ranked gene set. The
SVM is trained again using the remaining training gene set and the process is repeated until there are
no genes left in the training genes list. At this instant, the resulting ranked genes list will be holding
the entire gene set in an ordered way.

The output of SVM-RFE is the ordered gene set from which the top features are selected for
classification.

For further information and discussion about SVM-RFE please refer to [5; 10].

17
2.4.2. Sequential Forward Selection (SFS) [23]
Sequential forward selection is a gene selection method which can be used with any type of
classifiers. The first thing to do is to define a comparison metric such as the classification accuracy to
compare between different sets of selected genes. This metric will form the feedback signal in the
method when it runs. Next, the set of selected genes is initialized with an empty set and the set of
candidate genes is initialized with the full set of genes.

The iterative process runs by moving one gene from the candidate set to the selected set each loop
until some criterion is met. To choose the gene which should be moved in any iteration, each one of
the candidates is tried to be added to the selected set then the comparison metric is calculated using
the entire set of selected genes at that instance. The gene which if added to the selected genes set
results in the best performance of this set is moved permanently to the selected genes set and the next
iteration starts.

The stopping criterion can be that a predetermined number of selected genes is reached, a specific
performance level is reached or the performance enhancement rate is less than a specific value.

2.4.3. Other Closed-Loop GS Methods


Many other closed-loop gene selection methods have been introduced to the literature such as:
Vector-Index-Adaptive SVM (VIA-SVM) [5], shrunken centroids [28], elastic net [28], GA/SVM
[31], GA/KNN [31], consistency based methods [31], genetic algorithm-based feature selection (GA-
FS) [32], prediction risk-based selection for bagging [32], evolving fuzzy gene selection (EFGS) [9]
and fuzzy C-means clustering-based enhanced gene selection method (FCCEGS) [9].

2.4.4. MATLAB for Closed-Loop GS


Closed-loop gene selection methods depend on the chosen type of the classifier. MATLAB
functions for classifiers will be presented later in this document and they will represent the base for
implementing both gene selection and classification.

For instance, one function is important to be mentioned here which performs sequential gene
selection. The function is sequentialfs and it takes the criterion function handler as a parameter which
makes it flexible to be used with different classifiers.

MATLAB help files offer more detailed description with examples about this function.

2.4.5. Discussion and Analysis


The first question to be answered in choosing a gene selection method is if it should be an open-
loop or a closed-loop method. Open-loop methods consume much less processing than closed-loop
methods because they are done in one shot (even if they were iterative). Closed-loop methods include
full training and testing for a selected classifier using the selected gene set in every loop and this

18
consumes much more computation and time resources. On the other hand, closed-loop methods take
the classifier and the classification accuracy into consideration and this is more intuitive because the
classification accuracy is the ultimate goal in most of the cases. Another complication in the
comparison is that open-loop methods are totally independent of the classifier while closed-loop
methods cannot be applied before determining the classifier to be used.

The trend in research is to start with open-loop methods to filter the very large full set of genes by
eliminating the less informative ones. The result is a subset of genes which is still larger than the final
sought gene subset. These resulting genes are processed by a closed-loop method to obtain the final
subset of selected genes. This 2-stage gene selection approached can be referred to as a coarse-grained
gene selection (open-loop) then fine-grained gene selection (closed-loop).

Open-loop methods were discussed and compared in section ‎2.4.5 while closed-loop methods are
discussed and compared here.

It is not easy to compare closed-loop methods because this involves comparisons between the
classifiers on which these methods depend. Even though, one can see that some methods such as
SVM-RFE, VIA-SVM, shrunken centroids, elastic net and others are totally dependent on the chosen
classifier. This is not the case when compared to other methods such as forward sequential selection
whose algorithm can be used with any classifier.

19
2.5. Gene Clustering
Unsupervised clustering is a class of methods which aim at grouping (clustering) the given data
into number of groups (clusters) where the members of each cluster share common characteristics that
differentiate them from the other clusters [12; 13]. The term “unsupervised” indicates that there are no
pre-allocated class labels for each sample.

In microarray data analysis, clustering is usually applied over genes. This has many consequences
such as helping in gene selection or deepening the understanding of the similarities and dissimilarities
between different genes [12]. In gene selection for example, selecting genes from different clusters
implies that these genes have more independence and less redundancy, while the genes that belong to
the same cluster have more overlapping characteristics [15].

Many clustering methods have been applied in the context of microarray analysis, and some of
them are listed in this section.

2.5.1. Self-Organizing Maps (SOMs)


Self-Organizing Maps (SOMs) were introduced by Kohonen [33] as an unsupervised learning
technique for clustering. SOMs define a map of units (neurons) that are organized in a low
dimensional space (usually 2-D) according to a specific topology. The topology can be a simple
rectangular grid, a hexagonal grid, a random topology or any other suitable topology. Depending on
the adopted topology, a distance metric is defined to quantify the distance between any two map units
(neurons). The distance between the and the neurons ( ) can be calculated using Euclidian,
city-blocks, chessboard or any other suitable distance metric.

Given n-D sample vectors , SOMs map these vectors to the map units that are
considered as the clusters’ centers. Each map unit is assigned an n-D model vector
where is the step number. The algorithm starts by setting the initial
values of the model vectors then it moves to the iterative process.

In each step a random sample is picked then a winner model vector is identified
using (Eq.17), which corresponds to the closest model to the picked sample.

(Eq.17)

If the model vectors are always normalized then (Eq.19) can be used instead of (Eq.17):

(Eq.18)

Where is the index of the closest model vector to the picked sample .

Once the winning model vector is identified, all of the model vectors are updated using (Eq.19).

20
(Eq.19)

This equation increases all model vectors values. The closer the model is to the winning model the
higher the increment it gains. This is achieved by the effect of the neighborhood function
which is equal to 1 when and decreases as is more distant from . To control the learning
rate, the scalar is used. It is common to use a Gaussian function as a neighborhood function,
which is also called the smoothing kernel, while sharp edged and bubble neighborhood functions are
used in the literature as well [34], look at (Eq.20) and (Eq.21).

(Eq.20)

(Eq.21)

Where in both equations controls the spatial width of the neighboring function. While time
passes (i.e. increases), and decrease to narrow the search.

For more about SOMs with applications please refer to [13; 15; 30; 33; 35; 36; 37].

2.5.2. Self-Organizing Oscillator Networks (SOON)


Self Organizing Oscillator Networks (SOON) is introduced in [38] and used in [12]. SOON is an
unsupervised clustering method which imitates some biological behaviors like the behavior of fireflies
which when exist in groups, tend to synchronize their flashing, despite the fact that they flash
randomly when being alone.

The basic component of SOON as a clustering technique is the “integrate and fire” (IF) oscillator.
Each oscillator is represented by a state variable which is bound between 0 and 1. When an
oscillator’s state variable reaches 1, it fires, sends an impulse to the neighboring oscillators, and rests
(Its state variable is set to 0). The impulse which is sent updates the state values for the other
oscillators such that the oscillators that are close in distance to the firing oscillator (belong to the same
cluster) their state values are brought closer to the state value of the winning oscillator, while the far
oscillators in distance are moved farther in terms of state values. By repeating this step iteratively, the
oscillators are eventually clustered in groups where the state values for the oscillators in the same
group are always the same, hence synchronized.

Another element is introduced in this model; it is the phase of the oscillator . The phase is a
bounded variable between 0 and 1 and is mapped to the state variables by a smooth monotonically
increasing mapping function . As suggested in [38] and used in [12], can be defined as
shown in (Eq.22):

21
(Eq.22)

Where is a constant that controls the concavity of the mapping function. Positive values for
makes it concave down, while negative values makes it concave up. Note that
and , this is a condition that should be met by the chosen function .

Different implementations have been used for SOON, the simplest one is called SOON-1
algorithm [12; 38] and the rest of this subsection will be discussing it.

SOON-1 starts by considering that each sample in the data set has a corresponding oscillator. For a
sample set of size and at the iteration , there are oscillator phase values .
The state values can be calculated from using (Eq.22).

In each step the following actions are done consecutively:

1) The winning oscillator , which is the one having the closest to 1, is identified by (Eq.23).

(Eq.23)

2) All oscillators’ phase values are increase by .


3) The state variables are calculated using (Eq.22)
4) The state variables are updated using (Eq.24):

(Eq.24)

Where bounds to [0, 1] (Eq.25) and is the coupling strength of oscillator at a


given phase (Eq.26).

(Eq.25)

(Eq.26)

Where is the radius of the cluster, is a distance limiting parameter usually set to , is
the constant of excitation and is usually small in the range of , is the constant of inhibition

22
and is usually set to , and is the distance between the oscillators and using a distance

function like the Euclidian.

5) The new values are calculated using the inverse of the function (Eq.27).

(Eq.27)

6) The dynamics of the oscillators that have just synchronized are modified to be identical so that
it is guaranteed that they keep synchronized later on. The condition for this to be true for a
group of synchronized oscillators is as in (Eq.28).

(Eq.28)

To achieve this, the distances between each of the synchronized oscillators and any other oscillator
should be identical. This can be achieved by setting these distances to their maximum, minimum or
average. (Eq.29) shows the 3 choices mathematically.

(Eq.29)

7) The oscillators that have fired (i.e. whose phase values have reached 1) are rested by setting
their phase values to 0.

For more details about SOON-1 as well as about other implementations of SOON please refer to
[12; 38].

2.5.3. K-Mean Clustering


It is discussed in [6; 30; 39; 40]. In [40], 4 different initialization methods are compared; namely:
random, Forgy (FA), MacQueen (MA) and Kaufman (KA). The first 3 are stochastic while the 4th is
deterministic. The paper suggests by empirical results that deterministic initialization is better.

2.5.4. Other Clustering Techniques


Many other clustering techniques have been introduced in the literature such as: hierarchical
clustering [6; 30; 39; 41; 42], particle swarm optimization-based clustering [13], fuzzy clustering [14;
43], simulated annealing based clustering [39], model-based clustering [3] and ensemble clustering
[2].

23
2.5.5. Clustering Performance Evaluation
A few clustering validity indexes have been introduced in the literature to evaluate the
performance of different clustering techniques and compare them [12]. In the following subsections,
some of these indices are presented.

As a general symbology, is the center of the cluster, is the centroid of the entire data set,
is the number of the clusters and is the membership of the data point in the cluster. In
crisp clustering, the data point has a membership value of (one) in the cluster to which it belongs
and zeros otherwise.

2.5.5.1. Davies-Bouldin (DB) Index


Suppose that is the scatter within the cluster , where is the

cardinality of this cluster, and is the distance between the centers of the and the
clusters, then the Davies-Bouldin (DB) index is defined as in (Eq.30).

(Eq.30)

Where . Any good clustering technique aims at minimizing the

index DB in order to get the best performance [12].

2.5.5.2. Calinski Harabasz (CH) Index


(Eq.31) shows the mathematical formula to calculate the CH index for data points partitioned
into clusters.

(Eq.31)

In this equation, B and W are the between and the within cluster scatter matrices. By substituting
the expressions for both traces in (Eq.31), (Eq.32) is obtained.

(Eq.32)

Where is the number of points in the cluster. To achieve the best clustering performance,
the index CH must be maximized [12].

2.5.5.3. Xie Beni (XB) Index


XB index computes the ratio between the compactness of the fuzzy K-partition of data set to its
separation . (Eq.33) shows the mathematical formula to calculate XB index.

24
(Eq.33)

For the best performance, XB index must be minimized [12].

2.5.5.4. “ ” Index
The index “ ” is defined as in (Eq.34) below.

(Eq.34)

The parameters and are user defined. For best performance, the index “ ” must be maximized
[12].

2.5.5.5. Dunn’s Index (DI)


Dunn’s Index (DI) is defined as in (Eq.35) below:

(Eq.35)

Where is the distance between the and the clusters and it is defined as
. is the diameter of the cluster and is defined

as . Here, is the distance between the data points and . For better
clustering performance, needs to be maximized [39; 44].

2.5.6. MATLAB for Clustering


Table ‎2.2 shows number of useful MATLAB functions to be used in the implementation of gene
clustering. More description and details can be easily found in the help files of MATLAB.

Table ‎2.2: MATLAB functions for clustering

Function Description
clusterdata Construct agglomerative clusters from data
pdist Pairwise distance between pairs of objects
kmeans K-means clustering
mahal Mahalanobis distance
dendrogram Dendrogram plot of hierarchical binary cluster tree

25
2.5.7. Discussion and Analysis
Clustering importance and usefulness have been introduced and justified in the introduction of
section ‎2.5.

Many clustering methods are available in the literature to be used. Some clustering methods such
as self-organizing maps (SOMs) need the number of clusters to be predetermined while it is not the
case with other methods such as self-organizing oscillator networks (SOONs). Although leaving the
number of clusters to be flexible looks like an advantage and simplifies the process, other parameters
need to be tuned in methods such as SOON to give the best result such as the radius of the cluster. In
this case, incorrect tuning of the parameters may result in extreme results such as one cluster per
sample or one cluster per the whole sample set. In short, choosing the best method can depend on the
pre-known parameters.

As in gene selection methods, most of the research compare between different clustering methods
by actual implementation and testing for them.

26
2.6. Supervised Classification
Supervised classification is the process of training a system (classifier) using the available labeled
data samples to be able to classify a new unlabeled sample correctly. This is done in two phases; the
training phase and the online phase. The training phase uses the available sample set to tune the
classifier’s parameters to be able to distinguish between the samples that belong to different classes,
while the online phase is when the trained classifier is used to classify an unlabeled sample.

The classifiers should achieve generality when they are designed and trained. This means that they
should be able to classify new unlabeled samples in general. Generalization necessitates the avoidance
of a problem known as overfitting, which is to over fit the classifier to the training data, even to the
exceptional training samples, in a way that makes it weaker in dealing with new unseen samples [10;
45].

Many classifiers have been introduced in the literature of machine learning, and many of them
have been applied in the context of microarray analysis as well. Examples of classifiers are: support
vector machines (SVM) [5; 28; 32; 46; 47], k-nearest-neighbor (KNN) [5; 25; 28; 41], artificial neural
networks (ANN) [9; 38; 48; 49], decision trees [24], weighted voting [5], random forests [28],
shrunken centroids [45; 28], elastic nets [28], genetic algorithm-based selective ensemble learning
(GASEN) [32], mutual-information-based selective ensemble learning (MISEN) [32], ensemble of
neural networks [32], neuro-fuzzy ensemble model [9; 14], genetic-algorithm-based multi-task
learning (GA-MTL) [32], heuristic multi-task learning (H-MTL) [32] and genetically evolved
decision trees [24].

2.6.1. Classification Testing and Validation


When different methods are applied on different data sets, there should be a metric that quantifies
the performance of each method to set up a valid comparison. The classification accuracy is the most
intuitive and most common metric in the context of classification methods. In general, classification
accuracy is the percentage of the correctly classified patterns over the whole pattern set.

Although this looks intuitive and promising, the problem is that the available data points whose
target labels (classes) are known are limited. If all of these data points (patterns) are used to train the
classifier, then the classifier is expected to be very specific to these points and using the same points
for testing would give high but misleading accuracy percentages which don’t measure how general
the classifier is. From the training point of view we need the largest possible number of training points
to make classification as statistical meaningful as possible. But from the testing point of view, we
need the largest number of testing points that have not been used for classification, so the measured
accuracy gives better indication of the generality of the classifier.

27
This problem of selecting training and testing data sets has been tackled by many methods such as
holdout validation [5; 6], k-fold cross-validation [5; 24; 31], leave-one-out cross-validation (LOOCV)
[9] and some methods that depend on bootstrapping [28; 31]. In [28], and in addition to the
classification accuracy, other metrics are suggested which measure the consistency of the gene
selection methods and the classification method.

2.6.2. MATLAB for Classification Implementation and Validation


Table ‎2.3 shows number of useful MATLAB functions to be used in the implementation of
classification methods and in testing and validating the classification results. More description and
details can be easily found in the help files of MATLAB.

Table ‎2.3: MATLAB functions for classification

Function Description
classify Discriminant classify
classregtree Construction classification and regression trees
NaiveBayes.fit Construct Naïve Bayes classifier object by fitting
training data
NaiveBayes.predict Predict class label for test data using Naïve Bayes
classifier
knnclassify Classify data using K-nearest neighbor method
svmclassify Classify data using support vector machine
svmtrain Train support vector machine classifier
classperf Evaluate performance of classifier
crossvalind Generate cross-validation indices

In addition to the aforementioned functions, MATLAB offers number of toolboxes which help in
implementing different classifiers. The most important toolboxes in this context are: neural network
toolbox, fuzzy logic toolbox and optimization toolbox. These toolboxes can be used through either a
user friendly GUI or the MATLAB command window. Full documentation with examples can be
found in the MATLAB help files.

28
2.6.3. Discussion and Analysis
Building strong and accurate classifiers is one of the ultimate goals of microarray data analysis.
Different classifiers can be compared by different metrics, but at the end and in most of the cases, real
choice is based on real implementation and testing.

K-nearest neighbor is the most intuitive method. The choice of the value of K is not easy and is
critical. In the case of small K (K = 1 in the extremist case), the risk of overfitting reveals while very
large K values can downgrade the performance. One more point in the case of K-nearest neighbor
classifiers is that in the case of large number of samples the classification task is expensive and might
not be practical for real-time applications.

Support vector machines (SVM) and artificial neural networks (ANN) differ from KNN in that
their training is expensive while the on-line classification process can be done in no time. SVM
follows a sound theoretical basis which leads to one unique and global solution; it searches for the
discriminant hyper-plane with the largest margins (recall section Error! Reference source not
ound.). SVM results in an understandable geometric solution which might allow for farther analysis.

On the other hand, ANN was inspired from the biological concept of neural networks and has been
developed over years through applications and experiments followed by theoretical analysis later on.
It has been proved that ANN can solve any classification problem given sufficient number of neurons,
layers and processing time. When compared with SVM, ANN results in complex geometric solutions
that might not allow for farther analysis while they are better than SVMs in that they solve any linear
or non-linear problem.

For SVMs to solve non-linear problems the kernel trick should be applied which can add another
level of complexity to its theory and implementation.

In testing and validation, it is commonly unacceptable to test a classifier using the same samples
that are used for training because this omits the generalization factor and can lead to high but
misleading results.

Most of the researchers use cross-validation methods such as K-fold and leave-one-out cross-
validation (LOOCV). For small sample sets, LOOCV might give better indications because if a large
set of samples was held out then the rest training samples might not be sufficient for statistically
significant training.

Consistency metrics light on a different side which is how consistent the results are when different
training and testing sets have been chosen. Consistency metrics indicate how general the classification
is expected to be when the entire set of samples is used for training.

29
2.6.4. Classification Applications and Results
This section overviews some of the applications of microarray data analysis by classification
associated with their results and references. Table ‎2.4 lists number of them. The first 5 columns detail
the microarray data set in terms of the title of the set, the number of classes (#C), the number of
samples (#X), the distinct numbers of samples for each class and the number of genes (m)
respectively.

Table ‎2.4: Samples of microarray data analysis applications and results from the literature

Data set Class.


Gene
Dist. Classifier Validation Acc. Ref.
Title #C #X m Selection
Classes (%)
Colon 2 62 40, 22 6000 IG NFE LOOCV 100 [9]
Cancer (In SNR NFE LOOCV 97.4
some IG SVM LOOCV 90.3
sets t-test BSVM Holdout 93.54 [11]
only LS Bound- LS-SVM Bootstrap 85 [23]
2000 SFS
are MIFS GP 10-fold CV 100 [24]
used) MIFS GATree 10-fold CV 88.33
t-statistic GP 10-fold CV 98.33
t-statistic GATree 10-fold CV 85.0
SC.s SC.s Bootstrap 87.8 [28]
Leukemia 2 72 25, 47 7129 IG NFE LOOCV 95.85 [9]
Cancer SNR NFE LOOCV 95.85
IG SVM LOOCV 94.1
RFE SVM LOOCV 100 [10]
t-test SVM Holdout 100 [11]
MGS_SOM Rule Based - 100 [15]
LS Bound- LS-SVM Bootstrap 97.6 [23]
SFS
38 - 3051 Elastic Net SVM Bootstrap 95.5 [28]
Lymphoma 2 47 24, 23 4026 IG NFE LOOCV 93.61 [9]
Cancer FCCEGS NFE LOOCV 95.74
SNR SVM LOOCV 76.0
77 58, 19 - t-test SVM Holdout 98.47 [11]
Prostate 2 102 52, 50 6033 t-test BSVM Holdout 98.04 [11]
Cancer RFE RF Bootstrap 93.9 [28]
Lung 2 181 31, 150 12533 MGS_SOM Rule Based - 100 [15]
Cancer
Breast 2 96 - 4869 SC.s SC.s Bootstrap 67.4 [28]
Cancer Many data sets used with different classifiers, the best result is ~70 [29]
reported here
Round blue- 4 63 23, 20, 6567 Special ANN 3-fold CV 100 [48]
cell tumors 12, 8 Filter +
(SRBCTs) wrapper
diagnostic PCA
categories

30
The 6th, 7th and 8th columns show the gene selection, classification and validation methods. These
methods are presented in an abbreviated form while their full names are detailed in Table ‎2.5. The 9th
column of Table ‎2.4 shows the classification accuracy as given in the relevant reference which is
shown in the 10th column.

Table ‎2.5: The full names of the gene selection, classification and validation methods' abbreviations that
are used Table ‎2.4

Type of Method Abbreviation Full Name


Gene Selection IG Information Gain (Ranking)
SNR Signal to Noise Ratio (Ranking)
FCCEGS Fuzzy C-mean Clustering-based Enhanced Gene
Selection
MGS_SOM Microarray Gene Selection by using Self-Organizing
Maps
LS Bound-SFS Least Squares Bound combined with Sequential Forward
Selection
MIFS Mutual Information Feature Selection
RFE Recursive Feature Elimination
SC.s Shrunken Centroids
PCA Principle Component Analysis
Classifier NFE Neuro-Fuzzy Ensemble
SVM Support Vector Machine
BSVM Biased Support Vector Machine
LS-SVM Least Squares Support Vector Machine
GP Genetic Programming
GATree Genetically Evolved Decision Tree
SC.s Shrunken Centroids
RF Random Forests
ANN Artificial Neural Network
Validation LOOCV Leave-One-Out Cross-Validation
K-fold CV K-fold Cross-Validation

The comparison between the results obtained from different research sources can be compared
when the data sets that are used are the same. Even though, some results might be questionable, such
as the 100% accuracy for both data sets (Leukemia cancer and lung cancer) as shown in [15]. This
result is obtained by using an (If … Then) rule over one or two selected genes only while more
sophisticated methods were applied on the same data set using larger number of selected genes and
less accuracies were obtained [9; 23]. Dealing with results and comparisons should take care of more
details than what is presented in Table ‎2.4.

Although different performance metrics other than classification accuracy are found in the
literature, classification accuracy is the only one to be reported in this section because it is the most
common and the easiest one to compare different works with.

31
2.7. Literature Review Conclusion
Microarray technology is relatively a new technology which allows measuring multi-gene profiles
in parallel and is implemented using many technologies such as Affymetrix and others. Biochemists
extract gene profiles using this technology and hand them as numeric arrays to information engineers
to perform different analysis processes over them and send the results back to biochemists for farther
interpretation.

The standard microarray data structure over which all farther analysis is performed is what is
known as sample-gene matrix, or sample-feature matrix in machine learning terminology, which is a
2D matrix with rows that represent genes and columns that represent samples.

Microarray analysis is done over many stages which can be handled sequentially or in parallel
depending on the targets of the researcher. These stages are:

 Gene selection: reducing the dimensionality of the problem by selecting the subset of the
genes with the highest influence on the classification process.
 Gene clustering: grouping genes with close profiles in groups (clusters) and use that to
help in gene selection and to infer deeper biological understanding.
 Classification: Training classifiers with the available known samples to be able to
correctly classify new unseen samples.
 Testing and Validation: testing the trained classifiers to measure the classification
accuracy taking the generalization factor into consideration.

Gene selection, gene clustering and sample classification can be done using a wide range of
different methods introduced in the literature and summarized in this document. Choosing a method
from the long list of methods depends on many heuristic factors, but in most of the cases, the final
choice is taken by actual implementation and testing for many methods with different parameters and
comparing their performance.

To conclude, microarray analysis lies under machine learning and is a wide field with no clear
theoretical basis that rule the best choices of methods and techniques. Research in the field is still
going and most of the results are empirical rather than theoretical. This research project tries to
contribute to the literature of microarray data analysis by applying the methods that are found in the
literature in a new context (over new extracted data sets) and might contribute to the theory of the
field as a side effect.

32
Chapter 3. Time Plan
This chapter discusses the time plan which was put for this project. The chapter presents the first
version of the time plan which was set in December 2010 and the wide modifications to which it was
exposed when the new time plan was set in early June 2011. The new time plan has been a good guide
for the project’s progress until now, even though, the actual progress of the project hasn’t gone strictly
to it. The actual progress of the project until now is presented and discussed in this chapter as well.

The project’s given time starts on Monday the 6th of June 2011 and ends on Thursday the 15th of
September 2011. The total period of the project is almost 14 weeks with 3 main deadlines that should
be considered in the time plan:

1. Friday 15th July 2011 – Interim report submission.


2. Monday 12th September 2011 – Final report submission.
3. Thursday 15th September 2011 – Project’s presentation and discussion.

Section ‎3.1 discusses the old time plan which was set in December 2010, section ‎3.2 discusses the
new time plan which was set in early June 2011, section ‎3.3 discusses the current progress with
comparisons with the new time plan, section ‎3.4 discusses the forward work plan and section ‎3.5
concludes the chapter.

3.1. First Time Plan (Set in December 2010)


The information which was available in December 2010 was not as clear as in June 2011. The
main incorrect pieces of information which were considered in designing this time plan and which
caused the dramatic modifications to it later on are:

1. The project is a 12-week project in the period from June 2011 to September 2011.
2. The only deadline is the last day of the 12th week on which the final report should be
submitted.
3. One data set is to be analyzed; it is the University of Oxford data set.

Figure ‎3.1 shows the old time plan in a Gantt chart presentation and reflects the aforementioned 3
misleading pieces of information. It is clear in the figure that the total given time is 12 weeks, the
interim report doesn’t exist and the data analysis tasks don’t consider two different and separate data
sets.

Most of the main tasks in this time plan are stated in a way or another in the newer version, so,
discussing the details of the tasks is better to be left to the next section. One basic note here is that the
actual literature review burden occurs in the couple of months before the date at which this time plan
starts, which makes giving it one week in this plan a reasonable choice.

33
Figure ‎3.1: Gantt chart for the project's time plan – Set in December 2010

3.2. Second Time Plan (Set in June 2011)


Before the first day of the project was due, a new time plan had been set to reflect the new pieces
of information that had been revealed. Figure ‎3.2 shows this time plan in a Gantt chart presentation
format.

The first points to notice are the 3 deliverables delivered at the first day of the project. By the time
at which this plan was set, the literature review part had been finished and its documentation was
ready. The MATLAB software was installed successfully and tested and the two microarray data sets
were expected to be received by the first day.

The project’s time line is almost divided into two distinct periods; the first period, which lasts for
about 6 weeks from the 6th of June to the 15th of July 2011, is assigned to the analysis of the first data
set; namely, the Stanford University yeast cell-cycle data set. This period is terminated by the
submission of the interim report whose deadline is the last day of the period. The second period lasts
for 8 weeks from the 18th of July to the 12th of September. This period is assigned to the analysis of
the second data set, the Oxford University microarray data set, and ends by submitting the final report
(dissertation) whose deadline is the 12th of September. One extra task is the preparation for the
presentation and discussion on the 15th of September, the last official day for the project.

As can be seen in Figure ‎3.2, some similar tasks are carried out over both data sets. The first week
for each of them is used for initial analysis (tasks 4.1 and 6.1). Initial analysis is the way to discover
the characteristics of the data sets and their special issues. This helps in determining the machine
learning methods (generally speaking, the computational methods) that are expected to give the
desired analysis results.

34
Figure ‎3.2: Gantt chart for the project's time plan – Set in June 2011

35
As the initial analysis is done, the promising machine learning techniques and methods are
designed and implemented in the following 2 weeks (Tasks 4.2 and 6.2). When some methods become
ready, simulation starts by applying these methods over the data sets (Tasks 4.3 and 6.3). Simulation
results start to come out even before all of the needed methods are implemented. After that, the
generated results are analyzed (Tasks 4.4 and 6.4) which might necessitate more simulation to bridge
possible gaps. The last step in the analysis is to conclude the results (Tasks 4.5 and 6.5).

An extra possible task is added to both parts which might and might not take place in reality; it is
to prepare and write a document to publish the results (Tasks 4.6 and 6.6). The significance of the
outcome of each of the data sets’ analysis is what determines if a publication is to be written or not.
Although it is not easy to determine that at the moment, it is expected that the second data set will
need more time to conclude and write publication because it is a new data set to the literature and has
a lot of areas to be investigated. If the analysis of the first data set is to be published, the publication
document should not extend the analysis of the first data set for more than an extra week. For the
second data set, it should be ready by the end of the project’s time line.

The interim report (This report), which presents the literature review in addition to the time plan
and the current progress, should be submitted by the 15th of July. It is expected that the report will
offer most of the results and the analysis of the first data set. A period of about 2 weeks should be
sufficient to write this report (task 5).

The final report (dissertation) is what carries most of the project’s weight. A 4-week period is
assigned to write and finalize that report which should be submitted by the 12th of September 2011
(task 7). Following the dissertation, a presentation and a discussion should be held on the 15 th of
September 2011. A well worked-out project and a well written dissertation should make a one week
period or slightly more a sufficient time to prepare for this final task (Task 8).

3.3. Current Progress Assessment


As the first period of the project has finished, a comparison between the expected time plan (Figure
‎3.2) and the actual one should be made. As a summary of the comparison which is stated below, the
time plan in general worked fine.

Figure ‎3.3 shows the updated time plan. The time plan in this figure is based on the time plan
shown in Figure ‎3.2 above but shows the actual time spans taken by the first period’s tasks instead of
the expected ones.

Deliverables 1 and 2 didn’t change, basically because they had been already ready even before
setting the previous time plan. The 3rd deliverable was delivered after one week of its expected time.
The reasons were out of control, external delays happened for the data sets to be received.

36
Figure ‎3.3: Gantt chart for the project's current progress and the remaining time plan – Set on 15th July 2011

37
As the reception of the data sets was delayed by one week, there was no way to perform initial
analysis on time. Even though, my supervisor had enough knowledge about the first data set to skip
this first step. That sufficient knowledge revealed that this data set should be exposed mainly to gene
clustering techniques for gene discovery.

What has been discussed justifies the clear modification shown in the time plan. The initial
analysis task has been waived and the rest of the tasks of the first period have been shifted one week
to the left. The task which was given a general name of “Machine learning methods design and
implementation” has been renamed to the more specific name “Clustering methods implementation”.
The “design” part has been dropped because these methods are already designed in details in the
literature,

Another main route change occurred while applying the clustering methods over the data set and at
the beginning of the analysis phase. A new method for combining the clustering results of different
methods was revealed. As the time plan has been already shifted to the left and more spare time was
offered, this new method was taken into consideration for further investigation and analysis. Task 4.4
is added to the plan for the design and the implementation of the new method. A new separate one-
week period of simulation is added to the task 4.2 to simulate the results of the new method and the
results analysis task (task 4.3) is extended to cover a 5-week period. These modifications reflect the
actual time taken for each of the tasks.

Simulation results and the new method’s design resulted in a large amount of information and
conclusions and hence the interim report was expected to be larger. For this reason, writing the
interim report started one week before its scheduled date (Task 5).

The unexpected amount of results from the first data set in addition to the issue of the new
proposed method forced more serious progress assessment and rescheduling for the rest of the project.
This task took a significant time and is shown as a new task in the figure (Task 5.1).

As a result of what have been discussed, the analysis of the first data set is done except for
finalizing the data analysis and concluding the results. The next section discusses the remaining work
needed to close the case of the first data set and analyze the second one.

3.4. Forward Work Plan


The time period after the 15th of July in the time plan in Figure ‎3.3 bares the plan for the forward
work in this project. The forward work can be considered to be within two parts; finalizing the
analysis of the first data set and analyzing the second data set.

According to the discussions in section ‎3.3, the analysis of the first data set has shown a promising
side when a new method for combing different clustering results was revealed. This success

38
necessitates delving into the analysis of this data set to deeper levels where novel outcomes are
expected to appear and to be suitable for publishing. Thus, the next short period is dedicated for the
comparison between the results of this research and the results of previous works on the same data set.
More research is needed as well to verify the originality of the new method. In addition to that, extra
theoretical issues and proposals are roaming around the new proposed method and they should be
assessed to identify any potential promising area for further investigation.

As a consequence of the aforementioned notes, the starting date to analyze the second data set has
been postponed by one week, waiving the task which was called “initial analysis” for the same reason
for which the “initial analysis” of the first data set was waived.

By applying the experience which has been gained in the first period of the project over the second
period’s plan, the results analysis task of the second data set (Task 6.3) has been extended to 3 weeks
instead of 2 weeks and the publication writing task has been shrunk to 2 weeks instead of 3.

3.5. Time Plan Conclusions


The first designed time plan in December 2010 was too far from what happened later on because
of the vagueness of the information which was available at that time. A realistic time plan was set
from scratch in June 2011, just before starting the project, to reflect the clearer view of the project’s
components and goals. This time plan reflected the existence of two main microarray data sets to be
analyzed, the existence of the interim report whose deadline is the 15 th of July 2011 and the actual
deadline for the final report (dissertation) which is the 12th of September 2011.

The real progress worked close to the newly set time plan with a good initial sign of saving one
week at the beginning. This spared week allowed for deeper analysis for the data, resulting in a new
proposed method design and analysis, and for more time for the report’s writing.

The second period of the time table has almost been exposed to no modifications. Even though,
slight modifications were forced by the results of the first period’s analysis.

39
Chapter 4. Methodology
This chapter discusses the methodology which is followed in the analysis of the first data set (i.e.,
the Stanford University yeast cell-cycle data set).

The main type of analysis applied over this data set is gene discovery through clustering. 4
clustering methods were applied over the data set with different parameters. These methods are k-
means clustering, self-organizing maps (SOMs), hierarchical clustering and self-organizing oscillator
networks (SOON). These methods have been thoroughly discussed in the literature, for more details
refer to section ‎2.5 which reviews the relevant literature and points to some useful references when
appropriate.

As the results for the clustering experiments are obtained, a comparison between these results is
required to reach a conclusion. One method of comparison is to compare the values of the clustering
validation indices when calculated for the results of these experiments. 5 clustering validation indices
are introduced and discussed in section ‎2.5.5.

A new method is introduced in this research project which attempts the results of the different
experiments by combining them in different ways. This new method, which is based on fuzzy
partition matrix format, creates a single “combined fuzzy partition matrix” from the set of partition
matrices produced by the experiments. Most of this chapter is dedicated for presenting this method in
details.

Section ‎4.1 discusses the partition matrix format for clustering results, section ‎4.2 discusses the
generation of a single combined partition matrix from the set of partition matrices available, section
‎4.3 discusses different ways of extracting a binary partition matrix from a fuzzy partition matrix and
section ‎4.4 introduces the usage of MATLAB as an implementation programming language and tool.

4.1. Partition Matrix Format for Clustering Results


The main objective of clustering is to group similar data points into number of groups (clusters)
according to specific similarity criteria. One of the ways to represent the results of a clustering method
is to use what is known as the partition matrix.

For a problem of clustering points into clusters, the partition matrix is a


matrix where the element represents the membership of the data point in the
cluster. In the case of gene clustering, genes represent the data points. In crisp clustering, where
each point belongs exclusively to one cluster, the membership value for a point to its cluster is 1 while
it is zero for all of the other clusters. In fuzzy clustering, each point might belong to different clusters
with different membership values.

40
Examples for crisp (binary) and fuzzy partition matrices are shown in figures Figure ‎4.1 and
Figure ‎4.2. These examples consider a problem of 20 points and 4 clusters.

1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
Clusters

2 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.1: Sample of a crisp (binary) clustering partition matrix

1 0.9 0.3 0 0.8 0 0 0 0.4 0 0 0 0.2 0 0.6 0.5 0 0 0.1 1 0.3


Clusters

2 0 0.3 1 0.1 0.2 1 0.1 0.2 0.1 0 0 0.3 0.5 0.1 0.1 0.2 0 0.1 0 0.2
3 0 0.3 0 0 0.2 0 0.5 0 0.9 0 0.5 0.2 0.2 0.2 0.2 0 1 0 0 0.4
4 0.1 0.1 0 0.1 0.6 0 0.4 0.4 0 1 0.5 0.3 0.3 0.1 0.2 0.8 0 0.8 0 0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.2: Sample of a fuzzy clustering partition matrix

All of the clustering methods’ results should be converted to this common format for the later
analysis applicability.

41
4.2. Combined Fuzzy Partition Matrix
For partition matrices { resulting from different clustering experiments for the
same set of data points ( points) and the same target number of clusters ( ), the partition matrix
is the combined fuzzy partition matrix which fuses the R matrices’ contents.

The most intuitive method of combining is to average the partition matrices by summing them in
an element by element fashion then dividing the result by their count. A serious problem appears in
this method which is the different order of similar clusters in different results. Refer to Figure ‎4.3 and
the following discussion to clarify this problem.

1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
Clusters

2 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 1
3 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
4 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(a)
1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1
Clusters

2 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0
3 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0
4 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(b)
Figure ‎4.3: Two binary partition matrices to be combined

Most of the clustering methods don’t respect a specific order for the clusters which makes the
cluster in one result the cluster in another result. If the clusters (the rows) in the result in Figure
‎4.3 (b) are reordered, it can be easily noticed that the clustering result is almost identical to the one in
Figure ‎4.3 (a) except for the data point number (11) which shows some difference. This problem
should be addressed by reordering the results’ clusters to align them while averaging.

The proposed way in this research project is to consider the first partition matrix out of matrices
as the reference then to reorder the second matrix and combine it with it. The intermediate combined
matrix is then considered as a reference while it is combined with the third matrix. This iterative
process builds the final combined fuzzy matrix by fusing the matrices one by one.

In the aforementioned process description it is seen that the reordering step is being performed
times, each time between two matrices of clusters. The problem is a combinatorial
optimization problem known as the allocation problem where the optimum permutation of the second
result’s clusters should be found in a space of possible permutations. This is an NP problem with
the complexity of and cannot be solved in polynomial time using accurate methods.

42
Without resorting to some of the relatively complex optimization methods, a simpler method of
allocation is used in this research. This method applies a max-min approach to allocate clusters from
the first result to clusters from the second result. Figure ‎4.4 shows a sample pairwise distance matrix
whose rows represent the clusters of the first result and whose columns represent the clusters of the
second result. Each element is the distance between the clusters represented by the corresponding row
and column.

The allocation is done in K iterations. In each iteration, the minimum value of each of the columns
is calculated. The row and the column which produce the maximum of these minima are allocated to
each other and are not considered in the later iterations. In the sample in Figure ‎4.4, the maximum of
the minima in the first iteration has the value of (4) and it leads to the allocation of the 2 nd row with
the 3rd column.

1 2 3 4
1 2 2 6 7
2 5 5 4 5
3 8 3 5 1
4 6 0 8 6
Min 2 0 4 1
Figure ‎4.4: Sample pairwise matrix for fuzzy partition matrices’ rows allocation

Calculating the pairwise matrix has the asymptotic complexity of . Then it performs
iterations. In each one it finds minima where each of them consumes time. This consumes
time of the order of . Thus, the total asymptotic complexity of the algorithm is .

Although in theory this might result in finding a local optimum permutation, in practice it doesn’t
seem to do so because of the relatively far distances between clusters compared to the differences
between different results.

The distance between clusters can be measured by many metrics. Although Hamming distance
looks intuitive for binary matrices, it fails to serve in the general case of fuzzy results. The method
which is described above contains combining clustering results with intermediate matrices that are
most likely to be fuzzy. So, using Euclidean distance looks reasonable. Different distance metrics can
be investigated in future works though.

43
4.3. Combined Fuzzy Partition Matrix Binarization
A combined fuzzy partition matrix shows the membership of each data point in each of the
clusters. This value can also be interpreted as the probability of this point to belong to this cluster. It
can be claimed that higher membership values indicate higher confidence in that a certain point
belongs to a certain cluster.

Binarization is the process of converting a combined fuzzy clustering partition matrix to a binary
partition matrix which maps the data points to the clusters in a distinct way (1 or 0, belongs or doesn’t
belong). Let be the binarized version of the combined fuzzy clustering partition matrix .

Figure ‎4.5 shows a sample combined fuzzy clustering partition matrix which might be generated
by combining the results of 10 clustering experiments, it shall be used as an example to draw different
ways of binarization.

1 0.9 0.3 0 0.8 0 0 0 0.4 0 0 0 0.2 0 0.6 0.5 0 0 0.1 1 0.3


Clusters

2 0 0.3 1 0.1 0.2 1 0.1 0.2 0.1 0 0 0.3 0.5 0.1 0.1 0.2 0 0.1 0 0.2
3 0 0.3 0 0 0.2 0 0.5 0 0.9 0 0.5 0.2 0.2 0.2 0.2 0 1 0 0 0.4
4 0.1 0.1 0 0.1 0.6 0 0.4 0.4 0 1 0.5 0.3 0.3 0.1 0.2 0.8 0 0.8 0 0.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.5: Sample of a combined fuzzy clustering partition matrix

Notice the data points 3, 6, 10, 17 and 19. Their combined membership values indicate that they
belong to certain clusters by consensus. All of the 10 experiments voted for the points 3 and 6 for
example to be in the same cluster. On the other hand, the data point 2 is confused; it has similar (and
small) membership values in three different clusters. This point is hard to be assigned to one the
clusters. The points 1, 4, 9, 16 and 18 have very high membership values in one of the clusters and
extremely small values in the rest. In this case, one might claim that there is enough confidence in
mapping these points to certain clusters and considering the small membership values as noise.

The aforementioned discussion leads to different proposed binarization methods. The following
subsections introduce some possible methods with their results if applied over the matrix in Figure
‎4.5.

Before proceeding to the subsections, recall that the clustering problem involves mapping data
points to clusters. The results of the methods to be discussed might show one of these two
observations:

1. Unmapped data points: Some data points might not be mapped to any of the clusters, their
new binary membership values are zeros in all of the clusters. Let denote the number of
these data points.

44
2. Multi-mapped data points: Some data points might be mapped to more than one cluster at the
same time with the membership value of (1) in each. Let denote the number of these
data points.

Although the existence of any of these observations breaks the concept of having the membership
values for a specific data point sum to 1, this concept can be waived in the course of this analysis for
convenience.

4.3.1. Intersection
This binarization method maps a data point to a cluster if all of the clustering experiments agreed
to map it to this cluster. The result of the intersection binarization applied over the matrix in Figure
‎4.5 is shown in Figure ‎4.6.

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Clusters

2 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.6: Intersection binarization sample result

This method results in and .

4.3.2. Union
Every data point is mapped to all of the clusters if there is any possibility that the data point
belongs to it by being mapped to that cluster by at least one of the experiments. Figure ‎4.7 shows the
result of union binarization applied over the matrix in Figure ‎4.5.

1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 1
Clusters

2 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1
3 0 1 0 0 1 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1
4 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.7: Union binarization sample result

This method results in and .

45
4.3.3. Maximum Value
Every data point is mapped to the appropriate cluster according to the maximum membership
value. If more than one cluster shared the maximum value then the data point is mapped to all of
them. Figure ‎4.8 below shows the result of using this method over the matrix which is in Figure ‎4.5.

This method results in and .

1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0
Clusters

2 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1
4 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
Figure ‎4.8: Max binarization sample result

4.3.4. Difference Threshold


In this method a data point is mapped to the appropriate cluster according to the maximum
membership value only if the difference between the maximum membership value and the second
maximum is not less than a threshold .

For the extreme case of , the method becomes identical to max binarization and for ,
the method becomes identical to intersection binarization. Figure ‎4.9 (a) and (b) show the result of
binarizing the matrix in Figure ‎4.5 with value of and respectively.

1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Clusters

2 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(a)
1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
Clusters

2 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(b)
Figure ‎4.9: Difference threshold binarization sample result for (a) α = 0.7 and (b) α = 0.3.

For , and .

46
4.3.5. Value Threshold
Each point is mapped to all the clusters in which its membership values are not less a specified
threshold . If , a trivial unwanted case appears where all of the points are mapped to all of the
clusters. If is an arbitrary small positive number, this method is identical to union binarization.
If , this method is identical to intersection binarization.

Figure ‎4.10 (a), (b) and (c) show the results for binarizing the matrix in Figure ‎4.5 using this
method with , and respectively.

1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Clusters

2 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(a)
1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
Clusters

2 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0
4 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(b)
1 1 1 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1
Clusters

2 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 0 1
4 0 0 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(c)
Figure ‎4.10: Value threshold binarization sample results for (a) α = 0.7, (b) α = 0.5 and (c) α = 0.3.

For , and .

For , and .

47
4.3.6. Top Thresholding
Each point is assigned to the cluster to which its maximum membership value points and to all of
the clusters whose membership values are not less than the maximum value by more than a predefined
threshold .

When , this technique becomes identical to the max binarization technique. As the threshold
is increased, the constraints of adding genes to the clusters rest.

Figure ‎4.11 (a), (b) and (c) show the results for binarizing the matrix in Figure ‎4.5 using this
method with , and respectively.

1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1
Clusters

2 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1
4 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(a)
1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1
Clusters

2 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1
3 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1
4 0 1 0 0 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(b)
1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1
Clusters

2 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 1
3 0 1 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1
4 0 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 0 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Data Points (Genes)
(c)
Figure ‎4.11: Top threshold binarization sample result for (a) α = 0.1, (b) α = 0.2 and (c) α = 0.3.

This method results in and .

48
4.4. MATLAB for Microarray Data Analysis
MATLAB is a special-purpose programming language produced by the MathWorks Incorporation
and is widely used by engineers as a powerful research tool. MATLAB’s syntax is based on matrices
where mathematical operations and functions are performed over large high-dimensional multi-point
data sets simultaneously (This is from the researcher point of view where the coding complexity is
reduced dramatically, but from the computational resources demand point of view, these operations
are broken into their full-length loops of machine code). It also provides the research community with
a wide range of built-in functions and graphical user-friendly toolboxes which perform field-specific
tasks.

Considering MATLAB for microarray data analysis is due to the abundant amount of these built-in
functions and toolboxes that are relevant to this field and due to the relatively easiness in
implementing new tasks. In addition to that, microarray data are high-dimensional multi-point data
sets for which MATLAB syntax seems to be the best fit.

MATLAB functions and toolboxes that support this research project are distributed over Chapter 2
(the literature review), each in its relevant section. Although it might be that not all of the available
and relevant functions and toolboxes are enumerated in this document, MATLAB help documents are
easy to access and search.

MATLAB is provided freely by the University of Liverpool computing services for students and
staff. No special arrangements are required for the project related to this resource.

49
Chapter 5. Results and Analysis
This chapter is set for presenting and analyzing the empirical results of the experiments that have
been carried out in this project over the Stanford University yeast cell-cycle data set.

19 different clustering experiments have been carried out over the Stanford University Yeast data
set which has 384 genes measurements for 17 samples (time points). 4 clustering methods have been
used with different parameters. The methods are: K-means clustering, self-organizing maps (SOMs)
neural networks, hierarchical clustering and self-organizing oscillator networks (SOON).

Table ‎5.1 summarizes the 19 experiments’ details. The first column shows the number of the
experiment, the second one shows the clustering algorithm name, the third shows the number of times
by which the experiment was repeated and the fourth shows the algorithm’s parameters.

Table ‎5.1: A summary of the experiments that have been carried out over the Stanford University yeast
cell-cycle microarray data set.

No. Method Rep. Parameters


1 K-means 20 Uniform random initialization.
Empty clusters are dropped.
K: from 2 to 20.
2 K-means 20 Sample initialization.
Empty clusters are dropped.
K: from 2 to 20.
3 K-means 20 Uniform random initialization.
Empty clusters are mapped to singletons.
K: from 2 to 20.
4 K-means 1 Kaufman deterministic initialization.
Empty clusters are dropped.
K: from 2 to 20.
5 K-means 1 Kaufman deterministic initialization (method 2).
Empty clusters are dropped.
K: from 2 to 20.
6 SOMs 20 Batch Mode.
Bubble neighborhood.
2D hexagonal grids.
Dimensions from (1x2) to (10x10).
7 SOMs 20 Batch Mode.
Gaussian neighborhood.
2D hexagonal grids.
Dimensions from (1x2) to (10x10).
8 SOMs 1 Sequential Mode.
Bubble neighborhood.
2D hexagonal grids.
Dimensions from (1x2) to (10x10).
9 SOMs 1 Sequential Mode.
Gaussian neighborhood.
2D hexagonal grids.
Dimensions from (1x2) to (10x10).

50
10 Hierarchical 1 Single linkage.
K: from 2 to 20.
11 Hierarchical 1 Complete linkage.
K: from 2 to 20.
12 Hierarchical 1 Average linkage.
K: from 2 to 20.
13 Hierarchical 1 Centroid linkage.
K: from 2 to 20.
14 Hierarchical 1 Ward linkage.
K: from 2 to 20.
15 Hierarchical 1 Median linkage.
K: from 2 to 20.
16 SOON 1 b: 1.
CE: 0.1.
CI: CE/N.
d0: from 1 to 10 (0.1 steps).
d1: 5 x d0.
17 SOON 1 b: 2.
CE: 0.1.
CI: CE/N.
d0: from 1 to 10 (0.1 steps).
d1: 5 x d0.
18 SOON 1 b: 0.5.
CE: 0.1.
CI: CE/N.
d0: from 1 to 10 (0.1 steps).
d1: 5 x d0.
19 SOON 1 b: 0.1.
CE: 0.1.
CI: CE/N.
d0: from 1 to 10 (0.1 steps).
d1: 5 x d0.

Section ‎5.1 presents and discusses the values of the clustering validation indices for the 19
experiments while giving a deeper insight of the nature of these experiments. Certain clustering
methods and parameters are shortlisted out of the 19 options to be exposed to further analysis. Section
‎5.2 analyzes the shortlisted experiments using the proposed method “combined fuzzy partition matrix
formulation and binarization” which is detailed in ‎Chapter 4.

51
5.1. Clustering Validation Indices Results
As can be seen in Table ‎5.1 above, the 19 experiments are carried out such that they cover large
ranges of the number of clusters targeted by each of them. This is achieved by varying the suitable
parameter in the context of each of the methods. The details of the nature of the 19 experiments are
discussed in this section while presenting and analyzing the values of clustering validation indices
plotted against the varying parameter which controls the number of clusters.

For fair comparison, all of the results of the indices’ calculations for all of the experiments are
scaled by the same value. The values of the indices DB, CH, XB, I and DI are divided by the
constants 1.5, 60, 2.5, 1 and 0.16 respectively.

5.1.1. K-Means Clustering Results


Figure ‎5.1 shows the results for the k-means clustering experiments (Experiments 1 to 5 as
numbered in Table ‎5.1). The first 3 experiments adopt types of stochastic initialization, thus, each of
them was repeated 20 times. The latest 2 ones adopt deterministic initialization methods which makes
1 repetition to be sufficient. In terms of initialization, both experiments 1 and 3 use uniform random
initialization, experiment 2 uses sample initialization, experiment 4 uses one variant of the
deterministic Kaufman initialization and experiment 5 uses another variant of it. Within the
initialization of experiment 4, the most centred point is chosen as the closest data point to the mean of
the entire set of points. In experiment 5, it is chosen as the point with the minimum mean of distances
to the other points.

Another parameter is varied between these 5 k-means clustering results; it is the strategy by which
empty clusters are treated. In the experiments 1, 2, 4 and 5, when an empty cluster appears at any
iteration, it is dropped. In many times, this results in obtaining a number of clusters which is less than
the one given to the algorithm initially. In experiment 3, empty clusters are not dropped but each of
them is mapped to a single point which is the farthest from the other clusters. This strategy ensures
obtaining the same number of clusters as the one with which the algorithm is initialized.

While analyzing these results, it is important to recall that for better clustering performance, the
indices DB and XB need to be minimized while the indices CH, I and DI need to be maximized. The
experiments were carried out over a varying number of clusters (K) ranging from 2 to 20.

It can be seen from the results of the 5 experiments that the best values for the indices are at K = 4.
Biologists suggest that there are 5 clusters for these genes and some studies show 6 [3]. It is a
significant difference between the indices values of K = 4, 5 and 6 as shown in the figure, which puts
forward question marks on the accurateness of the validation indices and the clustering methods from
one side and the other suggestions (K = 5 and K = 6) from another side.

52
Figure ‎5.1: Clustering validation indices values for k-means clustering experiments

A comparison between the 5 experiments shows that the validation indices for experiments 4 and 5
with the deterministic initialization have better performance than the first 3 ones. In addition to this,
deterministic initialization ensures the level of performance not to bounce between different runs of
the algorithm over the same data set. This can be seen by viewing the standard error bars in the first 3
plots which shows clearly the variation of the results in different repetitions.

Another notable point is that the index DI possesses very high standard deviation values when
compared with the other 4 indices. This can be justified by the way this index is calculated which
considers the distance between any 2 clusters as the distance between the closest points of them and
the diameter of any cluster as the distance between the furthest two points within it. This means that
single points (which might be odd or noisy) dominate the value while excluding out the rest bulk of

53
points within the cluster. This kind of calculation makes small changes to clusters lead to large
changes in the index’s values if these changes affected the boundary points, which is very expected.

The slight change of implementation between experiments 4 and 5 is seen not to change the results
significantly. It didn’t even change anything at all for K values lower than 6.

As mentioned before, experiment 3 never drops any empty cluster, which results in producing the
same number of clusters as requested. The other 4 experiments drop any empty cluster that they face
at any iteration which might lead to lower number of clusters as an outcome. Experimental results
have shown that by using deterministic initialization as in experiments 4 and 5 and sample
initialization, which is stochastic, as in experiment 2, all of the repetitions for all of the K values in the
tested range result in producing the same number of clusters as requested. The case is different for the
uniform random initialization in experiment 1. Figure ‎5.2 shows the number of clusters generated by
the algorithm against the requested number of clusters averaged over the 20 repetitions.

An obvious comparison
between the requested number of
clusters and the actual generated
one shows that this initialization
technique is not suitable for this
data set. At K = 20, the generated
number of clusters has the mean
of about 15, which means
dropping 5 clusters. In addition to
this effect, the standard deviation
is high which shows that even the
clusters that are dropped change
significantly between repetitions.

A logical inference of that is Figure ‎5.2: The number of generated clusters against the number of
that the clusters in this problem requested clusters for k-means clustering with uniform random
are not distributed in the space initialization
uniformly. The doubts extend from this result to the results of experiment 3 which adopts the same
initialization method but without dropping empty clusters. If the initialization is poor and assigns
clusters centers to the wrong regions of the space, then even by handling empty clusters by converting
them to singletons doesn’t solve the problem. For example, handling 5 clusters out of 20 with this
method and considering this 25% of the clusters as special noisy cases is unacceptable.

54
To summarize and conclude these 5 experiments, experiments 1 and 3 are unreliable because of
the uniform random initialization technique adopted in them. Experiment 2 has better performance in
terms of the relation between the initialization technique and the generated number of clusters, but the
values of the validation indices for its results are poor and show a variation in different runs.
Experiments 4 and 5 show the best performance in terms of the goodness of the validation indices and
consistency. Even though, these 2 experiments are very similar in implementation and results, which
makes considering both of them kind of redundancy.

5.1.2. Self-Organizing Maps (SOMs) Clustering Results

Figure ‎5.3: DB index values for the SOMs clustering experiments results

55
Experiments 6 to 9 cluster the gene set by using different setups of the SOMs method. In each of
these experiments, the 2D SOMs neurons grid is varied from to . Experiments 6 and 7
were repeated 20 times and thus their plots are accompanied with error plots (standard deviation). For
clarity, the error plots are presented separately from the values ones. Experiments 8 and 9 were
executed only once, so they are not associated with error plots.

For further clarity, each one of the 5 performance indices is plotted separately; otherwise the plots
would not be clear enough for analysis. Figure ‎5.3, Figure ‎5.4, Figure ‎5.5, Figure ‎5.6 and Figure ‎5.7
respectively show the DB, CH, XB, I and DI indices values for the SOMs experiments.

Figure ‎5.4: CH index values for the SOMs clustering experiments results

56
Figure ‎5.5: XB index values for the SOMs clustering experiments results

Experiments 6 and 7 adopt the batch mode training where the time needed for clustering (for
different grid sizes) ranges from about 0.4 to 2.1 seconds for experiment 6 and from about 0.2 to 1.1
seconds for experiment 7. An enormous jump in clustering time can be seen when these two
experiments are compared with experiments 8 and 9 which adopt the sequential mode training.
Clustering time for experiment 8 ranges from about 43.0 to 55.0 seconds and for experiment 9 from
about 38.0 to 55.0 seconds. For this reason, experiments 8 and 9 where not repeated 20 times for all of
90 different grid sizes.

In terms of the neighborhood function definition, experiments 6 and 8 use the bubble
neighborhood function while experiments 7 and 9 use the gaussian. By comparing the time needed for

57
clustering for both definitions, it is obvious that although the gaussian neighborhood function is more
complicated in its mathematical definition, it leads the SOM to convergance faster than the bubble
neighborhood.

A comparison between the DB and the XB values in Figure ‎5.3 and Figure ‎5.5 shows that they
follow the same pattern; this is because of the close mathematical definitions of them. Remember that
these two indices are minimized to obtain better performance.

Figure ‎5.6: I index values for the SOMs clustering experiments results

58
Figure ‎5.7: DI index values for the SOMs clustering experiments results

The DI index plots in Figure ‎5.7, while comparing them with the 4 other indices, show very high
error (standard deviation) and very small differences between different grid sizes. The index “I” also
shows a notable shape of monotonically decreasing values as the grid gets larger. This might indicate
that smaller number of clusters is better, but by comparing this with the other indices and with the
biological suggestions, it doesn’t seem to be the case. As a result, the indices “DI” and “I” look less
informative for this set of experiments and shall not be considered in the later discussion in this
section.

59
The 5 indices for the grid in experiment 8 show the value of zero. This is a special value set
by the code when all of the points are clustered in one cluster because some of these indices are not
defined for such special cases (like DI). The inference is that this experiment resulted in one empty
cluster and one cluster with the entire set of genes.

A closer look at Figure ‎5.3, Figure ‎5.4 and Figure ‎5.5, the grid sizes which show the best indices
values are and . In some instances shows good results too. This is reflected by both
better values for the indices themselves and the obvious low error values.

Comparing the 4 SOMs’ experiments in the context of clustering performance indices’ values,
experiment 6 (batch mode training + bubble neighbourhood) shows the best performance. Considering
clustering time, experiment 7 has the fastest convergence with barely acceptable results. As a result,
both experiments 6 and 7 are nominated for further analysis and discussions later on in this project.

Figure ‎5.8: Clustering validation indices values for hierarchical clustering experiments

60
5.1.3. Hierarchical Clustering Results
Figure ‎5.8 shows the values of the clustering validation indices for the results of the experiments
10 to 15. This set of experiments applies hierarchical clustering over the genes with different linkage
strategies. The experiments 10, 11, 12, 13, 14 and 15 adopt single, complete, average, centroid, ward
and median linkage strategies respectively.

The clusters are identified from the hierarchical cluster tree by finding the horizontal line by which
if the tree is cut, K or less clusters are separated. In these experiments, none of the times the method
generated a number of clusters which is less than the requested.

This method is deterministic, so, the experiments were executed only once. In terms of the
clustering time, they are very fast. Most of the time is spent in constructing the clustering tree, this
means that the time taken for K = 2 and for K = 20 almost the same. On average, the time periods
taken for clustering for the 6 experiments in order are about 13, 8, 8, 26, 8 and 22 milliseconds.

In terms of clustering quality, experiments 10 and 15 show unpromising behaviours while the rest
have good performance signs at different values of K. Experiments 11 and 12 show good indices
values at K = 5. Experiments 11 and 14 show good indices values for K = 4 and experiment 13 shows
K = 6 as the optimum number of clusters.

Although all of them perform clustering in a very short time, experiments 11, 12 and 14 are the
fastest with good outcomes. Experiment 13 is the slowest but its clustering validation indices values
are promising. As a consequence of this discussion, nominating experiments 11, 12, 13 and 14 to be
further analyzed seems a good choice at the moment.

5.1.4. Self-Organizing Oscillator Networks (SOON) Clustering Results


Four clustering experiments (from 16 to 19) were carried out using the SOON clustering method
with different parameters. SOON is different from the previous methods in that the number of clusters
is not pre-specified by the user. A cluster’s radius (d0) is the parameter which was varied such that
larger radii lead to small numbers of clusters until d0 is large enough to swallow the entire gene set in
one cluster.

The experiments differ from each other by the parameter (b) which is used as an exponent in the
definition of the function that maps phase values to state values (recall section Error! Reference
ource not found. for more details). If (b) is greater than 1, the function is concave up, if it is less than
1, the function is concave down and if it is equal to 1, the function is a line. For the experiments 16,
17, 18 and 19, the values of b = 1, 2, 0.5, 0.1 were used respectively.

The values of the radius d0 were varied from 1 to 10 with steps of 0.2. The generated numbers of
clusters for each of the trials are shown in Figure ‎5.9.

61
Figure ‎5.9: The generated number of clusters (K) versus the clusters' radii (d0) for the SOON clustering
experiments

As shown in Figure ‎5.9, when the radius (d0) is the smallest, each gene is clustered in its own
singleton cluster resulting in 384 clusters. While (d0) is increased, smaller numbers of clusters are
generated until all of the genes are swallowed in one single cluster just after (d0) passes the value (6).
The early range of (d0) where the number of clusters is too high is not of this project’s interest
because it is not suggested as an interesting range by the biologists for the sake of the current
objectives. In this respect, the 4 experiments in hand show close profiles.

Because of what has been noticed, a zoomed version of Figure ‎5.9 is plotted in Figure ‎5.10 to
show the number of clusters generated by the experiments 16 to 19 for the range of (d0) from 4 to 6.5,
then the clustering validation indices are plotted against the values of (d0) from 4 to 6.5 in Figure
‎5.11.

62
Figure ‎5.10: A zoomed version of Figure ‎5.9, the generated number of clusters (K) versus the clusters' radii (d0) for
the SOON clustering experiments in the range of d0 from 4.0 to 6.5

By looking at Figure ‎5.10, it is clear that there are more differences between the 4 experiments
with 4 different values of (b) in this zoomed range of interest.

Figure ‎5.11 shows the values of the 5 clustering validation indices plotted against the radius (d0) in
the range from d0 = 4.0 to 6.5. It can be seen that once all of the genes are swallowed by a single
cluster, the values of all of the indices are set to zeros. As an initial and general notice, the best values
of the indices appear when the number of clusters is 2. Although this might be correct from the
statistical point of view, it is not what the biologists expect and seems out of the range of their
interest. Though, it is worth it, as a future work recommendation, to analyze the case of 2 clusters and
move from it toward the case of 6 clusters gradually while monitoring the behaviour of the clusters
splitting.

63
Figure ‎5.11: Clustering validation indices values for SOON clustering experiments

Figure ‎5.12 shows the time taken for clustering for the experiments 16 to 19 for the smaller range
of d0 from 4.0 to 6.5. The trend in the plots is very obvious where higher values of d0 (smaller
numbers of clusters) lead to longer times for convergence. The case where all of the genes are
swallowed by a single cluster shows a sudden jump such that the clustering time exceeds 50 seconds.
Although this method is not fast, it is not too slow for realistic numbers of clusters such as 4, 5 and 6
where clustering takes from 3 to 5 seconds.

Despite the obvious differences in results between these 4 experiments, it is not obvious which one
of them is the best or the worst, thus, the 4 of them shall be considered for further discussion and
analysis.

64
Figure ‎5.12: Time taken for clustering by the SOON clustering experiments for different values of the radius (d0)

5.1.5. Clustering Experiments Shortlist


After the discussions in the subsections ‎5.1.1 to ‎5.1.4, 11 out of the 19 experiments are shortlisted
for further analysis. The shortlist is summarized here and classified by the clustering method:

 K-means: Experiments 4.
 SOMs: Experiments 6 and 7.
 Hierarchical: Experiments 11, 12, 13 and 14.
 SOON: Experiments 16, 17, 18, 19.

Of these experiments, only the parts where the number of clusters is less than 8 are considered.

65
5.2. Combined Fuzzy Partition Matrix and Binarization Analysis
The methodology described in ‎Chapter 4 is used to combine the results of the shortlisted group of
experiments listed in section ‎5.1.5 into one combined fuzzy partition matrix. The choice at this stage
of the project is to consider the biological suggestion of having 5 main clusters of genes that peak at
the 5 main cell-cycle stages; early G1, late G1, S, G2 and M.

The way in which the proposed method of analysis works allows dealing with exceptional genes
that peak at more than one stage in a more comprehensive approach than splitting them in separate
clusters. This is due to the concepts of multiple assignments and un-assignment of genes.

The complete combined fuzzy partition matrix that resulted from combining the 11 shortlisted
experiments is listed in ‎Appendix A and analyzed in this section. So, the Table ‎A.1 in ‎Appendix A.

The combined fuzzy partition matrix was exposed to 15 different binarization processes and
showed different results in each. Some of the results are unrealistic extremes that shall not be gone
through in more details later. Table ‎5.2 shows a summary for the results from each of the binarization
processes and followed by figures to illustrate the shape of the clusters resulting from each.

Table ‎5.2: Summary of the Binarization Experiments of the Combined Fuzzy Partition Matrix

Binarization Threshold
No. Clusters (By the stage of gene peaking)
Technique ( )
(1) (2) (3) (4) (5)
Early Late S / G2 Early M
G1 G1 G1 +
G2 / M
1 Intersection - 0 69 0 0 20 295 0
2 Union - 119 217 213 153 185 0 295
3 Max - 73 155 81 5 72 0 2
4 Value 0.3 76 160 89 11 82 0 34
5 Value 0.5 68 148 78 2 71 17 0
6 Value 0.7 61 130 61 0 65 67 0
7 Value 0.9 0 79 25 0 46 234 0
8 Difference (epsilon) 72 153 80 5 72 2 0
9 Difference 0.3 67 145 66 0 66 40 0
10 Difference 0.6 48 120 60 0 60 96 0
11 Difference 0.9 0 78 23 0 26 257 0
12 Top 0.05 73 157 86 8 75 0 15
13 Top 0.1 75 159 88 9 77 0 23
14 Top 0.2 78 162 88 17 79 0 30
15 Top 0.4 84 169 103 36 91 0 58
Biological Suggestion 67 135 75 (S) 52 (G2) 55 0 0

66
The first column of Table ‎5.2 shows the sequential number for the binarization experiment. The
second and the third columns detail the technique which is used in the experiment by respectively
showing the binarization technique name and the value of the threshold if needed. The fourth to the
eighth columns show the number of genes assigned to each of the 5 main clusters. The titles of these
columns contain the stage names at which each of these clusters peaks, the justification for this choice
is discussed later in this section. The ninth column shows the number of genes that are not assigned to
any of the clusters and the tenth column shows the number of genes that are assigned to multiple
clusters at the same time. The last row of the table shows the biological suggestion’s record.

Figure ‎5.13, Figure ‎5.15, Figure ‎5.18, Figure ‎5.21 and Figure ‎5.24 show the mean values for the 5
clusters as resulted by the 15 binarization experiments shown in Table ‎5.2. Figure ‎5.13 shows the
intersection and the union binarization results, Figure ‎5.15, Figure ‎5.18, Figure ‎5.21 and Figure ‎5.24
show the max, value thresholding, difference thresholding and top thresholding binarization
experiments’ results respectively.

Figure ‎5.13: Intersection and Union Binarization Clusters Means

67
5.2.1. Intersection and Union Binarization Results
As a first notice appearing in Figure ‎5.13 accompanied by a fast study of Table ‎5.2, the results of
the union binarization technique are not very reliable. This is because most of the genes are assigned
to multiple clusters and without respecting the differences in the membership values between them.
For this theoritical reason which is added to the emperical results in the table, this binarization
technique will not be considered later on in this research.

The result of the intersection binarization is interesting, although it neither gives a good view for
the entire gene set nor for all of the clusters, it gives a good insight into the purer clusters and genes.
By consensus, 69 genes are assigned to cluster 2; which peaks at late G1, and 20 genes are assigned to
cluster 5; which peaks at M. No consensus occured for 295 genes. The full lists of the genes that are
assigned to clusters by consensus are listed in ‎Appendix B, Table ‎B.1.

In a comparsion between the clusters to which the genes are assigned in this experiment and the
biological suggestions, 58 out of the 69 genes that are assigned to cluser 2 match the biological
suggestion while the remaining 11 genes are assigned to cluster 3 according to biology. All of the 20
genes that are assigned by consensus to cluster 5 match the corresponding biological suggestion
(Table ‎B.1 in ‎Appendix B).

Figure ‎5.14 plots the 11 genes whose biological suggestion don’t match the consensus of the
clustering experiments in this project.

Figure ‎5.14: The 11 genes that are assigned to different clusters by the biological
suggestion and the intersection binarization.

68
It can be seen in Figure ‎5.14 that 10 out of these 11 genes clearly peak at the time point 3 which is
is in the late G1 stage. The one which doesn’t peak at that point has a reletavely high value there.
Most of these genes peak again at the time point 11 which is again in late G1 and they peak point 12
which is at the beginning of the S stage. These characterstics (peaking at 3 and with shorter peaks at
11 and 12) are typical characteristics for the cluster 2. Thus, the consensus (intersection) binarization
shows well defended and justified results.

5.2.2. Max Binarization Results

Figure ‎5.15: Max Binarization Clusters Means


Figure ‎5.15 shows the mean gene patterns for the results of the max binarization method. The
number of genes in each cluster are listed in Table ‎5.2 above.

The max binarization technique doesn’t leave any gene unassigned and it keeps the multiple
assigned genes to the minimum where more than one cluster share the same maximum value. Figure
‎5.15 is a good sample for the actual visual patterns for the 5 clusters resulted in this project. Table ‎B.2
in ‎Appendix B lists the names of the genes that are assigned to each of the 5 clusters according to this
experiment.

The most skeptic pattern is that of the 4th cluster. The doubts come from two observations; the first
is that the number of genes that belong to this cluster is small comparatively (5 genes as in this
experiment), and the second is that it peaks at G1 at the time point 2 with a high peak, it peaks again
in the region of the time points 7, 8 and 9 and it peaks a third time in the region of the time points 15,
16 and 17.

69
Figure ‎5.16: Genes of the 4th cluster as a result of the max binarization technique

Figure ‎5.16 shows the 5 genes whose maximum membership values point to the 4th cluster. It is
obvious that these 5 genes share the main characteristics of the 4th cluster which have been mentioned
above. Even though, they show differences in the time point at which they peak at early G1 and the
time points at which they peak between G2 and M in both cycles.

Figure ‎5.17: The genes that are assigned to multiple clusters in the max binarization technique

70
As the max binarization technique shows, there are 2 genes that are assigned to multiple clusters at
the same time. These genes, which are plotted in Figure ‎5.17, have the indices: 3 and 209, and their
names are: YER111c and YGR140w. Gene 3 is assigned to the clusters 1 and 2 by this technique and
is biologically assigned to the cluster 1. As can be seen in Figure ‎5.17, gene 3 shows peaks twice at
the time points 3 and 10, i.e. at late G1 in the first cycle and at early G1 in the second cycle. These
two peaks appear clearly in the second and the first clusters respectively. This multiple assignment
gives a good biological insight to this gene whose peaks appear at different stages of the cell-cycle
each time. Gene 209, which is assigned to the clusters 2 and 3 by this experiment and to the cluster 3
by biologists, shows peaks at the time points 3, 6 and 14, and shows high values close to peaks at the
time points 4, 5 and 13. Peaking high at 3 (late G1) is a strong sign of belonging to the 2 nd cluster
while most of the other peaks and near-peaks of this gene are signs of belonging to the 3rd cluster.

As a conclusion of the aforementioned discussion, both multiple assignments are justified by


taking deeper insight into the genes profiles.

5.2.3. Value Thresholding Binarization Results


Figure ‎5.18 shows the mean gene patterns for the clusters that were generated by the value
thresholding binarization technique. The figure contains 4 sub-plots that reflect the results when is
0.3, 0.5, 0.7 and 0.9. By referring to Table ‎5.2, it is seen that the number of genes that belong to each
of the clusters in the case of is slightly higher than its corresponding value in the case of max
binarization. This results in assigning 34 genes to multiple clusters which shall be discussed in more
details in this section.

Once the value of was increased to 0.5, no more multiple assignments appeared, but, 17 genes
were unassigned to any of the 5 clusters. This case is discussed in more details in this section as well.
As the value of is increased more, more genes move to the unassigned genes group.

The first point to be discussed in this section is the 4th cluster behaviour at different binarization
conditions. At , 11 genes are assigned to this cluster and only 2 of them are assigned to it
distinctly (genes 67 “YLR015w” and 291 “YLR014c”). When the threshold is increased to 0.5, these
2 genes were the only surviving genes in this cluster. For the two higher values of , this cluster was
empty. This cluster is mainly characterized with a high peak at the first or the second time points (in
G1) and two more low peaks that are loosely distributed among the stages S, G2 and M in both cycles.

It is worth it to consider the 4th cluster in the case of (The clustering results assignments
are listed in Table ‎B.3 in ‎Appendix B) here as there are 11 genes grouped in it compared with only 5
in the previous case of max binarization. Figure ‎5.19 plots the 11 genes while highlighting one oddly
looking gene with a thick red line.

71
Figure ‎5.18: Value Thresholding Binarization Clusters Means

72
Figure ‎5.19: Cluster 4 genes - value thresholding binarization (alpha = 0.3)

As can be seen in Figure ‎5.19, all of the genes that belong to the 4th cluster have relatively high
peaks at the first or the second time points except the highlighted gene. They also show lower peaks at
other different time points; the fact due to which most of these genes are assigned to other clusters at
the same time. 2 genes are purely assigned to this cluster, 1 gene is also assigned to the 1 st cluster, 5
genes are also assigned to the 4th cluster and 3 genes are also assigned to the 5th cluster.

The highlighted gene is the gene number 82 with the name “YNL225c”. This gene is assigned to
this cluster (cluster 4) and to cluster 2 at the same time, it is the only gene which is assigned to both of
these clusters which indicates its oddity. The biological classification of this gene is into the 2 nd
cluster. The gene has a high peak at the 12th time point in the stage S and a lower peak at the 3rd time
point in late G1. These two peaks often appear in the members of the 2nd cluster. One more peak for
this gene is at the 14th time point which is within the stage G2. Peaking at the aforementioned time
points justifies this gene’s tendency toward the 2nd cluster, but peaks’ heights of this gene at the 2 cell-
cycles is the opposite of what the average cluster 2’s gene shows. It has low levels in the first cycle
and high levels in the second cycle. Although this weakens the gene’s membership in the 2nd cluster, it
doesn’t justify its tendency toward the 4th cluster. This gene is preferably assigned exclusively to
cluster 2 with special treatment.

34 genes are assigned to multiple clusters when the membership value threshold is set to 0.3. None
of these 34 genes is assigned to more than 2 clusters, and the 34 genes show 7 different combinations
of clusters pairs. 4 genes are assigned to clusters 1 and 2, 2 genes are assigned to clusters 1 and 5, 7
genes are assigned to clusters 2 and 3, 1 gene is assigned to clusters 2 and 4, 5 genes are assigned to
clusters 3 and 4, 12 genes are assigned to clusters 3 and 5 and 3 genes are assigned to clusters 4 and 5.

73
Figure ‎5.20 plots the genes that belong to each of these 7 cases. It can be seen in these figures that
many of the plotted genes show peaks at different time points which belong to different clusters. In
some of the cases, the genes clearly tend toward one of the two clusters to which is assigned, this is
expected because of the relatively low threshold value (0.3), even though, this experiment gives a
deeper insight into some genes that really show peaks in multiple stages.

Figure ‎5.20: Multiple-assigned genes in value thresholding binarization

74
By noticing missing the 4th cluster at and missing the 1st cluster at , the relative
purity and distinction of clusters can be inferred. If reached the value of 1, this technique becomes
identical to the intersection binarization which has been discussed in section ‎5.2.1. At that case, the 3rd
cluster was missed as well, and this adds to the facts inferred here.

5.2.4. Difference Thresholding Binarization Results


Figure ‎5.21 shows the mean patterns for the genes that are mapped to each of the 5 clusters
according to the difference thresholding binarization technique. 4 difference thresholds were used to
produce the 4 sub-plots in the figure at , 0.3, 0.6 and 0.9.

The case of , where is an arbitrary small number, in practice it is the smallest number that
can be implemented by the hardware in hand, is the same as the max binarization technique except
that if two clusters share the same maximum membership value for a certain genes, then the
difference is zero, it is below the threshold and the gene is unassigned to any cluster. The results show
that 2 genes bare this property and they are the same 2 genes that were assigned to multiple clusters in
max binarization. Studying this case cannot add more information than what have been discussed in
section ‎5.2.2.

As the difference threshold is increased, only the genes that are distinctly assigned to one cluster
with a higher difference than the closest competitor are assigned to their clusters. If the gene is
confused between more than one cluster with an insignificant difference (i.e. a difference which is
smaller than the threshold), the gene is unassigned to any of the clusters. At , the 4th cluster
disappeared, and at , the 1st cluster disappeared. This observation again gives an indication
about the pureness and distinction of each of the clusters.

Table ‎B.4 in ‎Appendix B lists the genes that are assigned to each of the 5 clusters and the genes
that are unassigned at all for the case of . In this case, 40 genes are unassigned at all and a
deeper insight into these genes might be beneficial. Figure ‎5.22 and Figure ‎5.23 present the plots of
the 40 genes patterns against the 17 time points.

Most of the genes that are shown in these 2 figures bare peaks at different stages which makes
them confused between multiple clusters without a clear membership to one of them with a difference
of at least 0.3 than the closest competitor.

Some of the genes have odd patterns such as the gene 297 (Figure ‎5.23) which only shows one
high peak at the time point 16 in stage M of the second cycle, and a very shallow hill at the 5 th and the
6th time points between the stages S and G2 of the first cycle.

75
Figure ‎5.21: Difference Thresholding Binarization Clusters Means

76
Figure ‎5.22: The first 20 genes (out of 40) that are unassigned by difference binarization at alpha = 0.3

Another notice about these 40 genes is that many of them follow the pattern of the 4th cluster as
this cluster disaapeared at this value of . Some of the genes that show this pattern in a way or another
as filtered by manual selection are these with the indices 31, 189, 260, 291, 298, 309 and 368. Most of
these genes were included in the 4th cluster in the previous experiments.

Some of these 40 genes show a plateau shape by peaking at multiple consecutive data points over
more than one stage instead of showing sharp peaks at specific data points. Samples of the genes that
bare this property are these with the indices 125, 203, 209, 219, 243 and 342. The first 3 of these
genes show plateaus extending from the 3rd time point to the 6th which covers the stages from late G1
through S to the beginning of G2. The genes 125 and 219 show low plateaus starting at the 12 th time
point in stage S and start degrading slowly toward the 16th data point in the M stage after passing
through G2. These plateaus prevented these genes from being assigned to distinct clusters.

77
Figure ‎5.23: The last 20 genes (out of 40) that are unassigned by difference binarization at alpha = 0.3

5.2.5. Top Thresholding Binarization Results


Figure ‎5.24 shows the mean gene patterns for the 5 clusters produced by the top thresholding
binarization technique. The 4 sub-plots show the results for the 4 different thresholds that were used
which are: 0.05, 0.1, 0.2 and 0.4. The nature of this technique makes higher values of thresholds
swallow more genes in the clusters since it widens the gap of permissible cluster assignment.

This technique can be considered a rested version of the max binarization technique where the next
to the maximum is taken with the maximum if it was close enough to it. For this reason, the results
here show gradual modifications over the results of the max binarization as the threshold is increased.

78
Figure ‎5.24: Top Thresholding Binarization Clusters Means

79
A very interesting point in this case study is the visual comparison between the clusters’ means in
the 4 sub-plots in Figure ‎5.24. As more and more genes are swallowed by the clusters (going from the
lowest to the highest thresholds), the clusters 1, 2, 3 and 5 show a solid immunity in their general
patterns while the 4th cluster (again) changes dramatically as it accepts more genes.

The case of is expected to be far from perfect where the multiple assignments are done to
clusters when the difference between the maximum and the next to it is large. In this case, 9 genes are
assigned to the 5 clusters at the same time; they are the genes with the indices 3, 64, 82, 143, 176,
178, 293, 294 and 368. As expected, the highest membership value of each of these 9 genes rarely
reaches 0.4 and in most of them it is even less than that which allows the membership value of 0 to be
considered sufficient for cluster assignment. These 9 genes are plotted in Figure ‎5.25.

Figure ‎5.25: The 9 genes that are assigned to the 5 clusters by the top thresholding
binarization with alpha = 0.4

It is easy to notice from the figure that each of these 9 genes peaks at multiple stages in a way that
none of the 5 clusters’ patterns dominated in a higher membership value than about 0.4.

The case of is interesting for analysis. It shows that 15 genes have the competitor(s) of
the maximum membership value very close to it to the extent that it might not be accurate to assign
the gene to the maximum alone. Table ‎B.5 in ‎Appendix B lists the genes that are assigned to each of
the 5 clusters by this method at this threshold.

Figure ‎5.26 shows the plots of the 15 genes that are assigned to more than one cluster at the same
time. It can be seen that none of the genes was assigned to more than 2 clusters.

80
Figure ‎5.26: The 15 genes that are assigned to more than one cluster by the top thresholding binarization at

Many of the genes that are shown in Figure ‎5.26 have been noticed before as outliers. These genes,
in most of the cases, have peaks in multiple stages at the same time such that they are almost equally
assigned to more than one cluster.

A final interesting point notice derived from the entirety of this section, the 3rd cluster that resulted
in this research combines the genes that show peaks at S and G2. By looking at these genes’ profiles,
it was notices that most of them have peaks in both S and G2 either in the same cycle or in different
cycles. The biological suggestion of having 2 clusters for S and G2 was not reflected in these results.
On the other hand, the 4th cluster which was generated in this research was not considered as a distinct
cluster by the biologists even that it clearly shows a different pattern from all of the others. Although
the manual investigation inside the 3rd cluster justifies the outcomes of this project, it is recommended
to consider the case of retaining or splitting this cluster in future works.

81
Chapter 6. Conclusions
The current progress of this project, which considers analyzing two microarray data sets, conforms
to the designed time plan to an acceptable level. Although the original plan indicates that analyzing
the first data set should have been finished, and analyzing the second data set should start
immediately, the emergence of a new promising method to combine the results of different clustering
experiments caused a slight, but justified, postponement.

In the case of the first date set, the University of Stanford yeast cell-cycle data set, Kaufman
deterministic initialization technique for k-means clustering outperforms stochastic techniques, batch
mode learning for SOMs clustering gives similar results to the sequential mode with much less
running time, for hierarchical clustering, complete, average, centroid and ward linkage techniques
perform better than single and median techniques and SOON clustering gives good results without
large variations when the parameter is tuned if the correct cluster’s radius was used.

Although the biological suggestion for the optimal number of clusters for the yeast data set is 5 by
assigning each gene to the cluster which represents the cell-cycle stage at which it shows a peak, it is
hard to state the actual optimal number of distinct clusters. This is due to the fact that many genes
show peaks at multiple stages and many others show odd profiles.

The biological suggestion of 5 clusters can be claimed to be correct while considering loose
clustering rather than distinct clustering. The practice of using loose clustering (i.e., allowing some
genes to be assigned to multiple clusters at the same time or to be left unassigned) views the
clustering results at a deeper level and paves the way for further investigation of the outlier genes.

The new proposed method, which is the “combined fuzzy partition matrix formulation and
binarization”, achieves loose clustering in a very comprehensive way by viewing the combined
clustering results through different binarization techniques to identify the outlier genes in addition to
the pure clusters.

The three clusters whose genes peak at the stages early G1, late G1 and M are the purest clusters.
Most of the genes that are biologically suggested to belong to the stages S and G2 show peaks at both
stages at the same time, either in the same cell-cycle or in different cell cycles. These two clusters, as
a result of this project, are found combined in one cluster called S / G2. A few genes (less than 15)
constitute the most skeptic cluster. This cluster is characterized by showing very high peaks at early to
mid G1 in addition to two more low peaks at G2 and / or M.

82
Chapter 7. References
1. A Genome-Wide Transcriptional Analysis of the Mitotic Cell Cycle. R. J. Cho, M. J. Campbell,
E. A. Winzeler, L. Steinmetz, A. Conway, L. Wodicka, T. G. Wolfsberg, A. E. Gabrielian, D.
Landsman, D. J. Lockhart and R. W. Davis. 1998, Molecular Cell, Vol. 2, pp. 65–73.

2. Performance of an Ensemble Clustering Algorithm on Biological Data Sets. H. Pirim, D.


Gautam, T. Bhowmik, A. D. Perkins, B. Ekşioglu and A. Alkan. 1, 2011, Journal of Mathematical
and Computational Applications, Vol. 16, pp. 87–96.

3. Model-Based Clustering and Data Trasformations for Gene Expression Data. K. Y. Yeung, C.
Fraley, A. Murua, A. E. Raftery and W. L. Ruzzo. 2001, Bioinformatics,, Vol. 17, pp. 977–987.

4. A new multiple regression approach for the construction of genetic regulatory networks. Shu-
Qin Zhang, Wai-Ki Ching, Nam-Kiu Tsing, Ho-Yin Leung, Dianjing Guo. 2010, Artificial
Intelligence in Medicine, Vol. 48, pp. 153-160.

5. Kung, S.Y. and M.W. Mak. Chapter 1: Feature Selection for Genomic and Proteomic Data
Mining. [book auth.] & J. C. Rajapakse Y.Q. Zhang. Machine Learning in Bioinformatics. New
Jersey : John Wiley & Sons, Inc., 2009, pp. 1-46.

6. Weeraratna, A.T. and Taub, D.D. Chapter 1: Microarray Data Analysis: An Overview of
Design, Methodology and Analysis. [book auth.] M.J. Korenberg. Microarray Data Analysis:
Methods and Applications. New Jersey : Humana Press Inc., 2007, pp. 1-16.

7. Comparison of microarray pre-processing methods. Shakya K, Ruskin HJ, Kerr G, Crane M,


Becker J. s.l. : Springer, 2009. Advances in Computational Biology, AEMB. Vol. 680.

8. Effect of Pre-processing methods on Microarray-based SVM classifiers in Affymetrix GeneChip.


J.P. Florida, H. Pomares, I. Rojas, J.M. Urquiza, L.J. Herrera and M.G. Claros. Barcelona :
s.n., 2010. The 2010 International Joint Conference on Neural Networks (IJCNN). pp. 1-6.

9. Wang, Z. and Palade, V. Chapter 5: Fuzzy Gene Mining: A Fuzzy-Based Framework for
Cancer Microarray Data Analysis. [book auth.] & J. C. Rajapakse Y.Q. Zhang. Machine Learning in
Bioinformatics. New Jersey : John Wiley & Sons, Inc, 2009, pp. 111-134.

10. Gene selection for cancer classification using support vector machines. Guyon, I.,Weston, J.,
Barnhill, S., and Vapnik. 2002, Machine Learning, Vol. 46, pp. 389-422.

11. Gene Selection for Tumor Classification Using Microarray Gene Expression Data. K.
Tendrapalli, R. Basnet, S. Mukkamala and A.H. Sung. 2007. Proceedings of the World Congress
on Engineering. Vol. 1, pp. 290-295.

83
12. Investigation of Self-Organizing Oscillator Networks for Use in Clustering Microarray Data.
S.A. Salem, L.B. Jack, and A.K. Nandi. 1, 2008, IEEE Trans. Nanobioscience, Vol. 7, pp. 65-79.

13. Gene Clustering Using Self-Organizing Maps and Particle Swarm Optimization. Xiang Xiao,
Ernst R. Dow, Russell Eberhart, Zina Ben Miled and Robert J. Oppelt. 2003. IEEE Parallel and
Distributed Processing Symposium Proceedings. p. 154.

14. A Comprehensive Fuzzy-Based Framework for Cancer Microarray Data Gene Expression
Analysis. Wang, Z. and Palade, V. 2007. Proceedings of the 7th IEEE International Conference on
Bioinformatics and Bioengineering (BIBE 2007). pp. 1003-1010.

15. Microarray Gene Selection Using Self-Organizing Map. S. Vanichayobon, S. Wichaidit and
W. Wettayaprasit. Beijing : s.n., 2007. Proceedings of the 7th WSEAS International Conference on
Simulation, Modelling and Optimization. pp. 239-244.

16. Cloning and screening of sequences expressed in a mouse colon tumor. Augenlicht, L. H. and
Kobrin, D. 1982, Cancer Res., Vol. 42, pp. 1088-1093.

17. Expression of cloned sequences in biopsies of human colonic tissue and in colonic carcinoma
cells induced to differentiate in vitro. Augenlicht, L. H., Wahrman, M. Z., Halsey, H., Anderson,
L., Taylor, J., and Lipkin, M. 22, 1987, Cancer Res., Vol. 47, pp. 6017-6021.

18. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.
Schena M, Shalon D, Davis RW, Brown PO. 1995, Science, Vol. 270, pp. 467–470.

19. Yeast microarrays for genome wide parallel genetic and gene expression analysis. Lashkari
DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO, Davis RW. 1997.
Proc Natl Acad Sci USA. Vol. 94, pp. 13057–13062.

20. Initial sequencing and analysis of the human genome. Lander, E. S., Linton, L. M., Birren,
B., et al. 2001, Nature, Vol. 409, pp. 860–921.

21. Expression monitoring by hybridization to high-density oligonucleotide arrays. Lockhart DJ,


Dong H, Byrne MC, et al. 1996, Nat Biotechnol, Vol. 14, pp. 1675-1680.

22. Affymetrix GeneChip official website. [Online] http://www.affymetrix.com.

23. LS Bound based gene selection for DNA microarray data. Mao, Xin Zhou and K.Z. 8, 2005,
Bioinformatics, Vol. 21, pp. 1559-1564.

24. Colon cancer prediction with genetics profiles using evolutionary techniques. A. Kulkarni, N.
Kumar, V. Ravi, U.S. Murthy. 2011, Expert Systems with Applications, Vol. 38, pp. 2752 – 2757.

84
25. Variable selection from random forests: application to gene expression data. Andres, R. Diaz-
Uriate and S.A. de. Barcelona : s.n., 2004. Proceedings of the 5th Annual Spanish Bioinformatics
Conference. pp. 47-53.

26. ROKU: a novel method for identification of tissue-specific genes. K. Kadota, J. Ye, Y. Nakai,
T. Terada and K. Shimizu. 2006, BMC Bioinformatics, Vol. 7, p. 294.

27. Feature dimension reduction for microarray data analysis using locally linear embedding. Shi,
C. and Chen, L. 2005. The Asia Pacific Bioinformatics Conference. pp. 211–217.

28. Welsch, R.S. Menjoge and R.E. Chapter 2: Comparing and Visualizing Gene Selection and
Classification Methods for Microarray Data. [book auth.] & J. C. Rajapakse Y.Q. Zhang. Machine
Learning in Bioinformatics. New Jersey : John Wiley & Sons, Inc., 2009, pp. 47-68.

29. Prediction of breast cancer prognosis using gene set statistics provides signature stability and
biological context. G. Abraham, A. Kowalczyk, S. Loi, I. Haviv and J. Zobel. 2010, BMC
BioInformatics, Vol. 11, p. 277.

30. Analysis of large-scale gene expression data. Sherlock, G. 2000, Current Opinion in
Immunology, Vol. 12, pp. 201–205.

31. S. Pang, I. Havukkala, Y. Hu, and N. Kasabov. Chapter 4: Bootstrapping Consistency


Method for Optimal Gene Selection from Microarray Gene Expression Data for Classification
Problems. [book auth.] & J. C. Rajapakse Y.Q. Zhang. Machine Learning in Bioinformatics. New
Jersey : John Wiley & Sons, Inc., 2009, pp. 89-110.

32. Yang, G.Z. Li and J.Y. Chapter 6: Feature Selection for Ensemble Learning and Its
Application. [book auth.] & J. C. Rajapakse Y.Q. Zhang. Machine Learning in Bioinformatics. New
Jersey : John Wiley & Sons, Inc., 2009, pp. 135-156.

33. Kohonen, T. Ed. Self-Organizing Maps. New York : Springer-Verlag, 1997.

34. A Parameter in the Learning Rule of SOM That Incorporates Activation Frequency. Neme,
Antonio and Miramontes, Pedro. 2006. Proceedings of ICANN. Vol. 4131, pp. 455-463.

35. Interpreting patterns of gene expression with self-organizing maps: Methods and application
to hematopoietic differentiation. P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E.
Dmitrovsky, E.S. Lander and T.R. Golub. 1999. Proc. Natl. Acad. Sci. Vol. 96, pp. 2907-2912.

36. Analysis and visualisation of gene expression data using self-organizing maps. J. Nikkila, P.
Toronen, S. Kaski, J. Venna, E. Castren, and G. Wong. 2002, Neural Networks, Vol. 15, pp. 953–
966.

85
37. Haykin, Simon. Neural Networks – A Comprehensive Foundation. 3nd Edition. Singapore :
Pearson, Prentice Hall, 1999.

38. Self-organization of pulse-coupled oscillators with application to clustering. Frigui, M. B. H.


Rhouma and H. 2, 2001, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 23, pp. 1–16.

39. Performance evaluation of some clustering algorithms and validity indices. Bandyopadhyay,
U. Maulik and S. 12, 2002, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 24, pp. 1650–1654.

40. An empirical comparison of four initialization methods for the K-Means algorithm. J.M. Pena,
J.A. Lozano and P. Larranaga. 10, 1999, Pattern Recognition Letters, Vol. 20, pp. 1027–1040.

41. Deriving quantitative conclusions from microarray expression data. OlshenA. B., and Jain,
A. N. 7, 2002, Bioinformatics, Vol. 18, pp. 961–970.

42. Cluster analysis and display of genome-wide expression patterns. M. B. Eisen, P. T.


Spellman, P. O. Brown and D. Botstein. 1998. Proc. Natl. Acad. Sci. Vol. 95, pp. 14863–14868.

43. Quantification in functional magnetic resonance imaging: fuzzy clustering vs. correlation
analysis. Baumgartner, R.,Windischberger, C., and Moser, E. 2, 1998, Magnetic Resonance
Imaging, Vol. 16, pp. 115–125.

44. NIFTI: An evolutionary approach for finding number. Srinivasan, Sudhakar Jonnalagadda
and Rajagopalan. 1, 2009, BMC Bioinformatics, Vol. 10, p. 40.

45. Gene selection and classification of microarray data using random forest. Diaz-Uriate, R.
and Andres, S.A. de. 1, 2006, BMC Bioinformatics, Vol. 7, p. 3.

46. Support-Vector Networks. Vapnik, Corinna Cortes and Vladimir. 3, 1995, Machine
Learning Journal, Vol. 20, pp. 273-297.

47. Knowledge-based analysis of microarray gene expression data by using support vector
machines. Brown, M. P., Grundy, W. N., Lin, D., et al. 2000. Proc. Natl. Acad. Sci. Vol. 97, pp.
262–267.

48. Classification and diagnostic prediction of cancers using gene expression profiling and
artificial neural networks. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M.,
Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., and Meltzer, P. S.
2001, Nature Medicine, Vol. 7, pp. 673–679.

49. Neural Networks: A General Framework for Nonlinear Function Approximation. Fisher,
Manfred M. 2006, Transactions in GIS, Vol. 10, p. 521.

86
Appendix A. Combined Fuzzy Partition Matrix Results
Table ‎A.1 shows the combined fuzzy partition matrix for the 11 shortlisted experiments listed in
section ‎5.1.5 at , where is the number of clusters. For convenience, the matrix is transposed.
The first column shows the number (index) of the gene with respect to this set of 384 genes, the
second column shows the name of the gene, the third column shows the biological suggestion for the
cluster to which the gene belongs and the fourth to the eighth columns show the membership values
for the gene in the 5 clusters as resulted in this research.

For some genes, the membership values in the 5 clusters sum up to 0.99 or 1.01. This is due to the
accumulated error caused by rounding the results to 2 decimal digits.

The methodology of constructing this matrix from single clustering experiments’ results is
explained in section ‎4.2. This matrix is analyzed and discussed in section ‎5.2.

Table ‎A.1: Combined Fuzzy Partition Matrix for the 11 Shortlisted Experiments

No. Name Bio. Clusters produced by this research


Cluster 1 2 3 4 5
1 YDL179w 1 0.83 0 0.03 0 0.14
2 YLR079w 1 0.83 0 0.03 0 0.14
3 YER111c 1 0.35 0.35 0.03 0.27 0
4 YBR200w 1 0.08 0 0.03 0 0.89
5 YJL194w 1 0.55 0.08 0.12 0.05 0.21
6 YLR274w 1 0.80 0 0.03 0 0.17
7 YBR202w 1 0.17 0 0.03 0 0.80
8 YPR019w 1 0.44 0 0.03 0 0.53
9 YBL023c 1 0.02 0 0 0 0.98
10 YEL032w 1 0.03 0 0 0 0.97
11 YGR044c 1 0.79 0.05 0.03 0 0.14
12 YML109w 1 0.05 0.88 0.03 0 0.05
13 YJL157c 1 0.83 0 0.03 0 0.14
14 YKL185w 1 0.83 0 0.03 0 0.14
15 YHR005c 1 0.83 0 0.03 0 0.14
16 YNR001c 1 0.83 0 0.03 0 0.14
17 YKL150w 1 0.83 0 0.03 0 0.14
18 YLR395c 1 0.76 0 0.03 0 0.21
19 YOR065w 1 0.76 0 0.03 0 0.21
20 YDL181w 1 0.83 0 0.03 0 0.14
21 YGR183c 1 0.83 0 0.03 0 0.14
22 YLR258w 1 0.76 0.08 0 0 0.17
23 YML110c 1 0.79 0.05 0.03 0 0.14
24 YLR273c 1 0.83 0 0.03 0 0.14
25 YCR005c 1 0.76 0 0 0.08 0.17
26 YCL040w 1 0.80 0 0.03 0 0.17
27 YMR256c 1 0.78 0 0.03 0 0.19
28 YIL009w 1 0.79 0.05 0.03 0 0.14
29 YLL040c 1 0.83 0 0.03 0 0.14

87
30 YNR016c 1 0.05 0.92 0.03 0 0
31 YBR067c 1 0.03 0.26 0.17 0.55 0
32 YPL058c 1 0.14 0 0.03 0 0.83
33 YGL055w 1 0.05 0.95 0 0 0
34 YGR281W 1 0.83 0 0.03 0 0.14
35 YBR083w 1 0.79 0.05 0.03 0 0.14
36 YBR054w 1 0.79 0 0.03 0 0.18
37 YKL116c 1 0.79 0.05 0.03 0 0.14
38 YPR002w 1 0.83 0 0.03 0 0.14
39 YNR067c 1 0.76 0.08 0.03 0 0.14
40 YBR158w 1 0.79 0.05 0.03 0 0.14
41 YDL117w 1 0.79 0.05 0.03 0 0.14
42 YGR035c 1 0.14 0 0.03 0 0.83
43 YHL026c 1 0.19 0.03 0.03 0 0.75
44 YMR007w 1 0.76 0 0.03 0 0.21
45 YMR254c 1 0.18 0 0.03 0 0.79
46 YNL046w 1 0.76 0.08 0.03 0 0.14
47 YOR264w 1 0.79 0.05 0.03 0 0.14
48 YPL066w 1 0.76 0.08 0.03 0 0.14
49 YBR052c 1 0.83 0 0.03 0 0.14
50 YPL158c 1 0.83 0 0.03 0 0.14
51 YHR022c 1 0.83 0 0.03 0 0.14
52 YPL004c 1 0.79 0.05 0.03 0 0.14
53 YBR157c 1 0.83 0 0.03 0 0.14
54 YNL078w 1 0.79 0.05 0.03 0 0.14
55 YOR066w 1 0.83 0 0.03 0 0.14
56 YMR031c 1 0.15 0 0.03 0 0.82
57 YBR053c 1 0.76 0 0.03 0 0.21
58 YDR511w 1 0.83 0 0.03 0 0.14
59 YLR254c 1 0.05 0 0 0 0.95
60 YDR033w 1 0.83 0 0.03 0 0.14
61 YKL163w 1 0.79 0 0.03 0 0.18
62 YBR231c 1 0.72 0.05 0.03 0.07 0.14
63 YDR368w 1 0.76 0.05 0.03 0.03 0.14
64 YLR050c 1 0.39 0.24 0 0.27 0.09
65 YLR049c 1 0.72 0.15 0 0.04 0.09
66 YOR273c 1 0.32 0 0.03 0 0.65
67 YLR015w 1 0 0.08 0.09 0.41 0.42
68 YGR109c 2 0 1 0 0 0
69 YPR120c 2 0 1 0 0 0
70 YDL127w 2 0.03 0.79 0 0.18 0
71 YNL289w 2 0 1 0 0 0
72 YPL256c 2 0 1 0 0 0
73 YMR199w 2 0 0.82 0 0.18 0
74 YJL187c 2 0 1 0 0 0
75 YDL003w 2 0 0.82 0 0.18 0
76 YMR076c 2 0 1 0 0 0
77 YKL042w 2 0 0.82 0 0.18 0
78 YFL008w 2 0 1 0 0 0
79 YPL241c 2 0 0.82 0 0.18 0
80 YMR078c 2 0 1 0 0 0
81 YLR212c 2 0 0.77 0 0.23 0
82 YNL225c 2 0.20 0.36 0.03 0.32 0.09

88
83 YPL209c 2 0 0.82 0.09 0.09 0
84 YJL074c 2 0 0.82 0 0.18 0
85 YNL233w 2 0 1 0 0 0
86 YLR313c 2 0 1 0 0 0
87 YGR041w 2 0.17 0.56 0 0.27 0
88 YGR152c 2 0 1 0 0 0
89 YDR507c 2 0 0.82 0 0.18 0
90 YLR286c 2 0.63 0.17 0.03 0.04 0.14
91 YIL159w 2 0 0.95 0 0 0.05
92 YGL027c 2 0 0.82 0 0.18 0
93 YML102w 2 0 1 0 0 0
94 YPR018w 2 0 1 0 0 0
95 YBL035c 2 0 1 0 0 0
96 YNL102w 2 0 1 0 0 0
97 YBR278w 2 0 1 0 0 0
98 YPR175w 2 0 1 0 0 0
99 YNL262w 2 0.09 0.64 0 0.27 0
100 YBR088c 2 0 0.82 0 0.18 0
101 YLR103c 2 0 1 0 0 0
102 YAR007c 2 0 0.82 0 0.18 0
103 YNL312w 2 0 1 0 0 0
104 YJL173c 2 0 0.82 0 0.18 0
105 YER070w 2 0 1 0 0 0
106 YOR074c 2 0 1 0 0 0
107 YDL164c 2 0 0.82 0 0.18 0
108 YKL045w 2 0 1 0 0 0
109 YBR252w/ 2 0.15 0.63 0 0.22 0
110 YLR032w 2 0 1 0 0 0
111 YML060w 2 0 1 0 0 0
112 YDR097c 2 0 0.82 0 0.18 0
113 YNL082w 2 0.09 0.64 0 0.27 0
114 YOL090w 2 0.09 0.73 0 0.18 0
115 YLR383w 2 0 1 0 0 0
116 YKL113c 2 0 1 0 0 0
117 YLR234w 2 0 1 0 0 0
118 YPL153c 2 0 1 0 0 0
119 YDL101c 2 0 0.82 0 0.18 0
120 YHR038W 2 0.79 0.05 0.03 0 0.14
121 YML021C 2 0 0.82 0 0.18 0
122 YML061c 2 0 1 0 0 0
123 YMR179w 2 0 0.82 0 0.18 0
124 YML027w 2 0 0.83 0 0.17 0
125 YPL127c 2 0 0.47 0.44 0.09 0
126 YGL089c 2 0.74 0.09 0.03 0.05 0.09
127 YPL187w 2 0.76 0.15 0 0 0.09
128 YDL227c 2 0.09 0.66 0 0.24 0
129 YNL173c 2 0.83 0 0.03 0 0.14
130 YLL021w 2 0 0.65 0.26 0.09 0
131 YLR382c 2 0.08 0.89 0.03 0 0
132 YJL196c 2 0 0.82 0 0.18 0
133 YJR148w 2 0.75 0.15 0 0.01 0.09
134 YOR317w 2 0.76 0.12 0.03 0 0.09
135 YKL165c 2 0 1 0 0 0

89
136 YLL002w 2 0 0.82 0 0.18 0
137 YPL124w 2 0 0.82 0 0.18 0
138 YKL101w 2 0 0.82 0 0.18 0
139 YLR457c 2 0 1 0 0 0
140 YDR297w 2 0 0.48 0.39 0.12 0
141 YBR070c 2 0 1 0 0 0
142 YLR233c 2 0 1 0 0 0
143 YPL057C/ 2 0.38 0.35 0 0.27 0
144 YBR073w 2 0 1 0 0 0
145 YGL200c 2 0.09 0.72 0 0.18 0
146 YHR153c 2 0 0.82 0 0.18 0
147 YNL272C 2 0 0.82 0 0.18 0
148 YOR316C 2 0.71 0.15 0 0 0.14
149 YKR077w 2 0.12 0.61 0 0.27 0
150 YDL105w 2 0.09 0.65 0 0.26 0
151 YPL267w 2 0 1 0 0 0
152 YDL156w 2 0 0.82 0 0.18 0
153 YOL017w 2 0 0.82 0 0.18 0
154 YBR071w 2 0.09 0.64 0 0.27 0
155 YLL022c 2 0 1 0 0 0
156 YLR236c 2 0 1 0 0 0
157 YCL022c 2 0.09 0.64 0 0.27 0
158 YCL024w 2 0 0.82 0 0.18 0
159 YLR121c 2 0 1 0 0 0
160 YHR154w 2 0 0.97 0 0.03 0
161 YHR110w 2 0.09 0.73 0 0.18 0
162 YDL010w 2 0 0.82 0 0.18 0
163 YPR174c 2 0 1 0 0 0
164 YJL181w 2 0 0.82 0 0.18 0
165 YLR183c 2 0 1 0 0 0
166 YOL007c 2 0 1 0 0 0
167 YIL026c 2 0 1 0 0 0
168 YJR043c 2 0 1 0 0 0
169 YKL108w 2 0 1 0 0 0
170 YLR326w 2 0 1 0 0 0
171 YLR381w 2 0 1 0 0 0
172 YNL273w 2 0 0.82 0 0.18 0
173 YPL014w 2 0.18 0.55 0 0.27 0
174 YJL018w 2 0 1 0 0 0
175 YBR089w 2 0 0.82 0 0.18 0
176 YLR349w 2 0.39 0.33 0 0.27 0
177 YCL061c 2 0 0.82 0 0.18 0
178 YDR309c 2 0.39 0.24 0 0.27 0.09
179 YKR013w 2 0 1 0 0 0
180 YJL078c 2 0.69 0.12 0 0.10 0.09
181 YKL161c 2 0.83 0.08 0 0 0.09
182 YDL018c 2 0 1 0 0 0
183 YKR083c 2 0.09 0.64 0 0.27 0
184 YDR383c 2 0.09 0.64 0 0.27 0
185 YDR013w 2 0 1 0 0 0
186 YGR238c 2 0 1 0 0 0
187 YHR113w 2 0.70 0.05 0.03 0.09 0.14
188 YDL119c 2 0.76 0 0.06 0 0.18

90
189 YDL124w 2 0.56 0 0.12 0.27 0.05
190 YHR039c 2 0.83 0.08 0 0 0.09
191 YDR493w 2 0.76 0.05 0.03 0.03 0.14
192 YNL309w 2 0 1 0 0 0
193 YPL208w 2 0 1 0 0 0
194 YLR376c 2 0 1 0 0 0
195 YNL300w 2 0 0.82 0.09 0.09 0
196 YAR003W 2 0 1 0 0 0
197 YCL060c 2 0 0.82 0 0.18 0
198 YDL103c 2 0 1 0 0 0
199 YGL028c 2 0.75 0.12 0 0.04 0.09
200 YNL303w 2 0 0.82 0 0.18 0
201 YNL310c 2 0 1 0 0 0
202 YOR144c 2 0 1 0 0 0
203 YDR113c 3 0 0.42 0.45 0.12 0
204 YNL126w 3 0 0 0.88 0.12 0
205 YEL061c 3 0 0 0.88 0.12 0
206 YHR172w 3 0 0.64 0.27 0.09 0
207 YPR141c 3 0 0.13 0.71 0.17 0
208 YLR045c 3 0 0 0.88 0.12 0
209 YGR140w 3 0 0.45 0.45 0.09 0
210 YBL063w 3 0 0 0.88 0.12 0
211 YDR488c 3 0 0.82 0 0.18 0
212 YMR198w 3 0 0.05 0.92 0.03 0
213 YOR026w 3 0 1 0 0 0
214 YDR356w 3 0 0.19 0.69 0.12 0
215 YIL140w 3 0 0.97 0.03 0 0
216 YKL067w 3 0.32 0.41 0 0.27 0
217 YJR006w 3 0 0.74 0.17 0.09 0
218 YBL003c/ 3 0 0.29 0.53 0.19 0
219 YBL002w/ 3 0 0.37 0.54 0.09 0
220 YMR190c 3 0 0 0.97 0.03 0
221 YJL115w 3 0 1 0 0 0
222 YDL197c 3 0 1 0 0 0
223 YCR065w 3 0 1 0 0 0
224 YPL016w 3 0 0 0.88 0.12 0
225 YBL052c 3 0 0 0.97 0.03 0
226 YBR275c 3 0 0.97 0.03 0 0
227 YIL126w 3 0 0 0.97 0.03 0
228 YFR037c 3 0 0.41 0.50 0.09 0
229 YKL127W/ 3 0 0.77 0.05 0.18 0
230 YAR008w 3 0 1 0 0 0
231 YER001w 3 0 0.73 0.18 0.09 0
232 YER003c 3 0 0.64 0.27 0.09 0
233 YDL093w 3 0 0.73 0.18 0.09 0
234 YIR017c 3 0 0 0.88 0.12 0
235 YKR001c 3 0 0.12 0.76 0.12 0
236 YDL095w 3 0 0.82 0 0.18 0
237 YNL073w 3 0.74 0.17 0 0 0.09
238 YER118c 3 0 0.73 0.09 0.18 0
239 YJR137c 3 0 0 0.88 0.12 0
240 YER017c 3 0.76 0.08 0.03 0.05 0.09
241 YER016w 3 0 0.68 0.23 0.09 0

91
242 YCR035c 3 0 0 0.97 0.03 0
243 YOL019w 3 0.27 0.45 0 0.27 0
244 YLR151c 3 0 1 0 0 0
245 YOR284w 3 0 1 0 0 0
246 YMR048w 3 0 1 0 0 0
247 YKR010c 3 0 0 0.88 0.12 0
248 YHR061c 3 0 0 0.88 0.12 0
249 YEL017w 3 0 0 0.88 0.12 0
250 YLL062c 3 0 0 0.88 0.12 0
251 YJR155w 3 0 1 0 0 0
252 YLR126c 3 0 1 0 0 0
253 YJL118w 3 0 0.10 0.78 0.12 0
254 YDR179c 3 0 0 0.88 0.12 0
255 YDR219c 3 0 0 0.97 0.03 0
256 YNL176c 3 0 0 0.97 0.03 0
257 YER018c 3 0 0 0.88 0.12 0
258 YBR156c 3 0 0 0.88 0.12 0
259 YLR455w 3 0 0 0.97 0.03 0
260 YBR184w 3 0 0.05 0.55 0.41 0
261 YLR228c 3 0 0.05 0.59 0.36 0
262 YDR252w 3 0 0.03 0.59 0.38 0
263 YNL304w 3 0 0.82 0 0.18 0
264 YGR189C 3 0 0.97 0.03 0 0
265 YAL034W-a 3 0 0.95 0.05 0 0
266 YBR007c 3 0 0.18 0.70 0.12 0
267 YBR276c 3 0 1 0 0 0
268 YCRX04w 3 0 0 0.58 0 0.42
269 YDL096c 3 0 0.82 0 0.18 0
270 YEL018w 3 0 0 0.83 0.17 0
271 YER019w 3 0 0.59 0.32 0.09 0
272 YFR026C 3 0 0.77 0 0.23 0
273 YFR027W 3 0 0.97 0.03 0 0
274 YFR038w 3 0 0.95 0.05 0 0
275 YKL066W 3 0.58 0.26 0.03 0.04 0.09
276 YLL061w 3 0 0 0.88 0.12 0
277 YNL072W 3 0.09 0.64 0 0.27 0
278 YIL050w 4 0 0 0.97 0 0.03
279 YIL106w 4 0 0 0.29 0 0.71
280 YBL097w 4 0 0 0.97 0 0.03
281 YKL049c 4 0 0 0.88 0.12 0
282 YCL014w 4 0 0 0.48 0 0.52
283 YOR188w 4 0 0 0.83 0.17 0
284 YJR076c 4 0 0 0.97 0 0.03
285 YJL099w 4 0 0 0.88 0 0.12
286 YKL048c 4 0 0 0.97 0 0.03
287 YBR038w 4 0 0 0.23 0 0.77
288 YJL092w 4 0 0 0.97 0.03 0
289 YCR084c 4 0 0 0.88 0.12 0
290 YGL255w 4 0 0 0.52 0 0.48
291 YLR014c 4 0.06 0.05 0.12 0.52 0.26
292 YJL137c 4 0 0 0.97 0 0.03
293 YOR274w 4 0.17 0 0.09 0.38 0.36
294 YBR104w 4 0.18 0.05 0.32 0.15 0.30

92
295 YJR112w 4 0 0 0.88 0.12 0
296 YCR073c 4 0 0 0.88 0 0.12
297 YDL198c 4 0 0 0.54 0 0.46
298 YLL046c 4 0.05 0 0.50 0.45 0
299 YDR389w 4 0 0 0.61 0 0.39
300 YDR464w 4 0 0 0.83 0.14 0.03
301 YKL068w 4 0 0 0.88 0 0.12
302 YIL131c 4 0 0 0.97 0 0.03
303 YDR451c 4 0 0 0.97 0 0.03
304 YCR085w 4 0 0 0.97 0 0.03
305 YMR003w 4 0 0 0.97 0 0.03
306 YNR009w 4 0 0 0.88 0.09 0.03
307 YOR073w 4 0 0 0.97 0 0.03
308 YMR215w 4 0 0 0.97 0 0.03
309 YLL047w 4 0.08 0 0.45 0.47 0
310 YDR366c 4 0 0.03 0.62 0.14 0.21
311 YIL158w 4 0 0 0.19 0 0.81
312 YGL101w 4 0 0 0.88 0 0.12
313 YJR110w 4 0 0 0.92 0 0.08
314 YOL012c 4 0 0 0.88 0 0.12
315 YBL032w 4 0 0 0.88 0 0.12
316 YBR043c 4 0 0 0.46 0 0.54
317 YCL012w 4 0 0 0.19 0 0.81
318 YCL013w 4 0 0 0.48 0 0.52
319 YCL062w 4 0 0 0.70 0 0.30
320 YCL063w 4 0 0 0.70 0 0.30
321 YCR086w 4 0 0 0.97 0 0.03
322 YDR324C 4 0 0 0.49 0 0.51
323 YDR325W 4 0 0 0.97 0 0.03
324 YIL169C 4 0 0 0.88 0.09 0.03
325 YKL052C 4 0 0 0.88 0.09 0.03
326 YKL053W 4 0 0.09 0.79 0.12 0
327 YKL069W 4 0 0 0.88 0.09 0.03
328 YPL116W 4 0 0 0.97 0 0.03
329 YPL264C 4 0 0 0.97 0 0.03
330 YGR108w 5 0 0 0.19 0 0.81
331 YPR119w 5 0 0 0.05 0 0.95
332 YAL040c 5 0.59 0.11 0 0.09 0.21
333 YGL116w 5 0 0 0.04 0 0.96
334 YOL069w 5 0 0 0.14 0 0.86
335 YGR092w 5 0 0 0 0 1
336 YBR138c 5 0 0 0.05 0 0.95
337 YOR058c 5 0 0 0.10 0 0.90
338 YHR023w 5 0 0 0.05 0 0.95
339 YPL242C 5 0 0 0.05 0 0.95
340 YJR092w 5 0 0 0.05 0 0.95
341 YLR353w 5 0 0 0.05 0 0.95
342 YCL037C 5 0 0 0.53 0 0.47
343 YMR001c 5 0 0 0.05 0 0.95
344 YGL021w 5 0 0 0.14 0 0.86
345 YCR042c 5 0.04 0 0 0 0.96
346 YOR025w 5 0 0 0.05 0 0.95
347 YOR229w 5 0 0 0 0 1

93
348 YDR146c 5 0 0 0.23 0 0.77
349 YLR131c 5 0 0 0.05 0 0.95
350 YNL053w 5 0 0 0.19 0 0.81
351 YKL130C 5 0 0 0.05 0 0.95
352 YIL162w 5 0 0 0 0 1
353 YDL138W 5 0 0 0 0 1
354 YDL048c 5 0 0 0.19 0.05 0.77
355 YGR143w 5 0 0 0 0 1
356 YKL129c 5 0 0 0 0 1
357 YIL167w 5 0.06 0 0 0 0.94
358 YHR152w 5 0 0 0 0 1
359 YPR167C 5 0 0.03 0.85 0 0.12
360 YJL079c 5 0 0 0 0 1
361 YAR018c 5 0 0 0 0 1
362 YAL022c 5 0 0 0 0 1
363 YGR279c 5 0 0 0 0 1
364 YGL201c 5 0.03 0 0 0 0.97
365 YNL057w 5 0 0 0 0 1
366 YML119w 5 0 0 0 0 1
367 YOL137w 5 0 0 0 0 1
368 YPL186c 5 0.14 0.03 0.12 0.41 0.30
369 YCL038c 5 0 0 0 0 1
370 YML033w 5 0 0 0.14 0 0.86
371 YML034w 5 0 0 0.05 0 0.95
372 YKR021w 5 0 0 0.05 0 0.95
373 YPR157w 5 0 0 0.05 0 0.95
374 YJL051w 5 0 0 0.05 0 0.95
375 YOL014w 5 0 0 0.05 0 0.95
376 YOR315w 5 0 0 0.14 0 0.86
377 YGR230w 5 0 0 0 0 1
378 YLR190w 5 0 0 0.05 0 0.95
379 YMR032w 5 0 0 0.05 0 0.95
380 YOL070c 5 0 0 0 0 1
381 YLR297W 5 0 0 0 0 1
382 YHL028W 5 0 0 0.05 0 0.95
383 YHR151C 5 0 0 0 0 1
384 YNL058C 5 0 0 0 0 1

94
Appendix B. Combined Fuzzy Partition Matrix Binarization
Results
This appendix shows the lists of genes that are assigned to each of the 5 main clusters as resulted
by binarizing the combined fuzzy partition matrix shown in Appendix A. Binarization techniques are
explained in section ‎4.3 and these results are discussed in section ‎5.2.

Table ‎5.2 in section ‎5.2 lists the 15 binarization experiments that were done. Not all of the results
of these experiments are listed here, only the ones that worth listing. The following tables are
referenced by the text in section ‎5.2 in the appropriate places for analysis and information inference.

All of the tables listed below have the same format. The table is divided into 6 batches of genes; 5
batches are for the 5 clusters and the last batch is for the un-clustered (unassigned) genes. At the top
of each of these batches, the number of genes listed is shown just under the batch’s title. The table has
3 columns; the sequential number (index) of the gene with respect to the list of 384 genes, the name of
the gene and the biologically suggested class (cluster).

Tables Table ‎B.1, Table ‎B.2, Table ‎B.3, Table ‎B.4 and Table ‎B.5 show the results of the
intersection binarization, max binarization, value thresholding binarization with , difference
thresholding binarization with and top thresholding binarization with .

95
Table ‎B.1: 246 YMR048w 3 21 YGR183c 1 131 YLR382c 2 249 YEL017w 3 333 YGL116w 5
251 YJR155w 3 22 YLR258w 1 132 YJL196c 2 250 YLL062c 3 334 YOL069w 5
Intersection 252 YLR126c 3 23 YML110c 1 133 YJR148w 2 253 YJL118w 3 336 YBR138c 5
Binarization 267 YBR276c 3 24 YLR273c 1 134 YOR317w 2 254 YDR179c 3 337 YOR058c 5
Results (3) 25 YCR005c 1 136 YLL002w 2 255 YDR219c 3 338 YHR023w 5
S / G2 26 YCL040w 1 137 YPL124w 2 256 YNL176c 3 339 YPL242C 5
0 Genes 27 YMR256c 1 138 YKL101w 2 257 YER018c 3 340 YJR092w 5
No Name C
28 YIL009w 1 140 YDR297w 2 258 YBR156c 3 341 YLR353w 5
(1) 29 YLL040c 1 143 YPL057C/ 2 259 YLR455w 3 342 YCL037C 5
Early G1 30 YNR016c 1 145 YGL200c 2 260 YBR184w 3 343 YMR001c 5
0 Genes (4)
Early G1 + G2 / M 31 YBR067c 1 146 YHR153c 2 261 YLR228c 3 344 YGL021w 5
0 Genes 32 YPL058c 1 147 YNL272C 2 262 YDR252w 3 345 YCR042c 5
33 YGL055w 1 148 YOR316C 2 263 YNL304w 3 346 YOR025w 5
(2) 34 YGR281W 1 149 YKR077w 2 264 YGR189C 3 348 YDR146c 5
Late G1 35 YBR083w 1 150 YDL105w 2 265 YAL034W-a 3 349 YLR131c 5
69 Genes (5)
M 36 YBR054w 1 152 YDL156w 2 266 YBR007c 3 350 YNL053w 5
68 YGR109c 2 20 Genes 37 YKL116c 1 153 YOL017w 2 268 YCRX04w 3 351 YKL130C 5
69 YPR120c 2 38 YPR002w 1 154 YBR071w 2 269 YDL096c 3 354 YDL048c 5
335 YGR092w 5
71 YNL289w 2 39 YNR067c 1 157 YCL022c 2 270 YEL018w 3 357 YIL167w 5
347 YOR229w 5
72 YPL256c 2 40 YBR158w 1 158 YCL024w 2 271 YER019w 3 359 YPR167C 5
352 YIL162w 5
74 YJL187c 2 41 YDL117w 1 160 YHR154w 2 272 YFR026C 3 364 YGL201c 5
353 YDL138W 5
76 YMR076c 2 42 YGR035c 1 161 YHR110w 2 273 YFR027W 3 368 YPL186c 5
355 YGR143w 5
78 YFL008w 2 43 YHL026c 1 162 YDL010w 2 274 YFR038w 3 370 YML033w 5
356 YKL129c 5
80 YMR078c 2 44 YMR007w 1 164 YJL181w 2 275 YKL066W 3 371 YML034w 5
358 YHR152w 5
85 YNL233w 2 45 YMR254c 1 172 YNL273w 2 276 YLL061w 3 372 YKR021w 5
360 YJL079c 5
86 YLR313c 2 46 YNL046w 1 173 YPL014w 2 277 YNL072W 3 373 YPR157w 5
361 YAR018c 5
88 YGR152c 2 47 YOR264w 1 175 YBR089w 2 278 YIL050w 4 374 YJL051w 5
362 YAL022c 5
93 YML102w 2 48 YPL066w 1 176 YLR349w 2 279 YIL106w 4 375 YOL014w 5
363 YGR279c 5
94 YPR018w 2 49 YBR052c 1 177 YCL061c 2 280 YBL097w 4 376 YOR315w 5
365 YNL057w 5
95 YBL035c 2 50 YPL158c 1 178 YDR309c 2 281 YKL049c 4 378 YLR190w 5
366 YML119w 5
96 YNL102w 2 51 YHR022c 1 180 YJL078c 2 282 YCL014w 4 379 YMR032w 5
367 YOL137w 5
97 YBR278w 2 52 YPL004c 1 181 YKL161c 2 283 YOR188w 4 382 YHL028W 5
369 YCL038c 5
98 YPR175w 2 53 YBR157c 1 183 YKR083c 2 284 YJR076c 4
377 YGR230w 5
101 YLR103c 2 54 YNL078w 1 184 YDR383c 2 285 YJL099w 4
380 YOL070c 5
103 YNL312w 2 55 YOR066w 1 187 YHR113w 2 286 YKL048c 4
381 YLR297W 5
105 YER070w 2 56 YMR031c 1 188 YDL119c 2 287 YBR038w 4
383 YHR151C 5
106 YOR074c 2 57 YBR053c 1 189 YDL124w 2 288 YJL092w 4
384 YNL058C 5
108 YKL045w 2 58 YDR511w 1 190 YHR039c 2 289 YCR084c 4
335 YGR092w 5
110 YLR032w 2 59 YLR254c 1 191 YDR493w 2 290 YGL255w 4
347 YOR229w 5
111 YML060w 2 60 YDR033w 1 195 YNL300w 2 291 YLR014c 4
352 YIL162w 5
115 YLR383w 2 61 YKL163w 1 197 YCL060c 2 292 YJL137c 4
353 YDL138W 5
116 YKL113c 2 62 YBR231c 1 199 YGL028c 2 293 YOR274w 4
355 YGR143w 5
117 YLR234w 2 63 YDR368w 1 200 YNL303w 2 294 YBR104w 4
356 YKL129c 5
118 YPL153c 2 64 YLR050c 1 203 YDR113c 3 295 YJR112w 4
358 YHR152w 5
122 YML061c 2 65 YLR049c 1 204 YNL126w 3 296 YCR073c 4
360 YJL079c 5
135 YKL165c 2 66 YOR273c 1 205 YEL061c 3 297 YDL198c 4
361 YAR018c 5
139 YLR457c 2 67 YLR015w 1 206 YHR172w 3 298 YLL046c 4
362 YAL022c 5
141 YBR070c 2 70 YDL127w 2 207 YPR141c 3 299 YDR389w 4
363 YGR279c 5
142 YLR233c 2 73 YMR199w 2 208 YLR045c 3 300 YDR464w 4
365 YNL057w 5
144 YBR073w 2 75 YDL003w 2 209 YGR140w 3 301 YKL068w 4
366 YML119w 5
151 YPL267w 2 77 YKL042w 2 210 YBL063w 3 302 YIL131c 4
367 YOL137w 5
155 YLL022c 2 79 YPL241c 2 211 YDR488c 3 303 YDR451c 4
369 YCL038c 5
156 YLR236c 2 81 YLR212c 2 212 YMR198w 3 304 YCR085w 4
377 YGR230w 5
159 YLR121c 2 82 YNL225c 2 214 YDR356w 3 305 YMR003w 4
380 YOL070c 5
163 YPR174c 2 83 YPL209c 2 215 YIL140w 3 306 YNR009w 4
381 YLR297W 5
165 YLR183c 2 84 YJL074c 2 216 YKL067w 3 307 YOR073w 4
383 YHR151C 5
166 YOL007c 2 87 YGR041w 2 217 YJR006w 3 308 YMR215w 4
384 YNL058C 5
167 YIL026c 2 89 YDR507c 2 218 YBL003c/ 3 309 YLL047w 4
168 YJR043c 2 (6)
Unassigned 90 YLR286c 2 219 YBL002w/ 3 310 YDR366c 4
169 YKL108w 2 91 YIL159w 2 220 YMR190c 3 311 YIL158w 4
295 Genes
170 YLR326w 2 92 YGL027c 2 224 YPL016w 3 312 YGL101w 4
171 YLR381w 2 1 YDL179w 1
2 YLR079w 1 99 YNL262w 2 225 YBL052c 3 313 YJR110w 4
174 YJL018w 2 100 YBR088c 2 226 YBR275c 3 314 YOL012c 4
179 YKR013w 2 3 YER111c 1
4 YBR200w 1 102 YAR007c 2 227 YIL126w 3 315 YBL032w 4
182 YDL018c 2 104 YJL173c 2 228 YFR037c 3 316 YBR043c 4
185 YDR013w 2 5 YJL194w 1
6 YLR274w 1 107 YDL164c 2 229 YKL127W/ 3 317 YCL012w 4
186 YGR238c 2 109 YBR252w/ 2 231 YER001w 3 318 YCL013w 4
192 YNL309w 2 7 YBR202w 1
8 YPR019w 1 112 YDR097c 2 232 YER003c 3 319 YCL062w 4
193 YPL208w 2 113 YNL082w 2 233 YDL093w 3 320 YCL063w 4
194 YLR376c 2 9 YBL023c 1
10 YEL032w 1 114 YOL090w 2 234 YIR017c 3 321 YCR086w 4
196 YAR003W 2 119 YDL101c 2 235 YKR001c 3 322 YDR324C 4
198 YDL103c 2 11 YGR044c 1
12 YML109w 1 120 YHR038W 2 236 YDL095w 3 323 YDR325W 4
201 YNL310c 2 121 YML021C 2 237 YNL073w 3 324 YIL169C 4
202 YOR144c 2 13 YJL157c 1
14 YKL185w 1 123 YMR179w 2 238 YER118c 3 325 YKL052C 4
213 YOR026w 3 124 YML027w 2 239 YJR137c 3 326 YKL053W 4
221 YJL115w 3 15 YHR005c 1
16 YNR001c 1 125 YPL127c 2 240 YER017c 3 327 YKL069W 4
222 YDL197c 3 126 YGL089c 2 241 YER016w 3 328 YPL116W 4
223 YCR065w 3 17 YKL150w 1
18 YLR395c 1 127 YPL187w 2 242 YCR035c 3 329 YPL264C 4
230 YAR008w 3 128 YDL227c 2 243 YOL019w 3 330 YGR108w 5
244 YLR151c 3 19 YOR065w 1
20 YDL181w 1 129 YNL173c 2 247 YKR010c 3 331 YPR119w 5
245 YOR284w 3 130 YLL021w 2 248 YHR061c 3 332 YAL040c 5

96
Table ‎B.2: Max 332 YAL040c 5 150 YDL105w 2 203 YDR113c 3 (4) 383 YHR151C 5
(2) 151 YPL267w 2 204 YNL126w 3 Early G1 + G2 / M 384 YNL058C 5
Binarization Late G1 152 YDL156w 2 205 YEL061c 3 5 Genes (6)
Results 155 Genes 153 YOL017w 2 207 YPR141c 3 31 YBR067c 1 Unassigned
3 YER111c 1 154 YBR071w 2 208 YLR045c 3 291 YLR014c 4 0 Genes
No Name C 12 YML109w 1 155 YLL022c 2 209 YGR140w 3 293 YOR274w 4
(1) 30 YNR016c 1 156 YLR236c 2 210 YBL063w 3 309 YLL047w 4
Early G1 33 YGL055w 1 157 YCL022c 2 212 YMR198w 3 368 YPL186c 5
73 Genes 68 YGR109c 2 158 YCL024w 2 214 YDR356w 3 (5)
1 YDL179w 1 69 YPR120c 2 159 YLR121c 2 218 YBL003c/ 3 M
2 YLR079w 1 70 YDL127w 2 160 YHR154w 2 219 YBL002w/ 3 72 Genes
3 YER111c 1 71 YNL289w 2 161 YHR110w 2 220 YMR190c 3 4 YBR200w 1
5 YJL194w 1 72 YPL256c 2 162 YDL010w 2 224 YPL016w 3 7 YBR202w 1
6 YLR274w 1 73 YMR199w 2 163 YPR174c 2 225 YBL052c 3 8 YPR019w 1
11 YGR044c 1 74 YJL187c 2 164 YJL181w 2 227 YIL126w 3 9 YBL023c 1
13 YJL157c 1 75 YDL003w 2 165 YLR183c 2 228 YFR037c 3 10 YEL032w 1
14 YKL185w 1 76 YMR076c 2 166 YOL007c 2 234 YIR017c 3 32 YPL058c 1
15 YHR005c 1 77 YKL042w 2 167 YIL026c 2 235 YKR001c 3 42 YGR035c 1
16 YNR001c 1 78 YFL008w 2 168 YJR043c 2 239 YJR137c 3 43 YHL026c 1
17 YKL150w 1 79 YPL241c 2 169 YKL108w 2 242 YCR035c 3 45 YMR254c 1
18 YLR395c 1 80 YMR078c 2 170 YLR326w 2 247 YKR010c 3 56 YMR031c 1
19 YOR065w 1 81 YLR212c 2 171 YLR381w 2 248 YHR061c 3 59 YLR254c 1
20 YDL181w 1 82 YNL225c 2 172 YNL273w 2 249 YEL017w 3 66 YOR273c 1
21 YGR183c 1 83 YPL209c 2 173 YPL014w 2 250 YLL062c 3 67 YLR015w 1
22 YLR258w 1 84 YJL074c 2 174 YJL018w 2 253 YJL118w 3 279 YIL106w 4
23 YML110c 1 85 YNL233w 2 175 YBR089w 2 254 YDR179c 3 282 YCL014w 4
24 YLR273c 1 86 YLR313c 2 177 YCL061c 2 255 YDR219c 3 287 YBR038w 4
25 YCR005c 1 87 YGR041w 2 179 YKR013w 2 256 YNL176c 3 311 YIL158w 4
26 YCL040w 1 88 YGR152c 2 182 YDL018c 2 257 YER018c 3 316 YBR043c 4
27 YMR256c 1 89 YDR507c 2 183 YKR083c 2 258 YBR156c 3 317 YCL012w 4
28 YIL009w 1 91 YIL159w 2 184 YDR383c 2 259 YLR455w 3 318 YCL013w 4
29 YLL040c 1 92 YGL027c 2 185 YDR013w 2 260 YBR184w 3 322 YDR324C 4
34 YGR281W 1 93 YML102w 2 186 YGR238c 2 261 YLR228c 3 330 YGR108w 5
35 YBR083w 1 94 YPR018w 2 192 YNL309w 2 262 YDR252w 3 331 YPR119w 5
36 YBR054w 1 95 YBL035c 2 193 YPL208w 2 266 YBR007c 3 333 YGL116w 5
37 YKL116c 1 96 YNL102w 2 194 YLR376c 2 268 YCRX04w 3 334 YOL069w 5
38 YPR002w 1 97 YBR278w 2 195 YNL300w 2 270 YEL018w 3 335 YGR092w 5
39 YNR067c 1 98 YPR175w 2 196 YAR003W 2 276 YLL061w 3 336 YBR138c 5
40 YBR158w 1 99 YNL262w 2 197 YCL060c 2 278 YIL050w 4 337 YOR058c 5
41 YDL117w 1 100 YBR088c 2 198 YDL103c 2 280 YBL097w 4 338 YHR023w 5
44 YMR007w 1 101 YLR103c 2 200 YNL303w 2 281 YKL049c 4 339 YPL242C 5
46 YNL046w 1 102 YAR007c 2 201 YNL310c 2 283 YOR188w 4 340 YJR092w 5
47 YOR264w 1 103 YNL312w 2 202 YOR144c 2 284 YJR076c 4 341 YLR353w 5
48 YPL066w 1 104 YJL173c 2 206 YHR172w 3 285 YJL099w 4 343 YMR001c 5
49 YBR052c 1 105 YER070w 2 209 YGR140w 3 286 YKL048c 4 344 YGL021w 5
50 YPL158c 1 106 YOR074c 2 211 YDR488c 3 288 YJL092w 4 345 YCR042c 5
51 YHR022c 1 107 YDL164c 2 213 YOR026w 3 289 YCR084c 4 346 YOR025w 5
52 YPL004c 1 108 YKL045w 2 215 YIL140w 3 290 YGL255w 4 347 YOR229w 5
53 YBR157c 1 109 YBR252w/ 2 216 YKL067w 3 292 YJL137c 4 348 YDR146c 5
54 YNL078w 1 110 YLR032w 2 217 YJR006w 3 294 YBR104w 4 349 YLR131c 5
55 YOR066w 1 111 YML060w 2 221 YJL115w 3 295 YJR112w 4 350 YNL053w 5
57 YBR053c 1 112 YDR097c 2 222 YDL197c 3 296 YCR073c 4 351 YKL130C 5
58 YDR511w 1 113 YNL082w 2 223 YCR065w 3 297 YDL198c 4 352 YIL162w 5
60 YDR033w 1 114 YOL090w 2 226 YBR275c 3 298 YLL046c 4 353 YDL138W 5
61 YKL163w 1 115 YLR383w 2 229 YKL127W/ 3 299 YDR389w 4 354 YDL048c 5
62 YBR231c 1 116 YKL113c 2 230 YAR008w 3 300 YDR464w 4 355 YGR143w 5
63 YDR368w 1 117 YLR234w 2 231 YER001w 3 301 YKL068w 4 356 YKL129c 5
64 YLR050c 1 118 YPL153c 2 232 YER003c 3 302 YIL131c 4 357 YIL167w 5
65 YLR049c 1 119 YDL101c 2 233 YDL093w 3 303 YDR451c 4 358 YHR152w 5
90 YLR286c 2 121 YML021C 2 236 YDL095w 3 304 YCR085w 4 360 YJL079c 5
120 YHR038W 2 122 YML061c 2 238 YER118c 3 305 YMR003w 4 361 YAR018c 5
126 YGL089c 2 123 YMR179w 2 241 YER016w 3 306 YNR009w 4 362 YAL022c 5
127 YPL187w 2 124 YML027w 2 243 YOL019w 3 307 YOR073w 4 363 YGR279c 5
129 YNL173c 2 125 YPL127c 2 244 YLR151c 3 308 YMR215w 4 364 YGL201c 5
133 YJR148w 2 128 YDL227c 2 245 YOR284w 3 310 YDR366c 4 365 YNL057w 5
134 YOR317w 2 130 YLL021w 2 246 YMR048w 3 312 YGL101w 4 366 YML119w 5
143 YPL057C/ 2 131 YLR382c 2 251 YJR155w 3 313 YJR110w 4 367 YOL137w 5
148 YOR316C 2 132 YJL196c 2 252 YLR126c 3 314 YOL012c 4 369 YCL038c 5
176 YLR349w 2 135 YKL165c 2 263 YNL304w 3 315 YBL032w 4 370 YML033w 5
178 YDR309c 2 136 YLL002w 2 264 YGR189C 3 319 YCL062w 4 371 YML034w 5
180 YJL078c 2 137 YPL124w 2 265 YAL034W-a 3 320 YCL063w 4 372 YKR021w 5
181 YKL161c 2 138 YKL101w 2 267 YBR276c 3 321 YCR086w 4 373 YPR157w 5
187 YHR113w 2 139 YLR457c 2 269 YDL096c 3 323 YDR325W 4 374 YJL051w 5
188 YDL119c 2 140 YDR297w 2 271 YER019w 3 324 YIL169C 4 375 YOL014w 5
189 YDL124w 2 141 YBR070c 2 272 YFR026C 3 325 YKL052C 4 376 YOR315w 5
190 YHR039c 2 142 YLR233c 2 273 YFR027W 3 326 YKL053W 4 377 YGR230w 5
191 YDR493w 2 144 YBR073w 2 274 YFR038w 3 327 YKL069W 4 378 YLR190w 5
199 YGL028c 2 145 YGL200c 2 277 YNL072W 3 328 YPL116W 4 379 YMR032w 5
237 YNL073w 3 146 YHR153c 2 (3) 329 YPL264C 4 380 YOL070c 5
240 YER017c 3 147 YNL272C 2 S / G2 342 YCL037C 5 381 YLR297W 5
275 YKL066W 3 149 YKR077w 2 81 Genes 359 YPR167C 5 382 YHL028W 5

97
Table ‎B.3: Value 191 YDR493w 2 142 YLR233c 2 265 YAL034W-a 3 312 YGL101w 4 347 YOR229w 5
199 YGL028c 2 143 YPL057C/ 2 267 YBR276c 3 313 YJR110w 4 348 YDR146c 5
Thresholding 216 YKL067w 3 144 YBR073w 2 269 YDL096c 3 314 YOL012c 4 349 YLR131c 5
Binarization 237 YNL073w 3 145 YGL200c 2 271 YER019w 3 315 YBL032w 4 350 YNL053w 5
Results ( 240 YER017c 3 146 YHR153c 2 272 YFR026C 3 316 YBR043c 4 351 YKL130C 5
275 YKL066W 3 147 YNL272C 2 273 YFR027W 3 318 YCL013w 4 352 YIL162w 5
) 332 YAL040c 5 149 YKR077w 2 274 YFR038w 3 319 YCL062w 4 353 YDL138W 5
(2) 150 YDL105w 2 277 YNL072W 3 320 YCL063w 4 354 YDL048c 5
No Name C Late G1 151 YPL267w 2 (3) 321 YCR086w 4 355 YGR143w 5
(1) 160 Genes 152 YDL156w 2 S / G2 322 YDR324C 4 356 YKL129c 5
Early G1 3 YER111c 1 153 YOL017w 2 89 Genes 323 YDR325W 4 357 YIL167w 5
76 Genes 12 YML109w 1 154 YBR071w 2 125 YPL127c 2 324 YIL169C 4 358 YHR152w 5
1 YDL179w 1 30 YNR016c 1 155 YLL022c 2 140 YDR297w 2 325 YKL052C 4 360 YJL079c 5
2 YLR079w 1 33 YGL055w 1 156 YLR236c 2 203 YDR113c 3 326 YKL053W 4 361 YAR018c 5
3 YER111c 1 68 YGR109c 2 157 YCL022c 2 204 YNL126w 3 327 YKL069W 4 362 YAL022c 5
5 YJL194w 1 69 YPR120c 2 158 YCL024w 2 205 YEL061c 3 328 YPL116W 4 363 YGR279c 5
6 YLR274w 1 70 YDL127w 2 159 YLR121c 2 207 YPR141c 3 329 YPL264C 4 364 YGL201c 5
8 YPR019w 1 71 YNL289w 2 160 YHR154w 2 208 YLR045c 3 342 YCL037C 5 365 YNL057w 5
11 YGR044c 1 72 YPL256c 2 161 YHR110w 2 209 YGR140w 3 359 YPR167C 5 366 YML119w 5
13 YJL157c 1 73 YMR199w 2 162 YDL010w 2 210 YBL063w 3 (4) 367 YOL137w 5
14 YKL185w 1 74 YJL187c 2 163 YPR174c 2 212 YMR198w 3 Early G1 + G2 / M 368 YPL186c 5
15 YHR005c 1 75 YDL003w 2 164 YJL181w 2 214 YDR356w 3 11 Genes 369 YCL038c 5
16 YNR001c 1 76 YMR076c 2 165 YLR183c 2 218 YBL003c/ 3 31 YBR067c 1 370 YML033w 5
17 YKL150w 1 77 YKL042w 2 166 YOL007c 2 219 YBL002w/ 3 67 YLR015w 1 371 YML034w 5
18 YLR395c 1 78 YFL008w 2 167 YIL026c 2 220 YMR190c 3 82 YNL225c 2 372 YKR021w 5
19 YOR065w 1 79 YPL241c 2 168 YJR043c 2 224 YPL016w 3 260 YBR184w 3 373 YPR157w 5
20 YDL181w 1 80 YMR078c 2 169 YKL108w 2 225 YBL052c 3 261 YLR228c 3 374 YJL051w 5
21 YGR183c 1 81 YLR212c 2 170 YLR326w 2 227 YIL126w 3 262 YDR252w 3 375 YOL014w 5
22 YLR258w 1 82 YNL225c 2 171 YLR381w 2 228 YFR037c 3 291 YLR014c 4 376 YOR315w 5
23 YML110c 1 83 YPL209c 2 172 YNL273w 2 234 YIR017c 3 293 YOR274w 4 377 YGR230w 5
24 YLR273c 1 84 YJL074c 2 173 YPL014w 2 235 YKR001c 3 298 YLL046c 4 378 YLR190w 5
25 YCR005c 1 85 YNL233w 2 174 YJL018w 2 239 YJR137c 3 309 YLL047w 4 379 YMR032w 5
26 YCL040w 1 86 YLR313c 2 175 YBR089w 2 242 YCR035c 3 368 YPL186c 5 380 YOL070c 5
27 YMR256c 1 87 YGR041w 2 176 YLR349w 2 247 YKR010c 3 (5) 381 YLR297W 5
28 YIL009w 1 88 YGR152c 2 177 YCL061c 2 248 YHR061c 3 M 382 YHL028W 5
29 YLL040c 1 89 YDR507c 2 179 YKR013w 2 249 YEL017w 3 82 Genes 383 YHR151C 5
34 YGR281W 1 91 YIL159w 2 182 YDL018c 2 250 YLL062c 3 4 YBR200w 1 384 YNL058C 5
35 YBR083w 1 92 YGL027c 2 183 YKR083c 2 253 YJL118w 3 7 YBR202w 1 (6)
36 YBR054w 1 93 YML102w 2 184 YDR383c 2 254 YDR179c 3 8 YPR019w 1 Unassigned
37 YKL116c 1 94 YPR018w 2 185 YDR013w 2 255 YDR219c 3 9 YBL023c 1 0 Genes
38 YPR002w 1 95 YBL035c 2 186 YGR238c 2 256 YNL176c 3 10 YEL032w 1
39 YNR067c 1 96 YNL102w 2 192 YNL309w 2 257 YER018c 3 32 YPL058c 1
40 YBR158w 1 97 YBR278w 2 193 YPL208w 2 258 YBR156c 3 42 YGR035c 1
41 YDL117w 1 98 YPR175w 2 194 YLR376c 2 259 YLR455w 3 43 YHL026c 1
44 YMR007w 1 99 YNL262w 2 195 YNL300w 2 260 YBR184w 3 45 YMR254c 1
46 YNL046w 1 100 YBR088c 2 196 YAR003W 2 261 YLR228c 3 56 YMR031c 1
47 YOR264w 1 101 YLR103c 2 197 YCL060c 2 262 YDR252w 3 59 YLR254c 1
48 YPL066w 1 102 YAR007c 2 198 YDL103c 2 266 YBR007c 3 66 YOR273c 1
49 YBR052c 1 103 YNL312w 2 200 YNL303w 2 268 YCRX04w 3 67 YLR015w 1
50 YPL158c 1 104 YJL173c 2 201 YNL310c 2 270 YEL018w 3 268 YCRX04w 3
51 YHR022c 1 105 YER070w 2 202 YOR144c 2 271 YER019w 3 279 YIL106w 4
52 YPL004c 1 106 YOR074c 2 203 YDR113c 3 276 YLL061w 3 282 YCL014w 4
53 YBR157c 1 107 YDL164c 2 206 YHR172w 3 278 YIL050w 4 287 YBR038w 4
54 YNL078w 1 108 YKL045w 2 209 YGR140w 3 280 YBL097w 4 290 YGL255w 4
55 YOR066w 1 109 YBR252w/ 2 211 YDR488c 3 281 YKL049c 4 293 YOR274w 4
57 YBR053c 1 110 YLR032w 2 213 YOR026w 3 282 YCL014w 4 294 YBR104w 4
58 YDR511w 1 111 YML060w 2 215 YIL140w 3 283 YOR188w 4 297 YDL198c 4
60 YDR033w 1 112 YDR097c 2 216 YKL067w 3 284 YJR076c 4 299 YDR389w 4
61 YKL163w 1 113 YNL082w 2 217 YJR006w 3 285 YJL099w 4 311 YIL158w 4
62 YBR231c 1 114 YOL090w 2 219 YBL002w/ 3 286 YKL048c 4 316 YBR043c 4
63 YDR368w 1 115 YLR383w 2 221 YJL115w 3 288 YJL092w 4 317 YCL012w 4
64 YLR050c 1 116 YKL113c 2 222 YDL197c 3 289 YCR084c 4 318 YCL013w 4
65 YLR049c 1 117 YLR234w 2 223 YCR065w 3 290 YGL255w 4 319 YCL062w 4
66 YOR273c 1 118 YPL153c 2 226 YBR275c 3 292 YJL137c 4 320 YCL063w 4
90 YLR286c 2 119 YDL101c 2 228 YFR037c 3 294 YBR104w 4 322 YDR324C 4
120 YHR038W 2 121 YML021C 2 229 YKL127W/ 3 295 YJR112w 4 330 YGR108w 5
126 YGL089c 2 122 YML061c 2 230 YAR008w 3 296 YCR073c 4 331 YPR119w 5
127 YPL187w 2 123 YMR179w 2 231 YER001w 3 297 YDL198c 4 333 YGL116w 5
129 YNL173c 2 124 YML027w 2 232 YER003c 3 298 YLL046c 4 334 YOL069w 5
133 YJR148w 2 125 YPL127c 2 233 YDL093w 3 299 YDR389w 4 335 YGR092w 5
134 YOR317w 2 128 YDL227c 2 236 YDL095w 3 300 YDR464w 4 336 YBR138c 5
143 YPL057C/ 2 130 YLL021w 2 238 YER118c 3 301 YKL068w 4 337 YOR058c 5
148 YOR316C 2 131 YLR382c 2 241 YER016w 3 302 YIL131c 4 338 YHR023w 5
176 YLR349w 2 132 YJL196c 2 243 YOL019w 3 303 YDR451c 4 339 YPL242C 5
178 YDR309c 2 135 YKL165c 2 244 YLR151c 3 304 YCR085w 4 340 YJR092w 5
180 YJL078c 2 136 YLL002w 2 245 YOR284w 3 305 YMR003w 4 341 YLR353w 5
181 YKL161c 2 137 YPL124w 2 246 YMR048w 3 306 YNR009w 4 342 YCL037C 5
187 YHR113w 2 138 YKL101w 2 251 YJR155w 3 307 YOR073w 4 343 YMR001c 5
188 YDL119c 2 139 YLR457c 2 252 YLR126c 3 308 YMR215w 4 344 YGL021w 5
189 YDL124w 2 140 YDR297w 2 263 YNL304w 3 309 YLL047w 4 345 YCR042c 5
190 YHR039c 2 141 YBR070c 2 264 YGR189C 3 310 YDR366c 4 346 YOR025w 5

98
Table ‎B.4: (2) 156 YLR236c 2 234 YIR017c 3 333 YGL116w 5 293 YOR274w 4
Late G1 157 YCL022c 2 235 YKR001c 3 334 YOL069w 5 294 YBR104w 4
Difference 145 Genes 158 YCL024w 2 239 YJR137c 3 335 YGR092w 5 297 YDL198c 4
Thresholding 12 YML109w 1 159 YLR121c 2 242 YCR035c 3 336 YBR138c 5 298 YLL046c 4
Binarization 30 YNR016c 1 160 YHR154w 2 247 YKR010c 3 337 YOR058c 5 299 YDR389w 4
33 YGL055w 1 161 YHR110w 2 248 YHR061c 3 338 YHR023w 5 309 YLL047w 4
Results ( 68 YGR109c 2 162 YDL010w 2 249 YEL017w 3 339 YPL242C 5 316 YBR043c 4
) 69 YPR120c 2 163 YPR174c 2 250 YLL062c 3 340 YJR092w 5 318 YCL013w 4
70 YDL127w 2 164 YJL181w 2 253 YJL118w 3 341 YLR353w 5 322 YDR324C 4
No Name C 71 YNL289w 2 165 YLR183c 2 254 YDR179c 3 343 YMR001c 5 342 YCL037C 5
72 YPL256c 2 166 YOL007c 2 255 YDR219c 3 344 YGL021w 5 368 YPL186c 5
(1)
Early G1 73 YMR199w 2 167 YIL026c 2 256 YNL176c 3 345 YCR042c 5
67 Genes 74 YJL187c 2 168 YJR043c 2 257 YER018c 3 346 YOR025w 5
1 YDL179w 1 75 YDL003w 2 169 YKL108w 2 258 YBR156c 3 347 YOR229w 5
76 YMR076c 2 170 YLR326w 2 259 YLR455w 3 348 YDR146c 5
2 YLR079w 1
77 YKL042w 2 171 YLR381w 2 266 YBR007c 3 349 YLR131c 5
5 YJL194w 1
78 YFL008w 2 172 YNL273w 2 270 YEL018w 3 350 YNL053w 5
6 YLR274w 1
11 YGR044c 1 79 YPL241c 2 174 YJL018w 2 276 YLL061w 3 351 YKL130C 5
80 YMR078c 2 175 YBR089w 2 278 YIL050w 4 352 YIL162w 5
13 YJL157c 1
81 YLR212c 2 177 YCL061c 2 280 YBL097w 4 353 YDL138W 5
14 YKL185w 1
83 YPL209c 2 179 YKR013w 2 281 YKL049c 4 354 YDL048c 5
15 YHR005c 1
16 YNR001c 1 84 YJL074c 2 182 YDL018c 2 283 YOR188w 4 355 YGR143w 5
85 YNL233w 2 183 YKR083c 2 284 YJR076c 4 356 YKL129c 5
17 YKL150w 1
86 YLR313c 2 184 YDR383c 2 285 YJL099w 4 357 YIL167w 5
18 YLR395c 1
88 YGR152c 2 185 YDR013w 2 286 YKL048c 4 358 YHR152w 5
19 YOR065w 1
89 YDR507c 2 186 YGR238c 2 288 YJL092w 4 360 YJL079c 5
20 YDL181w 1
91 YIL159w 2 192 YNL309w 2 289 YCR084c 4 361 YAR018c 5
21 YGR183c 1
92 YGL027c 2 193 YPL208w 2 292 YJL137c 4 362 YAL022c 5
22 YLR258w 1
93 YML102w 2 194 YLR376c 2 295 YJR112w 4 363 YGR279c 5
23 YML110c 1
24 YLR273c 1 94 YPR018w 2 195 YNL300w 2 296 YCR073c 4 364 YGL201c 5
95 YBL035c 2 196 YAR003W 2 300 YDR464w 4 365 YNL057w 5
25 YCR005c 1
26 YCL040w 1 96 YNL102w 2 197 YCL060c 2 301 YKL068w 4 366 YML119w 5
97 YBR278w 2 198 YDL103c 2 302 YIL131c 4 367 YOL137w 5
27 YMR256c 1
28 YIL009w 1 98 YPR175w 2 200 YNL303w 2 303 YDR451c 4 369 YCL038c 5
99 YNL262w 2 201 YNL310c 2 304 YCR085w 4 370 YML033w 5
29 YLL040c 1
34 YGR281W 1 100 YBR088c 2 202 YOR144c 2 305 YMR003w 4 371 YML034w 5
101 YLR103c 2 206 YHR172w 3 306 YNR009w 4 372 YKR021w 5
35 YBR083w 1
36 YBR054w 1 102 YAR007c 2 211 YDR488c 3 307 YOR073w 4 373 YPR157w 5
103 YNL312w 2 213 YOR026w 3 308 YMR215w 4 374 YJL051w 5
37 YKL116c 1
38 YPR002w 1 104 YJL173c 2 215 YIL140w 3 310 YDR366c 4 375 YOL014w 5
105 YER070w 2 217 YJR006w 3 312 YGL101w 4 376 YOR315w 5
39 YNR067c 1
40 YBR158w 1 106 YOR074c 2 221 YJL115w 3 313 YJR110w 4 377 YGR230w 5
107 YDL164c 2 222 YDL197c 3 314 YOL012c 4 378 YLR190w 5
41 YDL117w 1
108 YKL045w 2 223 YCR065w 3 315 YBL032w 4 379 YMR032w 5
44 YMR007w 1
109 YBR252w/ 2 226 YBR275c 3 319 YCL062w 4 380 YOL070c 5
46 YNL046w 1
110 YLR032w 2 229 YKL127W/ 3 320 YCL063w 4 381 YLR297W 5
47 YOR264w 1
111 YML060w 2 230 YAR008w 3 321 YCR086w 4 382 YHL028W 5
48 YPL066w 1
112 YDR097c 2 231 YER001w 3 323 YDR325W 4 383 YHR151C 5
49 YBR052c 1
113 YNL082w 2 232 YER003c 3 324 YIL169C 4 384 YNL058C 5
50 YPL158c 1
114 YOL090w 2 233 YDL093w 3 325 YKL052C 4 (6)
51 YHR022c 1
52 YPL004c 1 115 YLR383w 2 236 YDL095w 3 326 YKL053W 4 Unassigned
116 YKL113c 2 238 YER118c 3 327 YKL069W 4 40 Genes
53 YBR157c 1
117 YLR234w 2 241 YER016w 3 328 YPL116W 4 3 YER111c 1
54 YNL078w 1
118 YPL153c 2 244 YLR151c 3 329 YPL264C 4 8 YPR019w 1
55 YOR066w 1
57 YBR053c 1 119 YDL101c 2 245 YOR284w 3 359 YPR167C 5 31 YBR067c 1
121 YML021C 2 246 YMR048w 3 (4) 64 YLR050c 1
58 YDR511w 1
122 YML061c 2 251 YJR155w 3 Early G1 + G2 / M 67 YLR015w 1
60 YDR033w 1
123 YMR179w 2 252 YLR126c 3 0 Genes 82 YNL225c 2
61 YKL163w 1
124 YML027w 2 263 YNL304w 3 87 YGR041w 2
62 YBR231c 1
128 YDL227c 2 264 YGR189C 3 125 YPL127c 2
63 YDR368w 1
130 YLL021w 2 265 YAL034W-a 3 140 YDR297w 2
65 YLR049c 1
131 YLR382c 2 267 YBR276c 3 (5) 143 YPL057C 2
90 YLR286c 2
132 YJL196c 2 269 YDL096c 3 M 173 YPL014w 2
120 YHR038W 2
135 YKL165c 2 272 YFR026C 3 66 Genes 176 YLR349w 2
126 YGL089c 2
136 YLL002w 2 273 YFR027W 3 4 YBR200w 1 178 YDR309c 2
127 YPL187w 2
137 YPL124w 2 274 YFR038w 3 7 YBR202w 1 189 YDL124w 2
129 YNL173c 2
138 YKL101w 2 277 YNL072W 3 9 YBL023c 1 203 YDR113c 3
133 YJR148w 2
139 YLR457c 2 (3) 10 YEL032w 1 209 YGR140w 3
134 YOR317w 2
141 YBR070c 2 S / G2 32 YPL058c 1 216 YKL067w 3
148 YOR316C 2
142 YLR233c 2 66 Genes 42 YGR035c 1 218 YBL003c/ 3
180 YJL078c 2
144 YBR073w 2 204 YNL126w 3 43 YHL026c 1 219 YBL002w/ 3
181 YKL161c 2
187 YHR113w 2 145 YGL200c 2 205 YEL061c 3 45 YMR254c 1 228 YFR037c 3
146 YHR153c 2 207 YPR141c 3 56 YMR031c 1 243 YOL019w 3
188 YDL119c 2
147 YNL272C 2 208 YLR045c 3 59 YLR254c 1 260 YBR184w 3
190 YHR039c 2
149 YKR077w 2 210 YBL063w 3 66 YOR273c 1 261 YLR228c 3
191 YDR493w 2
150 YDL105w 2 212 YMR198w 3 279 YIL106w 4 262 YDR252w 3
199 YGL028c 2
151 YPL267w 2 214 YDR356w 3 287 YBR038w 4 268 YCRX04w 3
237 YNL073w 3
152 YDL156w 2 220 YMR190c 3 311 YIL158w 4 271 YER019w 3
240 YER017c 3
153 YOL017w 2 224 YPL016w 3 317 YCL012w 4 282 YCL014w 4
275 YKL066W 3
154 YBR071w 2 225 YBL052c 3 330 YGR108w 5 290 YGL255w 4
332 YAL040c 5
155 YLL022c 2 227 YIL126w 3 331 YPR119w 5 291 YLR014c 4

99
Table ‎B.5: Top 237 YNL073w 3 145 YGL200c 2 274 YFR038w 3 322 YDR324C 4 366 YML119w 5
240 YER017c 3 146 YHR153c 2 277 YNL072W 3 323 YDR325W 4 367 YOL137w 5
Thresholding 275 YKL066W 3 147 YNL272C 2 324 YIL169C 4 369 YCL038c 5
(3)
Binarization 332 YAL040c 5 149 YKR077w 2 S / G2 325 YKL052C 4 370 YML033w 5
Results ( (2) 150 YDL105w 2 86 Genes 326 YKL053W 4 371 YML034w 5
Late G1 151 YPL267w 2 125 YPL127c 2 327 YKL069W 4 372 YKR021w 5
) 157 Genes 152 YDL156w 2 203 YDR113c 3 328 YPL116W 4 373 YPR157w 5
3 YER111c 1 153 YOL017w 2 204 YNL126w 3 329 YPL264C 4 374 YJL051w 5
No Name C 12 YML109w 1 154 YBR071w 2 205 YEL061c 3 342 YCL037C 5 375 YOL014w 5
(1) 30 YNR016c 1 155 YLL022c 2 207 YPR141c 3 359 YPR167C 5 376 YOR315w 5
Early G1 33 YGL055w 1 156 YLR236c 2 208 YLR045c 3 (4) 377 YGR230w 5
73 Genes 68 YGR109c 2 157 YCL022c 2 209 YGR140w 3 Early G1 + G2 / M 378 YLR190w 5
1 YDL179w 1 69 YPR120c 2 158 YCL024w 2 210 YBL063w 3 8 Genes 379 YMR032w 5
2 YLR079w 1 70 YDL127w 2 159 YLR121c 2 212 YMR198w 3 31 YBR067c 1 380 YOL070c 5
3 YER111c 1 71 YNL289w 2 160 YHR154w 2 214 YDR356w 3 67 YLR015w 1 381 YLR297W 5
5 YJL194w 1 72 YPL256c 2 161 YHR110w 2 218 YBL003c/ 3 82 YNL225c 2 382 YHL028W 5
6 YLR274w 1 73 YMR199w 2 162 YDL010w 2 219 YBL002w/ 3 291 YLR014c 4 383 YHR151C 5
11 YGR044c 1 74 YJL187c 2 163 YPR174c 2 220 YMR190c 3 293 YOR274w 4 384 YNL058C 5
13 YJL157c 1 75 YDL003w 2 164 YJL181w 2 224 YPL016w 3 298 YLL046c 4 (6)
14 YKL185w 1 76 YMR076c 2 165 YLR183c 2 225 YBL052c 3 309 YLL047w 4 Unassigned
15 YHR005c 1 77 YKL042w 2 166 YOL007c 2 227 YIL126w 3 368 YPL186c 5 0 Genes
16 YNR001c 1 78 YFL008w 2 167 YIL026c 2 228 YFR037c 3 (5)
17 YKL150w 1 79 YPL241c 2 168 YJR043c 2 234 YIR017c 3 M
18 YLR395c 1 80 YMR078c 2 169 YKL108w 2 235 YKR001c 3 75 Genes
19 YOR065w 1 81 YLR212c 2 170 YLR326w 2 239 YJR137c 3 4 YBR200w 1
20 YDL181w 1 82 YNL225c 2 171 YLR381w 2 242 YCR035c 3 7 YBR202w 1
21 YGR183c 1 83 YPL209c 2 172 YNL273w 2 247 YKR010c 3 8 YPR019w 1
22 YLR258w 1 84 YJL074c 2 173 YPL014w 2 248 YHR061c 3 9 YBL023c 1
23 YML110c 1 85 YNL233w 2 174 YJL018w 2 249 YEL017w 3 10 YEL032w 1
24 YLR273c 1 86 YLR313c 2 175 YBR089w 2 250 YLL062c 3 32 YPL058c 1
25 YCR005c 1 87 YGR041w 2 177 YCL061c 2 253 YJL118w 3 42 YGR035c 1
26 YCL040w 1 88 YGR152c 2 179 YKR013w 2 254 YDR179c 3 43 YHL026c 1
27 YMR256c 1 89 YDR507c 2 182 YDL018c 2 255 YDR219c 3 45 YMR254c 1
28 YIL009w 1 91 YIL159w 2 183 YKR083c 2 256 YNL176c 3 56 YMR031c 1
29 YLL040c 1 92 YGL027c 2 184 YDR383c 2 257 YER018c 3 59 YLR254c 1
34 YGR281W 1 93 YML102w 2 185 YDR013w 2 258 YBR156c 3 66 YOR273c 1
35 YBR083w 1 94 YPR018w 2 186 YGR238c 2 259 YLR455w 3 67 YLR015w 1
36 YBR054w 1 95 YBL035c 2 192 YNL309w 2 260 YBR184w 3 279 YIL106w 4
37 YKL116c 1 96 YNL102w 2 193 YPL208w 2 261 YLR228c 3 282 YCL014w 4
38 YPR002w 1 97 YBR278w 2 194 YLR376c 2 262 YDR252w 3 287 YBR038w 4
39 YNR067c 1 98 YPR175w 2 195 YNL300w 2 266 YBR007c 3 290 YGL255w 4
40 YBR158w 1 99 YNL262w 2 196 YAR003W 2 268 YCRX04w 3 293 YOR274w 4
41 YDL117w 1 100 YBR088c 2 197 YCL060c 2 270 YEL018w 3 294 YBR104w 4
44 YMR007w 1 101 YLR103c 2 198 YDL103c 2 276 YLL061w 3 311 YIL158w 4
46 YNL046w 1 102 YAR007c 2 200 YNL303w 2 278 YIL050w 4 316 YBR043c 4
47 YOR264w 1 103 YNL312w 2 201 YNL310c 2 280 YBL097w 4 317 YCL012w 4
48 YPL066w 1 104 YJL173c 2 202 YOR144c 2 281 YKL049c 4 318 YCL013w 4
49 YBR052c 1 105 YER070w 2 203 YDR113c 3 282 YCL014w 4 322 YDR324C 4
50 YPL158c 1 106 YOR074c 2 206 YHR172w 3 283 YOR188w 4 330 YGR108w 5
51 YHR022c 1 107 YDL164c 2 209 YGR140w 3 284 YJR076c 4 331 YPR119w 5
52 YPL004c 1 108 YKL045w 2 211 YDR488c 3 285 YJL099w 4 333 YGL116w 5
53 YBR157c 1 109 YBR252w/ 2 213 YOR026w 3 286 YKL048c 4 334 YOL069w 5
54 YNL078w 1 110 YLR032w 2 215 YIL140w 3 288 YJL092w 4 335 YGR092w 5
55 YOR066w 1 111 YML060w 2 216 YKL067w 3 289 YCR084c 4 336 YBR138c 5
57 YBR053c 1 112 YDR097c 2 217 YJR006w 3 290 YGL255w 4 337 YOR058c 5
58 YDR511w 1 113 YNL082w 2 221 YJL115w 3 292 YJL137c 4 338 YHR023w 5
60 YDR033w 1 114 YOL090w 2 222 YDL197c 3 294 YBR104w 4 339 YPL242C 5
61 YKL163w 1 115 YLR383w 2 223 YCR065w 3 295 YJR112w 4 340 YJR092w 5
62 YBR231c 1 116 YKL113c 2 226 YBR275c 3 296 YCR073c 4 341 YLR353w 5
63 YDR368w 1 117 YLR234w 2 229 YKL127W/ 3 297 YDL198c 4 343 YMR001c 5
64 YLR050c 1 118 YPL153c 2 230 YAR008w 3 298 YLL046c 4 344 YGL021w 5
65 YLR049c 1 119 YDL101c 2 231 YER001w 3 299 YDR389w 4 345 YCR042c 5
90 YLR286c 2 121 YML021C 2 232 YER003c 3 300 YDR464w 4 346 YOR025w 5
120 YHR038W 2 122 YML061c 2 233 YDL093w 3 301 YKL068w 4 347 YOR229w 5
126 YGL089c 2 123 YMR179w 2 236 YDL095w 3 302 YIL131c 4 348 YDR146c 5
127 YPL187w 2 124 YML027w 2 238 YER118c 3 303 YDR451c 4 349 YLR131c 5
129 YNL173c 2 125 YPL127c 2 241 YER016w 3 304 YCR085w 4 350 YNL053w 5
133 YJR148w 2 128 YDL227c 2 243 YOL019w 3 305 YMR003w 4 351 YKL130C 5
134 YOR317w 2 130 YLL021w 2 244 YLR151c 3 306 YNR009w 4 352 YIL162w 5
143 YPL057C 2 131 YLR382c 2 245 YOR284w 3 307 YOR073w 4 353 YDL138W 5
148 YOR316C 2 132 YJL196c 2 246 YMR048w 3 308 YMR215w 4 354 YDL048c 5
176 YLR349w 2 135 YKL165c 2 251 YJR155w 3 309 YLL047w 4 355 YGR143w 5
178 YDR309c 2 136 YLL002w 2 252 YLR126c 3 310 YDR366c 4 356 YKL129c 5
180 YJL078c 2 137 YPL124w 2 263 YNL304w 3 312 YGL101w 4 357 YIL167w 5
181 YKL161c 2 138 YKL101w 2 264 YGR189C 3 313 YJR110w 4 358 YHR152w 5
187 YHR113w 2 139 YLR457c 2 265 YAL034W-a 3 314 YOL012c 4 360 YJL079c 5
188 YDL119c 2 140 YDR297w 2 267 YBR276c 3 315 YBL032w 4 361 YAR018c 5
189 YDL124w 2 141 YBR070c 2 269 YDL096c 3 318 YCL013w 4 362 YAL022c 5
190 YHR039c 2 142 YLR233c 2 271 YER019w 3 319 YCL062w 4 363 YGR279c 5
191 YDR493w 2 143 YPL057C 2 272 YFR026C 3 320 YCL063w 4 364 YGL201c 5
199 YGL028c 2 144 YBR073w 2 273 YFR027W 3 321 YCR086w 4 365 YNL057w 5

100
101

You might also like