Sandeep Kumar BEng (Hons) MEng
Address : Telephone: C-317, Hari Marg, Malviya Nagar, Jaipur-302017, Rajasthan, INDIA +91-141-2521648 Mobile: +91-9660006262 Email :

 


3.4 Y EAR S

Ongoing Nov. 2010 Oct. 2007-10 Aug. 2007 Jun-Oct 2006

 

Aug-Dec 2004 Aug. 2004 Jul. 2003

E-Com Systems Inc. Cancer Research UK (Home Fundraising), UK. AECOM, Graduate Consultant, Traffic Technology, UK. Primark, Retail Operative, UK. Seedling Academy of Design, Technology & Management, Lecture r in E lectronics and Electrical Engineering Department, India. Sterling Telecom & Net Systems Ltd., Information Communication Technology Engineer, India. vCustomer, Technical Support Executive, India. Best Solutions Ltd., Games Developer, India.


Percentile in Behavioral Traits at NIIT National IT Aptitude Test, India. E10B Exchange, BSNL.

S OFTW ARE S KILL S: MapInfo, AutoCAD, Matlab 7.1, Java, C++, C, HTML, 8085 & 8086 ASM, Visual Basic, Microsoft Office, Windows Batch file, Falsh 5.0/MX, Basic, Adobe Photoshop & Illustrator, Internet. OTHER QU AL IF IC ATI ONS :
 

2007-10 2009-19

2008-10 S EM IN ARS :

Member of The IET. Full Clean UK Driving License (Valid until 24/03/19) & Pass Plus Certificate. Motorway Pass (Highways Agency).

E DUCAT ION : 2003 E10B Exchange and Wireless Phones, India.

Jan 2008 Aug. 2004

 

May 2000 June 1998

M.Eng in Data Comm unications, The University of Sheffie ld (60.23 %) B.Eng in Electronics and Comm., The University of Rajasthan (Honors 75.25%) HSC - Science, CBSE (68.4%) SSC - CBSE (69.2%)


 

2010 2010 2010 2010 2009

 


2008 2008 2008 2007 2004 2004

 

  

MySQL & PHP Course (Ongoing) City & Guilds Foundation Course in Electrotechnical Technology EAL Level 3 VRQ Certificate in Refrigeration Maintenance CITB F-Gas Safe Handling Transport Planning Society Certificate in T raffic Engineering & Planning ECS Health & Safety Training (JIB for the Electrical Contracting Industry) Motorway Awareness & Safety Training, Mouchel. MapInfo, AECOM. Introduction to System Design, AECOM. Mesh4G , AECOM. UMTS, BSNL. High Aptitude for IT with 81

Lowestoft UTMC Des ign & Implementation: Produced CAD of lamp columns, MESH4G, power cable at feeder pillars & lamp columns. Determine d power consumption of the Urban Traffic and Management Control (UTMC) system. Wrote UTMC report, Design and Check certificates and equipment list. Operator of Variable Message Sign’s (VMS), CUTLAS UTMC system on behalf of Suffolk County Council. CENTRO Framework: - Performe d site audits of the Real Time Passenger Information (RTPI) signs & bus shelters i n Birmingham & Coventry on behalf of CENT RO. ITS Strategy for Suffolk: - Prepared strategy report in line with 'Local Transport Plan' and 'Suffolk Bus Strategy' HA Data Fusion: - Gathered road data for modeling input. Walsall TRO: - Prepared CAD of road networks using ParkMap and MapInfo. ITS Radar Newswire: - Wrote Highways Agency ITS Report.

( dar_International_ReportITS_World_Congress_2009_Final.pdf)  Suffolk CATS: - Cycle Activated Traffic Warning Sign.  North West TechMac: - Performe d Statutory Enquiry Checks. - Assisted in writing technical notes for power cable specification of the ANPR and Ramp Metering equipment using AMTECH software.  Oakham Road, Tividale Pedestri an Crossing: - Prepared CAD of pedestrian c rossing facility using AutoT rack. - Technical, safety and regulatory parameters report.  Hertfordshire LTP: - Website:  TataIndicom BTS: - Supervised commissioning of mobile communication tower, cabins for communication equipments, diesel generator set, civil & electrical work.  Jaipur National University: - Lecture r in Electronics & Electrical Engineering. Taught Circuit Analysis & Communication Theory as core subjects and supervised Electronics Workshop Lab. AC ADEM IC

   

DTMF Remote Control. 3GPP Cryptographic Algorithms. Cell phone signal detector. Hierarchical Clus tering of Microarray Data.

 

2003 2002 2002 2002 2000 1999 1999 1998 1996 1991

    

Third a t National Level Dance, India. Bronze medal - Swimming, Rajasthan University, India. First at College in Annual Day Sports, India. Certificate College Cultural Activities, India. CBSE Excellence Award in Physical Education, 94%, India. Rajasthan Colts Cricket Team, UK. National Level U-16&19 Cricket, India. Gold Medalist - District Karate, India. Silver Medal - District Gymnastics, India. Second at National Art, India.

 

Hindi: Mother tongue English: Fluent (IELTS: 6.5)

REFERENC ES Furnished upon re quest.


by Sandeep Kumar Bairwa

Under the supervision of Dr Guido Sanguinetti

The University of Sheffield Department of Electronic and Electrical Engineering (2006-07)

This dissertation is a part requirement for the degree of M.Sc. in Data Communications

The advent of microarrays have opened up unprecedented pathways in the research field of Bioinformatics. With the possibility of studying thousands of genes together at the same time, automated learning tools are being developed rigorously. Biologist use the gene expression data from microarrays to reveal biologically similar groups of genes, that are meaningful and informative. Many standard techniques have been developed to cluster the similar genes together, but finding optimal clusters and their number from a given data set still remains an unsolved problem. One of the most prominent techniques employed in this regard is called Hierarchical clustering. In this thesis we begin with a gentle introduction to microarrays and the associated data analysis techniques. Further we present two novel methods following agglomerative hierarchical clustering approach to separate overlapping and closely spaced clusters. The novel methods proposed in this thesis are compared with the clustering results of Bioinfomatics toolbox provided by MathWorks in the 'Matlab 7.1' software to separate a case of overlapping clusters. The novel methods developed give correct results in this regard, thus clearly proving to be advantageous. The methods proposed are based on Gaussian mixture assumptions. Cluster assignments are done using empirical covariance and Gaussian Bayes Classifier. We have presented test results on synthetic as well as real data sets. While clustering the real data, pruning decision is also made.


I am extremely thankful to my supervisor. Finally I want to thank the University for allowing me to be a part of such a wonderful institution. because of whom I have been able to raise the standards of my thesis. I am thankful to my family members for their moral support. Peter Rockett for his extended guidance offered to me for the writeup work. 3 . Guido Sanguinetti.Acknowledgements I want to thank the technical and administrative staff of the Electronics and Computer Science Departments of my University for their help and assistance throughout the preparation of this thesis. Dr. I want to thank Dr. meet the deadlines in time and gain numerous skills which I would not otherwise. Without his guidance it would not be possible for me to keep my work going in the right direction.

.......12 Applications of microarrays...25 B) Novel method for Hierarchical Clustering using Empirical Covariance.............35 D) Tests on Real Data..9 Types of microarrays......22 Advantages of hierarchical clustering..Contents 1........ Hierarchical clustering.........37 4 ..15 Clustering the microarray data......... Microarrays : A biological background.11 Normalization.....8 Microarray Technology.....16 4......30 Critical Features of Implementation...28 Results...27 Algorithmic Description of Matlab implementation..13 Analysis Methodology.........12 3........ 26 Overview of the approach..14 Clustering.......29 C) Novel method for Hierarchical Clustering using stereographic projection. Practical Work....30 Overview of the approach..............31 Method used.....24 Problems with hierarchical clustering. Introduction..25 A) Hardware & Software Employed..31 Algorithmic description of the Matlab Implementation....7 2...24 5........... Data Analysis...........................34 Results.......26 Critical Features of Implementation....17 Hierarchical Clustering Techniques.....

41 8........... Abbreviations......40 6.Critical features of the Matlab code & the pruning decision...... Future Work Areas. Conclusion and Evaluation...............37 About the real data set..38 E) A naive method developed at the outset of project work ............39 Test Results...45 10...... References.. Diary of major milestones achieved .42 9.39 Brief Description..37 Verification of our result.40 7..........45 5 .....

passages and figures/diagrams quoted in this thesis from other people's work have been specifically acknowledged by clear crossreferencing to the author(s). Furthermore I have read and understood the definition of Unfair Means for assessed work produced in the M.Sc. work and page(s).Declaration I certify that all sentences. Name : Sandeep Kumar Bairwa Signature : Date : 6 . Data Communications Student Handbook (page 18/19) and have compiled with its requirements. I understand the failure to comply with the above amounts to plagiarism and will be considered grounds for failure in this thesis and the degree examination as a whole.

Biologist use the gene expression data from microarrays to reveal biologically similar groups of genes. Its visual results showing hierarchy in the data set and the choice in variable depth of clusters are some of its useful features [Eisen. computer science. There are various aspects of this clustering method that have attracted researchers for its usage. This report is organised in such a manner that it begins with a gentle introduction in the field leading to address the current analysis issues. Massive amounts of data can now be easily as well as timely handled by employing machines programmed with automated learning procedures. Many advanced clustering techniques have been developed too. In this project we have addressed the current problems of determining correct clusters from a dataset that comprise of closely spaced and overlapping clusters. This branch of science is well known as Machine Learning. many research techniques are being developed and are currently in the developing stage. In data analysis section basic steps in analysis of the gene expression data are elaborated with a simple 7 . First an essential introduction to the biology behind the microarray technology is outlined followed by the adjoining microarray features. 1998]. 2003]. now the sole problem remains in the research technique adopted to avail correct results.1. finding optimal groups and their number from a given data set remains a tricky issue. issues and the data extraction process. With the provision of simultaneous study of huge amounts of genetic data as well as the analysing infrastructure. also known as computational biology. Here various techniques of statistics. such as to decide as to how many clusters are present in a data set. Though some problems remain in its applicability. but numerous questions about living organisms still remain unanswered. Microarrays have immensely contributed in reducing the time taken by the experiments to complete as well as the information gained. Thus immensely boosting the quality and standard of research and analysis in the field. Another relevant field is data mining which is specifically related to the study of large data sets. that are meaningful and informative. 2005]. Many standard techniques have been developed to cluster similar genes together so as to reveal information about unknown genes. some examples of which are bi-clustering and fuzzy clustering [Troyanskaya. Further noise had been added to the synthetic data sets in order to test the robustness of the methods proposed for their performance in varying conditions. Now with the possibility of analysing over tens of thousands of genes together as compared to the earlier times when only a single gene could be studied at a time. mathematics and physics are employed for enhanced learning in biology. Introduction Microarray technology is amongst the busiest fields in Bioinformatics. One of the most prominent techniques employed in this regard is Hierarchical clustering which groups similar genes together providing a simple and manageable visual aid. Still numerous papers currently employ hierarchical clustering technique to learn from the gene expression data. Particularly relevant to this thesis. This has led to the birth of an interdisciplinary field named as Bioinformatics [Lim.

of which genes are the functional units. It is an intermediary in the conversion process resulting in protein formation. Further. is termed as genome. 2001] & [Schena. the complete hereditary information. RNA is categorised into mRNA. Microarrays : A biological background A concise biological background to Microarrays is being presented below. Nucleotides are elementary building blocks that make up DNA and RNA. DNA and RNA. 2003] : Cells : All living organisms are composed of at least one cell or more. but differ in their structure. RNA are similar to DNA. Molecules : Within cells are molecules that are in turn classified into : 1) Small Molecules 2) Proteins 3) DNA 4) RNA Small molecules either have independent roles or are engaged in the up keeping of Proteins. Proteins act as the functional units of a cell. DNA houses the genetic instructions that are used in the development and functioning of all living organisms.example. Proteins are the actual carrier of the instructions issued by genes. Each stage of practical implementations are followed by their respective results ending with the discussions for all stages together and elaboration of future work areas. tRNA and rRNA. Cells are extremely small in size and have their own development system. Genes are composed of DNA and are engaged in biological functioning of a cell. i.e. An organism's complete set of DNA. These are few nanometres in size and viewed on an electron microscope. Further clustering is defined with particular focus on the hierarchical clustering technique and its types. 8 . the sequenced information is not enough to understand the functional properties of genes. The material was produced using the references [Brazma. Each gene carries directive information for proteins. highly repetitive DNA sequences are difficult to sequence. 2. In the following section all the practical work done is presented. Results of the Human Genome Project admit that with current technology.

General idea of the study is to explore all aspects of life in any living organisms. Gene expression process is outlined as follows: Transcription Genes Translation mRNA proteins mRNA Transcription is the initial step to synthesize mRNA from DNA. immunities. In these processes mRNA are made stable. ageing. polyademylation and splicing.Gene Expression is a conversion process resulting in obtaining functional proteins from the initial DNA level stage. At completion the whole gene is transformed into RNA. growth. The study of gene expression has become important since it has lead to a greater understanding of diseases. Microarray Technology : Following text has been referenced from [Schena. In other words the dynamics of cell as to how it responds and how it orients itself in accordance to its needs are revealed. thus subsequently gaining valuable information as to how the cells respond to the varying conditions and its self needs. There are various intricacies involved in the process itself and is beyond the scope of this report to go into its biological details. For example. Scientists try to infer which genes are expressed. unwinding the DNA helix. mRNA processing involves a chain of editing events that are named as capping. a continuous coding sequence is recovered and the efficiency of protein synthesis increases. Further amino acids are obtained which result in a protein at the end. which may involve its functioning. An expression level is a scalar value indicating the cellular concentration of Messenger Ribonucleic Acid (mRNA) molecules. it was revealed recently that human cancer is related to the change in gene expression. diseases. 2003] 9 . The process begins at a site identified by a promoter where the synthesis starts by the RNA polymerase. These are further converted into proteins which are used by the cells to perform their crucial functions. We are primarily concerned about studying genes which determine the characteristics of a living organism. This RNA detaches from the respective DNA and thus becomes independent. survival needs and evolution etc. Translation is a process associated in deriving polypeptide chains from the mRNA after the transcription stage.

2001]. microarray is an array of very small size. Just as an integrated circuit in electronics has a substrate as a base. In microarray technology the entire genome is presented on a single glass chip known as a probe which reveals the expression level for each gene. 10 . These spots are always arranged in a uniform pattern. Size of a single spot is 50-350 micrometre [Schena.000 [Brazma. Size <=1 sq. Spot size ~ 0. the father of microarray technology. The array houses spots arranged in a matrix form having fixed distances between them. To study them simultaneously and quickly in an efficient manner microarrays are employed. microarrays have similarly made it possible to simultaneously study different genes on a single chip packed close together and many thousands in number. tumor samples etc. Spots/chip ~ 60. In his book. inch. describes the advent of his innovation similar to microprocessors in the electronic communication field.1 mm. Mark Schena. Analysis Steps : Basic steps involved in the microarray analysis are as follows : Samples (Cells.000 genes [WWW1]. chemical and many more areas involving living organisms by issuing rapid and quantitative analysis. Just as microprocessors have evolved in a short time offering high speed computational capabilities. This has revolutionised the biotechnology world which encompasses agriculture. microarrays has its substrate as the chip. Basically in microarrays a continuous arrangement of fluorescent samples (prepared from RNA) affixed on a glass. but it has enormous amounts of information within. silicon or plastic chip are studied whose levels of intensity are directly proportional to their expression. For instance. As the name suggests. 2003]. humans have about 30.) Extraction mRNA Conversion cDNA Labelling and Hybridisation Laser Scan DNA microarray Scanned Image.There are many thousands of genes in a single cell. A microarray. medical. Figure 1.

Let the first sample having healthy cells be labelled with fluorescent green dye and the second one with red dye.These stages are depicted in the figure below : Figure 2 [Renkwitz. the most common way is to use two samples of a particular cell in two different states. They are manufactured using numerous methods such as Photolithographic arrays (Affymetrix). 2000] Out of the numerous ways to measure gene expression level. One color microarray that employ oligonucleotides (short sequences of nucleotides). For instance gray color indicates missing data in some chips. Inkjet Arrays (Agilent). healthy and diseased. Whereas for equal and nil (absence of) quantities. When excited by a laser the amount of particular RNA determines the color of light given out by the spots.) Types of microarrays : Broadly they are divided into two categories : 1. Together they are washed over a microarray slide. Hybridization takes place on the surface of the slide in which complementary parts tend to get together. So larger amount of RNA from sample one on a spot radiates green light. (The emitted light color implication varies with chip types. Maskless Arrays (Nimblegen) etc. a yellow and black light is radiated respectively. 11 .

AIDS. the studies relating to genes is well assisted by microarrays and many new findings are constantly being made with this advancement in analysis technology. To deal with this situation an assumption is made that the average gene expression level across the chips do not vary.Drug discovery (Pharmacogenomics) . Broadly. Applications of microarrays [WWW2] and [Schena. The labeling dyes have a different efficiency of labeling thus rendering irregular brightness levels even if RNA amount is same in two samples.All human diseases can be studied by microarrays to find their treatments. Two color or cDNA microarays. viruses. 4. Some examples of diseases being investigated currently on a large scale are mental illness. 2. Finally this accounts to incorrect gene expression. Different amounts of samples are present in the beginning which produce different amounts of RNA.2. expression levels of thousands of genes etc. cancer etc. Alzheimer. .The relationships are studied between the toxic compounds and their responses after administration on subjects understudy. aging. 3. hormonal imbalance. . fruits and animals revealing information that was unknown before. marginal variations are always present in the signal. Hybridization is a non-uniform process practically. 2003] : .Toxicological research (Toxicogenomics) . 2001]. These problems result in slide preparations having different brightness levels in the end and at each stage there is a marginal error induced during processing. .Gene Discovery : Complex genetic diseases. Problems associated within the analysis steps : 1. It is worth mentioning that the technology stretches to genomic study of bacteria. Thus the expression level of all genes on the chip fall within a common comparable range. 12 .Reasoning as to why some drugs suit certain group of patients as compared to others and also the amount of toxication inhibited on administration. plants. stroke. This way of limiting the data within a bounded region is termed as normalization [Guérette. The signal from scanner too cannot be kept constant. thereby resulting in one sample being hybridized better than the other.

Usually a space between data points is used. TN and the genes G1. Handling thousands of such numbers can be cumbersome. Many software packages are now widely available for preprocessing data... After segmentation the numerical values obtained are arranged in a regular and systematic manner. Thus resulting in a matrix form of data arrangement that is originally obtained from the microarray slides. The numerical values are categorised as signal that is the value representing the actual microarray data and a background value indicating noise level.GM. For observations at times T1. This requires the biologists to be employ statistical techniques to set a relationship for the vast population of gene data. Thus a large number such as 100000 is represented by 5 using log to the base 10. 13 . Most common approach at this is using log values. T2. a N dimensional space is chosen to plot each gene with the columns as the dimensional coordinates and rows as the indices. Researchers try to correlate these genes in N dimensional graphs. For convenient analysis this data is transformed into an equivalent. data mining and image analysis... This type of text file is commonly know as tab delimited file. Numerous segmentation procedures are employed to separate noise from the actual expression values. Data Analysis The following text is referenced from [Schena. Quantification is a process through which these images are converted into numerical values which represent the concentration levels of mRNA. The data is saved in text file formats with adjacent numeric values separated by a comma or a space. T3 . These files are easily transferable onto different spread sheet programs. TIFF/tagged format images are recovered from microarrays through scanning. For this many computer scientists and statisticians have contributed in the development of novel and efficient methods of genetic data analysis. Noise is induced due to inaccurate labelling of probe and through scanning process. Data arrangement : Genes are arranged row wise and the columns represent time variances of the experiment. 2003]. G3 . researchers have to resort to machines for assisting with the repetitive calculations.3. Data transformation : The machine read microarray data obtained from scanning process is in form of 16 bit values. Also due to the massive amounts of data generated in the genomic study. G2.

as many as 15000 genes [WWW3] may be presented in a single matrix for many samples. 1988]. 1 Figure 3 Sample 2 Numerical Data 1.1 . Analysis Methodology : Expression data gained is analysed in form of a matrix as shown in Figure 3. . orientation and direction in the genetic data are considered to group functionally similar genes [Eisen. The expressed data is normalised for meaningful statistical analysis [Shannon. Grouping thus helps in revealing information about unknown genes with the help of known genes in same group. Sometimes the dataset provided suffers from missing values due to various reasons such as human errors. As elaborated in [Bryan. 2004]. . Each element in the sample columns is a log ratio of the testing sample to the reference sample. 14 . Genes 1 2 3 . Numerical Data 15000. This affects the results in further analysis and some considerations might have to be made later in this regard.2 Numerical Data 2.1 Numerical Data 2. These are termed as proximity matrices which are the measures of similarity between genes. This way many genes that are difficult to analyse individually can be studied with reference to other genes that are similar to them in functionality and regulatory operations. . Parameters such as spread.2 .Basically for the genetic data a similarity or dissimilarity relationship is made using various distance measurement formulae. . 1998]. A brief outline to practical analysis procedure of genetic data is presented in the next section. . 2003]. .2 Numerical Data 3.1 Numerical Data 3. experimental errors or unavailability of information [Jain. Huge amount of data. The distances found are arranged in N dimensional matrices. Numerical Data 15000. 15000 Sample 1 Numerical Data 1. 2 In order to study genes it is required to group the data based on similarity. unexplored genes can be studied with the help of genes related to them via clustering.

This is especially useful when the data is distributed in higher dimensions (>2). d) Validation : On rare occasions it is checked if clustering can be done or expression data is thoroughly random. it is the process in which genes are arranged in such way that similar genes can be brought together in one group away from unlike genes. The type/format of data decides which clustering methods can be employed on it. Visual help can be taken in which data is projected into a different dimension. specifically the correlation between genes behaviour. Grouping or clustering is done on the basis of similarity. With regard to clustering of microarray data. Non-linear projections [Jain. the data is conditioned. For instance it has to be decided if hierarchical clustering will be more suitable or the partitional type would be more appropriate. 1988] are employed more as compared to linear since they ensure an accurate transformation of data into lower dimensions. c) Conversions into data models : For efficient analysis the data can be modelled into distributions with specific parameters. For example the data obtained is normalised as discussed before. Such approaches help in determining the correct clustering result. Many different distributions are employed such as gaussian distribution. In second case researcher have the provision to choose other analysis techniques rather than clustering. researcher chooses a method for clustering. Also multiple methods may be adopted and the results may be compared to check the correctness. e) Clustering approach : At the outset. 2001] distribution on sphere etc. In this way a number of groups are produced each containing items within them being similar to each other but being dissimilar to the items inside other groups. Also it should be confirmed. Advantages of model based clustering : 15 . 2005] distribution on sphere.Clustering : Basically clustering is a process by which many individual items are brought together to form a single group of similar items. General approach taken while clustering data : a) Assimilation of test data : Gathering data for analysis is an important step before starting the clustering process. f) Verification : In order to check the correctness of the result after clustering different techniques are employed. This process may include transformations of data in different domains. b) Pre-processing the data : In order to make data suitable for coherent analysis. Kent (FB5) [Peel. the data gathered is well suited to the technological standards available for analysis to the researcher. For example noise may be added to the data to check the variation in the result. Von-Mises Fisher [Banerjee.

Clustering the microarray data : Since microarrays provide massive amounts of data to be analysed. For example if two Gaussian distributions are considered than the overlap does not affect the tails to be rigidly cut-off since the distribution characteristics are maintained. Both approaches provide information resulting in the grouping of similar genes in case 1) and sectioning of similar groups in case 2). − More realistic partitioning of the groups is possible with this approach.− There has been an extensive research in this field as compared to other approaches thus is backed up by numerous proven results. 16 . Biologist suggest that only some portion of the gene expression data is actively involved in the cellular processes and functioning. Thus is advantageous to study only a part of genes that involved. Clustering is broadly divided into two types [Shannon. successive clusters are derived in steps from the previous ones. Only work to be done is to allocate the respective members of the groups based on similarity index chosen by the researcher. In k-means method the cluster divisions are set initially to a desired number. the expressed measurement data and the characteristic of the sample as a priori. 2004]. only the measurement data is available. various automated data mining tools are being developed. The gene expression data provided by the microarrays can be clustered in two different ways : 1) Clustering the genes. latter method is generally more suited. Amongst the numerous types of clustering methodologies some of the most noticeable statistical methods are Hierarchical and k-means clustering. The term k represents the finite groups defined in the beginning before the assignment process begins. In second case of clustering thus within the sample groups some part of gene groups are evaluated this particular way of clustering is termed as sub-space clustering [Jiang. Whereas in hierarchical clustering. 2) Clustering the time variances/samples. − Setting the data set into a model helps in determining the localisation intensity characteristics of the individual groupings. Whereas in unsupervised learning. Thus overlapping clusters can be effectively separated. As little is known about the gene expression for a particular state. 2003] : 1) Supervised or Discrimination or Extrinsic 2) Unsupervised or Clustering or Intrinsic In Supervised classification two things are presented for analysis.

.T .. is given as : Pearson's Correlation coefficient : Also known as product. T mean (μ) = 1. Thus it is not advised to assign genes in a single group... 1998]... This suggests inaccuracy of partitional type clustering approaches [Turner. x 3. .... Mahalanobis Distance : Distance from a group of points having.4.moment coefficient of 17 . Unsupervised/clustering approach is generally suited to gene expression data. A distance matrix is simply the distances between a row with all other rows in the data set.. It is known that genes are not involved in a single process at all times rather may be involved partially in multiple processes.. Most of the clustering methods require a distance or proximity matrix to be worked out from the given data. n .. n-dimensional euclidean space distance between them. Hierarchical clustering Hierarchical clustering of microarray has become a common way to identify patterns in clustering due to its simplicity and visual clarity [Eisen. x n . 2. covariance matrix Σ data (x) =  x 1. 3. 2005].. Some commonly used distance measures are summarised as follows (used to find the similarity between genes) : Euclidean or Straight Line distance : Two points D & E in a n-dimensional euclidean space distance between them. x 2.

The joint distribution of x and y is bivariate normal.Gene2) = 1. For a data set as shown in Figure 3 having 15000 genes a distance matrix of order 15000 by 15000 will be generated.5 −2 21.Gene3) =  0.5 −.52 1.X and Y.5 2 Sample2 1.Gene3) = distance(Gene2. It can take values from -1 to 1 representing perfect negative or inverse and perfect positive correlation respectively.2−1 2 distance(Gene1.2 −−2. the correlation coefficient is given by : A numerical estimate is presented on how much the two variables are correlated.5 2 1.2 -2.5−12 Thus the corresponding distance matrix is : 18 . Generally for normalised data the euclidean distance measure is used. Let two variables . if 3 genes. whose expression values are taken in two samples : Genes Gene1 Gene2 Gene3 Sample1 1.5 0. Whereas a 0 means no correlation.5 1 The distances are calculated as follows : distance(Gene1.5−2 2−2. be gaussian.correlation. For example.

best shown pictorially in Figure 5.8 Gene3 0.53 3.83 0 3.53 Gene2 3. the minimum distance between two genes is found as : 19 . Average Linkage. Figure 4 shows three well separated clusters. Always a symmetric matrix. Complete Linkage or Furthest Neighbour etc.e. Single linkage algorithm computes the minimum distance available between two genes in two clusters.Gene2) = distance(Gene2. Individual genes have been associated with other genes in terms of distance.Gene1) Distances of interest are N(N-1)/2 for N genes. All distances are non-negative. distance(Gene1.83 0. Thus upper half or the lower half is sufficient to speculate the complete matrix.Gene1 Gene1 Gene2 Gene3 0 3. (i. For a data set S. Figure 4 In hierarchical clustering similarity amongst the clusters can be determined by linkage methods such as Single Linkage or Nearest Neighbour.8 0 Some properties of distance matrix worth noticing : Distances on the diagonal are all zeros. A group of points or genes lying close to each other are said to form a cluster.

Let Nr be the total data points of a grouping named 'r'. most similar). Then Xri be the ith data point in first group. Replace the individual clusters of these two merged genes with the single new cluster formed using one of 20 . Complete Linkage : 3. Let Ns be the total data points of another group named 's'.e. Having found the two genes that are nearest (i. Average Linkage : Discussion : The single linkage method can result in long clusters. Whereas the complete link method will result in tight/isolated clusters. Average linkage is mostly employed in clustering due to its intermediate results. thus less number of clusters are generated. are joined to form a new cluster. and Xsj be the jth data point in the second group. Mathematical interpretation is summarised below. In hierarchical clustering the most common approach is to treat each gene as a cluster. taken from [WWW5].Figure 5 [WWW4] Whereas complete linkage takes the largest distance and average linkage as the name suggests first finds the mean of two clusters individually then the distance between the two means is the average linkage distance. Now the measures can be described as follows : 1. thereby giving large number of clusters. Then find the nearest gene using the distance matrix. Single Linkage : 2. some long enough to reach out and join a nearby cluster.

e. the genes with least distance between them will be most similar and vice-versa. X2 & X3 in one group/cluster and X4 in the other group.the linkage methods described previously.X2. Also it is correct if X3 and X4 form individual groups leaving X1 and X2 together in one group. For example. 5 closest genes may be merged together to give the first new cluster. It can be easily decided thereon as to where the tree should be cut or pruned to result into a desired clustering. i. Then a general progressive hierarchical clustering results in a dendrogram as shown in Figure 6. The genes X1 and X2 are closest to each other therefore are merged into a single cluster. These lines joining the clusters give a clear view of the clustering topology. Agglomerative The common approach stated above is agglomerative. Further since this resulting cluster is found to be closest to X3 gene. This process results in a hierarchy of genes in the data set according to their distance.X3. It is not mandatory to merge two genes at a time only. Divisive 2. Whereas the divisive is just opposite of this. Hierarchical clustering is divided into two groups : 1. The convenient tree structure drawn using the result of hierarchical clustering is named as dendrogram.X4}. in which finally the whole data converges into one big cluster. The name hierarchy suggests a tree representation which is an inherent advantage of the hierarchical clustering technique. This process is continued till all the genes are clustered. they are merged together in the next step. The U-shaped lines joining these clustered genes are the most important part of the dendrogram. Thus the whole data is iteratively divided into fragments until each gene is singled out into one node alone. A basic example of hierarchical clustering elaborates this theory: Given a set of genes/objects X = {X1. The final tree can be clustered as follows : The possible clustering results can be : X1. larger number of genes can be merged at same instance. Thus this step reduces the N data set into N-1 data set. To study the result a diagrammatic representation is used since it is easier for a human being to comprehend a picture. It reflects the distance between the genes joined together and is read on a vertical scale. 21 . We choose distance between the genes to be used as the criterion to measure the similarity between the genes.

Join the two closest clusters in the data set to form a single new cluster thus reducing the size of current data set by one. Compute the similarity matrix. a square matrix of size N. 3. 2. Repeat from step 4 with current data set until all data points lie within one cluster. 6.e. 2003] : 1. Assume each data point from 1 to N to be a cluster. Find the distances of all the clusters in the current data set. All clusters as a result of this operation will be valid clusters. Begin with the data set of length N. i. Figure 6 Hierarchical Clustering Techniques : 1) Standard Hierarchical Clustering : The basic algorithm for an agglomerative type hierarchical clustering is presented and is referenced from [Matteucci. 22 . 4. 5.Either way it can be noticed that if a horizontal line is drawn at different places over the vertical lines joining the nodes in the dendrogram then a valid group can be found.

23 . At each step of grouping all possible tree combinations are tested. Second hypothesis believes that the current data set is rather having two of more subgroups in it. 2006]. The method adopted can be manipulated to suit many other kinds of data sets by changing the adopted mixture model. First hypothesis is based on believing that the data set in currently joined group belongs to a common probability distribution. From the two hypothesis the posterior probability is determined using which final grouping is done. − The algorithm is similar to agglomerative hierarchical clustering. the quality of clustering cannot be tested. similarity measure to be chosen and since the traditional approaches do not employ probability models. Initially the data set is considered to have N trees in it which is actually all the points contained in it . Dirichlet process mixture model is used. the pruning decision. It also comments on where to prune the hierarchical tree by using the posterior probability value. 2005] is outlined below : − Marginal likelihoods are used to determine the suitable combination of clusters. Thus this helps in determining the optimality of the formed group. The paper discusses the limitations of traditional approaches such as finding finite clusters. 2) Bayesian Hierarchical Clustering : Bayesian Hierarchical clustering used in [Heller. The difference being that decision made to join two clusters is based on hypothesis test. The attractive features being that the groupings or clustering is justified by the probabilistic approach. The algorithm for which is presented in [Mulajkar.Unavoidable problems associated with the above routine is that at minimum possible condition the time complexity is bound to be of order N 2 for the data set of length N.

Say. In such data sets hierarchical clustering is well suited to generate an extensive result incorporating the whole data set. − Problems seem to arise to deal with non numerical data. the data is modelled as two overlapping Gaussian mixtures then the tails of each distribution encroach into the other distributions region. Moreover there is little or no provision in standard approaches to deal with overlapping clusters. For instance. − A visually immediate result is obtained.Advantages of hierarchical clustering : − Genes/objects are ordered. − Re-evaluation of clustering is not possible. − A wide variety of clustering. 24 . − The methods used are predisposed to produce specific shaped cluster outputs. − Unprecedented findings are favoured due to the small size of the clusters produced. can be achieved by pruning the tree at a desired level. − The results of two or three samples/experiments are easy to study on a two and three dimensional graph but it gets a bit difficult to analyse many experiments together in multi dimension. − For a vast data such as in microarrays it is difficult to pre-define partitions or supply with prior classification information. based on similarity. − Since different metrics result in different clusterings therefore it becomes mandatory to compare all the results and find an optimum result which is a tedious process. − Noise associated with data is another problem. These deficiencies have given way to the development of novel clustering techniques and is an open field demanding a standard approach implementation. − A wide variety of cluster types can be formulated by using various linkage methods. single linkage method churns out elongated clusters whereas the average linkage will give spherical clusters. thus resulting in easy and manageable comprehension. This clearly indicates the unavailability of a standard clustering technique. Problems with hierarchical clustering : − An incorrect assignment at a prior stage cannot be corrected in later stages of clustering process. These are rigidly cut off after clustering and assigned into the cluster to which they lie closest in terms of distance.

Practical Work A) Hardware & Software Employed During software coding some functions in the Bio-Informatics toolbox of Matlab Software. This function is used in combination with others to perform clustering operations. is presented below : 'pdist' – Using this function distances between two points can be calculated by choosing an appropriate distance metric such as Euclidean. The function computes the distances between all combinations if a data set is provided with more than two points. The height of these lines represent the distance between the joined clusters. A brief description of the functions. This tree structure consists of U-Shaped lines that join two or more clusters. It determines the similar data points and their distances to cluster them together based on single link. The first two columns have the data point indexes and the third their distances. User can choose to display the number of nodes or the leaves in the tree to be displayed. Inc were used. 'linkage' – This function is usually followed by the pdist function to generate the clustering tree.5. Thus the format of input should be the same as of the pdist function output if used independently. the distances are (m-1)m/2 in number. Thus for a data set having 'm' points. measures as chosen by user. Thus when too many nodes are present (In Matlab >30) then one node represents many nodes and thus gives a manageable visual result.0. it was easier to separate the conspicuous clusters found initially and then deal with the points which seemed difficult to group. complete link etc. 'dendrogram' – It plots the hierarchical tree using the data output of the linkage function. Since the case of overlapping clusters was investigated exclusively during this work. Mahalanobis. The software implementation successfully separates overlapping clusters as described in the later section which were not separable using solely the functions of Matlab Software.1. Each time a new cluster is formed resulting in a new index that is the sum of its row of occurance with the total length of original data set. Spearman. Cityblock etc. Result of this function is a matrix of dimension (d-1 X 3) where d are the total data points. 25 . The dendrogram function is employed to separate clusters in this thesis. average link.246 of The MathWorks. Thus if two data points with indexes 1 and 9 are joined into a new cluster in the n th column of the matrix resulted by linkage function then the new index made is numbered (n+d) which can be further used to generate a new cluster. as used in the coding part. It starts with all the data points considered as leafs being in individual clusters and further are joined to form the hierarchical tree. Version 7. These distances are presented in a row starting from the distances of all other points with the first point then the distances of second point with rest of the points and so on.

To generate normally distributed random numbers Matlab's 'randn' function is used which chooses a seed to start producing a series of random numbers through multiplicative operations [WWW6] as per the clock time in computer. at the focal point of the distribution. To plot a particular case of overlapping Gaussian clusters a constant seed is given to 'randn' function. Thus this separation of points result in well separated clusters. initialisation of variables was done that would require large space in the system memory. B) Novel method for Hierarchical Clustering using Empirical Covariance Overview of the approach : Biologist look at the groups of similar genes to analyse them and gain information. But it was noticed that in case of closely spaced or overlapping clusters.66 GHz (Dual Core processor) Due to the case sensitive study. computation time issue has not been dealt with in this work. First these dense regions near the mean are found negating the scarcely populated areas of the distribution. This essentially requires to clearly separate the dissimilar genes which can be done by hierarchical clustering. A personal computer of following specifications was used : Intel CPU T2300 @ 1. This is the key to the cluster separation program. the dendrogram is difficult to prune since there is no clear separation of groups. Now the only problem remains with assigning the points that were the scarcely populated areas of the distribution which were lying away from the 26 . Thus these Gaussian mixture models have their individual characteristic parameters as mean and variance. This minimises the errors that may arise due to false allocations. population decreases giving scarcely populated regions. Using the function. densely populated and well separated clusters were obtained. That is the region near the mean of the dense region is heavily populated and further away from the mean. Hardware requirements have been minimum. This method is used on the overlapping clusters in which the overlapping regions are scarcely populated. The most interesting feature of modelling a group of points as normally distributed is that. Since the Matlab Software can easily detect well separated clusters. A successful technique was developed to resolve this problem. Although in order to save processing time for test results. All coding has been done in the Matlab environment. First it is assumed that overlapping clusters are normally distributed. its 'dendrogram' function can be used.'seed' . This is evident from Figure 9 that was generated using Matlab's Bioinformatics toolbox. the density of points is highest and going away towards the boundaries of the distribution it tends to become zero.

the means and empirical covariances of the dense regions were found. Based on empirical covariance the data points are grouped into clusters. Covariance : It is the measure of dependence between two random variables X and Y and is defined as : COV(X. The region that gave higher probability value according to the Mahalanobis distance formula was assigned the tested point. say. 27 . y is the unassigned point C1 and C2 are clusters one and two. 3) Dense Region separation : Matlab's 'dendrogram' function. 2) Similarity measure : Euclidean Distance. 4) Probabilistic Data Point assignment : Mahalanobis Distance.Y) = 0 Empirical Covariance : The distance matrix is generated using euclidean distance formula discussed earlier. Each data point is considered as a cluster thus an agglomerative bottom-up approach is followed.Y) = E(XY) – E(X)E(Y) If X and Y are independent then COV(X. So. To assign a point to one of. Synthetic Data : The artificial bivariate normal data set generated using Matlab's command randn is stretched to obtain desired elliptical shapes by an appropriate design matrix. Using this approach finally the whole data is divided into conspicuous clusters. Critical Features of Implementation : 1) Synthetic Data : Generated using Matlab's 'randn' function.mean. Each remaining point was tested for its association with either region/cluster. two clusters Mahalanobis distance is used which states in terms of probability : where. This is done by using the Mahalanobis distance formula. M1 and m2 the means of two clusters. It helps in estimating the probability of a point to be a part of a distribution based on the mean and empirical covariance of the individual distribution.

− On the above found densely populated nodes. Algorithmic Description of Matlab implementation : − Generate the distance/proximity matrix from the available data set. − Sum up the rows column wise starting from topmost row.e. Screenshots of results : Synthetic data set shown in Figure 7. 'S' is equivalent to 60% of the data length. Thus the probability of a point 'y' to be in cluster1 or in cluster2 is compared. This results in a row vector having summations of the fixed number of nearest neighbours of each point that is represented by its respective column number. − Assign the remaining points to one of these clusters according to the Mahalanobis distance between the point and the cluster. − Sort the row vector thus obtained in ascending order. corresponding to 10% of the data length in each column. the indexes of the first 'S' summation values are gathered. Figure 7: Case of two overlapping clusters.∑1 −1 and ∑ 2 −1 are the inverse covariance matrices. gathering their indexes that were indicated by the columns so as to get the nodes with closest neighbours summations arranged in the beginning of the row. In order to get dense regions separated. Matlab's hierarchical clustering is run to obtain well separated clusters of dense regions. 28 . − Sort the distance matrix in ascending order such that in each column the first row shows the distance with itself i. 0 and going downward the distance increases showing distant neighbours of the respective data point represented by the column number. − Find mean and inverse covariance of these clusters. − The row vector has length equal to the length of original data set.

result of which is : Figure 9 : Using only Matlab's functions on the expression data. Its dendrogram (below). The dendrogram clearly shows two clusters far apart from each other by the longest height of the u-shaped line joining the two clusters. The above results show successful determination of two overlapping clusters. Thus showing failure of the Matlab functions in this case of overlapping clusters. Now the dendrogram can be easily cut to separate the two clusters. the resulting tree cannot be pruned to get conspicuous clusters . 29 . without finding the dense regions as done in our novel method initially. Same data was clustered using Matlab's Bioinformatics toolbox functions.Result of the novel algorithm proposed on the above data set : Figure 8 : Two overlapping clusters identified correctly(Top).

Using similar concept of normal distribution property as used before the dense regions are found lying on the surface of the sphere.C) Novel method for Hierarchical Clustering using stereographic projection. The difference being that the similarity measure chosen to group data is based on correlation coefficients rather than Eucledian distances. This essentially suggests that dimensionally. The novel method proposed here based on directionality assumes the multidimensional gene expression data initially to be distributed on a sphere. all the remaining points are projected to each dense region based on their means and covariances. The data is normalised to realise it being spread on sphere's surface with radius one. It is mean of the points in the respective dense region. The gene expression data is usually available with many samples for a single gene. In later section this procedure is better described mathematically. Having the well separated dense areas on the sphere surface. The proposed method is advantageous in dealing with different shapes that a data distribution may take. This method was recently adopted in [Dortet-Bernadet. The paper proposes to use various methods of dimensionality reduction so as to perform clustering and finding the number of clusters in a data set. Comparing the probabilities found for a test point in each case of dense region the assignments are done resulting in completion of clustering operation. expression data is distributed on the surface of a sphere using a density function who's parameters are derived by the EM algorithm. The novel method proposed here uses similar concept of realising expression data on a sphere together with a technique for dimensionality reduction by projection of the spherical data onto a plane and a decision to cut the dendrogram tree to gain clusters. 2007] where expression data is distributed based on its directionality. In the paper. the gene data be distributed on a hypersphere. 30 . In order to assign the remaining points to their respective clusters. The dense regions found are projected initially onto a plane tangent to a point on sphere's surface. Having projected the remaining points onto a plane surface Gaussian Bayes Classifier [Moore. Overview of the approach : Clustering based on directionality of the gene expression data is another way of grouping similar genes. 2001] is used to determine the probability of association of a point in one of the dense regions. Matlab's 'dendrogram' function can be used to separate each dense region found. where clustering is easier to perform. Stereographic projection results in mapping the points lying on the curved surface of the sphere onto a 2-dimensional plane surface.

On this plane be located a point 'x'. 3) Dense Region separation : Matlab's 'dendrogram' function.Critical Features of Implementation : 1) Synthetic Data : Generated using Matlab's 'randn' function.. e 3 .. 2) Similarity measure : Pearson's correlation coefficient. A line joining the point 'e' to 'x' is named as y  .. Consider a tangent plane to the point 'e' as shown in Figure 11.. eM ]     K  where e i ∈ R . 3 .. e2 . Amongst this data set a dense or highly similar region consists a subset of points that are : [ e1 . e 3 .. i = (1. 31 . Method used : Let the complete dataset be represented as a set of vectors : [ e1 .. 2. e2 . This is shown if Figure 12. e N ]     Mean of the dense region is given by : 1/ N ∑ e i = i =1 N e  e Consider a sphere having a vector  drawn from the origin 'o' to a point on its surface 'e' as shown in Figure 10.. 4) Probabilistic Data Point assignment : Gaussian Bayes Classifier. M) and K is the size of data set or the dimension.

And e1 be the projection of e1 on the tangent plane as shown in the Figure 13. Therefore e x ⋅ = 1 (B)   Consider a case of a point located on sphere ' e 1 ' .Figure 10 Figure 11 Figure 12 Figure 13 x e y From Figure 12 it can be stated  =    (A) e x e e y e e e y e So ⋅ =      = ⋅  ⋅ = 10 since  is perpendicular to the plane.   Consider e 1 = α ˙ e1  e Using result (B) it can be stated that e1 ˙  = 1  Multiplying (C) on both sides by e1 we have : (C) e1  = α e 1   e  ˙e ˙ or or Using result (D) in (C) :  e 1 = α e1 ˙   e α = 1/ e 1 ˙  (D) 32 .

e N / e N  ]  ˙e   ˙e Since their may be more then one dense regions of closely spaced data points on separating such far spaced individual dense regions following parameters can be computed for each case : Dense region one : Dense region two : . .. f 2 . let the remaining non-dense     points be f 1 . . f 3 ... Σ 2 3 N . e3 . e S .. eS ] the 2 N set of dense points.. e3 . e 2 . . e S . . e 3 / e 3  . P(  f 1 | e S . .. P( 1   P( f 2 | e1 .... . e 3 . e 2 . e2 . e S . Σ ). f 3 . To assign these remaining points to their most similar groups.. Σ 2 ). eS ].e1 = e 1 / e 1     e ˙ (E) Thus all points of dense regions can be projected to the plane using the result (E) giving :          e   e  [ e 1 . eN ] = [ e 1 / e 1 ˙  .. e 2 .. e1 ]. eM ] . .... P( f 2 | e 2 .. Σ 1 2 3 N 2 2 2   e . S S . Σ ).... Dense region S : 1 1    e .. first they too are projected to the same plane thus giving : f 1 . Σ S ) 33 . f 2 . e N ] → [ e 1 . Σ S )  f 2 | e S .. .. e 2 . P( f 1 | e 2 . f R → f 1 .. e3 . Σ covariances and [ e1 . Σ S e . f 2 . .. [ e1 . f R . f R For S different dense regions compute : 1 2   P( f 1 | e1 . . [ e1 . Σ ).. [ e1 2 3 N S S S  S Where e are their respective means. Now from the initial 3D data set [ e1 . e1 .. e 2 / e 2 ˙  .. .... e1 . e2 ]. f 3 .

. − Project the two dense regions on a plane tangent to the sphere using the means of each cluster. 34 . find their respective covariances and means.. P( fR | e 2 . Algorithmic description of the Matlab Implementation : − Normalise the data so as to realise the data distribution to be located on the surface of a sphere. − The row vector has length equal to the length of the original data set. Σ ). − Sort the proximity matrix in descending order such that in each column the first row shows correlation of a vector with itself that will be maximum and equal to 1. Comparing the probabilities computed by the above relation. . − Compute the probability of the remaining points in each case of projection to test their association with either of the two clusters. − Sum up the rows column wise starting from topmost row. Σ ). − Sort the row vector thus obtained in decreasing order gathering their indexes that are indicated by the columns so as to get the nodes with closest or most similar neighbour summations arranged in the beginning of the row. 'S' is equivalent to 60% of the complete data length. In order to get dense or most similar regions separated. The number of rows to be summed up are equal to 5% of the total length of data set. P( fR | e S . This results in a row vector having summations of the fixed number of most similar data points or neighbours for a data point represented by its respective column number. 1 2    P( fR | e 1 . − Separate the remaining points from the initial data set.. rest of the points can be assigned to their respective clusters. − Generate the proximity matrix based on Pearson's correlation coefficients. − On the above found densely populated nodes Matlab's hierarchical clustering is run to obtain well separated clusters of dense regions. Σ ) = 1 /  2 π │ Σ │ e ( exp −1/ 2 f −  Σ    −  f e −1 ) is the Gaussian Bayes Classifier [Moore. the indexes of the first 'S' summation values are gathered. − Having projected the two dense regions. − Similarly project the remaining points on to the same plane.  e P( f |  . . .. . 2001]. In this sorted matrix if we tread downwards column wise the correlation value decreases showing less similar points to least similar at the bottom. Σ S ) where. .

35 . Results : Figure 14 : Data generated from two Gaussian distributions and stretched by an appropriate design matrix. Figure 15 : Clustering result of the novel method proposed on the synthetic data. The two clusters are separated without any incorrect assignments.− Assign all the remaining points to their respective clusters by comparing the probability of each cases.

Thus being highest. As discussed before the diagonal elements in case of correlations used as a similarity measure are all ones which show the correlation of a point with itself. The proximity matrix generated is shown in Figure 16.Dendrogram of the resulting data : Figure : Tree obtained using the novel method proposed. can be easily cut into two individual clusters. Here it is difficult to cut the tree into conspicuous clusters. The dendrogram of the resultant clustering in the figure above shows conspicuous clusters that were separated from the original data set. Figure : Matlab's result on the same data set. 36 .

Test Condition : If d n − d n−1  > 10% of d n then Clusters=Clusters+1 Decision : Cut dendrogram into number equal to the value of 'Clusters' found. 1883]. Dendrogram is pruned based on the output of Matlab's 'linkage' function. 1998].f . To decide the total number of clusters to be made following method is adopted : Let d 1 . The genes showing variance greater than 2 were assimilated to generate the initial data set. 2000].. investigates the cell cycle regulation processes on mRNA level and information that can help in prediction of cell cycle...Find d n − d n−1  for n=2. d 2 ..Figure 16 : A part of the proximity matrix generated based on Pearson Correlation Coefficient. About the real data set : The yeast Scaaharomyces Cerevisiae is a well known commercial fungi used in making breads and alcohol [Hansen.... Matlab's 'abs' function was used to find the peak expression values for each gene in the cdc15 data set. d f be all the distances generated by the 'linkage' function... 1998] using microarrays.4. The paper [Spellman..3. 2.. D) Tests on Real Data : The novel hierarchical clustering method proposed based on stereographic projections was tested on real data set available over the internet from [Botstein. . Particularly the 37 ..Set Clusters = 1 . Its genes are studied in the paper [Spellman. Critical features of the Matlab code & the pruning decision : 1.. d 3 .

After this the cell divides resulting into two cells. 38 . Since results found by different methods are same. it proves that correct clustering is done and result is not a mere output of numerical computations [Eisen. The division into two nucleus within the cell is termed as Mitosis.fluctuations of mRNA measure during the cell cycle and regulation were studied using microarrays. In our case we selected data with variance greater than 2 from the cdc15 experiment in order to reduce computation time. This proves the similarity in result of our algorithm and the result shown in paper [Spellman. Along with this process the the nucleus membrane disappears and the protein fibres attach themselves to each chromosome. These then separate resulting in the end of Mitosis. Verification of our result : Our results in Figure 17 & 18 show clear peaks in the expression levels during three different time spans. 1998]. 1998]. Figure 17 : Well separated clusters resulting from application of novel algorithm on cdc15 data [Botstein. Mitosis starts with the DNA inside the nucleus grouping together to form coil like structure which are referred as chromosomes. A cell cycle is a chain of events which results in a 'daughter cell' [WWW7] through a four staged complex division processes. 2000].

Result of clusterings on a synthetic data sets is shown in Figures 18. E) A naive method developed at the outset of project work : At the outset of the project work a clustering implementation was done in Matlab 7. applicability and robustness which were dealt with in the two implementations later.1 which helped to realise the numerous shortcomings that needed to be addressed for implementing a efficient clustering code. Thus this implementation does not give good results in case of overlapping clusters. This code was written in the beginning of the project work and suffered with many weak areas such as coding style. Brief Description : Euclidean distance was used as the similarity index to assign points to the individual clusters. rather there is a rigid boundary drawn between two clusters separating them apart. 39 .<Time Span 1> <Time Span 2> <Time Span 3> Figure 18 : Heat map of cdc15 expression data with variance>2. In case of closely placed clusters the overlapping regions were not taken into account. Though it works well in the case of clusters that do not lie too close to each other.

going from smaller to bigger ones. The difference being that in case of distances. Thus cutting a small percentage of the data from the beginning also gave dense regions. high index shows greater similarity than a low correlation value. This was noticeable in the results when two normally distributed clusters were made to overlap and separated using the method proposed in 2D clustering. since Pearson's correlation measure is used the data has to be arranged in descending order to collect the most similar points that are having high correlation with each other in the beginning of the sorted data. On cutting out small percentage of data from the beginning of the sorted data gave dense region or most similar data points. But the wrong assignments of data points were less than 2% in all different cases thus proving the effectiveness of the method proposed. The regions where the clusters overlap are bound to have miss assignments since there is no other information available for the data set other than the expression values. similarities were found using Euclidean distance formula thus the dense regions were found by arranging the data in ascending order so as to get the nearest neighbour at the beginning and farthest in the end. The initial data sets taken are assumed to have similar groups of points that have normal distribution.Test Results : Figure 18 : Clustering result of the standard implementation showing correct determination of two well separated clusters in a synthetic data set (on left). 6) Conclusion and Evaluation In the novel method based on empirical covariance. Similar result obtained by using 'dendrogram' function of Matlab Software on the same data set (on right). Since even if there is a case of small group of genes actively involved in a process which may be needed to 40 . But in case of clustering using stereographic projections. Whereas for correlations. Thus least similar data points have a value near or equal to zero. correspond to most similar to least similar.

Figure 19 7. 41 . In order to deal with data spread uniformly on all sides of the unit sphere a new projection method needs to be developed. The attractive features being that it efficiently describes elliptical shaped clusters and the clusters which are rotationally symmetric can be differentiated together through clustering than this percentage is fairly low to cover the whole sub group of those genes. Figure 19 elaborates this discussion. Thus the method proposed would not fail in clustering similar groups. Future Work Areas In case of hierarchical clustering using stereographic projection we have assumed that the gene expression data lies on a single side of the sphere rather than spread around throughly on all sides. The five parameter Kent distribution can be employed to model the gene expression data.

doi :10. Parkinson. Botstein. ' Microarray Catalogue '. Ghahramani.ebi. Wicker.cs. BMC Bioinformatics. 1883. pp. 1-39.. [WWW3] Available : www. and Conlon. 2005. B. 2006.14863–14868. [Online] Available : http://www. J. ' Saccharomyces spp.. 1998.. Schlitt. Dhillon. J. Journal of Machine Learning Research 6.S. Meyen ex Hansen...stanford.R. Spellman.. Twenty-second International Conference on Machine Learning. and Botstein. 44-66. Heller.txt Brazma. Dortet-Bernadet. D. European Bioinformatics Institute. [Online] Available : http://genome-www. To appear in 2004.. Goldstein. D.transcriptome.2004. molecules. ' Statistical issues in the clustering of gene expression data '. J. 2005..doctorfungus.. McGill University. ' Yeast Cell Cycle Analysis Project '.B.... functional genomics. Ghosh. P. and Zhang. 'Bayesian Hierarchical Clustering'. Hicks. A. Mitani. 2001.1016/j.jmva..htm Hardin... Shojatalab. Journal of Multivariate Analysis. genes. Eisen.. doi:10..A. M.ens.M.T.02. Futcher. M. VanKoten. K. B. ' Clustering on the Unit Hypersphere using von Mises-Fisher Distributions '.mcgill. 2000. ' A quick introduction to elements of biology . doi:10. Biology Department Genomic Service. Guérette. microarrays '.1093/biostatistics/kxm012. ' A robust measure of correlation between two genes on a microarray '. I. Statistica Sinica.. [Online] Available : www. 2007. ' Cluster analysis and display of genome-wide expression patterns '. 42 . References Banerjee.O..htm Hansen... 2007. D.. 95.cells... D.. A. T.1186/14712105-8-220. Stanford University or Molecular Biology of the Cell. A. [Online] Available : www. '. Brown. Ghosh.. Proc Natl Acad Sci USA. N.8. Sra. ' Model-Based clustering on the unit sphere with an illustration using gene expression profiles '. M. S.. A.html Bryan. H. P. 8:220. Brown. ' Problems in gene clustering based on gene expression data '. P. L.011. E. ' Normalization Techniques for cDNA Microarray Data '. Z.

Available : www.pdf Jain.jde. R. 2002. 50-63. [WWW4] Available : http://obelia. 2001.. 1988. L. 16.Introduction to Engineering Systems. M.polimi. Available : http://www. M.gene-chips. T.. H. and McLachlan. 2003. University of 96.html Moore. ' Learning Gaussian Bayes Classifiers '. . ' Statistics Toolbox : Linkage '. ' Hierarchical Clustering of Microarray Data based on Empirical Covariance '.edu/~engintro/Project3/MATLABtips2_05. Ph.. D.. ' Microarrays : Chipping away at the mysteries of science and medicine '. Laboratory.. G. and Dubes.pdf Mulajkar. Whiten.Sc..... University of Texas.htm MathWorks Inc. A. ' Cluster Analysis '. Renkwitz. Prentice 43 . 2003 [Online]. 2006.aca. Biology 101. 2003. [WWW1] Available : www. Oak Ridge National.fastol.... 2004. Culverhouse.shtml Carnegie Mellon University. and Zhang. Vol. 2005. A. Pharmacogenomics. Tang. 2007. Journal of the American Statistical Association.. 'Analysing microarray data using cluster analysis '. 2003.html Manchester Metropolitan University. D.. C. and Duncan.mathworks. ' How Many Genes Are in the Human Genome? '.dei.autonlab. [Online] Available : www. ' A Tutorial on Clustering Algorithms '.com/~renkwitz/microarray_chips. R. 2004.J. 'Algorithms for Clustering Data'.. University of Notre Dame [WWW6]. ' Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification '. 2001.nd.htm Schena. C. 1370 -1386. ' Matlab Tips #2 '. Politecnico di.ornl. [Online]. 2000.. ' DNA Microarray (Genome Chip) '. Matteucci. [Online] Available : www. ' Microarray Analysis ' John Wiley & Sons. Peel. Jiang. A. [WWW2] Available : www..D. ' Father of Bioinformatics '.. [WWW5] Available : www.J. Available : http://home. IEEE Transactions on knowledge and data engineering. W. Chesapeake College. W. ' Cluster Analysis for Gene Expression Data : A Survey '. Shi.

.warwick.cshl..pdf Turner. 2004. ' Biclustering Microarray Data: Some Extensions of the Plaid Model '. ' Microarray analysis '. Molecular Biology of the Cell.B.pdf 44 . B. P.. Ph. M. 3273–3297.. Sherlock. 42-196. ' Cell Cycle Research '. Eisen. [WWW7] Available : http://www. K. O.cellcycles. Standford University. Iyer..O.. H. Brown. Anders. P.Spellman. Available : www2...Q. D. 2003. M. The Cell Cycle Database.R..D. Zhang. and Ph.T. Troyanskaya..D. [Online] Available: http://stein.. 1998. 2005 [Online]. ' Comprehensive Identification of Cell Cycle–regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization '.. V. G..

2007 Naive clustering code implemented to find rigid clusters. 25th August. Pruning the dendrogram based on 'linkage' function output. Novel methd of Hierarchical clustering using Stereographic Projection implemented in Matlab. 2007 28th April.9. Abbreviations 2D 3D cDNA cdc15 DNA mRNA RNA rRNA tRNA Trans. 2007 45 . 2007 26th July. Novel method of Hierarchical clustering using Empirical Covariance implemented in Matlab. Two dimensional Three dimensional Complementary DNA Cell division control protein Deoxyribonucleic acid Messenger RNA Ribonucleic acid Ribosomal RNA Transfer RNA Transactions 10. Diary of major milestones achieved 6th April.