You are on page 1of 13

Online available since 2014/Jan/26 at www.oricpub.

com © (2014) Copyright ORIC Publications

Journal of Science and Engineering Vol. 3 (2), 2013, 63-75
ORICPublications
www.oricpub.com

SEJournal
Science and Engineering
www.oricpub.com/journal-of-sci-and-eng

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment
Zakaria Suliman Zubi1, Marim Aboajela Emsaed2
1 2

Sirte University, Faculty of Science, Computer Science Department, Sirte, P.O Box 727, Libya. Tripoli University, Faculty of Information Technology , Computer Science Department, Tripoli, Libya, P,O Box 13210.

Abstract
Received: 29 June 2013 Accepted: 20 Dec 2013

Keywords: Data Mining Sequence Mining Biological Database Genetic Algorithm Clustering Classification K-means

The purpose of this work is to Identifying Cancer Patients using DNA Micro-Array Data that use DNA chains which contain informational code to composition of the human body, methods are based on the idea of selecting a gene subset to distinguish all classes, it will be more effective to solve a multi-class problem, and we will propose a genetic programming (GP) based approach to deal with the gene selection and classification tasks for biological datasets. This biological dataset will be derived from multiple biological databases. The procedure responsible for extracting datasets called DNA-Aggregator. We will design a biological aggregator, which aggregates various datasets via DNA micro-array community-developed ontology. Our aggregator is composed of modules that retrieve the data from various biological databases. It will also enable queries by other applications to recognize the genes. The genes will be categorized in groups based on a classification method, which collects similar expression patterns. Using a clustering method such as k-mean is required either to discover the groups of similar objects from the biological database to characterize the underlying data distribution.

1. INTRODUCTION Data mining techniques used to make predictions and typically using only recent static data. Sequence mining is a special case of structured data mining and concerned with finding statistically relevant patterns between data examples where the values delivered in a sequence. These values delivered and then stored in huge collections of data; examples of such collections include biological databases were the DNA sequence databases. However, these data is a sequential data in nature cases, which requires a technique for discovering sequential patterns; this technique could be sequence-mining technique. The principle of sequence mining is to discover useful sequential knowledge. This knowledge obtains the form of insight into the structure of the data. DNA (gene) is an extraordinary chip data with thousands of attributes which represents the gene expression values [8]. Cancers caused through gene mutations and other types of chromosomal or molecular abnormalities. The frequent sporadic cancers, i.e. cancers in individuals with a negative family history for cancer, carry somatic gene mutations acquired at mitosis. Genes caught up with cancers are mainly those involved in normal homeostasis of cellular proliferation, differentiation and death.
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of ORIC Publications,www.oricpub.com.

Correspondences:

Z. S. Zubi
Sirte University, Faculty of Science, Computer Science Department, Sirte, P.O Box 727, Libya.

64 | Z. S. Zubi, M. A. Emsaed

Cancer growth usually requires some different gene mutations accumulate in a cell of origin and in its sub clones during colonial evolution of malignant growth. Gene mutations in cancers invariably leads to alterations of gene expression patterns with respect to normal cellular counterparts, including the mutated genes themselves and their downstream targets [5]. New techniques may help us to overcome this limitation called Genetic programming (GP). Genetic programming (GP) based is an essential method for both feature selection and generating simple models based on a few genes demonstrated on cancer data. Genetic programming (GP) has been widely applied with classification problems because it can discover underlying data relationships. GP is a promising solution for the discovery of potentially important gene by generating comprehensible rules for classification. 1.1 Early Diagnosing of Cancer Diseases A sound body depends on the continuous interplay of thousands of proteins, acting together in just the right amounts and in just the right places--and each properly functioning protein is the product of an intact gene. Many, if not most of the diseases have their roots in our genes. More than 4,000 diseases stem from altered genes inherited from one's mother and/or father. Common disorders such as heart disease and most cancers arise from a complex interplay among multiple genes and between genes and factors in the environment [4]. Cancer is a class of diseases distinguished by out-of-control cell growth. There are over 100 dissimilar types of cancer, and the type of cell that is initially affected classifies each [10]. The Beginning of Cancer All cancers begin in cells, the body's basic unit of life. To recognize cancer, it's helpful to know what happens when normal cells become cancer cells. The body is made up of many types of cells. These cells grow and divide in a controlled way to produce more cells as they are needed to keep the body healthy. When cells become old or damaged, they die and are replaced with new cells. However, occasionally this orderly process goes wrong. The genetic material (DNA) of a cell can become damaged or changed, producing mutations that affect normal cell growth and division. When this occurs, cells do not die when they should and new cells form when the body does not need them. The extra cells may form a mass of tissue called a tumor as shown in figure1 [11]. 

Figure 1 The cancer transformation

Cancer Classifications Five broad groups used to classify cancer, these groups are listed as follow:

© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 65

 Carcinomas are characterized by cells that cover internal and external parts of the body such as lung, breast, and colon cancer.  Sarcomas distinguished by cells that are located in bone, cartilage, fat, connective tissue, muscle, and other supportive tissues.  Lymphomas are cancers that begin in the lymph nodes and immune system tissues.  Leukemias are cancers that begin in the bone marrow and often accumulate in the bloodstream.  Adenomas are cancers that arise in the thyroid, the pituitary gland, the adrenal gland, and other glandular tissues [10]. The objectives of the early detection are listed as follow: a) To detect and remove / arrest all premalignant lesions; b) To give patients the best treatment available; c) To reduce the morbidity and mortality of this disease; d) To help spread awareness among patients. 1.2 Sequence Mining Techniques Sequences are an important type of data which occur frequently in many fields such as medical, business, financial, customer behavior, educations, security, and other applications. In these applications, the analysis of the data needs to be carried out in different ways to satisfy different application requirements, and it needs to be implemented in an efficient way as well. DNA sequences encode the genetic makeup of humans and all other species; and protein sequences describe the amino acid composition of proteins and encode the structure and function of proteins. Moreover, sequences can be used to capture how individual humans behave through various temporal activity histories such as weblogs histories and customer purchase ones. In general there are various methods to extract information and patterns from databases, such as Time series, association rule mining and data mining [11]. 2 BASIC DNA PRINCIPLES

The basic element of life is the cell, which is a tiny factory producing the raw materials, energy, and waste removal capabilities necessary to sustain life. Thousands of different proteins, called enzymes, are necessary to keep these cellular factories functioning. An average human being is composed of approximately 100 trillion cells, all of which originated from a single cell. Each cell contains the same genetic structure within the nucleus of our cells is a chemical substance known as DNA that contains the informational code for replicating the cell and constructing the needed enzymes. Because the DNA resides in the nucleus of the cell, it is often referred to it as a nuclear DNA [3]. DNA has two primary purposes: (1) to make copies of it so cells can divide and carry on the same information; and (2) to carry instructions on how to make proteins so cells can build the machinery of life. Information encoded within the DNA structure itself is passed on from generation to generation with one-half of a person's DNA information coming from their mother and one-half coming from their father. 2.1 DNA Structure and definition Nucleic acids including DNA are composed of nucleotide units that are made up of three parts: a nucleobase, a sugar, and a phosphate shown in figure 2. The nucleobase or 'base' imparts the variation in each nucleotide unit while the phosphate and sugar portions form the backbone structure of the DNA molecule. The DNA alphabet is composed of only four characters representing the four nucleobases: A (adenine), T (thymine), C (cytosine), and G (guanine).
Journal of Science and Engineering / Vol. 3 (2), 2013

66 | Z. S. Zubi, M. A. Emsaed

Figure 2. Basic components of nucleic acids: (a) phosphate sugar backbone with bases coming off the sugar molecules, (b) chemical structure of phosphates and sugar molecules illustrating numbering scheme on the sugar carbon atoms. DNA sequences are conventionally written from 5’ to 3’.

2.2 Base pairing and hybridization of DNA Strands In its natural state in the cell, DNA is actually composed of two strands that are correlated together through a process known as hybridization. Individual nucleotides pair up with their ‘complementary base’ through hydrogen bonds that form between the bases. The base pairing rules are such that adenine can only hybridize to thymine and cytosine can only hybridize to guanine figure 3 illustrated more facts about the pairing rules.

Figure 3. Base pairing of DNA strands to form doublehelix structure.

2.3 Chromosomes, genes, and DNA markers There are approximately three billion base pairs in a single copy of the human genome. Obtaining a full catalog of our genes was the focus of the Human Genome Project. The information from the Human Genome Project will benefit medical science as well as forensic human identity testing and help us better understand our genetic makeup. Within human cells, DNA found in the nucleus of the cell (nuclear DNA) is divided into chromosomes, which are dense packets of DNA and protection proteins called histones. The human genome consists of 22 matched pairs of autosomal chromosomes and two sex determining chromosomes figure 4 shows these pairs. Thus, normal human cells contain 46 different chromosomes or 23 pairs of chromosomes. Males are designated XY because they contain a single copy of the X chromosome and a single copy of the Y chromosome while females.
© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 67

Figure 4.

Human genome contained in every cell consists of 23 pairs of chromosomes and a small circular as mitochondrial DNA.

genome known

 Designating physical chromosome locations The basic regions of a chromosome are illustrated in figure 5. The centre region of a chromosome, known as the centromere, controls the movement of the chromosome during cell division. On the other side of the centromere are ‘arms’ that terminate with telomeres as shown in figure 5. The shorter arm is referred to as ‘p’ while longer arm is designated ‘q’.

Figure 5. Basic chromosome structure and nomenclature

3

TUMOUR SUPPRESSOR GENE P53

The p53 tumour suppressor gene is the most frequently altered gene in human cancer, including brain tumours. The p53 protein is a transcription factor involved in maintaining genomic integrity by controlling cell cycle progression and cell survival. About 50% of primary human tumours carry mutations in the p53 gene. The function of p53 is critical to the efficiency of many cancer treatment procedures, because radiotherapy and chemotherapy act in part by triggering programmed cell death in response to DNA damage [6]. P53 tumour suppressor gene is one of the most commonly mutated genes. The p53 is a 20 Kb gene located on the short arm of chromosome 17 at 17p13.1 locus.

Journal of Science and Engineering / Vol. 3 (2), 2013

68 | Z. S. Zubi, M. A. Emsaed

3.1 Primers for PCR and DNA sequencing The primers used were oligonucleotides complementary to the sequence flanking the exon/intron junctions of exons 5–9. The sequence of the primers is as follows: exon5, 5’CTGACTTTCAACTCTG-3’ (forward) and 5’-AGCCCTGTCGTCTCT-3’ (reverse); exon 6, 5’- CTCTGATTCCTCACTG-3’ (forward) and 5-ACCCCA GTTGCAAACC-3 (reverse); exon 7, 5’-TGCTTGCCACAGGTCT-3’ (forward )and 5’-ACAGCAGGCCAGTGT3’ (reverse); exon8, 5’AGGACCTGATTTCCTTAC-3’ (forward) and 5’-TCTGAGGCATAACTGC-3’ (reverse); exon9,5’-TATGCCTCAGATTCACT-3’(forward) and 5’-ACTTGATAAGAGGTCC-3’ (reverse). 4 DNA MICRO-ARRAYS DATA CONCEPTS

The DNA micro-arrays produced by placing small drops of liquid include genes on a glass microscope slide, and allowing the spots to dry. Each spot of liquid contains numerous copies of a single gene and the characteristics of each spot's of gene are shown in figure 6.

Figure 6 Cartoon of a DNA micro-array

The mRNA is isolated from each population and each population of mRNA converted into colored cDNA usually in red and green. Once the two populations of cDNA's produced, they will be mixed and incubated with the DNA micro-array and unbound cDNA is washed off, figure 7 shows the incubate process. The DNA micro-array scanned to discover the two colours of cDNA and then the green and the red images will be stored. Software merges the two colours and spots bound by both colours of cDNA appear yellow .

Figure 7 Shows the method for producing labeled cDNA

We indicate some real data in figure 8 using an application program to analyze the data [1].

© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 69

Figure.8 illustrates the real micro-array data for three genes.

 The major application of microchips falls into three categories:
1- Gene expression profiling : while RNA is extracted from tumour samples and hybridised to the micro-array to assess concurrently and in a single experiment the term of thousands of genes within the sample. 2- Genotyping: Genomic DNA from an individual tested for hundreds or thousands of genetic markers [notably single nucleotide polymorphisms (SNPs) or ‘snips’, or micro-satellite markers] in a single hybridisation. This will yield a genetic fingerprint, which in turn may be linked to the risk of developing single gene disorders or particular common complex diseases. 3- DNA sequencing: Sequence variations of specific genes can be monitored in a test DNA sample, thereby greatly increasing the scope for precise molecular diagnosis in single gene disorders or complex genetic diseases. [5]. DNA Sequencing Process: 1Mapping — Identity set of clones that span region of genome to sequence. 2Library Creation — Make sets of smaller clones from mapped clones. 3Template Preparation — Purify DNA from smaller clones — Set up and perform Sequencing chemistries 4Gel Electrophoresis — Determine sequences from smaller clones 5Pre-finishing and Finishing — Specialty techniques to produce high quality sequences 6Data Editing/ Annotation — Quality assurance — Verification — Biological annotation — Submission to public database [12]. Applications of DNA micro-arrays or ‘chips’ in oncology  Global understanding of abnormal gene expression contributing to malignancy, i.e. snapshots of genes either up or down regulated in tumours.  Molecular classification of neoplasm's by gene expression signatures, forecasting the tissue
Journal of Science and Engineering / Vol. 3 (2), 2013

70 | Z. S. Zubi, M. A. Emsaed

of origin of a tumour in the context of multiple cancer classes.  Classification of novel molecular-based subclasses in the tumours with clinical relevance.  Discovery of new prognostic or predictive indicators and biomarkers of therapeutic response;  Identification and validation of new molecular targets for drug development;  Prediction of drug side effects during preclinical development and toxicology studies;  Identification of genes conferring drug resistance;  Prediction or selection of patients most likely to benefit from, or suffer from particular side effects of drugs (pharmacogenomics) [5]. 5 DNA BIOLOGICAL DATABASES

Starting out with any research project it is required to gain information on the problem to be investigated. Biological data can be organized in many different manners: 1. Flat text files databases; 2. Relational databases; 3. Object oriented databases. Biological databases can be broadly classified in to sequence and structure databases. Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database is applicable to only Proteins. Biological database is the database of sequence. Three kinds of biological sequences include protein, DNA and RNA. In recent years biological data is doubled in size every 15 or 16 months. Since there are so many data in biology, biology database has greatly developed and became a part of the biologist’s everyday toolbox. The number of everyday queries has also increased to 40,000 queries per day. So we should have some good database search methods. Otherwise, we cannot use the biological database efficiently. The Nature of the Data Collected from Patients and so to construct database, samples of DNA must be collected, the samples analyzed, and the resulting data stored in such a way that it can be accessed efficiently. In the systems now in use, blood, saliva, or other tissue or fluid is collected. Databases and the ability to organize data are needed in order to keep research efficient and to get optimal output and information from data obtained in the lab. 5.1 Biological Dataset Biological dataset is a data or measurements collected from biological sources, which is stored or exchanged in a digital form. Biological dataset is regularly stored in files or databases. Examples of biological data are DNA base-pair sequences, and population data used in ecology. There are a number of DNA datasets from published cancer gene expression, including leukemia cancer dataset, colon cancer dataset, lymphoma dataset, breast cancer dataset, and ovarian cancer dataset. Among them three datasets will be used in this proposal work.  Leukemia cancer dataset Leukemia dataset consists of 61 samples: 25 samples of Acute Myeloid Leukemia (AML) and 36 samples of Acute Lymphoblastic Leukemia (ALL). The source of the gene expression measurements was taken form 55 bone marrow samples and 6 peripheral blood samples. The 34 of 61 samples are Leukemia cancer samples and the remaining are normal samples.  Colon cancer dataset Colon dataset consists of 68 samples of colon epithelial cells taken from colon-cancer patients. The 46

© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 71

of 68 samples are colon cancer samples and the remaining are normal samples. Lymphoma cancer dataset Lymphoma dataset consists of 35 samples of Lymphoma cells taken from Lymphoma-cancer patients. The 27 of 35 samples are Lymphoma cancer samples and the remaining are normal samples. 6 METHODS AND MODELS 

6.1 Genetic algorithm Genetic Algorithms (GAs) are adaptive Guidance search algorithm provided on the evolutionary ideas of natural selection and genetic. The basic concept of GAs is designed to simulate processes in natural system necessary for evolution, specifically those that follow the principles first laid down by Charles Darwin of survival of the fittest [7]. Three operators are used by genetic algorithms: 1. Selection: The selection operator Indicates to the method used for selecting which chromosomes will be reproducing. The fitness function evaluates each of the chromosomes (candidate solutions), and the fitter the chromosome, the more likely it will be selected to reproduce. 2. Crossover: The crossover operator performs recombination, creating two new offspring by randomly selecting a locus and exchanging sub sequences to the left and right of that locus between two chromosomes chosen during selection. For example, in binary representation, two strings 11111111 and 00000000 could be crossed over at the sixth locus in each to generate the two new offspring 11111000 and 00000111. 3. Mutation: The mutation operator randomly changes the bits or digits at a particular locus in a chromosome: usually, however, with very small probability. For example, after crossover, the 11111000 child string could be mutated at locus two to become 10111000. Mutation introduces new information to the collect genetic and protects against pile too quickly to a local optimum. Most genetic algorithms function Recursively updating a collection of possible solutions called a population. Each member of the population is evaluated for fitness on each cycle. A new population then replaces the old population using the operators above, with the fittest members being chosen for reproduction or cloning. The fitness function f (x) is a real-valued function operating on the chromosome (potential solution), not the gene, so that the x in f (x) refers to the numeric value taken by the chromosome at the time of fitness evaluation [2]. 6.2 Clustering Clustering indicates to the grouping of records, observations, or cases into classes of similar objects. A cluster is a collection of records that are similar to one another and dissimilar to records in other clusters. Clustering differs from classification in that there is no target variable for clustering. The clustering task does not try to classify, speculation, or expect the value of a target variable. Instead, clustering algorithms requires segmenting the all data set into relatively homogeneous subgroups or clusters, where the similarity of the records within the cluster is maximized, and the similarity to records outside this cluster is minimized. k-means clustering In statistics and machine learning, k-means clustering is a method of cluster analysis which aims to parting n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Algorithm: The algorithm of k-means clustering is a simple and effective algorithm for finding clusters in
Journal of Science and Engineering / Vol. 3 (2), 2013

72 | Z. S. Zubi, M. A. Emsaed

data. The steps of algorithm proceeds as follows.  Step 1: Choose the number of clusters, k.  Step 2: Randomly assign k records to be the initial cluster center locations.  Step 3: For each record, find the nearest cluster center, in a sense, each cluster center "owns" a subset of the records, which representing a partition of the data set. Thus consists k clusters, C1, C2, . . . , Ck .  Step 4: For each of the k clusters, find the cluster centroid, and update the location of each cluster center to the new value of the centroid.  Step 5: Repeat steps 3 to 5 until convergence or termination. The "nearest" standard in step 3 is usually Euclidean distance. The cluster centroid in step 4 is found as follows: Suppose that there n data points (a1, b1, c1), (a2, b2, c2), . . . , (an, bn, cn), the centroid of these points is the center of gravity of these points and is located at point (∑ai/n ,∑bi/n,∑ci/n) (1). For example, the points (1,1,1), (1,2,1), (1,3,1), and (2,1,1) would have centroid.

(1)

The algorithm terminates when the centroids no longer change. In other words, the algorithm terminates when for all clusters C1, C2, . . . ,Ck , all the records "owned" by each cluster center remain in that cluster, the algorithm may terminate when some convergence standard is met, such as no significant shrinkage in the sum of squared errors use “Equation (2):

(2) The proposed system This chart contains the phases throughout the system and the operations of the system respectively.
DNA sequence Input

MATLAB

DNA-Aggregator ( data set)

Genetic Algorithm

Cluster

Result

Performance Analysis

Figure 9 the proposed system

© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 73

7

IMPLEMENTATION

The system will apply several methods such as Genetic Programming method in scene of initialization of GP. We will also describe how the Data Clustering algorithms used in the system using MATLAB version 7.9.0.529 (R 2009b). The results will be conducted in an excel file. This figure shows the results of starting the match program which appears in the below Excel file.

Figure 10 result in Excel file

8

RESULTS

The reported results in our work were carried out in the proposed processes aiming to early and accurate diagnosis for cancer patients.

-

Leukem

Figure 11 result of Leukemia process

Journal of Science and Engineering / Vol. 3 (2), 2013

74 | Z. S. Zubi, M. A. Emsaed
-

Colon

Figure 12 result of colon -

process

Lymph

Figure 13

result of Lymph process

9

CONCLUSION

In this paper, we proposed a genetic algorithm GA based approach to deal with the gene selection and classification tasks for multi-class micro-array datasets. The multi-class problem was divided it into multiple two-class problems, and a set of sub-ensemble systems deployed to deal with respective two-class problems. The procedure responsible for extracting datasets called DNA-Aggregator. We designed a biological aggregator, which aggregates various datasets via DNA micro-array community-developed ontology based upon the concept of semantic Web for integrating and exchanging biological data. Trees constructed with different genes; important genes selected as important references for clinic diagnosis or cancer development. For each dataset, the biological significance of the selected genes validated from a biological database. The GA based method presents useful alternatives in the analysis of complex multi-class micro-array datasets, and working whit cluster (K-means) [9]. In our work we have applied GA in the sequencing of DNA molecules. The results produced by the algorithm were very good and in many cases were optimal or close to optimal. Several challenges have been faced and solutions found, so the system that has been designed is used for classifying, clustering and detecting cancer in DNA chips data. The system involves two major modules, the first module the clustering and the second module detects the cancer from the DNA chips. REFERENCES
[1] Malcolm Campbell and Laurie J. Heyer” DNA Microarrays: Background, Interactive Databases, and Hands -on Data Analysis” .page 5 . [2] DANIEL T. LAROSE.“ DATA MINING METHODS AND MODELS”. Copyright ©2006 by John Wiley & Sons,
© ORIC Publications/2014

Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment | 75

Inc. All rights reserved. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. Page 241. [3] John M. Butler ,“FORENSIC DNA TYPING”, Copyright © 2005, Elsevier (USA). [4] Lydia Schindler ,Donna Kerrigan, M.S, Jeanne Kelly , Brian Hollen . “Understanding Cancer and Related Topics Understanding Gene Testing”. [5] M. F. Fey” The impact of chip technology on cancer medicine”. DOI: 10.1093/ annonc/mdf647. [6] PORNIMA PHATAK, S KALAI SELVI, T DIVYA, A S HEGDE, SRIDEVI HEGDE and KUMARAVEL SOMASUNDARAM “Alterations in tumour suppressor gene p53 in human gliomas from Indian patients”. December 2002, © Indian Academy of Sciences [7] Tan Jun-shan, He Wei1, Qing Yan ,” Application of Genetic Algorithm in Data Mining”. 2009 First International Workshop on Education Technology and Computer Science. 978-0-7695-3557-9/09 © 2009 IEEE . DOI 10.1109/ETCS.2009.340. page 353.page 353. [8] W. B. Langdon and B. F. Buxton” Genetic Programming for Mining DNA Chip data from Cancer Patients” Computer Science, University College, Gower Street, London, WC1E 6BT, UK, fW.Langdon, B.Buxtong@cs.ucl.ac.uk http://www.cs.ucl.ac.uk/sta_/W.Langdon, /sta_/B.Buxton .page 1 [9] Zakaria Suliman Zubi ,Marim Aboajela Emsaed, 2010. "Sequence mining in DNA chips data for diagnosing cancer patients". In Proceedings of the 10th WSEAS international conference on Applied computer science (ACS'10), Hamido Fujita and Jun Sasaki (Eds.). World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, Wisconsin, USA, 139-151. [10] http://www.medicalnewstoday .com/info/cancer-oncology / whatiscancer .php. Page header” What is Cancer?”. Login clock 01:37 pm. date: 06-05-2010 [11] http://www.cancer.gov/ cancertopics / what-is-cance Cancer?”. Login clock 11:37 pm. date: 04-05-2010. [12] http://www.ornl.gov/sci/techresources/Human_Genome/graphics/DNASeq. Process.pdf .’Page header: DNA Sequencing Process Date’. Login clock 11:03pm. Date 16-2-2010.

Please cite this article as: Z. S. Zubi, M. A. Emsaed, (2013), Identifying Cancer Patients using DNA Micro-Array Data in Data Mining Environment, Journal of Science and Engineering, Vol. 3(2), 63-75. Journal of Science and Engineering / Vol. 3 (2), 2013