This action might not be possible to undo. Are you sure you want to continue?
the Pharmaceutical Industry
Aleksandar Mihajlovic1, Goran Rakocevic2 , Zoran Babovic3 Technische Universität München (TUM)1, Mathematical Institute: Serbian Academy of Sciences and Arts2, University of Belgrade, Innovation Center of the School of Electrical Engineering3 firstname.lastname@example.org, email@example.com, firstname.lastname@example.org 3 genetic associations with disease. These studies mainly represent a valuable discovery tool for examining genomic function and clarifying pathophysiologic mechanisms. By identifying and label alleles according to their distinct SNPs we can better isolate the target gene allele(s) of interest and expose them to various pharmacogenomics tests. SNP arrays are the tool of choice for GWA studies. They are a type of DNA microarray. The most important application of SNP arrays is in determining disease susceptibility and consequently, in pharmacogenomics by measuring the efficacy of drug therapies specifically for the individual. Reading SNP arrays is an arduous task and requires the aid of computer analysis software. Genotype and gene expression read quality throughout the course of pharmacogenomics assessment is pivotal. Any faulty reads may lead to faulty drug assessments and further into faulty treatment. In many cases the initial reads from the SNP arrays are faulty and for quality purposes must be controlled, thus modified in order to meet certain quality standards necessary for research and drug assessment. A quality issue related to faulty or missing SNP array and DNA microarray data altogether is exhibited within this paper. Monitoring SNP reads and scanning for such faults by using analysis software and a class of data mining algorithms known as gene imputation algorithms help control this quality issue.
Abstract- With advancing research in the field of genetics, the medical treatment of various diseases can be improved via new methods of therapy custom tailored for each individual case of patients involving the control and monitoring of individual patient genetic reactions to prescribed medication. This survey paper is an overview of several imputation algorithms and a revision of their classifications. Imputation algorithms considered are used primarily in the quality control of two important DNA microarray data types, genotype type data and gene expression data, used in genetic research and in the pharmaceutical industry.
Personalized medicine is an up and coming medical model beginning to influence the pharmaceutical industry at large. It emphasizes in general the customization of healthcare, with all decisions and practices being tailored to individual patient. Methods related to pharmacogenomics which mainly involve the systematic use of genetic or other information about an individual patient to select or optimize that patient's preventative and therapeutic care are primarily used . The promise of pharmacogenomics is that both the choice of a drug and its dose will be determined by the individual genetic make-up leading to personalized, effective and less harmful drug therapy by carefully studying the effects of drugs on gene expression after the appropriate disease associated gene(s) variants have been isolated and identified. Single nucleotide polymorphisms (SNPs) are the most common genetic variations present in DNA that are thought to account for most of the genetic variations that occur between individuals. In most cases the difference between a healthy gene allele and a disease associated allele is one or more SNPs. SNPs play a major role therefore in Genome Wide Association studies or GWAS which is an approach to identifying and associating genetic variances (alleles) with certain non-mendellian (non-inhertibale) diseases. It permits the interrogation of the entire human genome at levels of resolution previously unattainable, in thousands of unrelated individuals, unconstrained by prior hypotheses regarding
II. PROBLEM STATEMENT Using microarray technology, the gene expression levels and genotypes in two or more mRNA1 populations (two or more individuals) can be analyzed. Noting that genes are the recipes for proteins, the purpose of the gene expression testing is to visually determine the (abundance of specific gene mRNAs) amount of potential protein(s) synthesized by two cells; a control cell and a test cell (control cell is healthy cell and test cell is diseased cell). By comparing the healthy cell and diseased cell gene expressions, one can note which genes are more active and contribute more to the disease and which genes are less active and contribute insignificantly.
1 transcribed versions of genes, floating in the cytoplasm of a cell waiting to be read and synthesized into a protein by a ribosome.
is single stranded complementary bases of some template single stranded DNA molecule . if the position has a control cell color than the control cell has that matrix position specified SNP in its gene set and if the color is mixed between the two. then both cells have the SNP. In a cDNA2 microarray experiment. If a hybridized position in the matrix is of darker fluorescent intensity and the color belongs to a healthy cell set of mRNA. the cDNA samples are tested for SNPs. two samples of cellular mRNA are extracted. then the gene of this position in the corresponding cell is not so active. The DNA microarray prior to hybridization contains known fragments of cDNA with known SNPs. analyzed and automated via special laser reading computers (fig.A. The amount of hybridized microarray and cellular cDNA hybridizations is represented by the diverse fluorescent labeling of the hybridized DNA microarray matrix position due to the two distinctly labeled cellular cDNA sets (test and control). Microarrays In a diseased cell. up to 90% of genes have one or more missing values . If the hybridized matrix position has a color denoted by the test cell then the test cell has that SNP. B. The two cDNA sets are then fluorescently labeled in order for the researcher to be able to distinguish cDNA (proteins) belonging to one cell from those belonging to the other. Microarray data often suffers from the missing value problem. The two mRNA samples directly correlated to the amount of a particular protein being synthesized at a particular time. In testing for SNPs or SNP discovery. Missing values occur for different reasons. A spot in the matrix that has negative background corrected intensity would normally be declared as missing. (left) a spotting system for microarrays and (right) a microarray chip from affymetrix. 1. A microarray is a matrix of known gene sequences. In addition. Mixed colors tell us that a gene is active in both cells meaning that the gene might not have anything to do with the disease being analyzed. When hybridization occurs between the dual mixture of cellular cDNA fragments and the SNP matrix the hybridized matrix positions are hued by the fluorescently labeled cellular cDNA. The mRNA of both cells are reverse transcribed into the more stable cDNA form. Some of these reasons may be physical in nature such as insufﬁcient resolution. A mixture of the two fluorescently labeled cDNA samples is made. In genotype determination and analysis. The microarray experiment data usually come in the form of large matrices of expression levels of genes. One sample from a healthy cell and one from a diseased celled. Suspicious data is usually manually ﬂagged and excluded Fig. Testing the amount of gene activity between two such cells (a healthy cell and a sick cell) is known as gene expression testing. the goal is to discover which cell gene set of the two test cells has which SNP mutation present on the initial un-hybridized matrix of the DNA microarray. Both microarray methods are used in identifying which genes are the disease genes and which specific versions of these genes or alleles of these genes (results of SNP variations) are associated with the disease being genetically analyzed.  2 complementary DNA . hybridizes to the complementary cDNA strands of the known gene cDNA present on the microarray. The cDNA strands of both cells in the mixture. Whether a cell has a corresponding matrix SNP in its gene set or not can be determined by the fluorescence of the SNP matrix position after hybridization. if there are more than one sick alleles). The opposite goes for the sick cell as well. researchers will be able to create allele customized and optimal medication therapies for different patient profiles. Once the alleles associated with the disease at hand are found and their allelic genetic expression data is extracted (expression data is individual for each sick allele. image corruption. The Missing Value Problem Microarray hue data (fluorescence) is read. some genes work more than their healthy cell counterparts. 1) in what is known as the spotting process. an initial patient disease associated genotype profile can be created. the process is modified a little bit. In this type of analysis study. Microarray data can contain up to 10% missing values and in some data sets. explained below. Sometimes there are some genes working in a diseased cell that don’t even appear to work under the same conditions in a healthy cell. Gene expression testing is quantifiable and its results can be read by distinguishable fluorescent intensities and combined hues (made by mixing two color intensities (only. The gene expressions of the diseased genotypes will be monitored over a period of time and tested against various therapies and medication dosages. then the gene present at this position is very active in the healthy cell. suspicious values are often flagged as missing too . known as the spotting process .one color for healthy cell and one color for sick cell mRNAs) of individual matrix positions on a hybridized microarray. Missing data may also occur systematically as a result of the robotic methods used to create them. frequently with some values missing. If the fluorescent intensity at this position is weak. Via monitoring the gene expression reactions of disease alleles and controlling the therapeutic environment. The mixture is then applied to the DNA microarray gene matrix of known gene sequences. The genes come in rows and the different experimental conditions are represented by columns. hybridization failure or simply due to dust or scratches on the slide.
this solution is quite expensive. The above mentioned classes have method sub classes: The global approach which applies probabilistic methods. Previous papers concerning gene imputation algorithms have not classified them. Gene expression analysis algorithms such as hierarchical clustering and and many analysis methods such as principal component analysis (PCA) cannot use incomplete matrix data. The quickest solution for completing incomplete matrix images is to repeat the experiment several times until suitable fill in data for the missing or faulty matrix positions is found. It is only under this assumption that these algorithms can be considered accurate. In extending the classification of algorithms. Proposed classification graph shows the relationship between the two imputation algorithm classification schemes: information type based (ovals) and method type based (rectangles). Algorithms were efficiently classified based on the type of information used in the in . These algorithms are based on “global” correlation information from the data matrix as a whole (all rows and columns of the matrix are taken into consideration).from subsequent analysis . The authors classified imputation algorithms according to the information they use in the imputation process. One of the best known algorithms in this method and info type category is the Bayesian Principal Component Analysis (BPCA) algorithm . They are: (1) principal component (PC) regression. An overview of the best performing algorithms from each method category used in the industry is presented below. Imputation algorithms are primarily categorized based on the information they use as mentioned above in . Missing values in gene expression data negatively affect further gene analysis. classify them by method is made. to include the different algorithmic methods used in processing specific types of information. Imputation is an inexpensive statistical estimation and control method that controls the quality of the matrix image data by knowledgably guessing the missing values with high percentage accuracy. Instead of repeating the experiments. IV. the hybrid approach which utilizes both local and global approaches which apply both local and global methods and finally. SURVEY CRITICISM This survey is partially an extension of an existing paper  on imputation algorithms. 2 below. as presented below. A brief overview is provided within this work of existing imputation algorithms in terms of their classifications and available best performance information. III. The relationship between the information type classes of algorithms and the method subclasses proposed can be seen in fig. Imputation algorithms can be classified according to their methods of work. rather only elaborated on their workings and success in actual imputation. (ii) local approach. Probability Based Algorithms Algorithms whose methods are probability based belong to in the global approach class. (2) Bayesian estimation. the local approach which applies either NN (Nearest Neighbor) or regression (statistical) methods. The process based method for the knowledge class uses methods that are well acquainted with the microarray spotting process. Information type and method type are tightly connected. an attempt to Fig. as presented in this work. This method usually applies data base and data mining techniques which filter through databases of microarray spot information experiments similar to the one being analyzed. The information type used by this method is the whole data matrix. according to our observations we argue that the classification in  can be extended. Gene data processing concerns both external gene knowledge not concerning and concerning the microarray spot image being analyzed. (iii) hybrid approach and (iv) knowledge assisted approach algorithms. Such algorithms assume a global covariance structure among all genes in the expression matrix. The four different classes of algorithms on this classification level are: (i) global approach. and . Within this work. However. In many cases unknown or suspicious matrix spot is labeled with a question mark. one can attempt to estimate the missing values by imputation using imputation algorithms. the knowledge assisted approach which applies either gene data processing or process based methods. There are several analysis methods for analyzing DNA microarrays (tracking expression differences and more) which require complete matrices. These methods attempt to find suspicious data faults in and during the microarray spotting process. IMPUTATION ALGORITHMS The imputation algorithms use different available types of information from various external and internal (matrix oriented) sources in determining the missing data values. This classification can be extended. A. The methods the algorithm uses to impute values depend on the type(s) of available information used. The missing value estimation method based on BPCA consists of three elementary processes. meaning all of the spotting values. 2.
Both are regarded as normally distributed random variables in the PCA model. Algorithms belonging to this class can make use of external knowledge relating to the some experimental process or experimental data such as other gene data related to the gene in the matrix diagram or microarray spotting process information. 6]. (EM)-like repetitive In BCPA. GO is a well-accepted standard for gene function categorization . By external data is meant any data source not concerning the matrix data in question. B. KNNimpute uses pairwise information between the target gene with missing values and the K nearest reference genes to impute the missing values. The iteration terminates when the sum of square difference between the current and the previous estimated complete matrix falls below a threshold . spot quality information. The GO contains three independent ontologies that describe gene products according to: 1) associated biological processes (BP) 2) cellular components (CC) 3) molecular functions (MF) GOimpute uses two ontologies BP and MF. A well-known gene knowledge based algorithm is GOimpute. These two parameters are then combined to form a (3) combined . IKNNImpute or Iterative K Nearest Neighbor Impute algorithm is a modified version of the KNNImpute algorithm. NN and Regression Based Algorithms The information type used by NN and regression based algorithms is local in nature. The goal of GOimpute is to find a similar gene in terms of expression to the one being scrutinized and base upon this gene or genes missing value(s). The correlation between time points in a profile and between genes is captured by an autoregressive (AR). The algorithm ARLSimpute stands for Autoregressive Least Square Imputation. Further mention of them will certainly be found in upcoming papers. Two well-known algorithms of these method types are IKNNImpute and ARLSimpute. ARLSimpute is an imputation method that utilizes the correlation primarily between genes for genotype missing values and also correlations between time points in the gene expression profile . It first selects K most correlated genes according to their Euclidean distance. Algorithms using these two method types belong to the local approach class of algorithms because they exploit only local similarity structure in the data set for missing value imputation . the D-dimensional gene expression vectors y is expressed as a combination of K principal axis vectors . how the fluorescent intensity of matrix spots is translated and quantified digitally. The selected genes are then used in estimating the AR coefficients. Bayesian estimation and an EM algorithm are then used to estimate the posterior distributions of the model parameter and the missing values simultaneously. It is assumed that the highly correlated genes are generated by the same AR process that generates the gene profile. KNNimpute’s performance is average when there is a strong local correlation between genes in the matrix. BPCA works iteratively. Another useful form of knowledge would be information concerning the biological processes involved in microarray experiments . Each iterative step is thus a refinement of the previous one. Processes based algorithms unfortunately are excluded from the survey.g. Two parameters are calculated for two functionally similar genes. Gene Data and Process Based Algorithms Algorithms belonging to these two method types generally utilize external data as well as matrix data at hand. IKNNImpute is of type NN method and ARLSimpute is of type regression method. GOimpute utilizes gene ontological information. . = + (1) where are the factor scores and is the residual error. C. . K closest reference genes are selected from the previous step during each iteration. the complexity is growing with ( ) because several matrix inversions are required. Experimental results presented in  show the imputation accuracy of ARLSimpute in terms of NRMSE. The performance of this algorithm is optimal if genes don’t have local dominant similarities. The AR model generates the profile. These algorithms are bulky and cumbersome to implement. These algorithms introduce external data into the imputation process. such as how the spotting process is performed e. Algorithms using these two methods for imputing values belong to the knowledge approach class of algorithms. First (1) their semantic distance and second (2) their expression level distance. ARLSimpute is significantly better than any imputation algorithm that ignore the within profile correlation. The missing value x in the target gene is estimated as the weighted average of the xth component of the K reference genes with the weights set proportional to the inverse of the Euclidean distance between the target and the reference genes . The missing values are then estimated by solving a linear regression problem such as the least square problem using the above mentioned AR coefficients. IKNNImpute uses an iterative process to refine the missing value estimates by iteratively repeating the KNNImpute step.(3) an expectation–maximization algorithm [5. This data can be found for example in some database or file. The size of the matrices to invert depends on the number of components used for re-estimation . Only a subset of genes that exhibits high correlation with the gene containing the missing values is used to compute the missing values in the gene. KNNImpute was one the earliest solutions to the missing value problem. For time series expression data.
2010. pp. REFERENCES HL. New and open research problems involve improving the accuracy and speed of the existing imputation algorithms while decreasing the amount of information required by the imputation process. In terms of information type they are noted as knowledge assisted algorithms . The imputation algorithms have been previously classified and their classification extended within the paper. L Elo. An overview of the best imputation algorithms used in solving this problem has been provided. In this paper. 5:pp. Bioinformatics 2006. I Takemasa. L Elo. 2002. ) )∝ ( . Due to their dependence on external data next to matrix parameters of the matrix under inspection. C Ma.edu/Rdoc/library/pcaMethods/html/bcpa. “Missing value imputation for gene expression data: computational techniques to recovermissing data from available information”.2088–2096.” Improving cluster-based missing value estimation of DNA microarray data. 13(1): pp.  W Stacklies. Nat Genet 2000. H Yan.” BMCinformatics 2004. etal.566-72  http://www. FUTURE RESEARCH Concerning gene imputation algorithms. (3) (4) The combined distance is used to select neighborhood genes in KNN and LLS imputation.distance. The missing value problem which occurs during the DNA microarray spotting process has be introduced. SUMMARY Personalized medicine is the wave of the future and doctors all over the world are favoring personalized medicine more and more. ) ACKNOWLEDGMENTS The work presented here was supported by the Serbian Ministry of Education and Science (project III44006). et al. pp.” unpublished.114  AA Alizadeh.  MK Choong. “A Bayesian missing value estimation method for gene expression profile data.  AG de Brevern. “Autoregressive model based missing value estimation for DNA microarray time series data. 489  J Tuikkala.273–82K. “Improving missinvalue estimation in microarray data with gene ontology”. Elissa. 1608-19  Consortium TGO. O Nevalainen. 34(5): pp.” Nature 403: pp. they deliver the best results. Ameyaw. the importance of DNA microarrays in personalized medicine research and application have been established.com  V. “Title of paper if known.3  S Oba. 24:pp. “Microarray missing data imputation based on a set theoretic framework and biological knowledge.” Bioinformatics 2003. “Ethnicity and Pharmacogenomics. pp.22(5):pp. AW Liew. “Improving missing value estimation in microarray data with gene ontology.” Bioinformatics 2006. M Charbit.htm  LP Bras.22(5): pp. 566–72. JC Menezes. “Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. . GO improves the imputation accuracy when the number of experimental conditions is small or the proportion of annotated genes is large. S Hazout. IS Lossos. “Distinct types of difuse large B-cell lymphoma identified by gene expression profiling. .” Nucleic Acids Res 2006. htt://rss. where α controls the relative distribution of two distance measures: ( .” Pharmacogenomics: The Search for Individualized Therapies.Bayesian PCA missing Value Estimator. and at higher rates of missing values . VI.” IEEE Trans InformTechnol Biomed 2009. O Nevalainen. 503-511  A Wee-Chung Liew.131-7  X Gan. R:Documentation . MA Sato.unt. these algorithms tend to be complicated and time consuming.25:25-9  J Tuikkala. et al. Briefings in Bioinformatics Advance Access December 14. ) (2) ( = ( . A Malpertuv. Future research will show us faster and more refined gene and process external knowledge based algorithms. RE Davis. N-F Law and H Yan. the most interesting method classes are gene data and process based algorithms class.” Biomolecular Eng 2007.acs. “Gene ontology tool for the unification of biology”. H Yan. MB Eisen. Algorithms belonging to these method classes are more complicated than most of the other classes assessed within this paper. et al. However. McLeod and MM. Both the overview and the classifications are very important to researchers in understanding which information types to use with which algorithms and what new type of algorithms based on their method classifications can be made to use what specific information parameters.affymetrix.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.