You are on page 1of 62

ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

ABSTRACT

The advanced technologies like microarray technology, large amount of information


related to expression levels of genes of different living beings are available easily. Microarray is
that the technology that is employed to watch and calculate organic phenomenon levels of
thousands of genes at the same time in single experiment. Genes of AN organism play a really
crucial role within the operating of assorted cellular activities. Genes and alternative biological
molecules like polymer, polymer don't operate alone however all of them square measure
related . Their relationships square measure shown with the assistance of networks normally
called sequence restrictive Networks. sequence restrictive Networks square measure advanced
management networks that show the map of interactions among the genes. they supply terribly
helpful contribution to the genomic science and increase the understanding of assorted biological
processes. during this paper, formal logic primarily based technique is projected for the reverse
engineering of sequence restrictive network from microarray organic phenomenon datasets. Pre-
processing steps are introduced to extend the potency of the tactic. bunch technique is
additionally used to divide the matter into sub issues to scale back the procedure complexness at
some extent. Finally, the projected technique is tested on 2 completely different time course
organic phenomenon datasets of yeast having GEO accession range GDS37 and GDS3030.
formal logic has been projected as a way of analyzing the relationships between genes
furthermore as their corresponding proteins.

1
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

CHAPTER I
INTRODUCTION

1.1 Gene expression

Gene expression is that the method by that data from a sequence is employed within the
synthesis of a practical sequence product. These merchandise square measure usually proteins,
however in non-protein writing genes like ribonucleic acid (tRNA) or little ribonucleic acid
(snRNA) genes, the merchandise may be a practical RNA. The method of organic phenomenon
is employed by all familiar life eukaryotes (including cellular organisms), prokaryotes (bacteria
and archaea), and used by viruses to come up with the molecule machinery for keeps. Many
steps within the organic phenomenon method is also modulated, together with the transcription,
RNA junction, translation, and post translational modification of a macromolecule.

Figure 1.1 Gene expression

Sequence regulation provides the cell management over structure and performance, and
is that the basis for cellular differentiation, growing and therefore the skillfulness and flexibility
of any organism. Sequence regulation may additionally function a substrate for organic process
modification, since management of the temporal arrangement, location, and quantity of organic
phenomenon will have a profound impact on the functions (actions) of the sequence during a cell
or during a cellular organism. In biology, organic phenomenon is that the most basic level at that
the genotype provides rise to the composition, i.e. discernible attribute.

2
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

The order hold on in DNA is "interpreted" by organic phenomenon, and therefore the
properties of the expression create to the organism's composition. Such phenotypes square
measure usually expressed by the synthesis of proteins that management the organism's form, or
that act as enzymes catalysing specific metabolic pathways characterising the organism.
Regulation of organic phenomenon is so important to associate organism's development.

Transcription

Gene expression is that the method by that info from a sequence is employed within the
synthesis of a purposeful sequence product. These merchandise square measure typically
proteins, however in non supermolecule writing genes like acceptor RNA (tRNA) or tiny
ribonucleic acid (snRNA) genes, the merchandise could be a purposeful polymer. The method of
organic phenomenon is employed by all well-known life eukaryotes (including cellular
organisms), prokaryotes (bacteria and archaea), and used by viruses to come up with the organic
compound machinery always.

Many steps within the organic phenomenon method are also modulated, as well as the
transcription, polymer junction, translation, and post travel modification of a supermolecule.
sequence regulation offers the cell management over structure and performance, and is that the
basis for cellular differentiation, ontogenesis and also the skillfulness and flexibility of any
organism. sequence regulation might also function a substrate for biological process
modification, since management of the temporal arrangement, location, and quantity of organic
phenomenon will have a profound result on the functions (actions) of the sequence in a very cell
or in a very cellular organism.

In biological science, organic phenomenon is that the most elementary level at that the
genotype offers rise to the composition, i.e. discernible attribute. The ordination hold on in
polymer is "interpreted" by organic phenomenon, and also the properties of the expression make
to the organism's composition. Such phenotypes square measure typically expressed by the
synthesis of proteins that management the organism's form, or that act as enzymes catalysing
specific metabolic pathways characterising the organism. Regulation of organic phenomenon is
therefore crucial to AN organism's development.

3
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Figure 1.2 RNA processing

While transcription of organism supermolecule-coding genes creates template RNA


(mRNA) that's prepared for translation into protein, transcription of organism genes leaves a
primary transcript of RNA (pre mRNA), that 1st should bear a series of modifications to become
a mature ribonucleic acid.

These embody 5' capping, that is ready of accelerator reactions that add seven
methylguanosine (m7G) to the 5' finish of pre {mrna|messenger RNA|mRNA|template RNA|
informational RNA|ribonucleic acid|RNA} and so shield the RNA from degradation by
exonucleases. The m7G cap is then certain by cap binding complicated heterodimer
(CBC20/CBC80), that aids in {mrna|messenger RNA|mRNA|template RNA|informational RNA|
ribonucleic acid|RNA} export to living substance and conjointly shield the RNA from
decapping.

Another modification is 3' cleavage and polyadenylation. They occur if polyadenylation


signal sequence (5' AAUAAA 3') is gift in pre ribonucleic acid, that is sometimes between
supermolecule committal to writing sequence and eradicator. The pre ribonucleic acid is 1st
cleaved then a series of two hundred adenines (A) area unit additional to create poly(A) tail, that
protects the RNA from degradation. Poly(A) tail is certain by multiple poly(A) binding proteins
(PABP) necessary for ribonucleic acid export and translation re initiation. A very vital

4
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

modification of organism pre {mrna|messenger RNA|mRNA|template RNA|informational RNA|


ribonucleic acid|RNA} is RNA splice. The bulk of organism pre mRNAs encompass alternating
segments referred to as exons and introns. throughout the method of splice, associate degree
RNA supermolecule catalytical complicated referred to as spliceosome catalyzes 2
transesterification reactions, that take away associate degree desoxyribonucleic acid and unleash
it in type of reata structure, then splice near exons along. In bound cases, some introns or exons
are often either removed or preserved in mature ribonucleic acid. This therefore referred to as
various splice creates series of various transcripts originating from one cistron. As a result of
these transcripts are often probably translated into completely different proteins, splice extends
the quality of organism organic phenomenon. intensive RNA process could also be associate
degree biological process advantage created attainable by the nucleus of eukaryotes. In
prokaryotes, transcription and translation happen along, while in eukaryotes, the nuclear
membrane separates the 2 processes, giving time for RNA process to occur.

Non committal to writing RNA maturation

In most organisms non committal to writing genes (ncRNA) area unit transcribed as
precursors that bear additional process. within the case of ribosomal RNAs (rRNA), they're
typically transcribed as a pre rRNA that contains one or a lot of rRNAs. The pre rRNA is cleaved
and changed at specific sites by around a hundred and fifty completely different tiny nucleole
restricted RNA species, referred to as snoRNAs. SnoRNAs keep company with proteins, forming
snoRNPs. whereas snoRNA half basepair with the target RNA and so position the modification
at a certain website, the supermolecule half performs the catalytical reaction. In eukaryotes,
especially a snoRNP referred to as RNase, MRP cleaves the 45S pre rRNA into the 28S, 5.8S,
and 18S rRNAs. The rRNA and RNA process factors kind massive aggregates referred to as the
nucleole.

In the case of tRNA (tRNA), as an example, the 5' sequence is removed by RNase P,
whereas the 3' finish is removed by the tRNase Z accelerator and also the non templated 3' CCA
tail is additional by a nucleotidyl enzyme. within the case of small RNA (miRNA), miRNAs area
unit 1st transcribed as primary transcripts or pri miRNA with a cap and poly A tail and processed
to short, seventy ester stem loop structures referred to as pre miRNA within the karyon by the
enzymes Drosha and pacha. when being exported, it's then processed to mature miRNAs within

5
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

the living substance by interaction with the nuclease mechanical device, that conjointly initiates
the formation of the RNA elicited silencing complicated (RISC), composed of the Argonaute
supermolecule. Even snRNAs and snoRNAs themselves bear series of modification before they
become a part of useful RNP complicated. this is often done either within the karyoplasm or
within the specialised compartments referred to as Cajal bodies.

RNA export

In eukaryotes most mature RNA should be exported to the living substance from the
nucleus. whereas some RNAs perform within the nucleus, several RNAs area unit transported
through the nuclear pores and into the cytoplasm. Notably this includes all RNA sorts concerned
in supermolecule synthesis. In some cases RNAs area unit in addition transported to a selected a
part of the living substance, like a synapse; they're then towed by motor proteins that bind
through linker proteins to specific sequences on the RNA.

Translation

RNA the mature polymer is that the final cistron product. Within the case of {messenger
rna|messenger polymer|mRNA|template RNA|informational RNA|ribonucleic acid|RNA}
(mRNA) the RNA is Associate in Nursing info carrier cryptography for the synthesis of 1 or a lot
of proteins. RNA carrying one macromolecule sequence (common in eukaryotes) is
monocistronic while RNA carrying multiple macromolecule sequences (common in prokaryotes)
is understood as polycistronic. Each RNA consists of 3 elements a 5' untranslated region
(5'UTR), a macromolecule cryptography region or open reading frame (ORF), and a 3'
untranslated region (3'UTR). The cryptography region carries info for macromolecule synthesis
encoded by the order to create triplets. Every triplet of nucleotides of the cryptography region is
termed a sequence Associate in Nursing corresponds to a binding web site complementary to an
anticodon triplet in soluble RNA. Transfer RNAs with an equivalent anticodon sequence forever
carry a standardized style of organic compound. Amino acids area unit then enchained along by
the cell organ per the order of triplets within the cryptography region.

The cell organ helps soluble RNA to bind to template RNA and takes the organic
compound from every soluble RNA and makes a structure less macromolecule out of it. Every
RNA molecule is translated into several macromolecule molecules, on the average 2800 in

6
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

mammals. In prokaryotes translation typically happens at the purpose of transcription (co


transcriptionally), typically employing a template RNA that's still within the method of being
created. In eukaryotes translation will occur in a very style of regions of the cell counting on
wherever the macromolecule being written is meant to be. Major locations area unit the living
substance for soluble living substance proteins and also the membrane of the endoplasmic
reticulum for proteins that area unit for export from the cell or insertion into a plasma membrane.
Proteins that area unit alleged to be expressed at the endoplasmic reticulum area unit recognised
half method through the interpretation method. this is often ruled by the signal recognition
particle a macromolecule that binds to the cell organ and directs it to the endoplasmic reticulum
once it finds a symptom amide on the growing (nascent) organic compound chain.

1.2 Gene regulatory networks

All living organisms area unit created from cells that in house genomic info within the
sort of deoxyribonucleic acid among their nuclei. Genes area unit created from deoxyribonucleic
acid, the knowledge house of the cell, that carries genetic blueprint accustomed create proteins.
each single cistron stores a specific set of instruction that code for a particular macromolecule.
the method of conversion of genes into a practical product (protein) through transcription and
translation is understood as organic phenomenon. organic phenomenon method is tightly
regulated, that lets a cell to reply to its dynamic surroundings. organic phenomenon is taken into
account because the most elementary level wherever genotype offers rise to the constitution.
Cells regulate the expression of genes in response to changes within the surroundings,
transcription factors, epigenetic marks, that bring about to the restrictive network. A cistron
restrictive network (GRN) consists of a collection of genes interacting with one another to
regulate a particular cell operate. GRNs play a important role in development, differentiation,
and response to the surroundings.

A GRN is sort of a directed graph consisting of nodes and edges, wherever nodes
represent genes and their regulators, and edges represent their restrictive relationships like
activation or inhibition. The restrictive ordering resembles a logic process system, that receives
multiple inputs and processes them as a mixture of logical functions like “AND”, “OR”, or
“switch” functions. Identification of GRNs among a cell is incredibly essential as a result of it's
an instantaneous influence on the event and survival of living organisms. Recent advancements

7
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

in high turnout techniques like microarray and next generation sequencing (NGS) have generated
a vast quantity of Transcriptomic information. previous couple of decades have witnessed the
event of lots of machine ways for the illation of GRNs from organic phenomenon profiles,
additionally called reverseengineering or GRN reconstruction ways. A general introduction to
GRN and their applications in clinical and personalised medication will be found severally.

1.3 Fuzzy logic

Fuzzy logic could be a sort of several valued logic during which the reality values of
variables could also be any real between zero and one each comprehensive. it's used to handle
the construct of partial truth, wherever the reality worth could vary between fully true and fully
false. in contrast, in mathematical logic, the reality values of variables could solely be the
number values zero or one. The term formal logic was introduced with the 1965 proposal of
fuzzy pure mathematics by Lotfi Zadeh. formal logic had but been studied since the Nineteen
Twenties, as infinite valued logic notably.

Fuzzy logic is predicated on the observation that folks create choices supported inexact
and non numerical data. Fuzzy models or sets square measure mathematical means that of
representing unclearness and inexact data (hence the term fuzzy). These models have the aptitude
of recognising, representing, manipulating, deciphering, and utilising information and data that
square measure imprecise and lack certainty. formal logic has been applied to several fields,
from management theory to computer science. formal logic formal logic could be a basic
approach of computing that is predicated on the degrees of truth instead of crisp values i.e. true
or false (0 or 1). It uses zero and one for the acute cases. Mamdani and Tagaki Sugeno square
measure 2 renowned formal logic reasoning techniques. These models work on language based
mostly if then else fuzzy rules. formal logic is incredibly easy and simple approach to resolve the
advanced issues with accuracy. formal logic has numerous distinctive options find it irresistible
is incredibly strong because it doesn't need actual inputs, it are often changed simply so as to
enhance the performance, and it will turn out swish output with wide selection of inputs.

Introduced the construct of formal logic (FL) in 1965 with the introduction of fuzzy pure
mathematics. It gave birth to a replacement mechanism to method information by victimization
partial set memberships instead of crisp set membership. until the late Seventies, the fuzzy pure

8
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

mathematics wasn't applied to manage systems owing to the restricted process power of
computers. formal logic could be a drawback resolution system methodology that permits
unclearness, uncertainty, inexactness and partial truth in computing issues and provides effective
sensible resolution for conflict resolution of multiple criteria. formal logic lends itself to systems
starting from tiny and simple embedded system to giant and sophisticated management systems,
information based mostly systems, image process, power engineering, robotics, industrial
automation, shopper physics, multi objective optimisation, prediction, stock commerce,
diagnosis and treatment and bioinformatics so on. It are often enforced each at hardware and
software package level. formal logic uses a straightforward rule based mostly “IF X AND Y
THEN Z” approach for resolution management issues. It offers many distinctive options creating
it an honest alternative for many modeling and management applications. a number of the
distinguished options of fuzzy systems embrace i) it's inherently strong to inexact and creaky
inputs, ii) it's going to be programmed to fail safely, iii) it will method any cheap range of inputs
and generate various outputs, and iv) it's capable of modeling nonlinear systems with efficiency
which can be otherwise troublesome or not possible to model mathematically. formal logic based
mostly modeling and system consists of 3 important steps: fuzzification, reasoning rule, and
defuzzification.

Fuzzification

Fuzzification is that the method of assignment the numerical input of a system to fuzzy
sets with some extent of membership. This degree of membership could also be anyplace at
intervals the interval [0,1]. If it's zero then the worth doesn't belong to the given fuzzy set, and if
it's one then the worth fully belongs at intervals the fuzzy set. Any worth between zero and one
represents the degree of uncertainty that the worth belongs within the set. These fuzzy sets
square measure usually delineated by words, so by assignment the system input to fuzzy sets, we
are able to reason with it in a very lingually natural manner.

The image below the meanings of the expressions cold, warm, and hot square measure
delineated by functions mapping a scale. a degree on it scale has 3 "truth values" one for every of
the 3 functions. The vertical line within the image represents a selected temperature that the 3
arrows (truth values) gauge. Since the red arrow points to zero, this temperature could also be
understood as "not hot"; i.e. this temperature has zero membership within the fuzzy set "hot".

9
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

The orange arrow (pointing at zero.2) could describe it as "slightly warm" and also the blue
arrow (pointing at zero.8) "fairly cold". Therefore, this temperature has zero.2 membership
within the fuzzy set "warm" and zero.8 membership within the fuzzy set "cold". The degree of
membership allotted for every fuzzy set is that the results of fuzzification.

Figure 1.3 Fuzzy logic temperature

Fuzzification provides the simplest way to remodel precise quantitative values (e.g.
temperature = forty °C, twenty five °C or ten °C) into qualitative (nominal) values (e.g.
temperature=”High”, “Medium” or “Low”). There are alternative ways in which to remodel
precise values into distinct descriptors, however Everglade State offers a scientific and unbiased
approach while not the requirement for skilled data concerning the system. the entire vary of
input file is initial measured, so it's fuzzified into distinct subsections victimisation some
applicable membership functions (MFs). The MF represents the magnitude of participation and
associates a “weight” with every input, defines useful overlaps between inputs. The membership
perform is employed to map the non fuzzy inputs to fuzzy linguistic terms and the other way
around. Sometimes, before fuzzification, a standardization technique is applied to scale all the
numeric values within the given vary that helps to forestall attributes with giant ranges.
Triangular, quadrilateral and mathematician shapes are the foremost common MFs, however
there are alternative types of the MFs like singleton, Piece wise linear and exponential.

10
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Fuzzy inference

In this step, a rule base in the form of “IF-THEN” rule is constructed in order to control
the output variable. The purpose of the inference engine is to draw conclusions from the rule
base. Fuzzy set operations (AND, OR, or NOT) are applied to evaluate fuzzy rules and combine
their results. After evaluating the result of each rule, individual results are combined (using
accumulation methods such as maximum, bounded sum or normalized sum) to obtain a final
output.

Defuzzification

This would be straightforward if the output truth values were specifically those obtained
from fuzzification of a given variety. Since, however, all output truth values square measure
computed severally, in most cases they are doing not represent such a collection of numbers.
citation required One has then to determine for variety that matches best the "intention" encoded
within the truth worth. For example, for several truth values of fan_speed, an actual speed must
be found that best fits the computed truth values of the variables 'slow', 'moderate' and so on. The
inference step produces results as fuzzy values that need to be defuzzified to get a final crisp
output. The defuzzifier component performs the defuzzification according to the MF of the
output variable. Some of the commonly used methods for defuzzification are center of sums
(COS), center of gravity (COG), weighted average, and maxima method and so on.

Traditional formal logic primarily based technique

Woolf and Wang used formal logic to depict the interactions between the genes of yeast
knowledge. this can be one amongst the fundamental techniques to seek out the relationships
between the repressors, activators and target genes. The expression level values of genes are
fuzzified per totally different qualifiers like High, Low, Medium. The fuzzy rules utilized in this
system are delineate within the call matrix . Per these rules, target worth is calculable. Finally,
the ranking is completed on the premise of error between actual target worth and calculable
target worth. Genes with the low error are scored higher i.e. have higher ranks.

11
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Medical Decision Making

Fuzzy logic is a very important thought once it involves medical deciding. Since medical
and aid knowledge may be subjective or fuzzy, applications during this domain have a good
potential to learn tons by mistreatment formal logic primarily based approaches. one amongst the
common application areas that use formal logic is pc assisted identification (CAD) in drugs.
CAD could be a computerised set of repose connected tools which may be wont to aid physicians
in their diagnostic deciding for instance, once a doctor finds a lesion that is abnormal however
still at a really early stage of development he/she might use a CAD approach to characterize the
lesion and diagnose its nature. formal logic may be extremely acceptable to explain key
characteristics of this lesion. formal logic may be utilized in many various aspects inside the
CAD framework. Such aspects embody in medical image analysis, medical specialty signal
analysis, segmentation of pictures or signals, and have extraction / choice of pictures or signals
as delineate, as an example and.

The biggest question during this application space is what quantity helpful data may be
derived once mistreatment formal logic. a significant challenge is a way to derive the specified
fuzzy knowledge. This can be even more difficult once one should elicit such knowledge from
humans (usually, patients). Because it aforementioned “The envelope of what may be achieved
and what can not be achieved in diagnosis, ironically, is itself a fuzzy one. a way to elicit fuzzy
knowledge, and the way to validate the accuracy of the information remains associate current
effort powerfully associated with the appliance of formal logic. The matter of assessing the
standard of fuzzy knowledge could be a troublesome one. This can be why formal logic could be
a extremely promising risk inside the CAD application space however still needs additional
analysis to realize its full potential. Though the ideas of mistreatment formal logic in CAD is
exciting, there ar still many challenges that fuzzy approaches face inside the CAD framework.
sequence restrictive networks model regulation in living organisms. formal logic will effectively
model sequence regulation and interaction to accurately mirror the underlying biology. a brand
new multiscale fuzzy agglomeration technique permits genes to act between restrictive pathways
and across {different|totally totally different|completely different} conditions at different levels
of detail. Fuzzy cluster centers may be wont to quickly discover causative relationships between
teams of coregulated genes.

12
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Fuzzy measures weight knowledgeable data and facilitate quantify uncertainty


concerning the functions of sequences mistreatment annotations and therefore the gene
metaphysics information to verify a number of the interactions. The tactic is illustrated
mistreatment organic phenomenon knowledge from associate experiment on supermolecule
metabolism within the model plant mouse-ear cress. Key sequence restrictive relationships were
evaluated mistreatment data from the sequence metaphysics information. a brand new restrictive
relationship regarding trehalose regulation of supermolecule metabolism was conjointly
discovered within the extracted network. the fundamental fuzzy approach is represented. Crisp
values ar given as inputs and crisp values also are obtained within the output when the fuzzy
illation is accomplished. Fuzzification transforms crisp inputs and makes them amenable to
illation by mistreatment formal logic operators like AND, OR, and NOT. Rules ar evaluated and
aggregative then defuzzified to grant a crisp result. As a outline, any formal logic algorithmic
rule consists of 3 main ancient stages, fuzzification, inference, and defuzzification. The fuzzy
approach has a necessary role in combining biology primarily based models with logical
techniques for reconstructing the underlying GRN.

The appearance of microarray technology motivated computer scientists to improve


algorithms for inferring GRNs and provided the possibility of simultaneously monitoring the
expression levels for hundreds of thousands of genes. Various stages are used to regulate gene
expression, including DNA transcription, RNA processing and transport, RNA translation, and
posttranslational adjustment of proteins. The use of fuzzy approach gave the ability to search for
the potential regulatory interactions and comprehend reverse engineering of regulatory pathways
related to a subset of interesting genes. When making simulations, missing values were assigned
using the k-nearest-neighbor (kNN) algorithm. Two ways were used to analyze the datasets.
First, eliminate all genes with expression levels below the noise threshold and that did not vary
significantly during the cell cycle.

Biological relations within the best-fitting fuzzy GRN models will fruitfully recover each
the direct and indirect interactions from preceding information to supply best understanding of
biological background regarding the transcriptional that's why one amongst the most ways that
have emerged to infer GRNs is that the fuzzy approach model, which might give a basic rule
structure because of the These models are versatile and may be adjusted for many regulative

13
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

models and inferential rule teams. moreover, the nonlinear impacts between regulative genes will
be adscititious, supported further information induced from human specialists or mechanically
derived from specialised information bases and changing information regarding Because the
utilization of symbolic logic permits the fusion of high level, human like reasoning in
constructing and structuring the rule based mostly construction of the GRN, it's so advantageous
for the life scientist or subject material professional WHO have intensive information regarding
existing regulative mechanisms to specific this info in Associate in Nursing intuitive fashion
victimization the linguistics of symbolic logic.

In addition, the deoxyribonucleic acid (deoxyribonucleic acid) microarray technology


allowed the measuring of the polymer quantity that's related to variety of genes in parallel and set
that of them were expressed during a specific form of cells. This technology may face differing
kinds of difficulties in understanding the sequence interaction. the essential reasons of those
difficulties are the thousands of genes contained in microarray databases with few time steps
additionally to the expression values that are subjected to any, discovering GRNs victimization
microarray information has some further inherent difficulties. More experimental tests are so
needed for determinative the validity of genetic interactions foreseen by network models. These
ways were effectively applied to each the simulated/experimental information associated with
the RAF (rapidly accelerated fibrosarcoma) communication pathway and therefore the real
microarray information associated with the yeast cell cycle, and therefore the results were
compared with alternative acknowledge algorithms. Particularly, take away all genes with a most
expression level below the primary mark in any of the four datasets or those who exhibited but a
threefold distinction between most and minimum expression values all told four datasets.

Fuzzy Clustering Method


Fuzzy logic may also be with efficiency used for modeling the interaction and regulation
of genes to exactly replicate the basic biology and to produce the entire analysis of GRN from
bunch to evaluating the quality of the network. A multiscale fuzzy bunch methodology
exploitation k means that rule permits the genes interacting among regulative pathways across
many conditions at varied levels of detail. The causative relationships among teams of
coregulated genes is detected exploitation fuzzy cluster centers. during this context, fuzzy
methodology will live the load of professional information and assist in quantifying uncertainty

14
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

concerning the genes functions with the utilization of cistron metaphysics (GO) information so
as to stress sure interactions.

The experiment on supermolecule metabolism within the model plant (Arabidopsis


thaliana) is employed for organic phenomenon for example the planned model. the data from the
cistron metaphysics (GO) information is employed to judge the most cistron regulative
relationships. a completely unique regulative relationship concerning supermolecule metabolism
trehalose regulation was detected within the ensuing network. The GRN logical thinking rule has
been with efficiency evaluated to produce the data for time delay exploitation the cluster centers.
The feedback within the networks may also be allowed by exploitation the rule that creates the
results a lot of reliable. The recommended work is increased by incorporating the GRN model
with accessible metabolic networks within the simulation of sure cellular processes.

1.4 Purpose of the Research

This thesis attempts to improve and expand upon the proof-of-concept algorithm
proposed by Woolf and Wang. We propose the following solutions to the problems mentioned in
the previous section. We propose the use of clustering gene expression data as a preprocessing
method to eliminate combinations of genes that are not likely to fit the model. We propose
altering the methods used by Woolf and Wang to conjoin and aggregate fuzzy data to reduce the
sensitivity of the model to small variations in the inputs while still producing valid results. We
propose a generalized version of Woolf and Wang's fuzzy model to accommodate any number of
activators and repressors in the model. The above improvements will improve the performance,
robustness, and generality of the model to make it more viable as a method for the analysis of
gene expression data. For the conditions tested, the predictions created by the algorithmic rule
agree well with experimental information within the literature. The algorithmic rule also can
assist in crucial the operate of uncharacterized proteins and is ready to notice a well larger range
of transcription factors than may be found arbitrarily. This technology extends current techniques
like clump in this it permits the user to come up with a connected network of genes mistreatment
solely expression information.

15
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

1.5 Problem Description

DNA Microarray technology allows us to analyze the relative expression levels of a


group of genes of an organism simultaneously. Instead of being forced into only examining a few
genes at once, we now have the "whole picture'' of the expression of genes. Microarray
technology also is relatively fast, allowing the quick creation of spatial data (i.e., expression
levels of different cells in an organism at a particular time) as well as temporal data (i.e.,
expression levels of the same cell population in a time series). With the new wealth of spatio
temporal data obtained from microarrays, many different methods have been proposed to make
sense of the data. Clustering algorithms have been used to group genes by their expression
profiles to find related genes. Others have attempted to reverse engineer the network of genetic
interactions through methods such as linear matrices, series of differential equations, and
Boolean Networks Another method that has been Woolf and Wang have developed a fuzzy
model of known gene interaction (an activator/repressor relationship in this case) Using a
normalized subset of saccharomyces cervisae data.

The output of the model, which is the ideal expression of a gene that is regulated by that
activator and repressor, is compared to the expression level of a Gene combinations are ranked
based upon the mean squared error between the model and the target gene and variance between
the application of the fuzzy rules over the time period. Those combinations of genes that have a
low error are the most likely to exhibit an activator/repressor relationship. The method attempts
to simulate what a human would do in comparing expression levels of genes to find the
underlying relationships. Different fuzzy models can be developed for different models of
interaction, including co activators and co repressors as well as the presence of other factors in
the cell, such The method is intuitively pleasing and the results are consistent with literature of
genetic networks of saccharomyces cervisae.

Data Mining is a step in Knowledge Discovery in Database (KDD) is defmed as a


process of information extraction. Clustering is a common descriptive task also called as
unsupervised learning in which the classes are not predefmed. It is a collection of data objects
that is similar to one another and thus can be treated collectively as one group. Gene expression
is the conversion of the genetic information present in a DNA sequence into a unit of biological
function in a living cell. It typically involves the process of transcription of a gene into RNA

16
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

followed by translation of the RNA into protein. Microarray analysis is a technique that is used
to determine whether genes are expressed or not. Fuzzy logic is different from "crisp logic",
where binary sets have two valued logic, fuzzy logic variables may have a truth value that ranges
in degree between 0 and 1.Data mining refers to extracting or mining data from large
information. data processing is associate degree knowledge base field transferral along
techniques from machine learning, pattern recognition, statistics, databases, and mental image to
handle the problem of data extraction from giant databases.

The two primary goals of knowledge mining are prediction and outline. Prediction
involves exploitation some variables or fields within the information to predict unknown or
future values of alternative variables of interest. Description focuses on finding human
explainable patterns that describes the info. Bioinformatics may be a speedily developing branch
of biology and is very knowledge base exploitation techniques and ideas from science, statistics,
arithmetic, chemistry, organic chemistry. it's several sensible applications in numerous areas of
biology and medication. Bioinformatics derives data from pc analysis of biological information.
Fuzzy logic may be a superset of typical (Boolean) logic that has been extended to handle the
thought of partial truth. A fuzzy set is associate degree extension of a crisp set. Crisp sets permit
solely full membership or no membership the least bit, whereas fuzzy sets permit partial
membership.

Expression profiling assays generate huge data sets that are not amenable to simple
analysis. The greatest challenge in maximizing the use of this data is to develop algorithms to
interpret and interconnect results for different genes under different conditions. Currently, most
expression data is analyzed using clustering techniques, algorithms that identify distinct
expression patterns by grouping genes with similar expression patterns (1, 9). Thus clustering
can only distinguish between those genes that have the same and different expression profiles.
However, genes in the cell make up a complex network that cannot be revealed with current
techniques such as clustering. To determine the network describing how the genes interrelate,
more elaborate data mining techniques need to be developed.

Fuzzy logic is an algorithm drawn from engineering and other applied sciences to control
systems as diverse as washing machines to autofocus cameras (2, 10). It provides a way to
transform precise numbers, such as 32.43, into qualitative descriptors, such as “high” in a

17
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

process called fuzzification Although other techniques can be used to change precise values into
discrete descriptors, fuzzy logic provides a systematic and unbiased way to perform this
transformation, thereby removing the need for expert knowledge about the system. For example,
is 32.43 a high value? If 32.43 is a measure of the ambient air temperature in degrees celsius,
then most people would say that 32.43°C is a high temperature. But this analysis requires our
own expert knowledge, which can vary from person to person. Someone from a tropical climate
may feel that 32.43°C is a medium temperature, whereas someone from a very cold climate may
take 32.43°C as a very high temperature. When dealing with gene expression data, the problem is
even more complicated, because no expert exists to determine what defines a “high” expression
level. Using fuzzy logic, the full range of data is first measured and is then broken into discrete
subsections based on the observed data. These discrete subsections then provide a qualitative
description of the data. Once transformed, this qualitative data can be analyzed using heuristic
rules, which in turn generate fuzzy solutions.

There square measure 3 main benefits of applying mathematical logic to the analysis of
organic phenomenon information. First, mathematical logic inherently accounts for noise within
the information as a result of it extracts trends, not precise values. Second, in distinction to
different machine-driven higher cognitive process algorithms, like neural networks or
polynomial fits, algorithms in mathematical logic square measure solid within the same language
utilized in day to day speech. As a result, predictions created mistreatment mathematical logic
square measure simply explainable and may be reckon in foreseeable ways that. Third,
mathematical logic techniques square measure computationally economical and may be scaled to
incorporate an infinite range of parts. so they're ready to acknowledge an outsized range of
biologically necessary patterns.

A mathematical logic approach to analyzing organic phenomenon information. Physiol


genetic science.. we've got developed a unique algorithmic rule for analyzing organic
phenomenon information. This algorithmic rule uses mathematical logic to rework expression
values into qualitative descriptors that may be evaluated by employing a set of heuristic rules. In
our tests we tend to designed a model to search out triplets of activators, repressors, and targets
during a yeast organic phenomenon information set.

18
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

CHAPTER II
REVIEW OF LITERATURE

2.1 Sarah M. Ayyad, Ahmed I. Saleh, A New Distributed Feature Selection Technique for
Classifying Gene Expression Data [2019].

In bioinformatics, DNA microarray technology is considered a benchmark technique for


diagnosis of several diseases, such as cancers, based on gene expression data. Genes are
encoding regions, which form the necessary building blocks inside the cell and show the
way to proteins which achieve a diversity of functions. However, some genes may get
mutated. These genes have the responsibility for cancer incidence. Researchers have attempted
to analyze a large amount of microarray data to get the important information, which can be used
in cancer prediction and diagnosis. However, gene expression data are commonly composed of
only dozens of samples represented in thousands of genes.

Accordingly, the selection of informative genes (features) becomes difficult with


respect to effectiveness and proficiency of classification. One way to address this situation
is to use feature selection, which minimizes both the redundancy and dimensionality of
gene expression data through the classification process. Generally, in bioinformatics feature
selection methods could be one of the following three types wrapper methods, filter methods and
embedded methods.

Wrapper methods require the selection method to combine with a predictor to evaluate
classification performance of each gene subset. These methods produce subsets of features that
are strongly related to the classifier chosen, filter methods depend upon the generic
characteristics of the training data and do not depend on the classifier used. The embedded
methods usually employ machine learning algorithms for classification, then the best feature
subset is produced by the classification algorithm.

A new distributed feature selection (DBFS) technique has been proposed. DBFS tries to
improve the classification accuracy as well as reducing the running time for the process of
feature selection of DNA microarray by selecting features in a vertical manner. DBFS
implements different fast filter methods over several subsets of the data, which could be done in

19
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

parallel. The partition of the dataset involves dividing the dataset into different subsets of nearly
the same size that comprise the whole dataset.

DBFS uses a four-stage procedure, to attain the smallest feature subset. In the first stage,
the dataset is partitioned into several subsets of nearly the same size. In the second stage,
different fast filters are implemented over several subsets. In this stage, a new matrix called
probability of removal (POR) is generated, and a fuzzy inference system is employed to
give importance value to all features. In the third stage, a combination method builds a set of
features as a result of applying the filter methods.

Then those features are ranked. In the last stage, a wrapper-based feature selection
method is employed to achieve the highest accuracy. In the following subsections, the four
stages will be explained in more details. The development of fuzzy sets theory was introduced by
Zadeh, who presented the idea and first conception of fuzzy sets in his seminal paper “Fuzzy
Sets”. Nowadays, the fuzzy set theory is a significant method for controlling and modeling
in different scientific areas. Modeling using fuzzy sets has been recognized to be an efficient
method for developing multi-criteria decision dilemmas. In this step, we try to combine many
filter feature selection methods to select the best features. Several methods can be applied to
achieve such a task, fuzzy inference is the most appropriate one as it allows decision
making with estimated values under information.

Fuzzy inference is appropriate for approximate or uncertain reasoning. It supports


the decision-making process with computed values in incomplete or uncertain data. we
presented a new technique for partitioning the gene selection process employed for
classification of microarray gene expression data. This type of data is distinguished in it
numerous numbers of features with respect to the number of samples, for that reason our goal
was to partition the genes.

The proposed technique presents a decrease in overall runtime regarding the centralized
techniques along with the selected number of genes is reduced. Therefore, it delivers a scalable
solution able to handle datasets with a large number of attributes. In addition, with respect
to classification performance, our proposed technique was capable of meeting, and even in
several cases enhancing the standard approaches applied to the non-distributed datasets.

20
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

It is worth mentioning that the proposed feature selection technique can be used for a
wider variety of machine learning problems. It can also be applied in very high dimensional
datasets like document classification and spam filter. Classification of organic phenomenon
knowledge could be a important analysis space that plays a considerable role in diagnosing and
prediction of diseases.

Generally, feature choice is one among the extensively used techniques in knowledge
mining approaches, particularly in classification. factor expression knowledge area unit
sometimes composed of dozens of samples characterised by thousands of genes. This will
increase the spatial property not to mention the existence of unsuitable and redundant options.
consequently, the choice of informative genes (features) becomes troublesome, that badly affects
the factor classification accuracy. we have a tendency to take into account the feature choice for
classifying organic phenomenon microarray datasets.

The goal is to sight the foremost probably cancer-related genes in a very distributed
manner, that helps in effectively classifying the samples. Initially, the out there immense quantity
of thought of options area unit divided and distributed among many processors. Then, a brand
new filter choice methodology primarily based on a fuzzy abstract thought system is applied to
every set of the dataset. Finally, all the resulted options area unit hierarchal, then a wrapper-
based choice methodology is applied. Experimental results showed that our planned feature
choice technique performs higher than different techniques since it produces lower time latency
and improves classification performance.

2.2 Khalid Raza, Fuzzy logic based approaches for gene regulatory network inference
[2018].

Gene regulatory network inference (GRNI) from high-throughput gene expression data is
a well-posed challenge from last few decades. Several computational methods have been
proposed ranging from simple statistical to sophisticated computational intelligence approaches.
Among these approaches, the fuzzy logic theory has a lot of potential applications in different
areas of bioinformatics, including GRNI. Fuzzy logic is capable of representing nonlinear
systems and can incorporate domain knowledge using fuzzy rules. Some of the advantages of
fuzzy logic in gene expression studies are (i) it extracts trends rather than precise values, and

21
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

thus can handle inherent noise in the data; (ii) its predicted results are readily interpretable, and
(iii) it is computationally efficient and scalable.

An initial successful attempt to apply fuzzy logic for GRNI was made by Woolf and
Wang, where gene triplets consisting of activators, repressors, and targets were identified. Later
on, it was improved by several researchers in terms of fuzzy logic and its hybridization with
other computational techniques for GRNI such as fuzzy cognitive maps (FCMs), dynamic fuzzy
modeling, neuro-fuzzy, neuro-evolutionary, fuzzy Petri nets, fuzzy answer set programming and
other fuzzy hybrids. FCMs combine features from fuzzy logic and ANN and work well for
modeling complex processes including GRN. It can cope with the lack of quantitative
information as to how various biomolecules interact.

The genes are represented as causal fuzzy sets and the degree to which they are
dependent on each other. Traditionally, construction of FCMs relies on domain knowledge but it
can also be learned automatically to discover the domain knowledge. However, approaches to
learn FCMs automatically is limited to small-scale networks owing to very low accuracy, high
computational complexity, and very high density of learned FCMs. Few attempts have been
made to learn largescale FCMs using various decomposition methods including decomposed
genetic algorithm, ACO, differential evolution, PSO, LASSO, Compressed Sensing, the multi-
agent genetic algorithm (dMAG), mutual information based two-phase memetic algorithm
(MIMA), and memetic algorithm combined with ANN (MANN).

Dynamic fuzzy modeling approach has the capability to incorporate prior structural
knowledge into the GRN model and infer gene interactions as fuzzy rules. Neuro-fuzzy is one of
the most widely applied hybrid approaches for GRNI which combines the learning and
adaptation feature of ANN and knowledge representation through fuzzy logic. Several fuzzy-
evolutionary hybrid approaches have been proposed for GRNI and network parameter
optimization. Most of these approaches are multi-layer models, starting with initial layers
performing data preprocessing and clustering to final layer as predicted GRN with optimized
regulatory interaction parameters.

22
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Fuzzy Petri net (FPN) is also a promising modeling technique for large and complex
systems with capabilities for fuzzy knowledge representation and reasoning. Several FPN based
approaches have been developed for modeling and simulation of GRN. Although FPN has the
capability to model concurrent processes, it suffers from hierarchical structuring which limits its
application on large-scale networks. Also, it has no types, no modules, and allows only one kind
of tokens. Hence, fuzzy color Petri net (FCPN) is explored for its applications in GRNI which
supports hierarchical structure, data types, complex data manipulation, and each token has data
value attached to it. The lack of learning capability makes the FPN unsuitable for modeling
uncertain knowledge systems.

Although, FPN is a graphical tool which allows structuring a rule-based fuzzy reasoning
system, however, it needs both confidence degree of rule and truth degree of preposition a priori
which needs domain expertise. Fuzzy answer set programming (FASP) is a declarative
programming paradigm which is suitable for difficult search problems and useful in knowledge-
intensive applications. FASP was also explored to model the dynamics multi-valued GRN and
computed multi-valued activation of each gene. Other fuzzy logic based hybrid computational
methods have also been proposed for GRNI including union rule configuration (URC),
exhaustive search, network component analysis (NCA), Bayesian networks (BN), and so on.

The URC was applied to avoid combinatorial explosion problem in the fuzzy rules, while
exhaustive search recovers both direct and indirect regulations. Similarly, for computing high-
dimensional data, NCA tries to decompose data and takes advantage of partial network
connectivity knowledge. BN is suitable for small networks, however, if the number of genes is
large, learning a good BN is very difficult due to the exponential growth of search space.
Therefore, to reduce the search space in BN, fuzzy clustering can be applied as a preprocessing
step. Also, BN learning suffers from dimensionality and over-fitting problem due to the layout of
microarray data. To deal with this problem in GRN modeling, fuzzy ensemble clustering
approach is used which outputs small and highly inter-correlated partitions of genes.

One of the severe drawbacks of most of the GRNI methods is thatthese have been tested
on simulated gene expression data available from DREAM challenge and therefore lack rigorous
testing on real gene expression data. Also, these methods have been either tested on smallor mid-
size networks.

23
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

To deal with large-scale networks, we need better and robust decomposition approaches.
Further, only gene expression data would not be enough to infer GRN accurately. Therefore, we
need to develop better data integration techniques so that multi-omics data (such as miRNA
expression, ChIP-seq/ChIP-ChIP, mutation data such as SNPs, CNVs, GO annotations, protein-
protein interaction, and genedisease association data) can be easily integrated into the GRN
inference model to achieve accurate and reliable results. Furthermore, in addition to the
development of better methods to integrate known gene regulations into the model for
“knowledge-guided” reconstruction of gene regulatory networks, we also need to parallelize
most of the existing algorithms for efficient and large-scale gene regulatory network
reconstruction. Bioinformatics is one of the youngest and a growing field among modern
interdisciplinary scientific fields.

Computational systems biology is an interdisciplinary research area which deals with the
studies of dynamic interactions among biological macromolecules. From the last few decades, a
wide variety of methods and concepts borrowed from mathematics, computer science, statistics
and probability theory have been applied to solve problems in the area of bioinformatics and
computational systems biology. Among these methods, the fuzzy logic theory also finds a lot of
potential applications in different areas of bioinformatics, including Microarray gene expression
analysis, gene biomarkers identification, and gene regulatory network inference (GRNI). The
fuzzy logic theory is required to solve several challenging problems in bioinformatics which are
beyond the capabilities of other existing approaches.

The fast advancements in high turnout techniques have oxyacetylene giant scale
production of biological information at terribly cheap prices. a number of these techniques area
unit microarrays and next generation sequencing that give order level insight of living cells. As a
result, the dimensions of most of the biological databases, like NCBI GEO, NCBI SRA, etc., is
growing exponentially. These biological information area unit analyzed exploitation varied
process techniques for data discovery that is additionally one in every of the objectives of
bioinformatics analysis. cistron regulative network (GRN) could be a cistron cistron interaction
network that plays a polar role in understanding cistron regulation processes and illness
mechanism at the molecular level.

24
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

From last number of decades, researchers have an interest in developing process


algorithms for GRN logical thinking (GRNI) from high turnout experimental information. many
process approaches are planned for inferring GRN from organic phenomenon information as
well as applied mathematics techniques (correlation coefficient), scientific theory (mutual
information), regressionbased approaches, probabilistic approaches (Bayesian networks, naïve
byes), artificial neural networks and mathematical logic. The mathematical logic, along side its
coupling with alternative intelligent approaches, could be a well studied technique in GRNI
because of its many blessings. we have a tendency to gift a consolidated review on mathematical
logic and its hybrid approaches developed throughout last 20 years for GRNI. Dynamic fuzzy
modeling approach has the capability to incorporate prior structural knowledge into the GRN
model and infer gene interactions as fuzzy rules.

2.3 Priyakshi mahanta, Hasin afzal ahmed, FUMET: A fuzzy network module extraction
technique for gene expression data[2014].

Large amount of gene expression data produced by revolutionary microarray technology


have given rise to a number of challenges for the researchers in the field of gene expression data
analysis. These gene expression datasets are evidences of a number of biological phenomena.
These evidences in gene data can be traced out by performing reverse engineering on the
expression data. An important one of these biological phenomena is the regulatory relationship
among genes. A number of attempts have been reported in the literature toward detection of the
reflection of gene regulatory network in gene expression data. Typically, the first step of such an
attempt is to construct a co-expression network. There are basically two types of techniques to
construct gene coexpression network hard-thresholding-based scheme, two genes are connected
if the expression similarity of two genes exceeds the hard threshold. Loss of information and
sensitivity to the choice of the threshold are the main limitations of hard thresholding.
Techniques using soft thresholding weigh each connection of the network by a number between
0 and 1.

25
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

After constructing a co-expression network, relatively dense regions in the network are
extracted, which are termed as network modules. Each network module represents a regulated set
of genes, which may also correspond to a set of genes associated with similar biological
functions. In this article, we propose a co-expression network construction technique and a
biologically relevant network module extraction technique based on a fuzzy set theoretical soft
thresholding approach.

The extracted network modules have been validated in terms of well-known pvalue, Q-
value and topological statistics using seven benchmark datasets.a gene co-expression network
construction technique based on a soft-thresholding-approach was presented. The technique has
been established over seven publicly available benchmark real-life datasets. From the network,
highly co-expressed modules were extracted using topological overlap matrix and a fuzzy
membership function. A semi-supervised approach for construction of gene co-expression
network with the support of gene ontology tools for better biological significance validated using
protein interaction data is underway. Also, improvement of the existing FUMET to enable
handling shifted and scaled patterns for large number of datasets is ongoing.

Construction of co expression network associated extraction of network modules are an


appealing space of bioinformatics analysis. this text presents a co expression network
construction and a biologically relevant network module extraction technique supported fuzzy
set speculative approach. The technique is ready to handle each positive and negative
correlations among genes. The made network for a few benchmark organic phenomenon datasets
are valid mistreatment topological internal and external measures. The effectiveness of network
module extraction technique has been established in terms of standard p worth, letter worth and
topological statistics.

2.4 S.J.Brintha, V.Bhuvaneswari, Clustering Microarray Gene Expression Data Using


Type 2 Fuzzy Logic [2012].

Microarray is also called as DNA chip, gene chip, or biochip is used to analyze gene
expression. Fuzzy Logic is a multivalued logic that allows intermediate values to be defined
between conventional evaluations like true or false, yes or no, high or low, etc. Type 2 fuzzy
logic approach is used to fuzzify the microarray gene expression data. Then the clustering of

26
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

genes is done by using clustering algorithms and the cluster results are compared with the
proposed Type 2 fuzzy approach. Data Mining is a step in Knowledge Discovery in Database
(KDD) is defamed as a process of information extraction. Description focuses on fading human-
interpretable patterns that describes the data. Clustering is a common descriptive task also called
as unsupervised learning in which the classes are not predefined. It is a collection of data objects
that is similar to one another and thus can be treated collectively as one group.

Gene expression is the conversion of the genetic information present in a DNA sequence
into a unit of biological function in a living cell. It typically involves the process of transcription
of a gene into RNA followed by translation of the RNA into protein. Microarray analysis is a
technique that is used to determine whether genes are expressed or not. Fuzzy logic is different
from "crisp logic", where binary sets have two valued logic, fuzzy logic variables may have a
truth value that ranges in degree between 0 and 1.The objective of this paper is to convert the
numerical values into fuzzy terms and to group the genes according to cluster. The specific
objective is to compare the cluster results with the proposed Type 2 Fuzzy approach. Clustering
is also called as unsupervised learning that groups the input data sets into subsets, called 'clusters'
within which the elements are similar.

Clustering is a common technique for statistical data analysis used in many fields,
including machine learning, data mining, pattern recognition, image analysis, information
retrieval, and bioinformatics. First, the measurement of similarity between individual profiles is
done. Second, the measurement is used to form the groups or clusters. The goal of clustering is to
group the objects that are similar to one another. The objects in one group are different from
objects in other groups. The main idea of this work is based on applying fuzzy logic to
microarray gene data and to compare the proposed approach with existing approach and
traditional clustering algorithms. Data mining clustering algorithms k-means and hierarchical
clustering was implemented and tested on the yeast microarray gene data. The results infer that
the fuzzy logic approach gives more accuracy than the existing clustering algorithms. The
accuracy is further increased by using proposed Type 2 approach when compared to Typel
approach. Fuzzy c means algorithm can be implemented as a future work to the group of genes
with similar functionalities. The proposed work can be combined with other kind of machine
learning approach like Neural Networks.

27
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Microarray technology helps biologists for watching expression of thousands of genes in


an exceedingly single experiment on alittle chip.DNA microarrays are chop-chop turning into a
elementary tool in genomic analysis. Bioinformatics and data processing offer exciting and
difficult researches in many application areas particularly in process science. Bioinformatics is
that the science of managing, mining, and deciphering data from biological sequences and
structures. Fuzzy reasoning rules are wont to remodel the organic phenomenon levels of a given
dataset into fuzzy values. Data mining refers to extracting or mining information from large
information. data processing is Associate in Nursing knowledge base field transfer along
techniques from machine learning, pattern recognition, statistics, databases, and mental image to
handle the difficulty of knowledge extraction from massive databases. The two primary goals of
information mining are prediction and outline. Prediction involves victimization some variables
or fields within the information to predict unknown or future values of different variables of
interest. Bioinformatics could be a chop-chop developing branch of biology and is extremely
knowledge base victimization techniques and ideas from scientific discipline, statistics,
arithmetic, chemistry, organic chemistry. it's several sensible applications in several areas of
biology and drugs.

Bioinformatics derives information from pc analysis of biological information. Fuzzy


logic could be a superset of standard (Boolean) logic that has been extended to handle the
thought of partial truth. A fuzzy set is Associate in Nursing extension of a crisp set. Crisp sets
permit solely full membership or no membership in the slightest degree, whereas fuzzy sets
permit partial membership. Data bunch could be a technique within which, the knowledge that's
logically similar is physically keep along. The criterion for checking the similarity is
implementation dependent. Mining biological information is Associate in Nursing rising space of
intersection between bioinformatics and data processing. The comparison is completed for each
sort one and sort two fuzzy.

2.5 Richard Simon, Amy Lam, Analysis of Gene Expression Data Using BRB-Array Tools
[2007].

BRB ArrayTools is associate integrated package for the great analysis of


desoxyribonucleic acid microarray experiments. It absolutely was developed by skilled
biostatisticians fully fledged within the style and analysis of desoxyribonucleic acid microarray

28
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

studies and incorporates strategies developed by leading applied mathematics laboratories. The
software system is intended to be used by medical specialty scientists UN agency want to own
access to state of the art applied mathematics strategies for the analysis of organic phenomenon
information and to receive coaching within the applied mathematics analysis of high dimensional
information.

The software system provides the foremost intensive set of tools offered for prognostic
classifier development and complete cross validation. It offers intensive links to genomic
websites for factor annotation and analysis tools for pathway analysis. associate archive of over a
hundred informationsets of printed microarray information with associated clinical information is
provided and BRB ArrayTools mechanically imports data from the Gene Expression Omnibus
public archive at the National Center for Biotechnology data. the utilization of organic
phenomenon identification has augmented dramatically however serious issues within the
analysis of such information in publications area unit rife. Valid analysis of desoxyribonucleic
acid microarray experiments needs substantial applied mathematics data however statisticians
expertly in microarray strategies area unit in brief provide and not offered to several laboratories.
BRB ArrayTools was developed in an effort to broadly speaking share the data gained by
biostatisticians of the Biometric analysis Branch of the National Cancer Institute in an
exceedingly decade of involvement in cooperative microarray investigations and within the
development of applied mathematics methodology for microarray information.

The first objectives of BRB ArrayTools are: (1) to produce scientists with software
system that guides them to utilize valid and powerful strategies acceptable for his or her
experimental objectives while not requiring them to be told a programing language. (2) To
encapsulate into software system the expertise of skilled statisticians UN agency browse and
critically appraise the intensive printed literature of latest analytic and machine strategies. (3) To
facilitate education of scientists in applied mathematics strategies for the analysis of
desoxyribonucleic acid microarray information.

The next section describes the design of BRB ArrayTools. the subsequent sections offer
an summary of desoxyribonucleic acid microarray analysis and a few of the analysis tools
enclosed within the system. The software system conjointly includes tutorials and microarray
datasets. BRB ArrayTools is accessible for free of charge for non business applications. BRB

29
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

ArrayTools contains a insert feature that permits applied mathematics users to increase the
software system exploitation their own functions written within the R applied mathematics
language. The insert facility conjointly provides the insert developer with a wizard for making a
graphical computer programme for interacting with the BRB ArrayTools user. thus the developer
UN agency is aware of the R applied mathematics language will produce an expert computer
programme which will alter his/her perform to be distributed and utilized by the larger
community of BRB ArrayTools investigators.BRB ArrayTools represents a completely unique
experiment in sanctioning scientists to equip themselves to require advantage of the
revolutionary changes happening in biology.

There area unit presently over 5000 registered users in over sixty countries. many
hundred publications have cited BRB ArrayTools as getting used for his or her information
analysis. The program is actively being extended to produce extra tools for organic phenomenon
analysis. Our judgments of what to incorporate in BRB ArrayTools area unit heavily influenced
by our involvement in experimental and methodologic microarray analysis, within the style and
analysis of a spread of enormous and little experiments, and interactions with our users. Being
greatly concerned within the development of latest applied mathematics strategies for this kind
of analysis forces U.S. to try to remain current with the intensive literature within the applied
mathematics, bioinformatics and machine learning literature and to critically appraise that
strategies represent necessary advances for the sector and will be enclosed in our software
system. As necessary because the inclusion of latest tools, however, is that the effectiveness of
the software system for serving to medical specialty scientists educate themselves within the use
of advanced applied mathematics and bioinformatic strategies to arrange themselves for the
revolution that's happening in biology.

2.6 Staal A. Vinterbo, Eun-Young Kim, Small, fuzzy and interpretable gene expression
based classifiers [2005].

Complex data mining algorithms, such as support vector machines, neural networks and
logistic regression have been used in the classification of gene-expression data. Usually, they
produce models that are not easily interpretable by biologists and biomedical researchers, given
the high number of variables and parameters. If simple but accurate rules could be induced from
relatively small training samples, interpretation of the models would be greatly facilitated.

30
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Some of these rely on experts or literature to directly determine rules to be used to construct gene
networks or the constraints that need to be considered. Given that expert knowledge is still
limited to a small number of genes, data mining approaches have also been utilized. There have
been also reports of supervised rule induction. Some authors take advantage of tree-like
algorithm s that are well developed in the literature and have several different variants. Others
use different formulations of association rule induction algorithms.

In this work, we present a rule induction and filtering strategy based on fuzzy sets and
compare it with benchmark logistic regression models. Several classification algorithms rely on
discretization of values into a small number of categories. Discretization is also useful because it
makes data simpler to interpret. The precision of values that are obtained from microarrays looks
high. However, the To whom correspondence should be addressed. The values are read from
scanners that measure the degree of fluorescence in different color channels. The process is
prone to saturation and other artifacts, and often biologists rely on the experiments to get an idea
of the degree of mRNA abundance, rather than obtaining Given human interpretation of the
values in terms of broad categories such as upregulation, neutral or downregulation, it may be
possible (and desirable) to simplify the numerical scale into Furthermore, it may be possible to
use this simplification to produce simple rules that can be easily interpreted by humans, instead
of real numbers that get multiplied by certain coefficients to generate a classifier (as is the case
in other modeling approaches).By associating values of membership in different categories, it is
possible, for example, to describe a sample that is at the high-end of the ‘low’ category. Note that
fuzzy memberships do not correspond to probabilities and their operators are different from
those used in that framework as well.

Fuzzy set based calculations have been referred to as ‘computing with words’ rather than
computing with numeric values. The algorithms we present in this paper combine fuzzy
discretization (assignment of degrees of membership to the discrete categories of values such as
low, medium and high, or benign and malignant) and fuzzy operators, with rule induction and
filtering algorithms that are especially developed to produce a small number of short rules that
are useful for model interpretation. In a way, this work extends in that the framework for fuzzy
discretization is generalized, the induction and filtering of rules is automated, and a larger
number of datasets A rule represents a projection of a set of data items onto the subdimensions

31
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

found in the descriptors of its antecedent. As there are many equivalent projections for each data
item that can be used to construct a particular rule, and the choice is somewhat made arbitrarily
as in above, an implementation will sometimes give the user the option of generating rules from
multiple projections and combining the results using a voting scheme.

The result is a larger coverage of the input space. An implementation of this is found in
the ROSETTA system. One of the main drawbacks of this approach is the large number of rules
it results in. In order to reduce the number of rules, one can filter them empirically, i.e. keep the
ones that seem to work best on selected data. Presents and discusses several ways of filtering
propositional rules. We employ a simple but efficient, both with respect to computational
resources and efficacy, setcover-based filtering approach. As humans square measure higher
ready to interpret easy rules like ‘if A is pregulated and B is downregulated, then the chance of
illness X is high’ than equations involving many coefficients, interaction terms and constants;
some authors have projected the employment of rulebased systems for analysis of knowledge
originating from microarray.There is a broad vary of strategies wont to induce rules from
coaching knowledge.Two varieties of algorithms square measure necessary for the induction of
those rules from the data: algorithms for categorization of continuous values and algorithms for
rule discovery and filtering that as critical fuzzy discretization) doesn't take into consideration
that prices at the borderline between value classes could also be terribly similar.

We present a fuzzy set framework for implementation of classifiers. We illustrate our


implementation using five different gene expression datasets drawn from different laboratories
and measurement technologies. Our models result in short rules that are easy to interpret and
exhibit classification performances that compare favorably with those of logistic regression
models, one of the standard classification methodologies applied in the biomedical domain. This
study has important limitations. The sample sizes were extremely small, and there may be an
optimistic bias reflected in the results even with the use of a 5 ×2 cross-validation scheme. We
compared our classifiers with those derived using multivariate logistic regression models. The
latter have theoretical limitations that, although difficult to verify in practice, may come into play
in one or more datasets that we utilized. If, for example, the data are not linearly separable, then
more flexible models such as logistic regression models with interaction terms, artificial neural
networks, or support vector machines with non-linear kernels might be more appropriate. More

32
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

importantly, although our rules were cross-validated in previously unseen samples, their
biological significance needs to be verified. It may be the case that the rules have statistical
significance but that their translation into biological significance is meaningless. If this is the
case, our argument would not be true in practice. Extensive investigation involving domain
experts and other measurement instruments would be necessary to demonstrate that the approach
we propose is superior to the current status quo.

2.7 Habtom Ressom, Robert Reynolds, Increasing the efficiency of fuzzy logic-based gene
expression data analysis[2003].

The model’s sensitivity to noise is reduced by implementing appropriate methods of


“fuzzy rule aggregation” and “conjunction” that produce reliable results in the face of minor
changes in model input. through the use of DNA microarray technology, time series data for
gene expression can easily be prepared on the genome-wide scale, allowing the transcription
levels of many genes to be measured simultaneously. With such data, one can attempt to reverse
engineer a network of gene interaction. The benefits of characterizing gene interaction are many,
for example, the effects of drugs on a regulatory pathway can be characterized; tumor
development in cells can be tracked, etc. However, there are many problems that impede
understanding gene interaction, such as identifying synchronization, working with low
expression levels, the existence of splice variants, etc., of which two are considered here.

First, we consider the significant amount of time that is required because of the large
volume of data; the complexity of a regulatory network model increases as the number of genes
used in the model increases, and as a result this may take a long time to examine. Second, we
discuss the fact that any attempt to model or analyze DNA microarray data is likely to be
affected by measurement error. Several methods have been proposed to develop maps of gene
interaction, including linear equations (6, 31), differential equations (2), Boolean networks (15,
21), and fuzzy cognitive maps (7). Woolf and Wang (33) introduced an approach based on fuzzy
rules of known activator/repressor relationships of gene interaction. Using a normalized subset
ofSaccharomyces cerevisiae data (3), they applied every possible combination of activators and
repressors for each gene.

33
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

The output from the model was compared with the expression levels of the genes. Gene
combinations were ranked based upon the error between the predicted and actual target gene.
Those combinations of genes that had a low error and are characterized by the fuzzy rule base
are the most likely to exhibit an activator/repressor relationship. The method of Woolf and Wang
(33) attempts to simulate what a human would do in comparing expression levels of genes to find
underlying relationships. Different fuzzy models can be developed with the method for different
models of interaction, including coactivators and corepressors, as well as the presence of other
factors in the cell, such as proteins or other factors necessary for transcription. The method is
intuitive, and the results of Woolf and Wang are consistent with the literature on genetic
networks of Scerevisiae. The model itself is an interesting generalization of Boolean networks,
where genes are not either “on” or “off” but are often both “on” and “off” at the same time.
Computational time is a significant obstacle in developing more complex fuzzy models. A
solution to the problem may lie in preprocessing the data. If it is possible to find triplets of genes
that are unlikely tofit the model before running the algorithm, then the unlikely triplets can be
ignored when the algorithm is run, thus saving an amount of time approximately proportional to
the number of gene triplets not considered at later stages. To address the issue of robustness, two
approaches can be considered.

The first approach is to improve the microarray technology to increase signal-to-noise


ratio. Analysis of variance approaches (12) may also provide more accurate readings and
knowledge of signal-to-noise ratios. However, issues such as the stochastic nature of gene
expression may not be eliminated, and some degree of error will have to be dealt with. The
second approach is to use models that are more resilient to minor variations in the expression
data. Fuzzy logic is a generalization of Boolean logic in which the truth value of a statement can
be in the continuous interval between 1 (completely true) and 0 (completely false). Fuzzy logic
can deal with the subjectivity of qualitative measurements. Using membership functions, one can
establish a qualitative set of rules that can model a control system or a physical process. The
rules are expressed in qualitative values that a human can easily understand. Then the
corresponding gene’s expression will be high.” The membership values for all inputs can be
applied to all the rules to obtain a fuzzy output that can be converted back (defuzzification) to a
quantitative value (in this case, the gene’s expression level) through the membership functions.

34
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Fuzzy logic can thus provide a quickly implemented and easy to understand nonlinear
approximation of a process. We show that the use of clustering as a preprocessing step in
building a fuzzy logic-based model can save a significant amount of computation time. Unlike a
previously used modeling approach that tests all combinations of genes, our method applies
cluster centers formed by SOMs to identify genes that are likely to interact. This approach results
in speeding up the gene interaction modeling process and making it feasible to build complex
models including coactivators and corepressors. This will increase the attractiveness of fuzzy
algorithms to reverse-engineer the gene expression network. At present, there is no exact method
for selecting the number of clusters and the percentage of cluster combinations to keep. In our
investigation, using four publicly available datasets, we found that the number of clusters should
be at least half the number of time points in the dataset.

Increasing the number of clusters beyond that number results in some clusters with
similar properties, which has little effect in adding detail to the clusters. Current microarray
technology provides data that is associated with a substantial amount of noise. Hence, any
analysis method that employs microarray data needs to take the effect of noise into account. In
this paper we show that the selection of appropriate fuzzy operators improves the resilience of
the fuzzy algorithm to noisy data. Using clustering as a preprocessing step and applying
appropriate fuzzy operators, we demonstrated that the speed of computation and resilience to
noise of a fuzzy logic-based gene expression data analysis could be substantially increased.
These two improvements are very essential in light of the large volume and noisy data generated
by existing microarray technology.

Our improvements will increase the feasibility of the use of fuzzy logic in analyzing the
relationships between genes with microarray methodologies. The fuzzy model is not limited to
the particular form described in this paper. The fuzzy sets and rules were implemented as
proposed by Woolf and Wang (33). Our aim was to show the viability of our approach in terms
of reducing computation time through clustering and improving robustness using appropriate
fuzzy operators. Future work should be focused on extending the fuzzy model to improve
accuracy and generalization. To accomplish this, three approaches should be investigated. First,
research should be directed to the use of time delayed and instantaneous expression values of
activators and repressors as input to the fuzzy model. We believe that this approach can improve

35
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

the prediction of the expression level of the target gene. Second, coactivators and corepressors
should be included in the fuzzy model. This can pave the way toward the creation of a
generalized gene regulatory model that can accommodate any number of genes.

Finally, more fuzzy sets such as very low, low medium, medium high, and very high will
be added to the fuzzy model to make the model response more continuous. This will, however,
require modifying the decision matrix and studying the appropriate fuzzy rules, which will
increase the model complexity. Other approaches such as artificial neural networks or adaptive
regression schemes can be used to capture the relationships between activators and repressors
from the data itself. The performance of these approaches highly depends on the quality of data
available. We believe combining these approaches with the fuzzy logic-based analysis would be
an interesting future work.

DNA microarray technology will accommodate a varied associate degreealysis of the


expression of genes in an organism. The wealth of spatiotemporal knowledge generated by this
technology permits researchers to doubtless reverse engineer a selected genetic network. “Fuzzy
logic” has been planned as a way to investigate the relationships between genes and facilitate
decipher a genetic network. This methodology will determine interacting cistrons that match a
famed “fuzzy” model of gene interaction by testing all combos of organic phenomenon profiles.
This introduces enhancements remodeled previous fuzzy cistron regulative models in terms of
computation time and lustiness to noise. Improvement in computation time is achieved by
employing a cluster analysis as a preprocessing methodology to scale back the whole range of
cistron combos analyzed. This approach hurries up the algorithmic program by an element of
fifty with lowest impact on the results.

2.8 Peter j. Woolf, and Yixin Wang, A fuzzy logic approach to analyzing gene expression
data [2000].

The pharmaceutical industry is beginning to recognize that gene regulation can be useful
for both assaying drugs and as a source for new molecular targets, assuming the regulatory
network As such, changes in gene expression patterns can be used to assay drug efficacy
throughout the drug discovery process. One assay that takes advantage of the existing level of
sequence information and that is complementary to sequence and genetic analysis is gene

36
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

expression profiling. Expression profiling technologies such as GeneChip measure the


expression level of thousands of genes simultaneously using an array of oligonucleotides bound
to a silicon surface. These arrays are hybridized under stringent conditions with a complex
sample representing mRNAs expressed in the test cell or tissue. The results from these
expression profiling technologies are quantitative and highly allel, thereby allowing us to take an
accurate snapshot of the workings of the cell in a particular state.

Expression profiling assays generate huge data sets that are not amenable to simple
analysis. The greatest challenge in maximizing the use of this data is to develop algorithms to
interpret and interconnect results for different genes under different conditions. Currently, most
expression data is analyzed using clustering techniques, algorithms that identify distinct
expression patterns by grouping genes with similar expression patterns (1, 9). Thus clustering
can only distinguish between those genes that have the same and different expression profiles.
However, genes in the cell make up a complex network that cannot be revealed with current
techniques such as clustering. To determine the network describing how the genes interrelate,
more elaborate data mining techniques need to be developed. Fuzzy logic is an algorithm drawn
from engineering and other applied sciences to control systems as diverse as washing machines
to autofocus cameras (2, 10).It provides a way to transform precise numbers, such as 32.43, into
qualitative descriptors, such as “high” in a process called “fuzzification.”

The techniques can be used to change precise values into discrete descriptors, fuzzy ogic
provides a systematic and unbiased way to perform this transformation, thereby removing the
need for expert knowledge about the system. If 32.43 is a measure of the ambient air
temperature in degrees celsius, then most people would say that 32.43°C is a high temperature.
But this analysis requires our own expert knowledge, which can vary from person to person.
Someone from a tropical climate may feel that 32.43°C is a medium temperature, whereas
someone from a very cold climate may take 32.43°C as a very high temperature. When dealing
with gene expression data, the problem is even more complicated, because no expert exists to
determine what defines a “high” expression level. Using fuzzy logic, the full range of data is first
measured and is then broken into discrete subsections based on the observed data.

These discrete subsections then provide a qualitative description of the data. Once
transformed, this qualitative data can be analyzed using heuristic rules, which in turn generate

37
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

fuzzy solutions. Second, in contrast to other automated decision making algorithms, such as
neural networks or polynomial fits, algorithms in fuzzy logic are cast in the same language used
in day-to-day conversation. As a result, predictions made using fuzzy logic are easily
interpretable and can be extrapolated in predictable ways. Thus they are able to recognize a large
number of biologically important patterns. Using fuzzy logic, we have developed a analysis
technique that can identify logical relationships between genes and in some cases even predict
the function of an unknown gene.

This algorithm was validated using yeast expression data gathered from the Affymetrix
GeneChip system. By using yeast gene expression data collected at different time points of the
cell cycle, we were able to identify many regulatory elements and their target genes within the
cell that work together to maintain and control certain cellular processes. Several cases are
validated by available experimental results, including the signaling network controlled by the
transcription factors HAP1 and ROX1, which control the transition from anaerobic to aerobic
growth. These results suggest that our fuzzy logic technique can indeed find biologically relevant
connections between sets of genes, which in turn could help to describe the complex web of
interactions In general, the findings of the algorithm agree well with experimental results from
the literature.

This should not come as a surprise, seeing as the algorithm searches for relationships that
fit our logical understanding of how an activator, repressor, and target should interact. Thus, by
using essentially the same criteria that an experimenter would use to describe the regulatory
function of a protein, the fuzzy logic algorithm approximates the thought process an However, in
contrast to an expert, the algorithm is automated, unbiased, and general. Expression data can be
difficult to interpret and, if not analyzed properly, can be easily misconstrued. By applying a
computational algorithm to the analysis of the data, we have provided a “lens” through which the
data can be sorted in an unbiased manner, quickly and efficiently. Although in this study the
algorithm was only used to search for triplets of activator, repressor, and target genes, other
variations of the algorithm are also possible.

The choice of the activator repressor model provided a simple method to demonstrate that
this technology can yield biologically meaningful results. However, the technique is general and
can be applied to other relationships and more complicated systems. To a first approximation, the

38
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

interaction between multiple proteins is essentially linear, thus the algorithm searched for linear
behavior. However, in the case of multiple redundant promoter binding sites (such as the two
HAP binding sites on the CYC1 gene), this linear approximation is not accurate, causing the This
situation could be remedied by including a more sophisticated “fuzzification” step to include
nonlinear effects; however, this added complexity may only correct for a few missed connections
while edging out many of the more common near linear relationships. Also, the goal of this
algorithm is not to yield quantitative predictions, but instead to draw general trends that connect
the regulation of multiple genes.

Thus including specific nonlinear effects would not help to draw many connections, but
would end up adding a significant computational burden to an already difficult problem. The
fuzzy logic algorithm found a disproportionately large number of transcription factors in the
roles of activators and repressors; however, not all of the activators and repressors found were
Two possible reasons for this discrepancy are1) transcription factors are expressed at low levels
and as such difficult to detect, and/or2) other gene products such as enzymes can indirectly
regulate transcription. Transcription factors are generally present only at a very low
concentration; thus changes in transcription factor expression levels can be difficult to detect
using current expression profiling techniques.

Presumably, if expression profiling technology were to become more sensitive, then the
fuzzy logic algorithm would detect an even greater bias of transcription factors in the activator
and repressor roles. However, in many cases the expression level of a particular protein is not
governed by the expression of a transcription factor, but instead by the concentration of some
intracellular compound, such as Ca 21 concentration or cAMP levels, which in turn are
controlled by enzymes inside the cell. In these cases, changes in the expression level of the
enzyme have a “transcription-factor-like” effect and would be detected by the algorithm as an
activator or repressor. From an drug design point of view, these “transcription-factor-like”
enzymes are possibly more interesting than true transcription factors, because it is generally
easier to change the activity of an enzyme in the cytosol with a drug than to block a true
transcription factor in the nucleus. Moreover, the data set used in this study came from a single
experiment in which cell cycle control was the main process of study.

39
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Transcription factors that are not involved in pathways related to this cellular process
might not show significant change in their expression and thus could not be evaluated by the To
perform a more comprehensive survey on transcription factors, we are analyzing a data set that
includes gene expression profiles of both wild-type and various mutant yeast cells. Many more
transcription factors can be evaluated because the cellular processes they control have been
perturbed. Although the validation of this algorithm was performed using GeneChip data in this
report, the fuzzy logic algorithm should work equally well with other expression profiling
techniques such as SAGE has the advantage that it can detect completely unknown proteins,
whereas GeneChip technologies require that at least the sequence of protein’s mRNA be known.
This ability to detect unknown proteins would be particularly well suited to the functional
characterization that the fuzzy logic algorithm makes possible. Each new data set provides a
different set of expression levels that can be tested to see whether they fit the proposed
regulatory model.

We have developed a completely unique formula for analyzing organic phenomenon


knowledge. This formula uses formal logic to rework expression values into qualitative
descriptors that may be evaluated by employing a set of heuristic rules. In our tests we have a
tendency to designed a model to seek out triplets of activators, repressors, and targets in a very
yeast organic phenomenon knowledge set. For the conditions tested, the predictions created by
the formula agree well with experimental knowledge within the literature. The formula may also
assist in crucial the perform of uncharacterized proteins and is in a position to find a well larger
range of transcription factors than can be found willy-nilly. This technology extends current
techniques like bunch therein it permits the user to come up with a connected network of genes
exploitation solely expression knowledge. For example, the heuristic program “if high then move
fast” takes “high” as a fuzzy input and “fast” as a fuzzy answer. There are 3 main benefits of
applying formal logic to the analysis of organic phenomenon knowledge.

First, formal logic inherently accounts for noise within the knowledge as a result of it
extracts trends, not precise values. Third, formal logic techniques are computationally
economical and may be scaled to incorporate a limiteless range of parts. In this work we have a
tendency to gift a formal logic primarily based formula for analyzing organic phenomenon
knowledge. tended to describe complete factorral networks of gene interactions supported

40
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

expression knowledge alone. However, exploitation formal logic to research expression


knowledge will have some limitations. An additional advantage to the formal logic formula is
that knowledge will come back from associate degreey supply at intervals an organism (tissue,
cell type, treatment, or physiological state), and therefore the output really are improved by
deeper and a lot of numerous knowledge set. the explanation for this improvement is that the
formula has to observe changes within the expression level of a macromolecule relative to
changes in different expression levels.

In our studies, several knowledge sets were eliminated exclusively as a result of they
didn't sufficiently explore the combos of expression levels (too high a letter value), creating their
predictions not possible. By together with knowledge sets from cells in several states, the
formula gains a lot of data regarding the small print of the restrictive network. Transcription
factors that are not involved in pathways related to this cellular process might not show
significant change in their expression and thus could not be evaluated by the To perform a more
comprehensive survey on transcription factors, we are analyzing a data set that includes gene
expression profiles of both wild-type and various mutant yeast cells. Many more transcription
factors can be evaluated because the cellular processes they control have been perturbed.
Although the validation of this algorithm was performed using GeneChip data in this report, the
fuzzy logic algorithm should work equally well with other expression profiling techniques.

2.9 Petri, Mikko Kolehmainen, Analysis of gene expression data using self-organizing maps
[2000].

We have here applied the SOM algorithm to analyze published data of yeast gene
expression and show that SOM is an excellent tool for the analysis and visualization of Our
results demonstrate that SOM is a fast and convenient method to organize and interpret gene
expression data. Expression procles of a group of genes are represented by a common weight
vector and the data are easily visualized as a two-dimensional matrix. Application of Sammon's
algorithm transforms the data into a format where the relationship between individual neurons is
even more clearly visualized. This is particularly the case for those genes which are clearly
regulated by the applied treatment. Similarity refects participation in a common pathway or
regulation by a common regulatory element in a promoter region, or both. Elucidation of the
functional role of these genes is currently being attempted by examining the phenotype of yeast

41
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

with a null mutation of these genes. Our results suggest an alternative approach to and a
functional role for an unknown gene. Gene expression procles from yeast exposed to a variety of
different environmental conditions are becoming publicly available. Localization of this
homologue in yeast SOMs may also suggest a possible functional role. Furthermore, the entire
genomic sequences of Caenorhabditis elegans, Drosophila melanogasterandHomo sapiens are
being nalized.

A self-organizing map (SOM, Kohonen's map) is an unsupervised neural network


algorithm which has successfully been used to analyze very large data ¢les in various, such as
process monitoring and visualization, exploratory data analysis and simulation of brain-like
feature maps. We have here investigated whether SOM and Sammon's mapping algorithms can
be applied to the analysis of gene expression data. We have analyzed a publicly available
database on the changes in gene expression during a diauxic shift (shift from anaerobic to
aerobic metabolism in yeast using prototype software running Our results show that SOM
rapidly and reliably clusters this gene expression data set into groups that not only show similar
gene expression procles but also contain functionally related genes.DNA microarray
technologies along side quickly increasing genomic sequence info is resulting in AN explosion
in accessible organic phenomenon information. presently there's a good would like for
economical ways to investigate and visualize these huge information sets.

A self-organizing map (SOM) is AN unattended neural network learning formula that has
been with success used for the analysis and organization of enormous information files.
Examination of the cistron contents of individual neurons demonstrates that in several cases, the
agglomeration achieved by Kyrgyzstani monetary unit dependably predicts useful similarity. In
distinction, genes that weren't regulated at the transcriptional level throughout diauxic shift are
sorted along in additional random basis and also the agglomeration of non-regulated genes into a
same vegetative cell doesn't seem to predict useful similarity. The yeast ordering still contains
thousands of genes with unknown operate. Problems arise once null mutants don't show any
clear composition.

Applying Kyrgyzstani monetary unit to those totally different information sets could
reveal conditions underneath that this explicit cistron is regulated and since agglomeration of
regulated genes seems to predict useful similarity, comparison with different cistrons within the

42
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

same neurons offers valuable hints concerning the potential useful role of the unknown gene.
several class genes with AN unknown operate have a homologue within the yeast ordering. Since
SOMs usually work higher once additional information ar provided, this methodology could
become a valuable tool within the organization and interpretation of class organic phenomenon
information.SOM permits simple visual image of advanced information and is strong to minor
experimental variation. Moreover, the formula places genes with similar, however not identical
procles in neighboring teams making a swish transition of connected procles over the entire
matrix.

43
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

CHAPTER III
METHODOLOGY

3.1 DNA Microarray Technology

DNA Microarrays attempt to analyze the expression of different genes in parallel on any
scale up to the entire genome of an organism. The construction of microarrays begins with the
production of complimentary DNA (cDNA) segments that represent each gene. Each segment
is the complement to the actual DNA sequence of a gene and differs from the corresponding
mRNA sequence only in that thymine in cDNA replaces uracil in mRNA. Each spot on the
microarray is created by inserting copies of a of one gene's cDNA sequence on a glass slide or
other substrate by a high speed robotic process that physically binds the sequence to a small spot
on the slide. A spot is new for every gene sequence to be used in the microarray. The substrate
and the spots of DNA sequences are collectively known as the microarray. Each spot is referred
to as a probe.

Figure 3.1 A scanned DNA microarray after hybridization

To measure gene expression for a cell population, mRNA is extracted from the cells and
is reverse-transcribed into complimentary DNA (cDNA). This cDNA sequence is identical to the
DNA sequence for the gene found in the nucleus and is thus complimentary to the cDNA probes
on the microarray chip.

44
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

The concentration of each sequence is multiplied proportionally through chemical


reactions. Chemical dyes (often green and red in microarray experiments) are bound to the
sequences to allow for subsequent analysis of concentration. A solution of this dyed cDNA is
created and exposed to the microarray. On the microarray, the cDNA sequences bind, or
hybridize, to the probes that contain their complimentary sequence. After a proscribed amount
of time, the remaining cDNA solution is washed off the chip. What remains are the probes and
the cDNA sequences that hybridized with them. The microarray is scanned with a laser set at the
wavelength of the dye's color. The florescent intensity of each spot indicates approximately how
many copies of the gene are bound to the spot, and thus, a relative perspective of the expression
of that gene in the cell. The appearance of a scanned microarray can be found. Unfortunately, the
florescence alone tells us very little when the gene expression from only one population is used;
we cannot directly correlate the florescence of a probe to the copies of a gene on that probe.

To alleviate the problem, we can add a second population whose cDNA sequences were
treated with a different dye. This second population can be used as a control population; in the
case of time series data, the second (control) population is often the cell population at a fixed
point of time while the first population is the same cell population at a later time. The two dyes
should have colors of significantly different wavelengths to avoid "crosstalk", i-e., a situation
where one dye affects the measured florescence of the other. The relative difference in
florescence of the two dyes on a particular spot should tell us how much a gene's expression
differs between the two populations.

Expression levels can be reported as some form of difference between the two
florescence’s, such as a ratio. Gene expression profiles can be assembled from a series of these
differential values at different points in time. The technology is young and still has some
problems. First, the florescence signal is unlikely to exactly match the level of expression of each
gene. The probe solution used is far fiom a free solution; the distribution of a certain cDNA
sequence through the solution is not even. This problem may be partially alleviated by devoting
several spots on the microarray to each gene and averaging the results, but it cannot guarantee
the elimination of the problem. cDNA probes with similar, but not identical, sequences to a
particular spot on the microarray may still hybridize to the spot with mixed results, exaggerating
the expression of one gene, possibly at the expense of another. Many other issues may also exist.

45
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

3.2 Woolf and Wang's Fuzzy Gene Model Algorithm

As previously mentioned, a fuzzy model of gene interaction would be a generalization of


a simple Boolean "on" and "off model; it realizes that transitionary states exist and attempts to
account for them. Woolf and Wang's algorithm starts by selecting an appropriate subset of genes
to analyze. The differential threshold only accepts genes whose expression changes by a factor
of at least 3; that is, the ratio between the gene's lowest and highest expression level should be at
least 3. This ratio exceeds the minimum detectable change (as determined by the estimated noise
level) and ensures that the genes used are ones that change significantly over the time series and
thus have switched "on" or "off'. The minimum expression threshold is set to eliminate genes
whose highest expression level is below a certain level. Differential changes are greater for a
given absolute change if the overall expression level is low; a difference of 30 between local
maxima and minima means a greater differential change when the minimum is 30 than when it is
300. Genes with low expression levels are thus more likely to be distorted by noise and will not
serve well in analysis.

Figure 3.2 Fuzzy membership functions for gene expression data

Once the subset of genes has been obtained, the data is fuzzified. Each gene is
normalized to a scale of 0 to 1, where 1 is the highest expression level and 0 is the lowest (these
are the "on" and "off positions of Boolean data). The normalized information is fuzzified into 3
fuzzy qualifiers, "Low", "Med" , and "High". The membership functions. All doable triples of
sequences area unit applied to a model of gene interaction. From the model, we tend to get 2
outputs.

46
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Figure 3.3 Fuzzy model of gene interaction from

The first is the model's output, which is compared to the target gene and scored on basis
of the Mean Squared Error between the modeled target output and the actual target output. The
second output is a variance; the variance returned is the variance between the total degree to
which each rule was fired over the time series. This serves as a confidence value for the model
output; lower variances imply that all rules were fired nearly equally and the model output is
indicative of the model as a whole. This does not imply that a gene triplet with a high model
variance cannot produce a valid output. However, a high variance combination has not been
equally handled by all rules in the model and we cannot be sure that the model would fit the
triplet in other parts of the model's output space.

Thus, the variance serves as a secondary supplement to the error score; it may have some
importance when comparing the validity of two gene triplets with nearly identical error scores,
but may not be as valid in comparing gene triplets with significantly different error scores.
Both the error and the variance are multiplied by 1000 to obtain a score that is easier to read. In
both cases, lower scores are better; low scores imply low error or variance. All possible triplets
of genes (with each gene serving in each role in the model) may be examined in this manner.
The program records the error and variance of each triplet. To save computation time and
memory, error and variance limits can be set. If a triplet has a higher error than the specified
limit, it will not be recorded in the results.

47
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

If a particular activator and repressor combination has a variance above the specified
limit, no other triplets with those genes in the activator and repressor positions will be examined.
First, they attempted to find known gene relationships in the results. As an example in the paper,
they were able to find many known relationships to the gene HAP1. Pairs of genes that appear in
many triplets in the same position (e.g., gene A is expressed as an activator and gene C is
expressed as a target in many gene triplets), it is likely that the two are related and the third gene
in the triplets are relatively irrelevant to the model. The most frequently appearing pairs were
usually biologically related. Finally, they examined the presence of transcription factors in the
results.

Transcription factors have a direct effect upon transcription of genes and will thus have a
profound impact on gene expression. Logically, the lowest error results should have a
disproportionate number of transcription factors; that is, the probability of finding a transcription
factor in a low-error gene triplet should be better than the probability of finding a transcription
factor in the input dataset. The appearance of a particular gene triplet in the results does not
necessarily mean that the modeled relationship exists between the two genes. Since there are
fewer time steps than there are genes, there is no way to develop a unique solution. Due to the
stochastic nature of the network, it is unlikely that a unique solution would be right even if one
were found. The validity of the results can be further strengthened through the results of different
datasets and the union of the results of these disparate datasets to find common links.

First, it is assumed that the time from a gene's transcription to its translation into a protein
is negligible; that is, the expression level of a particular gene is directly proportional to the
presence of its corresponding protein at any point in time. This is generally not true; reaction
times of the system are relatively slow and possibly not constant. If we assume that time from
transcription to translation is generally consistent, then we can simply view the gene expression
data as the protein expression data with a time shift. Second, the model is deterministic, i.e.,
there can be only one model output for a particular expression level input. However, as
discussed. Gene interaction is increasingly shown to be a stochastic process: "the number of
transcription factors in a cell is often low the environment in which the gene regulatory
interactions occur is far from free solution; and the reaction kinetics is relatively slow." The
averaging of expression levels through using a large number of cells may eliminate some of

48
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

these effects (by making the overall solution closer to "average" and uniform), but may also
distort regulatory networks to some degree, hindering our ability to extract important
relationships from the genetic network. Care must be taken to ensure that averaging does not
distort these networks. The model is intuitively pleasing; the method is similar to that of a human
expert attempting to find relationships through time series data.

Fuzzy logic deals with uncertainty and ambiguity; it handles qualitative data and may
handle the nonlinear, stochastic nature of the data better than other deterministic models, such as
the Boolean model where the gene is either "on" or "off" and nothing in between. It is also
expandable to model any known relationships between genes and can be modified to handle time
delays or multiple activators and/or repressors. However, as discussed in the introduction, there
are several problems with the model that endanger its viability as an analysis method. First, the
algorithm has a high algorithmic complexity; all permutations of three genes must be analyzed,
giving the algorithm an 0(N3) complexity.

The complexity of the model is directly related to the number of inputs; a model with two
activators and two repressors would have an 0(N5) complexity. With each additional input, the
run time of the algorithm increases by orders of magnitude. A model with two co-activators has
been shown (in our experiments) to take more than 200 times longer than a model with only one
activator. Adding another input would likely increase the run time by a similar factor, making
the analysis of complex relationships nearly impossible without extremely powerful computers.
Second, the output space of the model is highly irregular and thus vulnerable to large changes in
output for a small change in input. Since the error of microarray data can be up to 30% or more,
it is likely that the output of the model can be highly inaccurate. Transcription factors have a
direct effect upon transcription of genes and will thus have a profound impact on gene
expression. Logically, the lowest error results should have a disproportionate number of
transcription factors; that is, the probability of finding a transcription factor in a low-error gene
triplet should be better than the probability of finding a transcription factor in the input dataset.
First, it is assumed that the time from a gene's transcription to its translation into a protein is
negligible; that is, the expression level of a particular gene is directly proportional to the
presence of its corresponding protein at any point in time. This is generally not true; reaction
times of the system are relatively slow and possibly not constant.

49
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

3.3 Proposed Work

Figure 3.4 Block diagram of proposed method

50
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Step 1 pre-processing

High dimensional microarray gene expression datasets comprise of thousands of genes


out of which some genes do not show any interesting changes during experimentation. Therefore,
these genes are often removed because it are going to be useful in effective analysis of organic
phenomenon dataset. It represents the filtering steps used in the proposed method for microarray
gene expression datasets. First, the genes are identified with missing values and then indexing
commands are used to remove the genes. The reasons for missing values in gene expression data
are due to scratches and dust on the microarray glass slide while monitoring gene expressions.

Figure 3.4 Decision matrix for woolf and wang model

Figure 3.5 Filtering steps for microarray gene expression dataset

51
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

If genes containing missing values are not removed then it will lead us to wrong
interpretation. The gene expression levels where values are different in very low amount i.e. they
have very low variance and flat in nature are filtered out as they do not produce any interesting
results. The main aim to introduce this pre-processing step is to ensure that a wide enough
dynamic range of measurements exists in different genes profiles. The formulae used to calculate
variance of the gene profiles is given below

where, N is total variety of genes, Gi is that the organic phenomenon for cistron i and m
represents the mean. Then, the cistron with low absolute expressions levels are filtered out. The
low absolute expressions of genes are because of poor spot crossing and huge quantisation errors
in microarray experiment.

The low absolute value filter is implemented as it is believed that gene profiles having
low gene expression are less reliable than gene profiles having high gene expression. Lastly the
genes having low entropy are removed. The effectiveness of the genes is calculated by using
entropy filter method. Low entropy means less effective genes. The motive to remove low
entropy genes is to mitigate the spiking behaviour shown by some gene profiles present in a
particular gene expression dataset. The formulae used to calculate entropy of the gene profile is
given below

where p () is the probability function and i stands for number of samples/time point of gene
expression dataset.

52
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Step 2 clustering

K-means clustering outperforms the hierarchical clustering as the performance of


hierarchical clustering goes on decreasing as the number of genes increases and it leads to
increase in the execution .The other benefit of using K-mean clustering on top of Hierarchical
clustering is that the genes are equally distributed in the clusters produced by k-means clustering
while in hierarchical clustering the genes are mostly concentrated in two-three clusters. K-means
clustering produce better results for high dimensional microarray gene expression datasets on the
other hand hierarchical clustering produces good results for small datasets.

In this work, clusters are generated using k-means clustering technique. First, the
networks of cluster centroids (CCs) are generated. The cluster centres network is then examined
to check which cluster centre fits the method well. Those clusters whose cluster centres do not fit
in the network well are discarded and remaining clusters are further considered for examination.
This step is helpful in decreasing the computational time of the proposed method to some extent
and it also decreases the complexity of network. After the pre-processing of microarray organic
phenomenon dataset, the cluster is performed to divide the matter into sub issues. the 2
completely different cluster techniques: class-conscious cluster and k-means cluster are enforced
on the input microarray organic phenomenon dataset so as to pick out the foremost economical
technique. Therefore, k-means cluster is more employed in the planned methodology for the
reverse engineering of GRN.

Step 3 fuzzy inference is the system for proposed

Further, the genes present in the selected clusters are normalized on the scale of 0e1 using
the Min-Max technique. The output of proposed fuzzy inference system is classified into five
qualifiers {Medium Increase (MI), High Increase (HI), Insignificant (I), Medium Decrease
(MD), High Decrease (HD)}. After fuzzification of expression level of genes, genes are further
classified into Activator (A), Repressor (R) and Target (T) genes. The target gene is estimated
for each pair of activator and repressor for every sample present in the microarray gene
expression dataset using the fuzzy rules defined in Fuzzy Decision Matrix. Then the normalized
values of input organic phenomenon levels area unit fuzzified victimisation qualifiers High,
Medium and Low supported the subsequent fuzzy membership functions.

53
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Step 4 Ranking of triplets

After self-made completion of fuzzification, the target price is defuzzified victimization


the centre of mass technique to urge the ultimate target price. Grammar Check Re-write Again
NextAfter the defuzzification, the Estimated Target Value (ETV) and Actual Target Value
(ATV) are compared with each other to calculate the error and the variance.

Figure 3.6 Fuzzy input membership functions

Figure 3.7 Decision matrix for proposed method

54
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

The mathematical formulation for the calculation of Mean Squared Error (MSE) gene is as
follows

To find out the map of interactions in genes, triplets are ranked. Firstly, the threshold
limit (a) for residual score is defined to 1%. The values corresponding to those triplets lies below
the threshold limit are further considered for ranking. The threshold limit reduces the time
complexity to some extent. Selected gene triplets are ranked on the basis of residual score that
means low variance and low error means higher rank. Resulting triplets are added to resultant
matrix (Mr) and on the basis of Mr, candidate Gene Regulatory Network (cGRN) i.e. sub-
network are generated. The higher than method is recurrent for every cluster and every one
cGRN area unit obtained. Then for each cGRN, the factor representative of that network i.e.
Finally, all the cGRNs using their gene representative are merged. A complete network of genes
is formed as the output. The Pseudo Code for carrying out the reverse engineering of GRN for
each cluster.

55
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

CHAPTER IV
RESULT AND DISCUSSION

4.1 Dataset

The time course gene expression datasets of yeast are taken into consideration for the
validation of the proposed method. The first dataset of yeast is taken throughout its diamide
treatment (GEO Accession range GDS37) and another is taken whereas it's shifting from
anaerobic to aerobic growth (GEO The total range of genes of yeast information gift at the time
of diamide treatment are 6385 at seven totally different experimental conditions whereas
throughout shifting from anaerobic to aerobic growth are half-dozen307 genes at 6 totally
different procedure. The proposed method is validated by employing FunCoup 3.0 a database of
genome-wide functional coupling networks.

The implementation of the proposed method is done in MATLAB R2015a. Different


genes are selected randomly from the both datasets using a random function for the reverse
engineering of GRN. The randomly selected genes from GEO Accession Number GDS37 using
random function are YGR066C, YGR201C, PEX14, DBP1, UGX2, UIP4, NQM1, UPS2,
YFL054C, GPX1 and SNA3. The genes selected from other dataset i.e. GDS3030 are:
YLR108C, ARA1, ADH5, TMT1, ARG3, ARG56, CPA2, ARG4, RTC2, ICY2, INA1.

To measure the performance of proposed method, Sensitivity, Specificity and F-Score are
the measures that are taken into consideration. These parameters are defined as follows

56
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Figure 4.1 GRN generated by FunCoup 3.0 for Dataset with GEO Accession
Number: GDS37.

Figure 4.2 GRN generated by Proposed Method for Dataset with GEO Accession
Number: GDS37

57
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Table 4.1 Performance measures of dataset GDS37

The GRN generated for microarray gene expression dataset with GEO accession number
GDS37 is while the network generated by the proposed method for the same dataset. Table 3.1
shows the performance measures for different number of genes in microarray gene expression
dataset with GEO Accession Number: GDS37.

Figure 4.3 GRN generated for Dataset with GEO Accession Number: GDS3030

58
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Figure 4.4 GRN generated for Dataset with GEO Accession Number: GDS3030.

Similarly, the GRN generated for alternative dataset with GEO accession variety
GDS3030 is whereas that of projected technique. Table a pair of shows the performance
measures for various variety of genes in microarray organic phenomenon dataset with GEO
Accession Number: GDS3030. The performance measures of each microarray organic
phenomenon datasets ar compared with existing technique. 10represents the F-Score Comparison
between projected technique and microarray organic phenomenon dataset with GEO accession
variety GDS37. Represents the F-Score Comparison between projected technique for microarray
organic phenomenon dataset with GEO accession variety GDS3030.

59
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Table 4.2 Performance measures of GDS3030.

Figure 4.5 F-Score Comparison of dataset GDS37 with method by Al-Shobaili 2014

60
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

Figure 4.6 F-Score Comparison of dataset GDS3030 with method proposed by Al-Shobaili
2014.

61
ANALYSIS GENE EXPRESSION DATA USING THE APPROACH OF FUZZY LOGIC

CHAPTER V
CONCLUSION

Gene restrictive networks are complicated networks and are generated from the
microarray organic phenomenon datasets. Fuzzy logic based technique is robust and can tackle
the incomplete and imprecise microarray gene expression datasets. The Woolf and Wang fuzzy
logic based model is the most basic technique for the reverse engineering of GRN but have the
number of limitations like high time complexity and less efficiency. A novel fuzzy logic based
method is presented in which introduces various significant steps like, filtering and clustering,
that increase its effectiveness. Threshold limits are defined to decrease the time complexity of
the proposed method to some extent. This methodology is tested on 2 completely different
datasets of yeast with GEO accession range GDS37 and GDS3030. The results of each datasets
square measure compared with the assistance of on-line information. The results square measure
additional compared with the present methodology. The experimental results show that the
projected formal logic primarily based methodology for reverse engineering of GRN provides
comparable results with the opposite existing methodology in terms of the F-Score, In future, the
tactic is tested for the quantifiability and generated GRN is valid. In this work, some novel
interactions are found among the genes which need to be clinically tested in the laboratory.

62

You might also like