7

Computational Learning Theory
http://rajakishor.co.cc Page 95

The study machine learning involves the following in generating the laws that may govern machine learners. 1. Is it possible to identify classes of learning problems that are inherently difficult or easy, independent of the learning algorithm? 2. Can one characterize the number of training examples necessary or sufficient to assure successful learning? 3. How is this number affected if the learner is allowed to pose queries to the trainer, versus observing a random sample of training examples? 4. Can one characterize the number of mistakes that a learner will make before learning the target function? 5. Can one characterize the inherent computational complexity of classes of learning problems? The computational theory of learning provides answers to these questions by presenting key results within particular problem settings. We focus here on the problem of inductively learning an unknown target function, given only training examples of this target function and a space of candidate hypotheses. Within this setting, we will be chiefly concerned with questions such as 1. how many training examples are sufficient to successfully learn the target function? 2. how many mistakes will the learner make before succeeding? We can set quantitative bounds on these measures, depending on attributes of the learning problem such as: the size or complexity of the hypothesis space considered by the learner the accuracy to which the target concept must be approximated the probability that the learner will output a successful hypothesis the manner in which training examples are presented to the learner The goal of computational learning is to answer questions such as: Sample complexity. How many training examples are needed for a learner to converge (with high probability) to a successful hypothesis? Computational complexity. How much computational effort is needed for a learner to converge (with high probability) to a successful hypothesis? Mistake bound. How many training examples will the learner misclassify before converging to a successful hypothesis?

http://rajakishor.co.cc

Page 96

P

robably

L

earning an

A

pproximately

C

orrect

H

ypothesis

Here, we consider a particular setting for the learning problem, called the probably approximately correct (PAC) learning model. We begin by specifying the problem setting that defines the PAC learning model, then consider the questions of how many training examples and how much computation are required in order to learn various classes of target functions within this PAC model. For the sake of simplicity, we restrict the discussion to the case of learning Boolean-valued concepts from noise-free training data.

The Problem Setting
Let X refers to the set of all possible instances over which target functions is to be defined. For example, X might represent the set of all people, each described by the attributes age (e.g., young or old) and height (short or tall). Training examples are generated by drawing an instance x at random according to D, then presenting x along with its target value, c(x), to the learner. We assume instances are generated at random from X according to some probability distribution D. For example, D might be the distribution of instances generated by observing people who walk out of the largest sports store in Switzerland. In general, D may be any distribution, and it will not generally be known to the learner. All that we require of D is that it be stationary; that is, that the distribution not change over time. Let C refer to some set of target concepts that our learner might be called upon to learn. Each target concept c in C corresponds to some subset of X, or equivalently to some boolean-valued function c : X {0, 1}. For example, one target concept c in C might be the concept "people who are skiers." If x is a positive example of c, then we will write c(x) = 1; if x is a negative example, c(x) = 0. The learner L considers some set H of possible hypotheses when attempting to learn the target concept. For example, H might be the set of all hypotheses describable by conjunctions of the attributes age and height. After observing a sequence of training examples of the target concept c, L must output some hypothesis h from H, which is its estimate of c. To be fair, we evaluate the success of L by the performance of h over new instances drawn randomly from X according to D, the same probability distribution used to generate the training data.

http://rajakishor.co.cc

Page 97

Error of a Hypothesis
Because we are interested in how closely the learner's output hypothesis h approximates the actual target concept c, let us begin by defining the true error of a hypothesis h with respect to target concept c and instance distribution D. Definition The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.

errorD ( h) = Pr[c ( x ) ≠ h( x)]
x∈D

Here, the notation distribution D.

x∈D

Pr

indicates that the probability is taken over the instance

Informally, the true error of h is just the error rate we expect when applying h to future instances drawn according to the probability distribution D.

Probably Approximately Correct (PAC) Learnability
The aim of computational learning is to characterize classes of target concepts that can be reliably learned from a reasonable number of randomly drawn training examples and a reasonable amount of computation. We may try to characterize the number of training examples needed to learn a hypothesis h for which errorD(h) = 0. Unfortunately, it turns out this is futile in the setting we are considering, for two reasons. 1. Unless we provide training examples corresponding to every possible instance in X (an unrealistic assumption), there may be multiple hypotheses consistent with the provided training examples, and the learner cannot be certain to pick the one corresponding to the target concept. 2. Given that the training examples are drawn randomly, there will always be some nonzero probability that the training examples encountered by the learner will be misleading. (For example, although we might frequently see skiers of different heights, on any given day there is some small chance that all observed training examples will happen to be 2 meters tall.)

http://rajakishor.co.cc

Page 98

To accommodate these two difficulties, we weaken our demands on the learner in two ways. 1. We will not require that the learner output a zero error hypothesis - we will require only that its error be bounded by some constant, є, that can be made arbitrarily small. 2. We will not require that the learner succeed for every sequence of randomly drawn training examples - we will require only that its probability of failure be bounded by some constant, δ, that can be made arbitrarily small. In short, we require only that the learner probably learn a hypothesis that is approximately correct - hence the term probably approximately correct learning, or PAC learning. Consider some class C of possible target concepts and a learner L using hypothesis space H. We can say that the concept class C is PAC-learnable by L using H if, for any target concept c in C, L will with probability (1 - δ) output a hypothesis h with errorD(h) < ε, after observing a reasonable number of training examples and performing a reasonable amount of computation. More precisely, Consider a concept class C defined over a set of instances X of length n and a learner L using hypothesis space H. C is PAC-learnable by L using H if for all c є C, distributions D over X, ε such that 0 < ε < ½, and δ such that 0 < δ < ½, learner L will with probability at least (1 - δ) output a hypothesis h є H such that errorD(h) ≤ ε, in time that is polynomial in 1/ε , 1/δ, n, and size(c).

S

ample

C

omplexity for

F

inite

H

ypothesis

S

paces

PAC-learnability is largely determined by the number of training examples required by the learner. The growth in the number of required training examples with problem size is called the sample complexity of the learning problem. Here, we present a general bound on the sample complexity for a very broad class of learners, called consistent learners. A learner is consistent if it outputs hypotheses that perfectly fit the training data, whenever possible. It is quite reasonable to ask that a learning algorithm be consistent, given that we typically prefer a hypothesis that fits the training data over one that does not. We can derive a bound on the number of training examples required by any consistent learner, independent of the specific algorithm it uses to derive a consistent hypothesis. To accomplish this, we use the version space. We defined the version space, VSH,D, to be the set of all hypotheses h є H that correctly classify the training examples D.
http://rajakishor.co.cc Page 99

VS H , D = {h ∈ H | (∀ < x, c ( x) >∈ D )( h( x) = c ( x ))}
The significance of the version space here is that every consistent learner outputs a hypothesis belonging to the version space, regardless of the instance space X, hypothesis space H, or training data D. The reason is simply that by definition the version space VSH,D contains every consistent hypothesis in H. Therefore, to bound the number of examples needed by any consistent learner, we need only bound the number of examples needed to assure that the version space contains no unacceptable hypotheses. The following definition states this condition precisely. Consider a hypothesis space H, target concept c, instance distribution D, and set of training examples D of c. The version space VSH,D, is said to be ε-exhausted with respect to c and D, if every hypothesis h in VSH,D, has error less than ε with respect to c and D.

(∀h ∈ VS H , D )errorD (h) < ε
The version space VSH,D is the subset of hypotheses h є H, which have zero training error (say, r = 0). Of course, the true errorD(h) may be nonzero, even for hypotheses that commit zero errors over the training data. The version space is said to be ε -exhausted when all hypotheses h remaining in VSH,D have errorD(h) < ε .

T M
he

istake

B

ound of

L

earning

The theory of computational learning considers a variety of different settings. They are 1. how the training examples are generated (e.g., passive observation of random examples, active querying by the learner) 2. noise in the data (e.g., noisy or error-free) 3. the definition of success (e.g., the target concept must be learned exactly, or only probably and approximately) 4. assumptions made by the learner (e.g., regarding the distribution of instances and whether C ⊆ H) 5. the measure according to which the learner is evaluated (e.g., number of training examples, number of mistakes, total time).

http://rajakishor.co.cc

Page 100

In the case of mistake bound model of learning, the learner is evaluated by the total number of mistakes it makes before it converges to the correct hypothesis. As in the PAC setting, we assume the learner receives a sequence of training examples. However, here we demand that upon receiving each example x, the learner must predict the target value c(x), before it is shown the correct target value by the trainer. The question considered is "How many mistakes will the learner make in its predictions before it learns the target concept?” This question is significant in practical settings where learning must be done while the system is in actual use, rather than during some off-line training stage. For example, if the system is to learn to predict which credit card purchases should be approved and which are fraudulent, based on data collected during use, then we are interested in minimizing the total number of mistakes it will make before converging to the correct target function. Here the total number of mistakes can be even more important than the total number of training examples. This mistake bound learning problem may be studied in various specific settings. For example, we might count the number of mistakes made before PAC learning the target concept. In the examples below, we consider instead the number of mistakes made before learning the target concept exactly. Learning the target concept exactly means converging to a hypothesis such that ( ∀ x)h(x) = c(x).

Mistake Bound for the FIND-S Algorithm
Consider the hypothesis space H consisting of conjunctions of up to n boolean literals l1, …, ln and their negations (e.g., Rich ^ ¬Handsome). The FIND-S algorithm incrementally computes the maximally specific hypothesis consistent with the training examples. A straightforward implementation of FIND-S for the hypothesis space H is as follows: Find-S Initialize h to the most specific hypothesis l1 ^ ¬l1 ^ l2 ^ ¬l2 … ln ^ ¬ln For each positive training instance x o Remove from h any literal that is not satisfied by x Output hypothesis h. FIND-S converges in the limit to a hypothesis that makes no errors, provided C ⊆ H and provided the training data is noise-free. FIND-S begins with the most specific hypothesis (which classifies every instance a negative example), then incrementally generalizes this hypothesis as needed to cover observed positive training examples.
http://rajakishor.co.cc Page 101

Can we prove a bound on the total number of mistakes that FIND-S will make before exactly learning the target concept c? The answer is yes. To see this, note first that if c ∈ H, then FIND-S can never mistakenly classify a negative example as positive. The reason is that its current hypothesis h is always at least as specific as the target concept e. Therefore, to calculate the number of mistakes it will make, we need only to count the number of mistakes it will make misclassifying truly positive examples as negative. How many such mistakes can occur before FIND-S learns c exactly? Consider the first positive example encountered by FIND-S. The learner will certainly make a mistake classifying this example, because its initial hypothesis labels every instance negative. However, the result will be that half of the 2n terms in its initial hypothesis will be eliminated, leaving only n terms. For each subsequent positive example that is mistakenly classified by the current hypothesis, at least one more of the remaining n terms must be eliminated from the hypothesis. Therefore, the total number of mistakes can be at most n + 1. This number of mistakes will be required in the worst case, corresponding to learning the most general possible target concept ( ∀ x)c(x) = 1 and corresponding to a worst case sequence of instances that removes only one literal per mistake.

http://rajakishor.co.cc

Page 102

8

stanceInstance-Based Learning
http://rajakishor.co.cc Page 103

Instance-based learning methods such as nearest neighbor and locally weighted regression are conceptually straightforward approaches to approximating real-valued or discrete-valued target functions. Learning in these algorithms consists of simply storing the presented training data. When a new query instance is encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance. Instance-based methods can use more complex, symbolic representations for instances. In case-based learning, instances are represented in this fashion and the process for identifying "neighboring" instances is elaborated accordingly. Case-based reasoning has been applied to tasks such as storing and reusing past experience at a help desk, reasoning about legal cases by referring to previous cases, and solving complex scheduling problems by reusing relevant portions of previously solved problems. One disadvantage of instance-based approaches is that the cost of classifying new instances can be high. This is due to the fact that nearly all computation takes place at classification time rather than when the training examples are first encountered. Therefore, techniques for efficiently indexing training examples are a significant practical issue in reducing the computation required at query time. A second disadvantage to many instance-based approaches, especially nearestneighbor approaches, is that they typically consider all attributes of the instances when attempting to retrieve similar training examples from memory. If the target concept depends on only a few of the many available attributes, then the instances that are truly most "similar" may well be a large distance apart.

k-

N

earest

N

eighbor

L

earning

The most basic instance-based method is the k-NEARESNT NEIGHBOR algorithm. This algorithm assumes all instances correspond to points in the n-dimensional space Rn. The nearest neighbors of an instance are defined in terms of the standard Euclidean distance. More precisely, let an arbitrary instance x be described by the feature vector (a1(x), a2(x), …, an(x)) where ar(x) denotes the rth attribute of instance x. then the distance between two instances xi and xj is defined to be d(xi, xj), where
n

d ( xi , x j ) =
http://rajakishor.co.cc

∑ (a ( x ) − a ( x ))
r i r j r =1

2

Page 104

In nearest-neighbor learning the target function may be either discrete-valued or real-valued. Let us first consider learning discrete-valued target functions of the form f: Rn V, where V is the finite set {v1, …, vs}. The k-NEAREST NEIGHBOR algorithm for approximating a discrete-valued target function is given below. Training algorithm: For each training example (x, f(x)), add the example to the list training-examples Classification algorithm: Given a query instance xq to be classified, • Let x1, …, xk denote the k instances from training-examples that are nearest to xq. • Return
k

f ( xq ) ← arg max ∑ δ (v, f ( xi ))
v∈V i =1

where d(a, b) = 1 if a=b and where d(a, b) = 0 otherwise. As shown here, the value f(xq) returned by this algorithm as its estimate of f(xq) is just the most common value of f among the k training examples nearest to xq. If we choose k = 1, then the 1-NEAREST NEIGHBOR algorithm assigns to f ( xq ) the value f(xi) where xi is the training instance nearest to xq. For larger values of k, the algorithm assigns the most common value among the k nearest training examples. The k-NEAREST NEIGHBOR algorithm is easily adapted to approximating continuous-valued target functions. To accomplish this, we have the algorithm calculate the mean value of the k nearest training examples rather than calculate their most common value. More precisely, to approximate a real-valued target function f: Rn R we replace the final line of the above algorithm by the line

∑ f (x ) ←
q

k i =1

f ( xi )

k

Distance Weighted Nearest Neighbor Algorithm
One obvious refinement to the k-NEAREST NEIGHBOR algorithm is to weight the contribution of each of the k neighbors according to their distance to the query point xq, giving greater weight to closer neighbors. For example, in the above algorithm, which approximates discrete-valued target functions, we might weight the vote of each neighbor according to the inverse square of its distance from xq.
http://rajakishor.co.cc Page 105

k

f ( xq ) ← arg max ∑ wiδ (v, f ( xi ))
v∈V i =1

where

wi ≡

1 d ( xq , xi )2

To accommodate the case where the query point x, exactly matches one of the training instances xi and the denominator d(xq, xi)2 is therefore zero, we assign f ( xq ) to be f (xi) in this case. If there are several such training examples, we assign the majority classification among them. We can distance-weight the instances for real-valued target functions in a similar fashion, replacing the final line of the algorithm in this case by

f ( xq ) ←

∑ w f (x ) ∑ w
i =1 i i k i =1 i

k

where wi is the inverse square of its distance from xq.

L

ocally

W

eighted

R

egression

The nearest-neighbor approaches described in the previous section can be thought of as approximating the target function f(x) at the single query point x = xq. Locally weighted regression is a generalization of this approach. It constructs an explicit approximation to f over a local region surrounding xq. Locally weighted regression uses nearby or distance-weighted training examples to form this local approximation to f. For example, we might approximate the target function in the neighborhood surrounding xq using a linear function, a quadratic function, a multilayer neural network, or some other functional form. The phrase "locally weighted regression" is called local because the function is approximated based a only on data near the query point, weighted because the contribution of each training example is weighted by its distance from the query point, and regression because this is the term used widely in the statistical learning community for the problem of approximating real-valued functions.
http://rajakishor.co.cc Page 106

Given a new query instance xq, the general approach in locally weighted regression is to construct an approximation f that fits the training examples in the neighborhood surrounding xq. This approximation is then used to calculate the value

f ( xq ) , which is output as the estimated target value for the query instance. The
description of f may then be deleted, because a different local approximation will be calculated for each distinct query instance.

Locally Weighted Linear Regression
Let us consider the case of locally weighted regression in which the target function f is approximated near xq using a linear function of the form

f ( xq ) = w0 + w1a1 ( x ) + ... + wn an ( x )
ai(x) denotes the value of the ith attribute of the instance x. For a global approximation to the target function, we derive methods to choose weights that minimize the squared error summed over the set D of training examples

E≡

1 ∑ ( f ( x) − f ( x))2 2 x∈D

which led us to the gradient descent training rule

∆w j = η ∑ ( f ( x) − f ( x))a j ( x)
x∈D

where η is a constant learning rate How shall we modify this procedure to derive a local approximation rather than a global one? The simple way is to redefine the error criterion E to emphasize fitting the local training examples. Three possible criteria are given below. Note we write the error E(xq) to emphasize the fact that now the error is being defined as a function of the query point xq. 1. Minimize the squared error over just the k nearest neighbors:

E1 ( xq ) ≡

1 ∑nbrs of x (f(x)- f (x))2 2 x∈k nearest q
Page 107

http://rajakishor.co.cc

2. Minimize the squared error over the entire set D of training examples, while weighting the error of each training example by some decreasing function K of its distance from xq:

E2 ( xq ) ≡
3. Combine 1 and 2:

1 ∑ (f(x)- f (x))2 K (d ( xq , x)) 2 x∈D

E3 ( xq ) ≡

1 ∑nbrs of x (f(x)- f (x))2 K (d ( xq , x)) 2 x∈k nearest q

http://rajakishor.co.cc

Page 108

9

Genetic Algorithms
http://rajakishor.co.cc Page 109

Genetic Algorithms are nondeterministic stochastic search or optimization methods that utilize the theories of evolution and natural selection to solve a problem within a complex solution space. They are computer-based problem solving systems, which use computational models of some of the known mechanisms in evolution as key elements in their design and implementation. Genetic algorithms are loosely based on natural evolution and use a “survival of the fittest” technique, wherein the best solutions survive and are varied until we get a good result. In a genetic algorithm, the performance of a set of candidate solutions to a problem called ‘chromosomes’ are evaluated and ordered, then new candidate solutions are produced by selecting candidates as ‘parents’ and applying mutation or crossover operators which combine bits of two parents to produce one or more children. The new set of candidates is evaluated, and this cycle continues until an adequate solution is found.

P

reliminaries

Genetic Algorithms are nondeterministic stochastic search or optimization methods that utilize the theories of evolution and natural selection to solve a problem within a complex solution space. They are computer-based problem-solving systems, which use computational models of some of the known mechanisms in evolution as key elements in their design and implementation. They are a member of a wider population of algorithms, “Evolutionary Algorithms”. Genetic Algorithms perform a multi-point search in the problem space. On one hand it ensures the robustness, searching in a local minimum does not mean that the whole algorithm fails, while on the other it may give not just one, but more nearly optimal solutions for the problem from which the user can select. Due to robustness, flexibility and efficiency of Genetic Algorithms, costly redesigns of artificial systems that are based on them can be avoided. Genetic Algorithms are theoretically and empirically proven to provide robust search in complex problem spaces.

Biological Background
Genetic algorithms are inspired by Darwin's theory. Solutions to a problem can be obtained through evolutions. All living organisms consist of cells. In each cell there is the same set of chromosomes. Chromosomes are strings of DNA and serves as a model for the whole organism. The genes determine a chromosome’s characteristic. Each gene has several forms or alternatives, which are called alleles, producing differences in the

http://rajakishor.co.cc

Page 110

set of characteristics associated with that gene. The set of chromosome is called the genotype, which defines a phenotype (the individual) with certain fitness. During reproduction, first occurs recombination (or crossover). Genes from parents form in some way the whole new chromosome. The new created offspring can then be mutated. Mutation means, that the elements of DNA are a bit changed. These changes are mainly caused by errors in copying genes from parents. The fitness of an organism is measured by success of the organism in its life. According to Darwinian theory, the highly fit individuals are given opportunities to “reproduce” whereas the least fit members of the population are less likely to get selected for reproduction, and so “die out”. In nature, the genes of living creature are stored as pairs and each parent only presents only one gene from each pair. This differs from Genetic Algorithms, in which genes are not stored in pairs. But, in both Genetic Algorithms and biological life forms, only a fraction of parents’ genes are passed to each offspring.

E

ncoding of

C

hromosome

The first step in genetic algorithm is to “translate” the real problem into “biological terms”. Format of chromosome is called encoding. There are four commonly used encoding methods: binary encoding, permutation encoding, direct value encoding and tree encoding. i. Binary Encoding: Binary encoding is the most common and simplest one. In binary encoding, every chromosome is a string of bits, 0 or 1. For example: Chromosome A: 0101101100010011 Chromosome B: 1011010110110101 ii. Permutation Encoding: Permutation encoding can be used in “ordering problems”, such as traveling salesman problem or task ordering problem. In permutation encoding, every chromosome is a string of numbers, which represents number in a sequence. For example: Chromosome A: 8549102367 Chromosome B: 9102438576
http://rajakishor.co.cc Page 111

iii. Direct Value Encoding: Direct value encoding can be used in problems where some complicated values such as real numbers are used. Use of binary encoding for this type of problems would be very difficult. In value encoding, every chromosome is a string of some values. Values can be anything connected to problem, form numbers, real numbers or chars to some complicated objects. For example: Chromosome A: [red], [black], [blue], [yellow], [red], [green] Chromosome B: 1.8765, 3.9821, 9.1283, 6.8344, 4.116, 2.192 Chromosome C: ABCKDEIFGHNWLSWWEKPOIKNGVCI iv. Tree Encoding: Tree encoding is used mainly for evolving programs or expressions, for genetic programming. In tree encoding every chromosome is a tree of some objects, such as functions or commands in programming language.

G

enetic

A

lgorithm

O

perators

Various operators are available in literature for genetic algorithms. Most of them can be classified into fundamental categories of reproduction, crossover and mutation. i. Initialization There are many ways to initialize and encode the initial population. It can be binary or non-binary, fixed or variable length strings and so on. This operator is not of much significance if the system random generates valid chromosomes and evaluates each one. ii. Reproduction Between successive generations, the process by which chromosomes of the previous generations are retained in the next generations is reproduction. The two types of reproduction are Generational Reproduction and steady-state reproduction. Generational Reproduction In Generational Reproduction, the whole population is potentially replaced at each generation. The most often used method is to randomly generate a population of chromosomes. The next step is to decode chromosomes into
http://rajakishor.co.cc Page 112

individuals and evaluate fitness of all individuals, select fittest individuals, generate new population by selection, crossover and mutation. Repeat this step until the termination condition is not met. Steady State Reproduction In steady state reproduction, population of chromosomes is randomly generated. The next step is to decode chromosomes into individuals and evaluate fitness of all the individuals, put the fittest individuals into a mating pool, produce a number of offspring by crossover and mutation and replace the weakest individuals by offspring. Repeat this step until the termination condition is not met. iii. Selection According to Darwin's evolution theory the best ones should survive and create new offspring. There are many methods how to select the best chromosomes, for example roulette wheel selection, Boltzman selection, tournament selection, rank selection, spatially oriented selection. Roulette Wheel Selection Parents are selected according to their fitness. The better the chromosomes are, the more chances to be selected they have. Imagine a roulette wheel (pie chart) where all chromosomes in the population are placed in according to their normalized fitness. Then a random number is generated which decides the chromosome to be selected. Chromosomes with bigger fitness values will be selected more times since they occupy more space on the pie. Rank Selection The previous selection will have problems when the fitnesses differ very much. For example, if the best chromosome fitness is 90% of the entire roulette wheel then the other chromosomes will have very few chances to be selected. Rank selection first ranks the population and then every chromosome receives fitness from this ranking. The worst will have fitness 1, second worst 2 etc. and the best will have fitness N (number of chromosomes in population). After this, all the chromosomes have a chance to be selected. However, this method can lead to slower convergence, because the best chromosomes do not differ so much from other ones. Elitism When creating new population by crossover and mutation, we have a big chance that we will loose the best chromosome. Elitism is a method, which first copies the best chromosome (or a few best chromosomes) to new population. The rest

http://rajakishor.co.cc

Page 113

is done in classical way. Elitism can very rapidly increase performance of Genetic Algorithms, because it prevents losing the best-found solution. Tournament Selection The tournament selection chooses K parents at random and returns he fittest one among these. Some other forms of tournament selections exist like the Boltzmann tournament selection. Marriage tournament selection chooses one parent, has upto K tries to find one fitter and stops at the first of these tries, which finds a fitter one. If none is better than the initial choice, then the initial choice is returned. Spatially-oriented selection Spatially-oriented selection is a local selection method rather than a global one. That is, the selection competition is between several small neighboring chromosomes instead of the whole population. This method is based on the Wright’s shift-balance model of evolution and is termed as “Cellular Genetic Algorithms”. iv. Crossover The crossover operator is the most important operation in genetic algorithms. Crossover is a process of yielding recombination of bit strings via an exchange of segments between pairs of chromosomes. There are many kinds of crossovers. Certain crossover operators are applicable for binary chromosomes and some other for permutation chromosomes. One-point crossover A randomly chosen point is taken within the length of the chromosomes .The chromosomes are cut at that point. The first child consists of sub chromosome of Parent1 up to the cut point concatenated with the sub chromosome of parent2 after the cut point. The second child is constructed in a similar way. For example: P1 = 1010101| 010 P2 = 1110001|110 The crossover point is between 7th and 8th bits. Then the offspring will be O1 = 1010101|110 O2 = 1110001| 010

http://rajakishor.co.cc

Page 114

Two point crossover The chromosomes are thought of as rings with the first and the gene connected that is a wrap around structure. The rings are cut at two randomly chosen sites and the resulting sub rings are swapped. For example, P1 = 10|1010|1010 P2 = 11|1000|1110 The crossover points are between 2nd and 3rd bits, 6th and 7th bits. The offspring’s will be O1 = 11|1010|1110 O2 = 10|1000|1010 Uniform crossover Each gene of the chromosome is selected randomly from the corresponding gene of the parents. For example, P1 = 11111111 P2 = 00000000 Here the eight random numbers generated are 2,2,1,1,1,2,1,2. The random number 2 selects P2 for the corresponding bit position, so 0 is selected for random number 2 generated in this example. Then the offspring is O = 00111010 N-point crossover This is similar to one point and two point crossover except that we must select n positions and only the bits between odd and even crossover positions are swapped. The bits between even and odd crossovers are unchanged. Arithmetic crossover Here an arithmetic operation like OR, AND are performed between the two parents. The resulting chromosome is the offspring. For example: P1 = 11001011 P2 = 11011111 The arithmetic operation AND is performed between the two parents and the resulting offspring is O = 11001001

http://rajakishor.co.cc

Page 115

Partially matched crossover (PMX): Useful for permutation GA’s. Similar to two-point crossover with swaps done so that a permutation is finally obtained. For example, a two-point crossover performed on: P1: 1234 | 567 | 8 P2: 8521 | 364 | 7 We would get O1: 1234 | 364 | 8 O2: 8521 | 567 | 7 Which is illegal because in O1 3 and 4 are repeated and 5 and 7 do not occur at all. In O2 3 and 4 do not occur but 5 and 7 occur twice. PMX fixes this problem by noting that we made the swaps 3 -> 5, 6 -> 6 and 4 -> 7 and then repeating these swaps on the genes outside the crossover points, giving us O1’: 12573648 O2’: 83215674 Cycle crossover This time we do not pick a crossover point at all. We choose the first gene from one of the parents P1: 12345678 P2: 85213647 Say, we pick 1 from P1: O1 = 1 - - - - - -We must pick every element from one of the parents and place it in the position it was previously in. Since the first position is occupied by 1, the number 8 from P2 cannot go there. So, we must now pick the 8 from P1. O1 = 1 - - - - - -8 This forces us to put the 7 in position 7 and the 4 in position 4, as in P1. O1 = 1 - -4 - -78 Since 1, 4, 7 occupy the same set of positions, 8 in P1 and P2, we finish by filling in the blank positions with the elements of those positions in v2. Thus, O1 = 15243678 And we get O2 from the complement of O1 O = 82315647
http://rajakishor.co.cc Page 116

This process ensures that each chromosome is legal. Notice that it is possible for us to end up with the offspring being the same as the parents. This is not a problem since it will usually only occur if the parents have high fitnesses, in which case, it could still be a good choice. Order crossover (OX) This is more like PMX in that we choose two crossover points and crossover the genes between the two points. However instead of repairing the chromosome by swapping the repeats of each node also, we simply rearrange the rest of the genes to give a legal permutation. With the chromosomes P1 = 135 | 762 | 48 P2 = 563 | 821 | 47 We would start by switching the genes between the two crossover points. O1 = - - - | 821 | - O2 = - - - | 762 | -We then write down the genes from each parent chromosome starting from the second crossover point. O1: 48135762 O2: 47563821 Then the genes that were between the crossover points are deleted. That is, we would delete 8, 2 and 1 from the v1 list and 7, 6 and 2 from the v2 list to give O1: 43576 O2: 45381 which are then replaced into the child chromosomes, starting at the second crossover point. O1 = 57682143 O2 = 38176245 Matrix crossover (MX) For this we have a matrix representation where the element i, j is 1 if there is an edge from node i to node j and 0 otherwise. This is useful in solving the traveling salesperson problem. Matrix crossover is the same as one- or two-point crossover with the operation done on matrices instead of linear chromosomes.

http://rajakishor.co.cc

Page 117

Modified order crossover (MOX) This is similar to order crossover. We randomly choose one crossover point in the parents and as usual, leave the genes before the crossover point as they are. We then reorder the genes after the crossover point in the order that they appear in the second parent chromosome. This operator is used in our implementation. If we have, P1 = 123 | 456 P2 = 364 | 215 We would get O1 = 123 | 645 O2 = 364 | 125 v. Mutation Mutation has the effect of ensuring that all possible chromosomes are reachable. For example, the position of a chromosome can be any value from one to twenty, if in the initial population there is no chromosome having a value of 6 in any of its gene positions. Then with only crossover and reproduction operators then the value 6 will never occur in any future chromosomes. The mutation operator can overcome this way randomly selecting a bit position and changing its value. Mutation is useful in escaping local minima ass it helps explore new regions of the multidimensional solution space. If mutation rate is too high, it can cause well-bred chromosomes to be lost and thus decrease the exploitation of high fitness regions of the solution space. Some systems that use random populations (noisy) created at initialization phase do not use mutation operators at all. Some mutation operators are: Bit Inversion This operator is discussed on binary chromosomes. In this the bits in the chromosome are inverted (0’s are made as 1’s and 1’s are made as 0’s) depending on the probability of mutation. For example, 1000000001 = 1010000000 where the third and 10th bits have been (randomly) mutated. Order Changing This operator can be used on both binary and non-binary gene representations. In this a portion of the chromosome is selected and the genes in that region are randomly permuted. For example, (5 6 3 4 7 3) = (5 3 4 6 7 3), where the second, third and fourth values have been randomly scrambled.

http://rajakishor.co.cc

Page 118

Value Changing A value of a gene is changed within a specific range. For example, (3.4 4.2 4.6 6.4 3.2) = (3.4 4.2 4.5 6.4 3.2) Where one value has been changed within a specific range Reciprocal Exchange Two randomly selected positions in the chromosome are selected and the values in the chromosome in those positions are swapped. For example, (5 6 2 4 7 3) gives us (5 6 7 4 2 3) if the randomly selected positions are 3 and 5. vi. Inversion In Holland’s founding work on Genetic Algorithms he made mention of another operator, besides selection, crossover and mutation which takes place in biological reproduction. This is known as the inversion operator. An inversion is where a portion of a chromosome detaches from the rest of the chromosome, then changes direction and recombines with the chromosome. The process of inversion is decidedly more complex to implement than the other operators involved in genetic algorithms. Inversion has also attracted a substantial amount of research. For example, consider the Chromosome Before Inversion: During Inversion: 001001011010100101011010010100101010001010 00100101101010 01010110100101 00101010001010 01010110100101 One portion inverts: becomes (order is reversed) 10100101101010 Recombination: After Inversion: 00100101101010 10100101101010 00101010001010 001001011010101010010110101000101010001010

http://rajakishor.co.cc

Page 119

G S

enetic

earch

A M

lgorithms and

T

raditional

O

ptimization and

ethods

There are three main types of traditional or conventional search method: calculus-based, enumerative, and random. i. Calculus-based methods Calculus-based methods are also referred to as gradient methods. These methods use the information about the gradient of the function to guide the direction of search. If the derivative of the function cannot be computed, because it is discontinuous, for example, these methods often fail. Hill Climbing is one method that works using gradient information to find the local best by moving in the direction of steepest permissible direction. Calculus-based methods also depend upon the existence of derivatives or welldefined slope values. But, the real world of search is fraught with discontinuities, vast multimodal noisy search spaces. ii. Enumerative methods Enumerative methods work within a finite search space, or at least a discretized infinite search space. The algorithm then starts looking at objective function values at every point in the space, one at a time. Enumerative methods search every point in the space. So, if the search space is increased exponentially or if the problem is NP-Hard like the Traveling Salesman Problem, this method becomes inefficient. iii. Random search methods Random search methods are strictly random walks through the search space while saving the best. iv. Differences between Genetic Algorithms and conventional search procedures Genetic Algorithms and conventional optimization/search procedures in that: They work with a coding of the parameter set, not the parameters themselves. They search from a population of points in the problem domain, not a singular point. They use payoff information as the objective function rather than derivatives of the problem or auxiliary knowledge.
http://rajakishor.co.cc Page 120

They utilize probabilistic transition rules based on fitness rather than deterministic one. We can see that both the enumerative and random methods are not efficient when there is a significantly large search space or significantly difficult problem, as in the realm of NP-Complete problems. The calculus-based methods are inadequate when searching a "noisy" search space (one with numerous peaks). Taken together, these four differences – direct use of coding, search from a population, blindness to auxiliary information, and randomized operators- contribute to a genetic algorithm’s robustness and resulting advantage over other more commonly used techniques.

S

imilarity

T

emplates (Schemata)

Similarity template describes a subset of strings with similarities at certain string positions. These are helpful in answering questions like, how one string can be similar to its fellow strings. Schemata encode useful or promising characteristics found in the population. For the binary alphabet {0,1}, the similarity template can be described as operating upon the extended alphabet {0,1, *} where * denotes the don’t care symbol. For example the schema *11* describes a subset with four members {0110,0111,1110,1111}. The number of non-* symbols is called the order of the schema. A bit string of length N includes 2^N schemata and there are 3^N different schemata of length N. A population of P bit strings of length N contains between 2^N and min(P.2^N, 3^N) schemata, so the GA operates implicitly on a number of schemata much larger than the size of the population. Chromosomes of length N can be viewed as points in a discrete N-Dimensional search space. (i.e. vertices of a hypercube). Schemata can be seen as hyperplanes in such a search space. Low order (i.e. small number of non-* symbols) hyperplanes include more vertices. Highly fit, short defining length schemata (called building blocks) are propagated generation to generation by giving exponentially increasing numbers of representatives to the observed best; all this goes in parallel with no special book keeping or special memory other than or population of n strings. This processing leverage is important and we give it a special name, “Implicit Parallelism”.

http://rajakishor.co.cc

Page 121

W

orking of

G

enetic

A

lgorithms

Pseudo-Code for Genetic Algorithms
The following is a pseudo-code for general genetic algorithm approach: 0. [Representation] Define a genetic representation of the system. 1. [Start] Generate random population of n chromosomes (suitable solutions for the problem) 2. [Fitness] Evaluate the fitness of each chromosome in the population 3. [New population] Create a new population by repeating following steps until the new population is complete 3.1. [Selection] Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected) 3.2. [Crossover] With a crossover probability cross over the parents to form a new offspring (children). If no crossover was performed, offspring is an exact copy of parents. 3.3. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome). 3.4. [Accepting] Place new offspring in a new population 4. [Replace] Use new generated population for a further run of algorithm 5. [Test] If the end condition is satisfied, stop, and return the best solution in current population 6. [Loop] Go to step 2

http://rajakishor.co.cc

Page 122

Figure: Working of Genetic Algorithms The pseudo-code is very general. Many things can be implemented differently in various problems. First question is how to create chromosomes, what type of encoding to choose. In connection with this is the choice of the two basic operators of Genetic Algorithms, which are crossover and mutation. Furthermore, selection of parents from the current solution is also to be clearly defined.

Fitness Function
The fitness function is a non-negative figure of merit of the chromosome. The objective function is the basis for the computation of the fitness function, which provides the Genetic Algorithms with feedback from the environment, feedback used to direct the population towards areas of the search space characterized by better solutions. Generally, the only requirement of a fitness function is to return a value indicating the quality of the individual solution under evaluation. This gives the modeler almost unlimited freedom in building the model, therefore a diverse range of modeling structures can be incorporated into the Genetic Algorithms. At every evolutionary step, known as a generation, the individuals in the current population are decoded and evaluated according to some predefined quality criterion,
http://rajakishor.co.cc Page 123

referred to as the fitness function. To form a new population (the next generation), individuals are selected according to their fitness. In many problems, the objective is more naturally stated as the minimization of some cost function g(x), rather than the maximization of some utility or profit function u(x). Even if the problem is naturally stated in maximization form, this alone does not guarantee that the utility function will be non-negative for all x, as we require in the fitness function. As a result, it is often necessary to map the underlying natural objective function to a fitness function form through one or more mappings. In normal Operations Research work, to transform a minimization problem to a maximization problem is to simply multiply the cost function by a minus one. In genetic algorithm work, the operation alone is insufficient because the measure thus obtained is not guaranteed to be non-negative in all instances. With GA’s, the following cost-tofitness transformation is commonly used: f(x) = Cmax – g(x), when g(x) < Cmax. = 0, otherwise. There are varieties of ways to choose the co-efficient Cmax. Cmax may be taken as an input co-efficient, as the largest g value observed thus far, as the largest g value in the current population or the largest of the last k - generations. When the natural objective function formulation is a profit or utility function, we have no difficulty with the direction of the function: maximized profit or utility leads to desired performance. We may still have a problem with negative utility function u(x) values. To overcome this, we simply transform fitness according to the equation: f(x) = u(x) + Cmin, when u(x) + Cmin > 0 = 0, otherwise We may choose Cmin as an input co-efficient, as the absolute value of the worst u value in the current or last k – generations or as a function of the population variance. To avoid premature convergence, wherein the best chromosomes have a large number of copies right from the initial population and to prevent a random walk through mediocre chromosomes when the population average fitness is close to the best fitness, we perform fitness scaling. One way is linear scaling. If raw fitness is f and scaled fitness is f’, then the linear relationship between f and f’ is as follows: f’ = a * f + b, where a and b can be chosen in a number of ways.

http://rajakishor.co.cc

Page 124

S

ystem

D

esign

Genetic Algorithms Unit
This subsystem is the actual timetable generation unit. However, this unit has no knowledge of constraints. It accepts a fitness value based on constraints which prop up based on the data entered. Some design considerations for genetic algorithms are: Encoding Permutation based encoding is chosen. The representation satisfies some hard constraints by nature of itself. Advanced genetic algorithm operations are in place for these forms of encoding that motivated to choose tyhis representation. Reproduction Between successive generations the process by which chromosomes of the previous generations are retained in the next generations is reproduction. The reproduction which we have used in this system, is Generational Reproduction in which the whole of a population is potentially replaced at each generation. . Selection The selection of the chromosome from a given population is based on Roulette Wheel Selection. In this type of selection the parents are selected according to their fitness. The better the chromosomes are, the more chances to be selected from the population. Crossover The crossover method which is used in the system is Modified order crossover(MOX). is Modified order crossover is used due to ease of implementation and simplicity yet powerful. For example: If we have the parents P1,P2 as P1 = 123 | 456 P2 = 364 | 215 We would get the offspring’s as O1 = 123 | 645 O2 = 364 | 125

http://rajakishor.co.cc

Page 125

Mutation The mutation operator which is used in the system is Reciprocal Exchange. In Reciprocal exchange two randomly selected positions in the chromosome are selected and the values in the chromosome in those positions are swapped. Decoding The chromosome is in an encoded form. To evaluate that chromosome it is decoded and is evaluated. Evaluation Fitness of a chromosome is evaluated by calculating: F(chr) = Cmax - ( (wh * H(chr)) + ( ws * S(chr)) ) Where chr = chromosome, F = Function that returns fitness for the chromosome Cmax = A large integer value (to make the fitness positive) wh = weight foe each hard constraint violation. ws = weight for each soft constraint violation. H(chr) = Number of hard constraints that are violated. S(chr) = Number of soft constraints that are violated. All the above values are positive. To choose wh and ws : wh = ( ws * Smax ) + 1 where wh = the weight of each hard constraint violated. Smax = total number of soft constraints. ws = is the weight of each soft constraint. The weight of each hard constraint violated is the total number of soft constraints multiplied by the weight of each soft constraint added by one. This is done to ensure that chromosomes that satisfy hard constraints are generated first, followed by better ones that satisfy soft constraints also. Also the evaluation becomes easier: it is easy to say when a chromosome is fit. Example: If Smax = 10 , ws = 1, Then wh = 11.

http://rajakishor.co.cc

Page 126

GeneticAlgorithm

http://rajakishor.co.cc

Page 127

G

enetic

A

lgorithms

I

mplementation

Genetic Algorithm operators:
These form the core function of searching through the problem space for a solution. For the GA’s to work, some parameters have to be set which are as follows, for example: Maximum number of Generations = 10000 Maximum population size = 200 Maximum Fitness, Cmax = 2147483647 Fitnessvalue(threshold) = 2147483647 Probability of Crossover, pcross = 0.7 Probability of mutation, pmutation = 0.01 Initial population size = 10 Scaling factor for hard constraint violation = 1000 Scaling factor for soft constraint violation = 1 Selection operator: Roulette Wheel Selection Crossover operator: Modified Order Crossover Mutation operator: Reciprocal Exchange

http://rajakishor.co.cc

Page 128

10

Learning Sets of Rules
http://rajakishor.co.cc Page 129

In many cases it is useful to learn the target function represented as a set of ifthen rules that jointly define the function. One way to learn sets of rules is to first learn a decision tree, then translate the tree into an equivalent set of rules-one rule for each leaf node in the tree. A second method is to use a genetic algorithm that encodes each rule set as a bit string and uses genetic search operators to explore this hypothesis space. Here, we explore a variety of algorithms that directly learn rule sets and that differ from these algorithms in two key respects. 1. First, they are designed to learn sets of first-order rules that contain variables. This is significant because first-order rules are much more expressive than propositional rules. 2. Second, the algorithms discussed here use sequential covering algorithms that learn one rule at a time to incrementally grow the final set of rules. As an example of first-order rule sets, consider the following two rules that jointly describe the target concept Ancestor. Here we use the predicate Parent(x, y) to indicate that y is the mother or father of x, and the predicate Ancestor(x, y) to indicate that y is an ancestor of x related by an arbitrary number of family generations. IF Parent(x, y) THEN Ancestor(x, y) IF Parent(x, z) ^ Ancestor(z, y) THEN Ancestor(x, y) These two rules compactly describe a recursive function that would be very difficult to represent using a decision tree or other propositional representation.

S

equential

C

overing

A

lgorithms

Here we consider a family of algorithms for learning rule sets based on the strategy of learning one rule, removing the data it covers, then iterating this process. Such algorithms are called sequential covering algorithms. To elaborate, imagine we have a subroutine LEARN-ONE-RULE that accepts a set of positive and negative training examples as input, then outputs a single rule that covers many of the positive examples and few of the negative examples. We require that this output rule have high accuracy, but not necessarily high coverage. By high accuracy, we mean the predictions it makes should be correct. By accepting low coverage, we mean it need not make predictions for every training example. Given this LEARN-ONE-RULE subroutine for learning a single rule, one obvious approach to learning a set of rules is to invoke LEARN-ONE-RULE on all the available training examples, remove any positive examples covered by the rule it learns, then invoke it again to learn a second rule based on the remaining training examples. This
http://rajakishor.co.cc Page 130

procedure can be iterated as many times as desired to learn a disjunctive set of rules that together cover any desired fraction of the positive examples. This is called a sequential covering algorithm because it sequentially learns a set of rules that together cover the full set of positive examples. The final set of rules can then be sorted so that more accurate rules will be considered first when a new instance must be classified. A prototypical sequential covering algorithm is described below. SEQUENTIAL_COVERING(Target_attribute, Attributes, Examples, Threshold) Learned_rules  {} Rule  LEARN_ONE_RULE(Target_attribute, Attributes, Examples) while PERFORMANCE(Rule, Example) > Threshold, Do • • • Learned_rules Learned_rules + Rule Examples  Examples – {examples correctly classified by Rules} Rule  LEARN_ONE_RULE(Target_attribute, Attributes, Examples)

Learned_rule  sort Learned_rules according to PERFORMANCE over Examples return Learned_rules This sequential covering algorithm is one of the most widespread approaches to learning disjunctive sets of rules. It reduces the problem of learning a disjunctive set of rules to a sequence of simpler problems, each requiring that a single conjunctive rule be learned. Because it performs a greedy search, formulating a sequence of rules without backtracking, it is not guaranteed to find the smallest or best set of rules that cover the training examples. How shall we design LEARN-ONE-RULE to meet the needs of the sequential covering algorithm? We require an algorithm that can formulate a single rule with high accuracy, but that need not cover all of the positive examples.

General to Specific Beam Search
One effective approach to implementing LEARN-ONE-RULE is to organize the hypothesis space search in the same general fashion as the ID3 algorithm, but to follow only the most promising branch in the tree at each step.

http://rajakishor.co.cc

Page 131

As illustrated in the above search tree, the search begins by considering the most general rule precondition possible (the empty test that matches every instance), then greedily adding the attribute test that most improves rule performance measured over the training examples. Once this test has been added, the process is repeated by greedily adding a second attribute test, and so on. Like ID3, this process grows the hypothesis by greedily adding new attribute tests until the hypothesis reaches an acceptable level of performance. Unlike ID3, this implementation of LEARN-ONE-RULE follows only a single descendant at each search step-the attribute (value pair yielding the best performance) rather than growing a subtree that covers all possible values of the selected attribute. This approach to implementing LEARN-ONE-RULE performs a general-tospecific search through the space of possible rules in search of a rule with high accuracy, though perhaps incomplete coverage of the data. As in decision tree learning, there are many ways to define a measure to select the "best" descendant. To follow the lead of ID3 let us for now define the best descendant as the one whose covered examples have the lowest entropy. The general-to-specific search suggested above for the LEARN-ONE-RULE algorithm is a greedy depth-first search with no backtracking. As with any greedy search, there is a danger that a suboptimal choice will be made at any step. To reduce this risk, we can extend the algorithm to perform a beam search.

http://rajakishor.co.cc

Page 132

In beam search, the algorithm maintains a list of the k best candidates at each step, rather than a single best candidate. On each search step, descendants (specializations) are generated for each of these k best candidates, and the resulting set is again reduced to the k most promising members. Beam search keeps track of the most promising alternatives to the current top-rated hypothesis, so that all of their successors can be considered at each search step. This general to specific beam search algorithm is used by the CN2 program described by Clark and Niblett (1989). The algorithm is described below (generate-and-test approach).
LEARN-ONE-RULE(Target_Attribute, Attributes, Examples, k) Returns a single rule that covers some of the Examples. Conduct a general_to_specific greedy beam search for the best r ule, guided by the PERFORMANCE metric. Initialize Best_hypothesis to the most general hypothesis φ Initialize Candidate_hypotheses to the set {Best_hypothesis} While Candidate_hypotheses is not empty, Do 1. Generate the next more specific Candidate_hypothess • All_constraints  the set of all constraints of the form (a = v), where a is a member of Attributes and v is a value of a that occurs in the current set of Examples • New_candidate_hypotheses  for each h in Candidate_hypotheses • create a specialization of h by adding the constraint c • Remove from New_candidate_hypotheses any hypotheses that are duplicates, inconsistent, or not maximally specific 2. Update Best_hypothesis • For all h in New_candidate_hypotheses Do • If (PERFORMANCE(h, Examples, Target_attribute) > PERFORMANCE(Best_hypothesis,Examples, Target_attribute)) Then Best_hypothesis  h 3. Update Candidate_hypothesis • Candidate_hypotheses  the k best members of New_candidate_ hypotheses, according to the PERFORMANCE measure. Return a rule of the form “IF Best_hypothesis THEN prediction” where prediction is the most frequent value of Target_attribute among those Examples that match Best_hypothesis. PERFORMANCE(h, Examples, Target_attribute) h_examples  the subset of Examples that match h return Entropy(h_examples), where entropy is with respect to Target_attribute http://rajakishor.co.cc Page 133

Remarks: 1. Each hypothesis considered in the main loop of the algorithm is a conjunction of attribute-value constraints. 2. Each of these conjunctive hypotheses corresponds to a candidate set of preconditions for the rule to be learned and is evaluated by the entropy of the examples it covers. 3. The search considers increasingly specific candidate hypotheses until it reaches a maximally specific hypothesis that contains all available attributes. 4. The rule that is output by the algorithm is the rule encountered during the search whose PERFORMANCE is greatest, not necessarily the final hypothesis generated in the search. 5. The post-condition for the output rule is chosen only in the final step of the algorithm, after its precondition (represented by the variable Best_hypothesis) has been determined. 6. The algorithm constructs the rule post-condition to predict the value of the target attribute that is most common among the examples covered by the rule precondition. Finally, note that despite the use of beam search to reduce the risk, the greedy search may still produce suboptimal rules. However, even when this occurs, the SEQUENTIAL-COVERING algorithm can still learn a collection of rules that together cover the training examples, because it repeatedly calls LEARN-ONE-RULE on the remaining uncovered examples.

L

earning

R S S
ule ets:

ummary

There are several key dimensions in the design space of rule learning algorithms. Dimension-1 Sequential covering algorithms learn one rule at a time, removing the covered examples and repeating the process on the remaining examples. In contrast, decision tree algorithms such as ID3 learn the entire set of disjuncts simultaneously as part of the single search for an acceptable decision tree. We might, therefore, call algorithms such as ID3 simultaneous covering algorithms, in contrast to sequential covering algorithms such as CN2. Which should we prefer? The key difference occurs in the choice made at the most primitive step in the search. At each search step ID3 chooses among alternative
http://rajakishor.co.cc Page 134

attributes by comparing the partitions of the data they generate. In contrast, CN2 chooses among alternative attribute-value pairs, by comparing the subsets of data they cover. One way to see the significance of this difference is to compare the number of distinct choices made by the two algorithms in order to learn the same set of rules. To learn a set of n rules, each containing k attribute-value tests in their preconditions, sequential covering algorithms will perform (n * k) primitive search steps, making an independent decision to select each precondition of each rule. In contrast, simultaneous covering algorithms will make many fewer independent choices, because each choice of a decision node in the decision tree corresponds to choosing the precondition for the multiple rules associated with that node. In other words, if the decision node tests an attribute that has m possible values, the choice of the decision node corresponds to choosing a precondition for each of the m corresponding rules. Thus, sequential covering algorithms such as CN2 make a larger number of independent choices than simultaneous covering algorithms such as ID3. Still, the question remains, which should we prefer? The answer may depend on how much training data is available. If data is plentiful, then it may support the larger number of independent decisions required by the sequential covering algorithm, whereas if data is scarce, the "sharing" of decisions regarding preconditions of different rules may be more effective. An additional consideration is the task-specific question of whether it is desirable that different rules test the same attributes. In the simultaneous covering decision tree learning algorithms, they will. In sequential covering algorithms, they need not. Dimension-2 Sequential covering algorithms learn one rule at a time, removing the covered examples and repeating the process on the remaining examples. In the LEARN-ONE-RULE algorithm described above, the search is from generalto-specific hypotheses. Other algorithms we have discussed (e.g., FIND-S) search from specific-to-general. One advantage of general to specific search here is that there is a single maximally general hypothesis from which to begin the search, whereas there are very many specific hypotheses in most hypothesis spaces (i.e., one for each possible instance).

http://rajakishor.co.cc

Page 135

Dimension-3 A third dimension is whether the LEARN-ONE-RULE search is a generate-thentest search through the syntactically legal hypotheses, as it is in our suggested implementation, or whether it is example-driven so that individual training examples constrain the generation of hypotheses. This contrasts to the generate-and-test search of LEARN-ONE-RULE (discussed earlier), in which successor hypotheses are generated based only on the syntax of the hypothesis representation. The training data is considered only after these candidate hypotheses are generated and is used to choose among the candidates based on their performance over the entire collection of training examples. One important advantage of the generate-and-test approach is that each choice in the search is based on the hypothesis performance over many examples, so that the impact of noisy data is minimized. Prototypical example-driven search algorithms include the FIND-S and CANDIDATE-ELIMINATION algorithms. In each of these algorithms, the generation or revision of hypotheses is driven by the analysis of an individual training example, and the result is a revised hypothesis designed to correct performance for this single example. However, example-driven algorithms that refine the hypothesis based on individual examples are more easily misled by a single noisy training example and are therefore less robust to errors in the training data. Dimension-4 A fourth dimension is whether and how rules are post-pruned. As in decision tree learning, it is possible for LEARN-ONE-RULE to formulate rules that perform very well on the training data, but less well on subsequent data. As in decision tree learning, one way to address this issue is to post-prune each rule after it is learned from the training data. In particular, preconditions can be removed from the rule whenever this leads to improved performance over a set of pruning examples distinct from the training examples.

http://rajakishor.co.cc

Page 136

Dimension-5 A fifth dimension is the particular definition of rule PERFORMANCE used to guide the search in LEARN-ONE-RULE. Various evaluation functions have been used. Some common evaluation functions include: 1. Relative Frequency Let n denote the number of examples the rule matches and let nc denote the number of these that it classifies correctly. The relative frequency estimate of rule performance is n/nc. 2. m-estimate of accuracy This accuracy estimate is biased toward the default accuracy expected of the rule. It is often preferred when data is scarce and the rule must be evaluated based on few examples. Let n and nc denote the number of examples matched and correctly predicted by the rule. Let p be the prior probability that a randomly drawn example from the entire data set will have the classification assigned by the rule (e.g., if 12 out of 100 examples have the value predicted by the rule, then p = .12). Finally, let m be the weight, or equivalent number of examples for weighting this prior p. The m-estimate of rule accuracy is (nc+mp)/(n+m). Note if m is set to zero, then the m-estimate becomes the above relative frequency estimate. As m is increased, a larger number of examples is needed to override the prior assumed accuracy p. 3. Entropy This is the measure used by the PERFORMANCE subroutine in the generateand-test algorithm. Let S be the set of examples that match the rule preconditions. Entropy measures the uniformity of the target function values for this set of examples. We take the negative of the entropy so that better rules will have higher scores.
c

− Entropy ( S ) = ∑ pi log 2 pi
i =1

where c is the number of distinct values the target function may take on, and where pi is the proportion of examples from S for which the target function takes on the ith value.

http://rajakishor.co.cc

Page 137

This entropy measure, combined with a test for statistical significance, is used in the CN2 algorithm of Clark and Niblett (1989). It is also the basis for the information gain measure used by many decision tree learning algorithms.

http://rajakishor.co.cc

Page 138