You are on page 1of 24

Seminar on

“Data Mining Using Genetic

Algorithm(DMGA)”

Presented By

Pramod Vishwakarma, M.Tech.[CSE], IIIrd Sem, CET Moradabad, param.vish@gmail.com

Supervisor

Prof. Rajiv Kumar Nath

Contents

What Is Data Mining? Architecture of Typical Data Mining System Biological Terminologies What is Genetic Algorithm(GA)? Basic Principles of GA Why Data Mining using Genetic Algorithm? Functions of Genetic Algorithm Pseudo Code of GA Applications of GA Advantages and Disadvantages The Tool MATLAB Conclusion & Future Work References

What Is Data Mining?

Data mining (knowledge discovery from data)

What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

[1].

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Is everything “data mining”?

Simple search and query processing Expert systems

What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (

Architecture: Typical Data Mining System

Graphical User Interface Pattern Evaluation Data Mining Engine
Graphical User Interface Pattern Evaluation Data Mining Engine

Graphical User Interface

Graphical User Interface Pattern Evaluation Data Mining Engine
Graphical User Interface Pattern Evaluation Data Mining Engine

Pattern Evaluation

Graphical User Interface Pattern Evaluation Data Mining Engine
Graphical User Interface Pattern Evaluation Data Mining Engine

Data Mining Engine

Graphical User Interface Pattern Evaluation Data Mining Engine
Graphical User Interface Pattern Evaluation Data Mining Engine
Graphical User Interface Pattern Evaluation Data Mining Engine

Database or Data Warehouse Server

Knowl edge- Base
Knowl
edge-
Base
Data Database Warehouse
Data
Database
Warehouse
World-Wide Web
World-Wide
Web

data cleaning, integration, and selection

Other Info Repositories

[1]

Biological Terminologies [2]

Gene - Each gene encodes a particular protein. Basically can be said, that each gene encodes a trait, for example color of eyes.

Chromosomes - A chromosome consist of genes, blocks of DNA. Chromosomes are strings of DNA and serves as a model for the whole organism.

Alleles - Possible settings for a trait (e.g. blue, brown) are called alleles.

Locus - Each gene has its own position in the chromosome. This position is called locus.

Genome - Complete set of genetic material (all chromosomes) is called Genome.

Genotype - Particular set of genes in genome is called Genotype.

Biological Terminologies [2] • Gene - Each gene encodes a particular protein. Basically can be said,

Phenotype The genotype contains the information required to construct an organism which is referred to as the phenotype.

5

Genetic Algorithm(GA)

GA was developed by John Holland in 1970.

They are based on the genetic processes of biological organisms.

Over many generations, natural populations evolve according to the principles of natural selection and “survival of the fittest”, first clearly stated by Charles Darwin in the Origin of Species.

GAs are adaptive method which may be used to solve search and optimization problems.

After a number of new generations built with the help of the described mechanisms one obtains a solution that cannot be improved any further. This solution is taken as a final one.

6

Basic Principles of GA

Coding Fitness function Reproduction

Selection

Crossover

Mutation

Convergence

Coding

Before a GA can be run, a suitable coding(or representation)

for the problem must be devised.

It is assumed that a potential solution to a problem may be represented as a set of parameters (for example, the dimensions of the beams in a bridge design).

For example, if our problem is to maximize a function of three variables, F(x, y, z), we might represent each variable by a 10-bit binary number. Our chromosome would therefore contain three genes, and consist of 30 binary digits.

Fitness Function

A fitness function must be devised for each problem to be

solved.

Given a particular chromosome, the fitness function returns a

single numerical “fitness” or “figure of merit”.

Which is supposed to be proportional to the “utility” or “ability” of the individual which that chromosome represents.

Reproduction

During the reproductive phase of the GA, individuals are selected from the population and recombined, producing offspring which will comprise the next generation.

Parents are selected randomly from the population using a scheme which favours the more fit individuals.

Having selected two parents, their chromosomes are

recombined, typically using the mechanisms of crossover and mutation.

Example of Crossover & Mutation

Example of Crossover & Mutation 11

Convergence

Convergence is the progression towards increasing uniformity.

A gene is said to have converged when 95% of the population

share the same value.

The population is said to have converged when all of the genes have converged.

If the GA has been correctly implemented, the population will evolve over successive generations so that the fitness of the best and the average individual in each generation increases

towards the global optimum.

Why Data Mining using Genetic

Algorithm

There are more reasons for preference using genetic algorithms-

Its robustness

Ability to work on large and “noisy” datasets,

GA’s perform global search of the solution space in comparison to most other algorithms that use Greedy approach

Coping well with attribute interaction.

Parallel approaches to genetic algorithms,

the scalability of these algorithms can be achieved.

this characteristic is of great importance in data mining.

Moreover, genetic algorithms have high degree of autonomy that

enables discovery of knowledge previously unknown by the user.

Functions of Genetic Algorithm

The Fitness Function The „fitness score‟ is returned as a result Parent Selection

Mating Pool

Crossover

Likelihood of crossover being applied is typically between 0.6 and 1.0.

Mutation

Mutation is applied to each child individually after crossover. It randomly alters each gene with a small probability (typically

0.001).

Pseudo Code of GA[3]

Pseudo Code of GA [3] 15

15

Applications of GA

Domain

Application Types

Control

Design

Scheduling

Robotics

Machine Learning

Signal Processing

Game Playing

Combinatorial

Optimization

gas pipeline, pole balancing, missile evasion, pursuit

semiconductor layout, aircraft design, keyboard

configuration, communication networks

manufacturing, facility scheduling, resource allocation

trajectory planning

designing neural networks, improving classification

algorithms, classifier systems

filter design

poker, checkers, prisoner’s dilemma

set covering, travelling salesman, routing, bin packing,

graph colouring and partitioning

Advantages and Disadvantages

Advantages:

Concept is easy to understand

Modular, separate from application

It doesn’t have to know any rules of the problem in advance.

This is very useful for very complex and loosely defined problem.

With a well defined fitness function and carefully chosen attributes, genetic algorithm can perform much faster than

other algorithm such as the linear method.

Disadvantages:-

Conti…

The definition of the fitness function can be very complicated sometime.

The fitness function may affect the performance of the process significantly if the complexity of the fitness function increase.

It is because the fitness function is used to compare every element in the sample population to every data in the training data set.

Sometimes an acceptable solution cannot be derived even after countless iteration if the genetic operators are wrongly chosen.

The Tool MATLAB [4]

MATLAB Matrix Laboratory MATLAB is a high-performance language for technical

computing. It integrates computation, visualization and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation.

Simulink -

Simulink is an interactive environment for modeling,

simulating, and analyzing dynamic, multi domain systems. It

lets you build a block diagram, simulate the system’s behavior, evaluate its performance, and refine the design.

Typical Uses Of Matlab

Math and computation

Algorithm development

Data acquisition

Modeling, simulation, and prototyping

Data analysis, exploration, and visualization

Scientific and engineering graphics

Application development, including graphical user interface building

Future Work

In the future work, the algorithm derived in this presentation

will be implemented into program using MATLAB.

Beside, the study will be focus on applying genetic algorithm on the database.

Finally, it will compare with conventional data mining technique in order to find the benefit by using genetic

programming.

Conclusion

In this seminar, the basic knowledge of Data Mining and

most commonly used Architecture of Typical Data Mining

System are covered then Genetic Algorithm, its various operators are depicted and the pros and cons of GA are discussed.

Finally the introduction to Matlab and Simulink and future works are discussed.

References

  • 1. Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, 2006

  • 2. http://www.obitko.com/tutorials/genetic-algorithms/index.php

  • 3. David Beasley et. al. (1993). “An Overview of Genetic Algorithms: Part 1, Fundamentals”, University Computing, vol.15 (2), pp. 58-69.

  • 4. Learning MATLAB, COPYRIGHT 1984 - 2004 by The MathWorks, Inc.

Thank You…

any

question or suggestion