You are on page 1of 56

Exoplanet hunting using Machine Learning

Hunting worlds beyond our solar system.


Nagesh Singh Chauhan

Source: https://weheartit.com/entry/298477443

Our Solar System formed around 4600 million years ago. We know this from the study of meteorites and
radioactivity. It all began with a cloud of gas and dust. A nearby supernova explosion probably perturbed the
calm cloud, which then started to contract due to gravity, forming a flat, rotating disk with most of the material
concentrated in the center: the protosun. Later, gravity pulled the rest of the material into clumps and rounded
some of them, forming the planets and dwarf planets. The leftovers resulted in comets, asteroids, and
meteoroids.

But what are Exoplanets?

Exoplanets are planets beyond our own solar system. Thousands have been discovered in the past two decades,
mostly with NASA’s Kepler Space Telescope.
These exoplanets come in a huge variety of sizes and orbits. Some are gigantic planets hugging close to their
parent stars; others are icy, some rocky. NASA and other agencies are looking for a special kind of planet: one
that’s the same size as Earth, orbiting a sun-like star in the habitable zone.

The habitable zone is the area around a star where it is not too hot and not too cold for liquid water to exist on
the surface of surrounding planets. Imagine if Earth was where Pluto is. The Sun would be barely visible (about
the size of a pea) and Earth’s ocean and much of its atmosphere would freeze.

Habitable zone. Source: https://www.e-education.psu.edu/astro801/content/l12_p4.html

Exoplanets: Worlds Beyond Our Solar System


Exoplanets are planets beyond our own solar system. Thousands have been discovered in the
past two decades, mostly with…

www.space.com

Why even search for exoplanets?

There are about 100,000,000,000 stars in our Galaxy, the Milky Way. How many exoplanets — planets outside
of the Solar System — do we expect to exist? Why are some stars surrounded by planets? How diverse are
planetary systems? Does this diversity tell us something about the process of planet formation? These are some
of the many questions that motivate the study of exoplanets. Some exoplanets may have the necessary physical
conditions (amount and quality of light from the star, temperature, atmospheric composition) for the existence
of complex organic chemistry and perhaps for the development of Life (which may be quite different from Life
on Earth).

However, detecting exoplanets is no simple task. We may have imagined life on other planets in books and film
for centuries, but detecting actual exoplanets is a recent phenomenon. Planets on their own emit very little if any
light. We can only see Jupiter or Venus in the night sky because they reflect the sun’s light. If we were to look
at an exoplanet (the nearest one is over 4 light-years away), it would be very close to a brilliantly lit star,
making the planet impossible to see.

Source: https://media.giphy.com/media/YA2bZh31eFXi0/giphy.gif

Scientists discovered a very efficient way to study these occurrences; planets themselves do not emit light, but
the stars that orbit them do. Considering this fact into account scientists at NASA developed a method which
they called Transit method in which a digital-camera-like technology is used to detect and measure tiny dips in
a star’s brightness as a planet crosses in front of the star. With observations of transiting planets, astronomers
can calculate the ratio of a planet’s radius to that of its star — essentially the size of the planet’s shadow — and
with that ratio, they can calculate the planet’s size.

Kepler Space Telescope’s primary method of searching for planets was the “Transit” method.

Transit method: In the diagram below, a star is orbited by a planet. From the graph, it is visible that the
starlight intensity drops because it is partially obscured by the planet, given our position. The starlight
rises back to its original value once the planets not in the path.
Source: https://gfycat.com/viciousthaticelandichorse

Until just a few years ago, astronomers had only confirmed the presence of fewer than a thousand exoplanets.
Then came the Kepler mission, and the number of exoplanets exploded. The Kepler mission is sadly over in
2018, but the TESS mission or Transiting Exoplanet Survey Satellite has taken its place and is regularly
finding new exoplanets in the night sky. TESS is monitoring the brightness of stars for periodic drops caused by
planet transits. The TESS mission is finding planets ranging from small, rocky worlds to giant planets,
showcasing the diversity of planets in the galaxy.

I wanted to see if I could look at the available exoplanet data and make predictions about which planets might
be hospitable to life. The data made publicly available by NASA is beautiful in that it contains many useful
features. The goal is to create a model that can predict the existence of an Exoplanet, utilizing the flux (light
intensity) readings from 3198 different stars over time.

The dataset can be downloaded from here.

Lets us start by importing all the libraries:

import os
import warnings
import math
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
from sklearn.metrics import mean_squared_error, mean_absolute_error
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score,roc_curve,auc, f1_score,
roc_auc_score,confusion_matrix, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, normalize
from scipy import ndimage
import seaborn as sns
Load the train and test data.

test_data =
pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/kaggle/exoplanet/exoTest.csv').fillna(
0)
train_data =
pd.read_csv('/Users/nageshsinghchauhan/Downloads/ML/kaggle/exoplanet/exoTrain.csv').fillna
(0)train_data.head()

Dataset

Now the target column LABEL consists of two categories 1(Does not represents exoplanet) and 2(represents the
presence of exoplanet). So, convert them to binary values.

categ = {2: 1,1: 0}


train_data.LABEL = [categ[item] for item in train_data.LABEL]
test_data.LABEL = [categ[item] for item in test_data.LABEL]

Before moving forward let us also reduce the amount of memory used by both test and train data frames.

#Reduce memory
def reduce_memory(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

for col in df.columns:


col_type = df[col].dtype

if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max <
np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')end_mem = df.memory_usage().sum() /
1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return dftest_data = reduce_memory(test_data)#Output
Memory usage of dataframe is 13.91 MB
Memory usage after optimization is: 6.25 MB
Decreased by 55.1%

This step is for memory optimization purpose, you can do that for train_data data frame also.

Now visualize the target column in the train_dataset.

plt.figure(figsize=(6,4))
colors = ["0", "1"]
sns.countplot('LABEL', data=train_data, palette=colors)
plt.title('Class Distributions \n (0: Not Exoplanet || 1: Exoplanet)', fontsize=14)

Class distribution of Target variable.

Its quite clear that the data is highly imbalanced. So first let us start with data preprocessing techniques.

Let us plot the first 4 rows of the train data and observe the intensity of flux values.

from pylab import rcParams


rcParams['figure.figsize'] = 13, 8
plt.title('Distribution of flux values', fontsize=10)
plt.xlabel('Flux values')
plt.ylabel('Flux intensity')
plt.plot(train_data.iloc[0,])
plt.plot(train_data.iloc[1,])
plt.plot(train_data.iloc[2,])
plt.plot(train_data.iloc[3,])
plt.show()
Well, our data is clean but is not normalized.

Let us plot the Gaussian histogram of non-exoplanets data.

labels_1=[100,200,300]
for i in labels_1:
plt.hist(train_data.iloc[i,:], bins=200)
plt.title("Gaussian Histogram")
plt.xlabel("Flux values")
plt.show()
Absence of exoplanets

Now plot Gaussian histogram of the data when exoplanets are present.

labels_1=[16,21,25]
for i in labels_1:
plt.hist(train_data.iloc[i,:], bins=200)
plt.title("Gaussian Histogram")
plt.xlabel("Flux values")
plt.show()
Presence of exoplanets

So let us first split our dataset and normalize it.

x_train = train_data.drop(["LABEL"],axis=1)
y_train = train_data["LABEL"]
x_test = test_data.drop(["LABEL"],axis=1)
y_test = test_data["LABEL"]

Data Normalization is a technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to a common scale, without distorting
differences in the ranges of values.

x_train = normalized = normalize(x_train)


x_test = normalize(x_test)

The next step is to apply gaussian filters to both test and train.

In probability theory, the normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a very common
continuous probability distribution. Normal distributions are important in statistics and are often used in the
natural and social sciences to represent real-valued random variables whose distributions are not known.

The normal distribution is useful because of the central limit theorem. In its most general form, under some
conditions (which include finite variance), it states that averages of samples of observations of random variables
independently drawn from the same distribution converge in distribution to the normal, that is, they become
normally distributed when the number of observations is sufficiently large. Physical quantities that are expected
to be the sum of many independent processes often have distributions that are nearly normal.

x_train = filtered = ndimage.filters.gaussian_filter(x_train, sigma=10)


x_test = ndimage.filters.gaussian_filter(x_test, sigma=10)

we use feature scaling so that all the values remain in the comparable range.

#Feature scaling
std_scaler = StandardScaler()
x_train = scaled = std_scaler.fit_transform(x_train)
x_test = std_scaler.fit_transform(x_test)

The number of columns/features that we have been working with is huge. We have 5087 rows and 3198
columns in our training dataset. Basically we need to decrease the number of features(Dimentioanlity
Reduction) to remove the possibility of Curse of Dimensionality.

For reducing the number of dimensions/features we will use the most popular dimensionality reduction
algorithm i.e. PCA(Principal Component Analysis).

To perform PCA we have to choose the number of features/dimensions that we want in our data.

#Dimentioanlity reduction
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
total=sum(pca.explained_variance_)
k=0
current_variance=0
while current_variance/total < 0.90:
current_variance += pca.explained_variance_[k]
k=k+1

The above code gives k=37.


Now let us take k=37 and apply PCA on our independent variables.

#Apply PCA with n_componenets


pca = PCA(n_components=37)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Exoplanet Dataset Explained Variance')
plt.show()

The above plot tells us that selecting 37 components we can preserve something around 98.8% or 99% of the
total variance of the data. It makes sense, we’ll not use 100% of our variance, because it denotes all
components, and we want only the principal ones.

The number of columns got reduced to 37 in both test and train datasets.

Now moving on to the next step, as we know the target class is not equally distributed and one class dominates
the other. So we need to resample our data so that the target class is equally distributed.

There are 4 ways of addressing class imbalance problems like these:

 Synthesis of new minority class instances


 Over-sampling of minority class
 Under-sampling of the majority class
 tweak the cost function to make misclassification of minority instances more important than
misclassification of majority instances.

#Resampling as the data is highly unbalanced.


print("Before OverSampling, counts of label '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train==0)))sm =
SMOTE(random_state=27, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train.ravel())print("After
OverSampling, counts of label '1': {}".format(sum(y_train_res==1)))
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res==0)))

We have used the SMOTE(Synthetic Minority Over-sampling TEchnique). SMOTE is an over-sampling


method. What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the
minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record
one column at a time by a random amount within the difference to the neighboring records.

Before OverSampling, counts of label '1': 37


Before OverSampling, counts of label '0': 5050

After OverSampling, counts of label '1': 5050


After OverSampling, counts of label '0': 5050

Now it comes to building a model which can classify exoplanets on the test data.

So I’ll create a function model which will:

1. fit the model


2. Cross-validation
3. Accuracy
4. classification report
5. Confusion matrix

def model(classifier,dtrain_x,dtrain_y,dtest_x,dtest_y):
#fit the model
classifier.fit(dtrain_x,dtrain_y)
predictions = classifier.predict(dtest_x)

#Cross validation
accuracies = cross_val_score(estimator = classifier, X = x_train_res, y = y_train_res,
cv = 5, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()
print("Accuracy mean: "+ str(mean))
print("Accuracy variance: "+ str(variance))

#Accuracy
print ("\naccuracy_score :",accuracy_score(dtest_y,predictions))

#Classification report
print ("\nclassification report :\n",(classification_report(dtest_y,predictions)))

#Confusion matrix
plt.figure(figsize=(13,10))
plt.subplot(221)
sns.heatmap(confusion_matrix(dtest_y,predictions),annot=True,cmap="viridis",fmt =
"d",linecolor="k",linewidths=3)
plt.title("CONFUSION MATRIX",fontsize=20)

There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the
model to your training data and hope it would accurately work for the real data it has never seen before. You
need some kind of assurance that your model has got most of the patterns from the data correct, and it's not
picking up too much on the noise, or in other words its low on bias and variance.

Now fit the Support Vector Machine (SVM) algorithm to the training set and do prediction.

from sklearn.svm import SVC


SVM_model=SVC()
model(SVM_model,x_train_res,y_train_res,x_test,y_test)

Also, try the Random forest model and get the feature importance but before doing that include below code in
the function model.

#Display feature importance


df1 = pd.DataFrame.from_records(dtrain_x)
tmp = pd.DataFrame({'Feature': df1.columns, 'Feature importance':
classifier.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

and call Random forest classifier

from sklearn.ensemble import RandomForestClassifier


rf_classifier = RandomForestClassifier()
model(rf_classifier,x_train_res,y_train_res,x_test,y_test)

Generally, Feature importance provides a score that indicates how useful or valuable each feature was in the
construction of the model. The more an attribute is used to make key decisions with decision trees, the higher its
relative importance.
We can see that we are getting pretty good results from SVM and Random forest algorithms. However, you can
go ahead and tweak the parameters and also use other algorithms and check the difference in Accuracy.

Now try to solve the same problem using neural networks(ANN).

from tensorflow import set_random_seed


set_random_seed(101)
from sklearn.model_selection import cross_val_score
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential # initialize neural network library
from keras.layers import Dense # build our layers library
def build_classifier():
classifier = Sequential() # initialize neural network
classifier.add(Dense(units = 4, kernel_initializer = 'uniform', activation = 'relu',
input_dim = x_train_res.shape[1]))
classifier.add(Dense(units = 4, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation =
'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics =
['accuracy'])
return classifierclassifier = KerasClassifier(build_fn = build_classifier, epochs =
40)
accuracies = cross_val_score(estimator = classifier, X = x_train_res, y = y_train_res, cv
= 5, n_jobs = -1)
mean = accuracies.mean()
variance = accuracies.std()
print("Accuracy mean: "+ str(mean))
print("Accuracy variance: "+ str(variance))#Accuracy mean: 0.9186138613861387
#Accuracy variance: 0.07308084375906461

The Neural network model gave Accuracy mean of 91.86%


and Accuracy variance of 7.30% after cross-validation which a pretty handsome result.

Conclusion: The Future


It’s amazing we are able to gather light from distant stars, study this light that has been traveling for thousands
of years, and make conclusions about what potential worlds these stars might harbor.

Source: https://astrobiology.nasa.gov/news/the-just-approved-european-ariel-mission-will-be-first-dedicated-to-
probing-exoplanet-atmospheres/

Within the next 10 years, 30 to 40m diameter telescopes will operate from the Earth to detect exoplanets by
imaging and velocity variations of the stars. Satellite telescopes including Cheops, JWST, Plato, and Ariel, will
be launched to detect planets by the transit method. The JWST will also do direct imaging. Large Space
telescopes 8 to18m in diameter (LUVOIR, Habex) are being designed at NASA to detect signs of life on
exoplanets by 2050.

In the more distant future, huge space interferometers will make detailed maps of planets. And possibly,
interstellar probes will be launched towards the nearest exoplanets to take close-up images. Engineers are
already working on propulsion techniques to reach such distant targets.

So in this article, we predicted the presence of an exoplanet using machine learning models and neural
networks.

Well, that’s all for this article hope you guys have enjoyed reading this it. I’ll be glad if the article is of any
help. Feel free to share your comments/thoughts/feedback in the comment section.
Source:https://imgur.com/gallery/qcU0h

You can find the code in this Github link: https://github.com/nageshsinghc4/Exoplanet-exploration

Thank for reading!!!

Exoplanet-exploration-master.zip

Continuous Genetic Algorithm From Scratch With


Python

Cahit bartu yazıcı


Follow
Oct 29 · 11 min read
Genetic algorithm is a powerful optimization technique that was inspired by nature. Genetic algorithms mimic
evolution to find the best solution. Unlike most optimization algorithms, genetic algorithms do not use
derivatives to find the minima. One of the most significant advantages of genetic algorithms is their ability to
find a global minimum without getting stuck in local minima. Randomness plays a substantial role in the
structure of genetic algorithms, and it is the main reason genetic algorithms keep searching the search space.
The continuous in the title means the genetic algorithm we are going to create will be using floating numbers or
integers as optimization parameters instead of binary numbers.
Flowchart of genetic algorithms

Genetic algorithms create an initial population of randomly generated candidate solutions, these candidate
solutions are evaluated, and their fitness value is calculated. The fitness value of a solution is the numeric value
that determines how good a solution is, higher the fitness value better the solution. The figure below shows an
example generation with 8 individuals. Each individual is made up of 4 genes, which represent the optimization
parameters, and each individual has a fitness value, which in this case is the sum of the values of the genes.
An example of a generation

If the initial population does not meet the requirements of the termination criteria, genetic algorithm creates the
next generation. The first genetic operation is Selection; in this operation, the individuals that are going to be
moving on to the next generation are selected. After the selection process, Pairing operation commences.
Pairing operation pairs the selected individuals two by two for the Mating operation. The Mating operation
takes the paired parent individuals and creates offsprings, which will be replacing the individuals that were not
selected in the Selection operation, so the next generation has the same number of individuals as the previous
generation. This process is repeated until the termination criteria is met.

In this article, the genetic algorithm code was created from scratch using the Python standard library and
Numpy. Each of the genetic operations discussed before are created as functions. Before we begin with the
genetic algorithm code we need to import some libraries as;
import numpy as np
from numpy.random import randint
from random import random as rnd
from random import gauss, randrange

Initial Population
Genetic algorithms begin the optimization process by creating an initial population of candidate solutions whose
genes are randomly generated. To create the initial population, a function which creates individuals must be
created;

def individual(number_of_genes, upper_limit, lower_limit):


individual=[round(rnd()*(upper_limit-lower_limit)
+lower_limit,1) for x in range(number_of_genes)]
return individual

The function takes number of genes, upper and lower limits for the genes as inputs and creates the individual.
After the function to create individuals is created, another function is needed to create the population. The
function to create a population can be written as;

def population(number_of_individuals,
number_of_genes, upper_limit, lower_limit):
return [individual(number_of_genes, upper_limit, lower_limit)
for x in range(number_of_individuals)]

Using these two functions, the initial population can be created. After the genetic algorithm creates the first
generation, the fitness values of the individuals are calculated.

Fitness Calculation
Fitness calculation function determines the fitness value of an individual, how to calculate the fitness value
depends on the optimization problem. If the problem is to optimize the parameters of a function, that function
should be implemented to the fitness calculation function. The optimization problem can be very complex, and
using specific software may be needed to solve the problem; in that case, the fitness calculation function should
run simulations and collect the results from the software that is being used. For simplicity, we will go over the
generation example given at the beginning of the article.

def fitness_calculation(individual);
fitness_value = sum(individual)
return fitness_value

This is a very simple fitness function with only one parameter. Fitness function can be calculated for multiple
parameters. For multiple parameters, normalizing the different parameters is vert important, the difference in
magnitude between different parameters may cause one of the parameters to become obsolete for the fitness
function value. Parameters can be optimized with different methods, one of the normalization methods is
rescaling. Rescaling can be shown as;
Function for normalizing parameters

Where the m_s is the scaled value of the parameter, m_o is the actual value of the parameter. In this function,
maximum and minimum value of the parameter should be determined according to the nature of the proplem.

After the parameters are normalized, the importance of the parameters are determined by the biases given to
each parameter in the fitness function. Sum of the biases given to the parameters should be 1. For multiple
parameters, the fitness function can be written as;
Multi-parameter fitness function

Where b represents the biases of the fitness function and p represents the normalized parameters.
Selection
The Selection function takes the population of candidate solutions and their fitness values (a generation) and
outputs the individuals that are going to be moving on to the next generation. Elitism can be introduced to the
genetic algorithm, which will automatically select the best individual in a generation, so we do not lose the best
solution. There are a few selection methods that can be used. Selection methods given in this article are;

 Roulette wheel selection : In roulette wheel selection, each individual has a chance to be selected. The
chance of an individual to be selected is based on the fitness value of the individual. Fitter individuals
have a higher chance to be selected.
Roulette wheel selection figure
The function for roulette wheel selection takes the cumulative sums and the randomly generated value for the
selection process and returns the number of the selected individual. By calculating the cumulative sums, each
individual have a unique value between 0 and 1. To select individuals, a number between 0 and 1 is randomly
generated and the individual that is closes to the randomly generated number is selected. The roulette function
can be written as;

def roulette(cum_sum, chance):


veriable = list(cum_sum.copy())
veriable.append(chance)
veriable = sorted(veriable)
return veriable.index(chance)

 Fittest half selection : In this selection method, fittest half of the candidate solutions are selected to
move to the next generation.
Fittest half selection figure

 Random Selection : In this method, individuals selected randomly.


Random selection figure

Selection function can be written as;

def selection(generation, method='Fittest Half'):


generation['Normalized Fitness'] = \
sorted([generation['Fitness'][x]/sum(generation['Fitness'])
for x in range(len(generation['Fitness']))], reverse = True)
generation['Cumulative Sum'] = np.array(
generation['Normalized Fitness']).cumsum()
if method == 'Roulette Wheel':
selected = []
for x in range(len(generation['Individuals'])//2):
selected.append(roulette(generation
['Cumulative Sum'], rnd()))
while len(set(selected)) != len(selected):
selected[x] = \
(roulette(generation['Cumulative Sum'], rnd()))
selected = {'Individuals':
[generation['Individuals'][int(selected[x])]
for x in range(len(generation['Individuals'])//2)]
,'Fitness': [generation['Fitness'][int(selected[x])]
for x in range(
len(generation['Individuals'])//2)]}
elif method == 'Fittest Half':
selected_individuals = [generation['Individuals'][-x-1]
for x in range(int(len(generation['Individuals'])//2))]
selected_fitnesses = [generation['Fitness'][-x-1]
for x in range(int(len(generation['Individuals'])//2))]
selected = {'Individuals': selected_individuals,
'Fitness': selected_fitnesses}
elif method == 'Random':
selected_individuals = \
[generation['Individuals']
[randint(1,len(generation['Fitness']))]
for x in range(int(len(generation['Individuals'])//2))]
selected_fitnesses = [generation['Fitness'][-x-1]
for x in range(int(len(generation['Individuals'])//2))]
selected = {'Individuals': selected_individuals,
'Fitness': selected_fitnesses}
return selected

Pairing
Pairing and mating are used as a single operation in most genetic algorithm applications, but for creating
simpler functions and to be able to used different mating and paring algorithms easily, the two genetic
operations are separated in this application. If there is elitism in the genetic algorithm, the elit must be an input
to the function as well as the selected individuals. We are going to discuss three different pairing methods;

 Fittest: In this method, individuals are paired two by two, starting from the fittest individual. By doing
so, fitter individuals are paired together, but less fit individuals are paired together as well.
Fittest pairing figure

 Random: In this method, individuals are paired two by two randomly.


Random pairing figure

 Weighted random: In this method, individuals are paired randomly two by two, but fitter individuals
have a higher chance to be selected for pairing.
Weighted random pairing

Pairing function can be written as;

def pairing(elit, selected, method = 'Fittest'):


individuals = [elit['Individuals']]+selected['Individuals']
fitness = [elit['Fitness']]+selected['Fitness']
if method == 'Fittest':
parents = [[individuals[x],individuals[x+1]]
for x in range(len(individuals)//2)]
if method == 'Random':
parents = []
for x in range(len(individuals)//2):
parents.append(
[individuals[randint(0,(len(individuals)-1))],
individuals[randint(0,(len(individuals)-1))]])
while parents[x][0] == parents[x][1]:
parents[x][1] = individuals[
randint(0,(len(individuals)-1))]
if method == 'Weighted Random':
normalized_fitness = sorted(
[fitness[x] /sum(fitness)
for x in range(len(individuals)//2)], reverse = True)
cummulitive_sum = np.array(normalized_fitness).cumsum()
parents = []
for x in range(len(individuals)//2):
parents.append(
[individuals[roulette(cummulitive_sum,rnd())],
individuals[roulette(cummulitive_sum,rnd())]])
while parents[x][0] == parents[x][1]:
parents[x][1] = individuals[
roulette(cummulitive_sum,rnd())]
return parents

Mating
We will discuss two different mating methods. In the Python code given below, two selected parent individuals
create two offsprings. There are two mating methods we are going to discuss.

 Single point: In this method, genes after a single point are replaced with the genes of the other parent to
crate two offsprings.
Single point mating

 Two points: In this method, genes between two points are replaced with the genes of the other parent to
create two offsprings.
Two points mating

Mating function can be coded as;

def mating(parents, method='Single Point'):


if method == 'Single Point':
pivot_point = randint(1, len(parents[0]))
offsprings = [parents[0] \
[0:pivot_point]+parents[1][pivot_point:]]
offsprings.append(parents[1]
[0:pivot_point]+parents[0][pivot_point:])
if method == 'Two Pionts':
pivot_point_1 = randint(1, len(parents[0]-1))
pivot_point_2 = randint(1, len(parents[0]))
while pivot_point_2<pivot_point_1:
pivot_point_2 = randint(1, len(parents[0]))
offsprings = [parents[0][0:pivot_point_1]+
parents[1][pivot_point_1:pivot_point_2]+
[parents[0][pivot_point_2:]]]
offsprings.append([parents[1][0:pivot_point_1]+
parents[0][pivot_point_1:pivot_point_2]+
[parents[1][pivot_point_2:]]])
return offsprings
Mutations
The final genetic operation is random mutations. Random mutations occur in the selected individuals and their
offsprings to improve variety of the next generation. If there is elitism in the genetic algorithm, elit individual
does not go through random mutations so we do not lose the best solution. We are going to discuss two different
mutation methods.

 Gauss: In this method, the gene that goes through mutation is replaced with a number that is generated
according to gauss distribution around the original gene.
 Reset: In this method, the original gene is replaced with a randomly generated gene.
Reset mutation figure

The mutation function can be written as;

def mutation(individual, upper_limit, lower_limit, muatation_rate=2,


method='Reset', standard_deviation = 0.001):
gene = [randint(0, 7)]
for x in range(muatation_rate-1):
gene.append(randint(0, 7))
while len(set(gene)) < len(gene):
gene[x] = randint(0, 7)
mutated_individual = individual.copy()
if method == 'Gauss':
for x in range(muatation_rate):
mutated_individual[x] = \
round(individual[x]+gauss(0, standard_deviation), 1)
if method == 'Reset':
for x in range(muatation_rate):
mutated_individual[x] = round(rnd()* \
(upper_limit-lower_limit)+lower_limit,1)
return mutated_individual
Creating The Next Generation
The next generation is created using the genetic operations we discussed. Elitism can be introduced to the
genetic algorithm during the creating of next generation. Elitism is the The python code to create the next
generation can be written as;

def next_generation(gen, upper_limit, lower_limit):


elit = {}
next_gen = {}
elit['Individuals'] = gen['Individuals'].pop(-1)
elit['Fitness'] = gen['Fitness'].pop(-1)
selected = selection(gen)
parents = pairing(elit, selected)
offsprings = [[[mating(parents[x])
for x in range(len(parents))]
[y][z] for z in range(2)]
for y in range(len(parents))]
offsprings1 = [offsprings[x][0]
for x in range(len(parents))]
offsprings2 = [offsprings[x][1]
for x in range(len(parents))]
unmutated = selected['Individuals']+offsprings1+offsprings2
mutated = [mutation(unmutated[x], upper_limit, lower_limit)
for x in range(len(gen['Individuals']))]
unsorted_individuals = mutated + [elit['Individuals']]
unsorted_next_gen = \
[fitness_calculation(mutated[x])
for x in range(len(mutated))]
unsorted_fitness = [unsorted_next_gen[x]
for x in range(len(gen['Fitness']))] + [elit['Fitness']]
sorted_next_gen = \
sorted([[unsorted_individuals[x], unsorted_fitness[x]]
for x in range(len(unsorted_individuals))],
key=lambda x: x[1])
next_gen['Individuals'] = [sorted_next_gen[x][0]
for x in range(len(sorted_next_gen))]
next_gen['Fitness'] = [sorted_next_gen[x][1]
for x in range(len(sorted_next_gen))]
gen['Individuals'].append(elit['Individuals'])
gen['Fitness'].append(elit['Fitness'])
return next_gen

Termination Criteria
After a generation is created, termination criteria are used to determine if the genetic algorithm should create
another generation or should it stop. Different termination criteria can be used at the same time and if the
genetic algorithm satisfies one of the criteria the genetic algorithm stops. We are going to disscuss four
termination criteria.

 Maximum fitness : This termination criteria checks if the fittest individual in the current generation
satisfies our criteria. Using this termiantion method, desired results can be obtained. As seen from the
figure below, maximum fitness limit can be determined to include some of the local minima.
 Maximum average fitness: If we are interested in a set of solutions average values of the individuals in
the current generations can be checked to determine if the current generation satisfies our expectations.
 Maximum number of generations: We could limit the maximum number of generations created by the
genetic algorithm.
 Maximum similar fitness number: Due to elitism best individual in a generation moves on to the next
generation without mutating. This individual can be the best individual in the next generation as well.
We can limit the number for the same individual to be the best individual as this can be sing that the
genetic algorithm got stuck in a local minima. The function for checking if the maximum fitness value
have changed can be written as;

def fitness_similarity_chech(max_fitness, number_of_similarity):


result = False
similarity = 0
for n in range(len(max_fitness)-1):
if max_fitness[n] == max_fitness[n+1]:
similarity += 1
else:
similarity = 0
if similarity == number_of_similarity-1:
result = True
return result

Running the Algorithm


Now that all of the function we need for the genetic algorithm is ready, we can begin the optimization process.
To run the genetic algorithm with 20 individuals in each generation;

# Generations and fitness values will be written to this file


Result_file = 'GA_Results.txt'# Creating the First Generation
def first_generation(pop):
fitness = [fitness_calculation(pop[x])
for x in range(len(pop))]
sorted_fitness = sorted([[pop[x], fitness[x]]
for x in range(len(pop))], key=lambda x: x[1])
population = [sorted_fitness[x][0]
for x in range(len(sorted_fitness))]
fitness = [sorted_fitness[x][1]
for x in range(len(sorted_fitness))]
return {'Individuals': population, 'Fitness': sorted(fitness)}pop =
population(20,8,1,0)
gen = []
gen.append(first_generation(pop))
fitness_avg = np.array([sum(gen[0]['Fitness'])/
len(gen[0]['Fitness'])])
fitness_max = np.array([max(gen[0]['Fitness'])])
res = open(Result_file, 'a')
res.write('\n'+str(gen)+'\n')
res.close()finish = False
while finish == False:
if max(fitness_max) > 6:
break
if max(fitness_avg) > 5:
break
if fitness_similarity_chech(fitness_max, 50) == True:
break
gen.append(next_generation(gen[-1],1,0))
fitness_avg = np.append(fitness_avg, sum(
gen[-1]['Fitness'])/len(gen[-1]['Fitness']))
fitness_max = np.append(fitness_max, max(gen[-1]['Fitness']))
res = open(Result_file, 'a')
res.write('\n'+str(gen[-1])+'\n')
res.close()

Conclusion
Genetic algorithms can be used to solve multi-parameter constraint optimization problems. Like most of
optimization algorithms, genetic algorithms can be implemented directly from some libraries like sklearn, but
creating the algorithm from scratch gives a perspective on how it works and the algorithm can be tailored to a
specific problem.

Thank you for reading, I hope the article was helpful.

Genetic algorithms were first proposed by John H. Holland, you can find his original work here;

https://mitpress.mit.edu/books/adaptation-natural-and-artificial-systems

You can check out these two books if you want to learn more about genetic algorithms as well;

You might also like