You are on page 1of 88

A PROJECT REPORT ON

‘’CRIMES IN INDIA’’
Submitted in the partial fulfillment of the requirement for the award of the Degree

BSC DATA SCIENCE[DS]

Submitted by
M.HAVILAH K1801526

G.TEJASWI K1801528
SK.RAZIA K1801530

A.HARIKA K1801531

M.RACHANA K1801532

Under the Guidance of

V.T. PAVAN KUMAR [MCA,M.TECH,(PH.D)]


Asst.Professor(Department of Computers)
Submitted to:
Department of Computer Science & Applications

KAKARAPARTI BHAVANARAYANA COLLEGE(Autonomous)

(Sponsored by S.K.P.V.V Hindu High Schools Committee, Kothapeta,Vijayawada-


5200,(2018-2021))
This is to certify that is project work titled “CRIMES IN INDIA ” is the
bonafide work done by “M.Havilah (K1801526), G.Tejaswi(K1801528),
SK.Razia(K1801530), A.Harika(K1801531),M.Rachana(K1801532)” in the
partial fulfilment of the requirements for the award of the degree Bsc Data
Science (DS) at Kakaraparti Bhavanarayana College (Autonomous) affiliated
to Krishna University during 2018-2021.

PROJECT GUIDE HEAD OF THE DEPARTMENT

EXTERNAL EXAMINER
VIJAYAWADA-1.
Date:
DECLARATION
I here declare that his project work titled “CRIMES IN
INDIA” submitted to the Department of Computer Applications in
partial fulfillment of the award of the degree of BSC-DATA
SCIENCE[DS], KBN COLLEGE(Autonomous) Vijayawada, is done
by me has not been submitted to any other institution or published
elsewhere.

Place: Vijayawada
Date:

Roll.no. Regd.no Name


185226 K1801526 M.HAVILAH
185228 K1801528 G.TEJASWI
185230 K1801530 SK.RAZIA
185231 K1801531 A.HARIKA
185232 K1801532 M.RACHANA
ACKNOWLEDGMENT

First of all, we are grateful to The Almighty God for establishing us to


complete this project.
We express our sincere thanks to the Management and our beloved
principal Dr.E.Varaprasad for providing us such wonderful
facilities required encouragement in completion of this entire project
work.
We place on Record, Our sincere gratitude to Sri P.RAVINDRA,
HOD of Computer Applications for his constant encouragement. He
monitored our progress and arranged all facilities to make this project
easier.
We owe our profound gratitude to our Project Guide V.T. PAVAN
KUMAR, who took us to the limit. He was always so involved in the
entire process, shared his knowledge, and encouraged us to think.
We would like to thank all the other faculty members, technical staff
and supporting staff who have provided their contribution in the
completion of this project work.
Last but not Least, our heartfelt thanks to our Team Members, our
parents and friends for their encouragement and support in the
completion of this project work and also this course.
At last, this project work is a golden opportunity for learning and self-
development. We consider ourselves very lucky and honoured to have
so many people lead us through in completion of this project.
CONTENTS
1. Introduction
a. Title
b. Aim
2. Description
3. Data Science Process
4. System Requirements
a. Hardware Requirements
b. Software Requirements
5. Tool Used
a. Google Colaboratory
6. Coding Methods
7. Screenshots
8. Testing
9. Conclusion
10. Future Scope
11. Bibliography
introduction

Title: CRIMES IN INDIA


Aim: To Predict which months and time are safer
for the people.
Description:
Crimes is one of the biggest and dominating problem in our
society and its prevention is an important task. Daily there are huge
number of Crimes Committed. It is required to keep task of all the
crimes and maintain a database which may be used for future
reference. The current problem we facing are maintaining of proper
dataset of crime and analyzing this data to help in predicting and
solving crimes in future. Our task is to predict which category of
crime is mostly likely to occur at what place and what time.
The objective of this project is to analyze dataset which
consist of numerous crimes and predicting the type of crime which
may happen in future [demanding] depending upon various
conditions. In this project, We will be using the technique of Machine
learning and Data Science for crime prediction.
In our project, We analyze crime data from the City . It
consists of crime information like location description, types of
crimes data, time, latitude, longitude. The Random forest algorithm,
Naïve Bayes, classification, KNN and various other algorithms will
be tested for crime prediction and one with better accuracy will be
used for training. The objective, of this project is to given idea of how
machine learning and analysis of crime can be used by the law
enforcement agencies to detect, predict and solve crimes at a much
faster rate and thus reduces the crime rate.
Module description:
Murders:
Under the common law murder was an intentional killing that was
 Unlawful(in other words, not legally justified), and
 Committed with ‘malice aforethought’
 Intentionally inflicts serious bodily harm that causes the victim’s
death.
In today society, murder is defined by statute, rather than
common law. Though today’s statutes derive from common law,
one has to look to these statutes for important distinctions.

Assault on women:
Violence against women and girls is a major health and human
rights issue. At least one in five of the world’s female population has
been physically or sexually abused by a man or men at some time in
their life. Many, including pregnant women and young girls, are
subject to severe, sustained or repeated attacks. Worldwide, it has
been estimated that violence against women is as serious a cause of
death and incapacity among women of reproductive age as cancer,
and a greater cause of ill-health than traffic accidents and malaria
combined. The abuse of women is effectively repeated in almost
every society of the world.

Kidnapping & Abduction:


Kidnapping and Abduction are the crime under Indian Penal
Code, 1860. It talks about the forcefully taking of the person or a
child with or without the consent for that matter. People have
continued to take advantage of minors to kidnap them and exploit and
force them to perform horrendous acts. Such offences are an attack on
the liberty and freedom of citizens and must be prevented.
Robbery:
Robbery is the crime of stealing money or property from a bank,
shop, or vehicle, often by using force or threats.The gang members
committed dozens of armed robberies.

Victims of Rape :
 On average, there are 463,634 victims of rape and sexual assault
each year.
 Ages 12-34 are the highest risk years for rape and sexual assault.
 Those age 65 and older are 92% less likely than 12-24 year olds
to be a victim of rape or sexual assault, and 83% less likely than
25-49 years olds.

Dacoit:
When five or more persons conjointly commit or attempt to
commit a robbery, or where the whole number of persons
conjointly committing or attempting to commit a robbery, and
persons present and aiding such commission or attempt, amount to
five or more, every person so committing, attempting or aiding is
said to commit ‘dacoit’.
This concept of crime has been defined from the social and
legal standpoint. What constitutes criminal behavior depends
upon the legal codes of a particular society.
Data science process
The typical data science process consists of six steps
Step 1: Defining research goals and creating a project
charter
A project starts by understanding the what, the why, and the how of
your project
The outcome should be a clear research goal, a good understanding of
the context, well-defined deliverables, and a plan of action with a
timetable. This information is then best placed in a project charter.
Step 2: Retrieving data
Many companies will have already collected and stored the data for
you, and what they don’t have can often be bought from third parties.
This data can be stored in official data repositories such as
1. databases,
2. data marts,
3. data warehouses, and
4. data lakes
maintained by a team of IT professionals.
The database is data storage,
while a data warehouse is designed for reading and analyzing that
data.
A data mart is a subset of the data warehouse and
data lakes contains data in its natural or raw format.
If data isn’t available inside your organization, look outside your
organization’s walls.
Step 3: Cleansing, integrating, and transforming data
The data received from the data retrieval phase is likely to be “a
diamond in through.
Data cleansing is a subprocess of the data science process that
focuses on removing errors in your data so your data becomes a true
and consistent representation of the processes it originates from.
Types of errors
1. Interpretation error
2. Inconsistencies

Try to fix the problem early in the data acquisition chain or else fix it
in the program.
Error description Possible solution
Errors pointing to false values within one data set
Mistakes during data Manual overrules
entry
Redundant white space Use string functions
Impossible values Manual overrules
Missing values Remove observation or value
Outliers Validate and, if erroneous, treat as missing
value (remove or insert)
Errors pointing to inconsistencies between data sets
Deviations from a code Match on keys or else use manual overrules
book
Different units of Recalculate
measurement
Different levels Bring to same level of measurement by
of aggregation aggregation or extrapolation
Data Entry Errors

Data collection and data entry are error-prone processes. They often
require human intervention, and because they make typos or lose their
concentration for a second and introduce an error into the chain.
But data collected by machines or computers isn’t free from errors
either. Examples of errors originating from machines are transmission
errors or bugs in the extract, transform, and load phase (ETL)
Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other
redundant characters would.
The cleaning during the ETL phase wasn’t well executed, and keys in
one table contained a whitespace at the end of a string. This caused a
mismatch of keys such as “FR” – “FR”, dropping the observations
that couldn’t be matched.
Fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will
remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove
leading and trailing spaces
FIXING CAPITAL LETTER MISMATCHES
Capital letter mismatches are common. Most programming languages
make a distinction between “Brazil” and “brazil”. In this case you can
solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python. "Brazil".lower() ==
"brazil".lower() should result in true.
Impossible values and sanity checks
Sanity checks are another valuable type of data check. Here you
check the value against physically or theoretically impossible values
such as people taller than 3 meters or someone with an age of 299
years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
Outliers
An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a
different logic or generative process than the other observations. The
easiest way to find outliers is to use a plot or a table with the
minimum and maximum values
The plot on the top shows no outliers, whereas the plot on the bottom
shows possible outliers on the upper side when a normal distribution
is expected. The normal distribution, or Gaussian distribution, is the
most common distribution in natural sciences. It shows most cases
occurring around the average of the distribution and the occurrences
decrease when further away from it. The high values in the bottom
graph can point to outliers when assuming a normal distribution.
Deviations from a code book
Detecting errors in larger data sets against a code book or against
standardized values can be done with the help of set operations. A
code book is a description of your data, a form of metadata. It
contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable
means. (For instance “0” equals “negative”, “5” stands for “very
positive”.) A code book also tells the type of data you’re looking at: is
it hierarchical, graph, something else
If you have multiple values to check, it’s better to put them from the
code book into a table and use a difference operator to check the
discrepancy between both tables.
Combining data from different data sources
Your data comes from several different places, and in this substep we
focus on integrating these different sources. Data varies in size, type,
and structure, ranging from databases and Excel files to text
documents.
The different ways of combining data
You can perform two operations to combine information from
different data sets.
The first operation is joining: enriching an observation from one table
with information from another table.
The second operation is appending or stacking: adding the
observations of one table to those of another table.

Joining tables
Joining tables allows you to combine the information of one
observation found in one table with the information that you find in
another table.
Appending tables
Appending or stacking tables is effectively adding
observations from one table to another table.
Using views to simulate data joins and
appends
To avoid duplication of data, you virtually combine data with
views.

Enriching aggregated measures


Data enrichment can also be done by adding calculated information to
the table, such as the total number of sales or what percentage of total
stock has been sold in a certain region
Transforming data
Certain models require their data to be in a certain shape. Now
that you’ve cleansed and integrated the data, this is the next task
you’ll perform: transforming your data so it takes a suitable form for
data modeling.
Transforming data.
Relationships between an input variable and an output variable
aren’t always linear. Take, for instance, a relationship of the
form y = aebx. Taking the log of the independent variables simplifies
the estimation problem dramatically. The below figure shows
how transforming the input variables greatly simplifies the estimation
problem. Other times you might want to combine two variables into a
new variable.
EUCLIDEAN DISTANCE
Euclidean distance or “ordinary” distance is an extension to
one of the first things anyone learns in mathematics
about triangles (trigonometry): Pythagoras’s leg theorem.
If you know the length of the two sides next to the 90° angle of
a right-angled triangle you can easily derive the length of the
remaining side (hypotenuse).
The formula for this is
hypotenuse =
The Euclidean distance between two points in a two-dimensional
plane is calculated using a similar formula:

distance =
.
If you want to expand this distance calculation to more dimensions,
add the coordinates of the point within
those higher dimensions to the formula. For three dimensions we get
distance =
Data scientists use special methods to reduce the number of
variables but retain the maximum amount of data. The below figure
shows how reducing the number of variables makes it easier to
understand the key values. It also shows how two variables account
for 50.6% of the variation within the data set
( component 1 = 27.8% + component 2 = 22.8%)
These variables, called “component1” and “component2,” are both
combinations of the original variables.
Turning variables into dummies
Dummy variables can only take two values: true(1) or false(0).
They’re used to indicate the absence of a categorical effect that may
explain the observation. In this case you’ll make separate columns for
the classes stored in one variable and indicate it with 1 if the class is
present and 0 otherwise.
Step 4: Exploratory data analysis
The visualization techniques you use in this phase range from simple
line graphs or histograms, as shown in figure, to more complex
diagrams such as Sankey and network graphs.

Step 5: Build the models


With clean data in place and a good understanding of the content,
you’re ready to build models with the goal of making better
predictions, classifying objects, or gaining an understanding of the
system that you’re modeling.
models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the
model
2. Execution of the model
3. Diagnosis and model comparison

Step 6: Presenting findings and building applications on


top of them
After you’ve successfully analyzed the data and built a well-
performing model, you’re ready to present your findings to the world
Summary of Data Science Process
 Setting the research goal —
Defining the what, the why, and the how of your project in a
project charter.
 Retrieving data —
Finding and getting access to data needed in your project. This
data is either found within the company or retrieved from a
third party.
 Data preparation —
Checking and remediating data errors, enriching the data with
data from other data sources, and transforming it into suitable
format for your models.
 Data exploration —
Diving deeper into your data using descriptive statistics and
visual techniques.
 Data modeling
Using machine learning and statistical techniques to achieve
your project goal.
 Presentation and automation —
Presenting your results to the stakeholders and industrializing
your analysis process for repetitive reuse and integration with
other tools.
SYSTEM REQUIREMENTS:

a.Hardware requirements:

Processor
Intel 15
HDD
500GB
RAM
4GB

b. Software requirements:

Operating System
Windows 7
Front End Applications
Google Colaboratory
Back End Applications
Python 3.7
Others
Firefox, Internet Explorer
SOFTWARE TOOLS USED:

In Python Tutorial With Google Colab:


Python is a great general-purpose programming language on its
own, but with the help of a few popular libraries (numpy, scipy,
matplotlib) it becomes a powerful environment for scientific
computing.
We expect that many of you will have some experience with
Python and numpy, for the rest of you, this section will serve as a
quick crash course both on the Python programming language and on
the use of Python for scientific computing.
 Basic Python: Basic data types, Functions, Classes
 Numpy : Arrays, Array indexing, Data types, Array math,
Broadcasting
 Matplotlib : Plotting, Subplots, Images
 Python : Creating notebooks, Typical workflows

As of January 1, 2020, Python has officially dropped support


for python2.we’ll be using Python 3.7 for this iteration of the course.
You can check your Python version at the command line by running
python--version. In colab, we can enforce the Python version by
clicking Runtime -> change Runtime Type and selecting python3.
Note that as of April 2020, colab uses Python 3.6.9 which should run
everything without any errors.
REGRESSION:
Regression analysis is a fundamental concept in the field of
machine learning. It falls under supervised learning wherein the
algorithm is trained with both input features and output labels. It helps
in establishing a relationship among the variables by estimating how
one variable affects the other.
Imagine you're car shopping and have decided that gas mileage
is a deciding factor in your decision to buy. If you wanted to predict
the miles per gallon of some promising rides, how would you do it?
Well, since you know the different features of the car (weight,
horsepower, displacement, etc.) one possible method is regression. By
plotting the average MPG of each car given its features you can then
use regression techniques to find the relationship of the MPG and the
input features. The regression function here could be represented as
$Y = f(X)$, where Y would be the MPG and X would be the input
features like the weight, displacement, horsepower, etc. The target
function is $f$ and this curve helps us predict whether it’s beneficial
to buy or not buy. This mechanism is called regression.

REGRESSION IN MACHINE LEARNING:


Regression in machine learning consists of mathematical
methods that allow data scientists to predict a continuous outcome
(y) based on the value of one or more predictor variables (x).linear
regression is probably the most popular form of regression
analysis because of its ease-of-use in predicting and forecasting.

Evaluating a Regression Algorithm:

Let’s say you’ve developed an algorithm which predicts next


week's temperature. The temperature to be predicted depends on
different properties such as humidity, atmospheric pressure, air
temperature and wind speed. But how accurate are your predictions?
How good is your algorithm?
To evaluate your predictions, there are two important metrics to be
considered: variance and bias.

VARIANCE:

Variance is the amount by which the estimate of the target


function changes if different training data were used. The target
function $f$ establishes the relation between the input (properties)
and the output variables (predicted temperature). When a different
dataset is used the target function needs to remain stable with little
variance because, for any given type of data, the model should be
generic. In this case, the predicted temperature changes based on the
variations in the training dataset. To avoid false predictions, we need
to make sure the variance is low. For that reason, the model should be
generalized to accept unseen features of temperature data and produce
better predictions.

BIAS:

Bias is the algorithm’s tendency to consistently learn the wrong


thing by not taking into account all the information in the data. For
the model to be accurate, bias needs to be low. If there are
inconsistencies in the dataset like missing values, less number of data
tuples or errors in the input data, the bias will be high and the
predicted temperature will be wrong.
Accuracy and error are the two other important metrics. The error is
the difference between the actual value and the predicted value
estimated by the model. Accuracy is the fraction of predictions our
model got right.
For a model to be ideal, it’s expected to have low variance,
low bias and low error. To achieve this, we need to partition the
dataset into train and test datasets. The model will then learn patterns
from the training dataset and the performance will be evaluated on the
test dataset. To reduce the error while the model is learning, we come
up with an error function which will be reviewed in the following
section. If the model memorizes/mimics the training data fed to it,
rather than finding patterns, it will give false predictions on unseen
data. The curve derived from the trained model would then pass
through all the data points and the accuracy on the test dataset is low.
This is called overfitting and is caused by high variance.
On the flip side, if the model performs well on the test data but with
low accuracy on the training data, then this leads to underfitting.

Linear Regression:

Linear regression finds the linear relationship between the


dependent variable and one or more independent variables using a
best-fit straight line. Generally, a linear model makes a prediction by
simply computing a weighted sum of the input features, plus a
constant called the bias term (also called the intercept term). In this
technique, the dependent variable is continuous, the independent
variable(s) can be continuous or discrete, and the nature of the
regression line is linear. Mathematically, the prediction using linear
regression is given as:
$$y = \theta_0 + \theta_1x_1 + \theta_2x_2 + … + \theta_nx_n$$
Here, $y$ is the predicted value,
$n$ is the total number of input features,
$x_i$ is the input feature for $i^{th}$ value,
$\theta_i$ is the model parameter ($\theta_0$ is the bias and the
coefficients are $\theta_1, \theta_2, … \theta_n$).
The coefficient is like a volume knob, it varies according to the
corresponding input attribute, which brings change in the final value.
It signifies the contribution of the input variables in determining the
best-fit line.
Bias is a deviation induced to the line equation $y = mx$ for the
predictions we make. We need to tune the bias to vary the position of
the line that can fit best for the given data.

THE BIAS-VARIANCE TRADE-OFF:


Bias and variance are always in a trade-off. when bias is high,the
variance is low and when the variance is low,bias is high.The former
case arises when the model is too simple with a fewer number of
parameters and the latter when the model is complex with numerous
parameters.we require both variance and bias to be as small as
possible, and to get to that the trade-off needs to be dealt with
carefully,then that would bubble up to the desired

DRAWING THE BEST-FIT LINE :


Now, let’s see how linear regression adjusts the line between the data
for accurate predictions.
Imagine, you’re given a set of data and your goal is to draw the best-
fit line which passes through the data. This is the step-by-step process
you proceed with:
1. Consider your linear equation to be $y=mx+c$,where y is the
dependent data and x is the independent data given in your
dataset.
2. Adjust the line by varying the values of $m$ and $c$,i.e., the
coefficient and the bias.
3. come up with some random values for the coefficient and the
bias initially and plot the line.
4. since the line won’t fit well,change the values of ’m’ and
‘c’.This can be done using the ‘gradient descent algorithm’ or
‘least squares method’.

In accordance with the number of input and output variables,


linear regression is divided into three types: simple linear
regression, multiple linear regression and multivariate linear
regression.
LEAST SQUARES METHOD:
First, calculate the error/loss by subtracting the actual value from
the predicted one. Since the predicted values can be on either side of
the line, we square the difference to make it a positive value. The
result is denoted by ‘Q’, which is known as the sum of squared
errors.
Mathematically:

$$Q =\sum_{i=1}^{n}(y_{predicted}-y_{original} )^2$$


Our goal is to minimize the error function ‘Q." To get to that, we
differentiate Q w.r.t ‘m’ and ‘c’ and equate it to zero. After a few
mathematical derivations ‘m’ will be
$$m = \frac{cov(x,y)}{var(x)}$$
And ‘c’ will be,
$$c = y^{-} - bx^{-}$$
By plugging the above values into the linear equation, we get the best-
fit line.

GRADIENT DESCENT :
Gradient descent is an optimization technique used to tune the
coefficient and bias of a linear equation.
Imagine you are on the top left of a u-shaped cliff and moving
blind-folded towards the bottom center. You take small steps in the
direction of the steepest slope. This is what gradient descent does — it
is the derivative or the tangential line to a function that attempts to
find local minima of a function.
Simple Linear Regression:
Simple linear regression is one of the simplest (hence the name) yet
powerful regression techniques. It has one input ($x$) and one output
variable ($y$) and helps us predict the output from trained samples by
fitting a straight line between those variables. For example, we can
predict the grade of a student based upon the number of hours he/she
studies using simple linear regression.
Mathematically, this is represented by the equation:
$$y = mx +c$$
where $x$ is the independent variable (input),
$y$ is the dependent variable (output),
$m$ is slope,
and $c$ is an intercept.
The above mathematical representation is called a linear equation.
Example: Consider a linear equation with two variables, 3x + 2y = 0.
The values which when substituted make the equation right, are the
solutions. For the above equation, (-2, 3) is one solution because
when we replace x with -2 and y with +3 the equation holds true and
we get 0.
$$3 * -2 + 2 * 3 = 0$$
A linear equation is always a straight line when plotted on a graph.

In simple linear regression, we assume the slope and intercept to be


coefficient and bias, respectively. These act as the parameters that
influence the position of the line to be plotted between the data.
Imagine you plotted the data points in various colors, below is the
image that shows the best-fit line drawn using linear regression.
Multiple Linear Regression:

This is similar to simple linear regression, but there is more than one
independent variable. Every value of the independent variable x is
associated with a value of the dependent variable y. As it’s a multi-
dimensional representation, the best-fit line is a plane.
Mathematically, it’s expressed by:
$$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3$$
Imagine you need to predict if a student will pass or fail an exam.
We'd consider multiple inputs like the number of hours he/she spent
studying, total number of subjects and hours he/she slept for the
previous night. Since we have multiple inputs and would use multiple
linear regression.

Multivariate Linear Regression:

As the name implies, multivariate linear regression deals with


multiple output variables. For example, if a doctor needs to assess a
patient's health using collected blood samples, the diagnosis includes
predicting more than one value, like blood pressure, sugar level and
cholesterol level.

Polynomial Regression:
While the linear regression model is able to understand patterns for a
given dataset by fitting in a simple linear equation, it might not might
not be accurate when dealing with complex data. In those instances
we need to come up with curves which adjust with the data rather than
the lines. One approach is to use a polynomial model. Here, the
degree of the equation we derive from the model is greater than one.
Mathematically, a polynomial model is expressed by:
$$Y_{0} = b_{0}+ b_{1}x^{1} + … b_{n}x^{n}$$
where $Y_{0}$ is the predicted value for the polynomial model with
regression coefficients $b_{1}$ to $b_{n}$ for each degree and a bias
of $b_{0}$.
If n=1, the polynomial equation is said to be a linear equation

REGULARIZATION :
Using polynomial regression, we see how the curved lines
fit flexibly between the data, but sometimes even these result in false
predictions as they fail to interpret the input. For example, if your
model is a fifth-degree polynomial equation that’s trying to fit data
points derived from a quadratic equation, it will try to update all six
coefficients (five coefficients and one bias), which

lead to overfitting. Using regularization, we improve the fit so the


accuracy is better on the test dataset.
Ridge and Lasso Regression:
To avoid overfitting, we use ridge and lasso regression in the
presence of a large number of features. These are the regularization
techniques used in the regression field. They work by penalizing the
magnitude of coefficients of features along with minimizing the error
between the predicted and actual observations. Coefficients evidently
increase to fit with a complex model which might lead to overfitting,
so when penalized, it puts a check on them to avoid such scenarios.
Ridge regression/L2 regularization adds a penalty term
($\lambda{w_{i}^2}$) to the cost function which avoids overfitting,
hence our cost function is now expressed,
$$ J(w) = \frac{1}{n}(\sum_{i=1}^n (\hat{y}(i)-y(i))^2 +
\lambda{w_{i}^2})$$
When lambda = 0, we get back to overfitting, and lambda = infinity
adds too much weight and leads to underfitting. Therefore, $\lambda$
needs to be chosen carefully to avoid both of these.

In lasso regression/L1 regularization, an absolute value


($\lambda{w_{i}}$) is added rather than a squared coefficient. It
stands for least selective shrinkage selective operator.
The cost function would then be:
$$ J(w) = \frac{1}{n}(\sum_{i=1}^n (\hat{y}(i)-y(i))^2 +
\lambda{w_{i}})$$
Summary
1. Regression is a supervised machine learning technique which is
used to predict continuous values.

2. The ultimate goal of the regression algorithm is to plot a best-fit


line or a curve between the data.

3. The three main metrics that are used for evaluating the trained
regression model are variance,bias and error.if the variance is
high,it leads to overfitting and when the bias is high,it leads to
underfitting.

4. Based on the number of input features and output


labels,regression is classified as linear and multivariate.

5. Linear regression allows us to plot a linear equation,i.e., a


straight line. We need to tune the coefficient and bias of the
linear equation over the training data for accurate predictions
6.The tuning of coefficient and bias is achieved through gradient
descent or a cost function -least squares method.
7.Polynomial regression is used when data is non-linear,In this
model is more flexible as it plots a curve between the data.The
degree of the polynomial needs to vary such that overfitting
doesn’t occur.
Decision Trees for Classification: A
Machine Learning Algorithm

Introduction Decision Trees are a type of Supervised Machine

Learning (that is you explain what the input is and what the

corresponding output is in the training data) where the data is

continuously split according to a certain parameter. The tree can be

explained by two entities, namely decision nodes and leaves. The


leaves are the decisions or the final outcomes. And the decision nodes

are where the data is split.

An example of a decision tree can be explained using the above

binary tree. Let’s say you want to predict whether a person is fit given

their information like age, eating habit, and physical activity, etc. The

decision nodes here are questions like ‘What’s the age?’, ‘Does he

exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are

outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary

classification problem (a yes no type problem). There are two main

types of Decision Trees:

1. Classification trees (Yes/No types)

What we’ve seen above is an example of a classification tree, where

the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision

variable is Categorical.
2. Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a

number like 123. Now that we know what a Decision Tree is, we’ll

see how it works internally. There are many algorithms out there

which construct Decision Trees, but one of the best is called the ID3

Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before

discussing the ID3 algorithm, we’ll go through a few definitions.

Entropy Entropy, also called Shannon Entropy is denoted by H(S) for

a finite set S, is the measure of the amount of uncertainty or

randomness in data.

Intuitively, it tells us about the predictability of a certain event.

Example, consider a coin toss whose probability of heads is 0.5 and

probability of tails is 0.5. Here the entropy is the highest possible,

since there’s no way of determining what the outcome might be.

Alternatively, consider a coin which has heads on both sides, the

entropy of such an event can be predicted perfectly since we know

beforehand that it’ll always be heads. In other words, this event has

no randomness hence it’s entropy is zero. In particular, lower values

imply less uncertainty while higher values imply high uncertainty.


Information Gain Information gain is also called as Kullback-Leibler

divergence denoted by IG(S,A) for a set S is the effective change in

entropy after deciding on a particular attribute A. It measures the

relative change in entropy with respect to the independent variables.

Alternatively, where IG(S, A) is the information gained by applying

feature A. H(S) is the Entropy of the entire set, while the second term

calculates the Entropy after applying the feature A, where P(x) is the

probability of event x. Let’s understand this with the help of an

example. Consider a piece of data collected over the course of 14 days

where the features are Outlook, Temperature, Humidity, Wind and the

outcome variable is whether Golf was played on the day. Now, our

job is to build a predictive model which takes in the above 4

parameters and predicts whether Golf will be played on the day. We’ll

build a decision tree to do that using the ID3 algorithm.


Naive Bayes Classifier Algorithm
 The Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving
classification problems.
 It is mainly used in text classification that includes a high-
dimensional training dataset.
 The Naive Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building the
fast machine learning models that can make quick predictions.
 It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

The Naïve Bayes algorithm is comprised of two words Naïve and


Bayes, Which can be described as:

 Naive:

It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other
features. Such as if the fruit is identified on the bases of color,
shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on
each other.

 Bayes:
It is called Bayes because it depends on the principle of Bayes
Theorem
Bayes' Theorem:
 Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
 The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help
of the below example:
Suppose we have a dataset of weather conditions and corresponding
target variable "Play". So using this dataset we need to decide
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:
Convert the given dataset into frequency tables.

1. Generate Likelihood table by finding the probabilities of given


features.
2. Now, use Bayes theorem to calculate the posterior probability.

Advantages of Naïve Bayes Classifier:


 Naïve Bayes is one of the fast and easy ML algorithms to
predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the
other Algorithms.
 It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


 Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:


 It is used for Credit Scoring.
 It is used in medical data classification.
 It can be used in real-time predictions because Naïve Bayes
Classifier is an eager learner.
 It is used in Text classification such as Spam filtering and
Sentiment analysis.

Types of Naïve Bayes Model:


There are three types of Naive Bayes Model, which are given
below:

 Gaussian:

The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values
instead of discrete, then the model assumes that these values are
sampled from the Gaussian distribution.
 Multinomial:
The Multinomial Naïve Bayes classifier is used when the data is
multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs
to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
 Bernoulli:
The Bernoulli classifier works similar to the Multinomial
classifier, but the predictor variables are the independent
Booleans variables. Such as if a particular word is present or not
in a document. This model is also famous for document
classification on tables.
CODING
SCREEN SHOTS
Data Properties
#plotting data

 MURDER
 ASSAULT ON WOMEN
 KIDNAPPING AND ABDUCTION
 DACOITY
 ROBBERY
 ARSON
 HURT
 PREVENTION OF ATROCITIES (POA ACT)
 PROTECTION OF CIVIL RIGHT(PCR ACT)
 CRIMES AGNAIST SC’s
#PLOTTING ON INDIA MAP
Based on the bar graph that I computed, larceny happened the
most during May, June, and December, whereas September,
October, and August appear to be safer.
Here, we can tell the safest time of the day when larceny is the
least possible to happen in India is 5 am. However, people need
to be more careful from 4 to 6 pm.
TESTING
The usage of the word "testing " in relation to Machine Learning
models is primarily used for testing the model performance in terms
of accuracy/precision of the model. It can be noted that the word,
"testing" means different for conventional software development and
Machine Learning models development.
Poor quality in an ML model does not imply the presence of a bug.
Instead, to debug poor performance in a model, you investigate a
broader range of causes than you would in traditional programming.

Testing model performance


It is about testing the models with the test data/new data sets and
comparing the model performance in terms of parameters such as
accuracy/recall etc., to that of pre-determined accuracy with the
model already built and moved into production. This is the most
trivial of different techniques which could be used for blackbox
testing.

Dual Coding:
With dual coding technique, the idea is to build different models
based on different algorithms and comparing the prediction from
each of these models given a particular input data set.
Let's say, a classification model is built with different algorithms such
as random forest, SVM, neural network. All of them demonstrate a
comparative accuracy of 90% or so with random forest showing the
accuracy of 94%. This results in the selection of random forest.
However, during testing, the model for quality control checks, all of
the above models are preserved and input is fed into all of the
models. For inputs where the majority of remaining models other
than random forest gives a prediction which does not match with that
of the model built with random forest, a bug/defect could be raised in
the defect tracking system. These bugs could later be prioritized and
dealt with by data scientists.

Coverage Guided Fuzzing:


Coverage guided fuzzing is a technique where data to be fed into the
Machine Learning models could be planned appropriately such that
all of the feature’s activations get tested. Take for an instance, the
models built with neural networks, decision trees, random forest etc.
Let's say the model is built using neural networks.
The idea is to come up with data sets (test cases) which could result in
the activation of each of the neurons present in the neural network.
This technique sounds more like a white-box testing. However, the
way it becomes part of the blackbox testing is the feedback which
is obtained from the model which is then used to guide the further
fuzzing and hence, the name — Coverage guided fuzzing.

Metamorphic Testing:
In metamorphic testing, the test cases that result in success lead to
another set of test cases which could be used for further testing of
Machine Learning models.
One or more properties are identified that represent the metamorphic
relationship between input-output pairs.
Conclusion
By using Pandas, I analyzed and visualized the
open data of India Crime Incident Reports. Turns out Pandas is
indeed a very powerful Python package in terms of extracting,
grouping, sorting, analyzing, and plotting the data.

Predicting crimes before they happen is simple to understand, but it


takes a lot more than understanding the concept to make it a reality.
The aim of the project to make crime prediction a reality and
implement such advanced technology in real life. The use of new
technologies ,the implementation of such software can fundamentally
change the way police work, in a much better way. This project
outlined a framework envisaging how the aspects of machine and
deep learning, along with computer vision, can help create a system
that is much more helpful to the police.

Based on the bar graph that I computed, Robbery


happened the most during May, June, and December, whereas
September, October, and August appear to be safer.

Here, we can tell the safest time of the day when larceny is
the least possible to happen in India is 5 am. However, people
need to be more careful from 4 to 6 pm.
Future Scope
The present scope of our project, which is prediction of the
crime an individual criminal is likely to commit, we can also predict
the estimated time for the crime to take place as a future scope.

This paper presented the techniques and methods that can


be used to predict crime and help law agencies. The scope of using
different methods for crime prediction and prevention can change the
scenario of law enforcement agencies. Using Machine Learning
substantially impact the overall functionality of law enforcement
agencies.

In ML Machine can learn the pattern of previous crimes,


understand what crime actually is, and predict future crimes
accurately without human intervention. The Law enforcement
agencies can be warned and prevent crime from occurring by
implementing more surveillance within the prediction zone. Although
the current systems have a large impact on crime prevention, this
could be the next big approach and bring about a revolutionary
change in the crime rate, prediction.
Bibliography
1. National Crime Records Bureau:
https://ncrb.gov.in/en/crime-india

2. Crimes records :
https://ncrb.gov.in/en/crime-india-year-2013

3. TensorFlow Model Repository


https://github.com/tensorflow/models

4. Chrome Driver –
https://chromedriver.chromium.org/

You might also like