Professional Documents
Culture Documents
‘’CRIMES IN INDIA’’
Submitted in the partial fulfillment of the requirement for the award of the Degree
Submitted by
M.HAVILAH K1801526
G.TEJASWI K1801528
SK.RAZIA K1801530
A.HARIKA K1801531
M.RACHANA K1801532
EXTERNAL EXAMINER
VIJAYAWADA-1.
Date:
DECLARATION
I here declare that his project work titled “CRIMES IN
INDIA” submitted to the Department of Computer Applications in
partial fulfillment of the award of the degree of BSC-DATA
SCIENCE[DS], KBN COLLEGE(Autonomous) Vijayawada, is done
by me has not been submitted to any other institution or published
elsewhere.
Place: Vijayawada
Date:
Assault on women:
Violence against women and girls is a major health and human
rights issue. At least one in five of the world’s female population has
been physically or sexually abused by a man or men at some time in
their life. Many, including pregnant women and young girls, are
subject to severe, sustained or repeated attacks. Worldwide, it has
been estimated that violence against women is as serious a cause of
death and incapacity among women of reproductive age as cancer,
and a greater cause of ill-health than traffic accidents and malaria
combined. The abuse of women is effectively repeated in almost
every society of the world.
Victims of Rape :
On average, there are 463,634 victims of rape and sexual assault
each year.
Ages 12-34 are the highest risk years for rape and sexual assault.
Those age 65 and older are 92% less likely than 12-24 year olds
to be a victim of rape or sexual assault, and 83% less likely than
25-49 years olds.
Dacoit:
When five or more persons conjointly commit or attempt to
commit a robbery, or where the whole number of persons
conjointly committing or attempting to commit a robbery, and
persons present and aiding such commission or attempt, amount to
five or more, every person so committing, attempting or aiding is
said to commit ‘dacoit’.
This concept of crime has been defined from the social and
legal standpoint. What constitutes criminal behavior depends
upon the legal codes of a particular society.
Data science process
The typical data science process consists of six steps
Step 1: Defining research goals and creating a project
charter
A project starts by understanding the what, the why, and the how of
your project
The outcome should be a clear research goal, a good understanding of
the context, well-defined deliverables, and a plan of action with a
timetable. This information is then best placed in a project charter.
Step 2: Retrieving data
Many companies will have already collected and stored the data for
you, and what they don’t have can often be bought from third parties.
This data can be stored in official data repositories such as
1. databases,
2. data marts,
3. data warehouses, and
4. data lakes
maintained by a team of IT professionals.
The database is data storage,
while a data warehouse is designed for reading and analyzing that
data.
A data mart is a subset of the data warehouse and
data lakes contains data in its natural or raw format.
If data isn’t available inside your organization, look outside your
organization’s walls.
Step 3: Cleansing, integrating, and transforming data
The data received from the data retrieval phase is likely to be “a
diamond in through.
Data cleansing is a subprocess of the data science process that
focuses on removing errors in your data so your data becomes a true
and consistent representation of the processes it originates from.
Types of errors
1. Interpretation error
2. Inconsistencies
Try to fix the problem early in the data acquisition chain or else fix it
in the program.
Error description Possible solution
Errors pointing to false values within one data set
Mistakes during data Manual overrules
entry
Redundant white space Use string functions
Impossible values Manual overrules
Missing values Remove observation or value
Outliers Validate and, if erroneous, treat as missing
value (remove or insert)
Errors pointing to inconsistencies between data sets
Deviations from a code Match on keys or else use manual overrules
book
Different units of Recalculate
measurement
Different levels Bring to same level of measurement by
of aggregation aggregation or extrapolation
Data Entry Errors
Data collection and data entry are error-prone processes. They often
require human intervention, and because they make typos or lose their
concentration for a second and introduce an error into the chain.
But data collected by machines or computers isn’t free from errors
either. Examples of errors originating from machines are transmission
errors or bugs in the extract, transform, and load phase (ETL)
Redundant whitespace
Whitespaces tend to be hard to detect but cause errors like other
redundant characters would.
The cleaning during the ETL phase wasn’t well executed, and keys in
one table contained a whitespace at the end of a string. This caused a
mismatch of keys such as “FR” – “FR”, dropping the observations
that couldn’t be matched.
Fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will
remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove
leading and trailing spaces
FIXING CAPITAL LETTER MISMATCHES
Capital letter mismatches are common. Most programming languages
make a distinction between “Brazil” and “brazil”. In this case you can
solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python. "Brazil".lower() ==
"brazil".lower() should result in true.
Impossible values and sanity checks
Sanity checks are another valuable type of data check. Here you
check the value against physically or theoretically impossible values
such as people taller than 3 meters or someone with an age of 299
years. Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120
Outliers
An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a
different logic or generative process than the other observations. The
easiest way to find outliers is to use a plot or a table with the
minimum and maximum values
The plot on the top shows no outliers, whereas the plot on the bottom
shows possible outliers on the upper side when a normal distribution
is expected. The normal distribution, or Gaussian distribution, is the
most common distribution in natural sciences. It shows most cases
occurring around the average of the distribution and the occurrences
decrease when further away from it. The high values in the bottom
graph can point to outliers when assuming a normal distribution.
Deviations from a code book
Detecting errors in larger data sets against a code book or against
standardized values can be done with the help of set operations. A
code book is a description of your data, a form of metadata. It
contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable
means. (For instance “0” equals “negative”, “5” stands for “very
positive”.) A code book also tells the type of data you’re looking at: is
it hierarchical, graph, something else
If you have multiple values to check, it’s better to put them from the
code book into a table and use a difference operator to check the
discrepancy between both tables.
Combining data from different data sources
Your data comes from several different places, and in this substep we
focus on integrating these different sources. Data varies in size, type,
and structure, ranging from databases and Excel files to text
documents.
The different ways of combining data
You can perform two operations to combine information from
different data sets.
The first operation is joining: enriching an observation from one table
with information from another table.
The second operation is appending or stacking: adding the
observations of one table to those of another table.
Joining tables
Joining tables allows you to combine the information of one
observation found in one table with the information that you find in
another table.
Appending tables
Appending or stacking tables is effectively adding
observations from one table to another table.
Using views to simulate data joins and
appends
To avoid duplication of data, you virtually combine data with
views.
distance =
.
If you want to expand this distance calculation to more dimensions,
add the coordinates of the point within
those higher dimensions to the formula. For three dimensions we get
distance =
Data scientists use special methods to reduce the number of
variables but retain the maximum amount of data. The below figure
shows how reducing the number of variables makes it easier to
understand the key values. It also shows how two variables account
for 50.6% of the variation within the data set
( component 1 = 27.8% + component 2 = 22.8%)
These variables, called “component1” and “component2,” are both
combinations of the original variables.
Turning variables into dummies
Dummy variables can only take two values: true(1) or false(0).
They’re used to indicate the absence of a categorical effect that may
explain the observation. In this case you’ll make separate columns for
the classes stored in one variable and indicate it with 1 if the class is
present and 0 otherwise.
Step 4: Exploratory data analysis
The visualization techniques you use in this phase range from simple
line graphs or histograms, as shown in figure, to more complex
diagrams such as Sankey and network graphs.
a.Hardware requirements:
Processor
Intel 15
HDD
500GB
RAM
4GB
b. Software requirements:
Operating System
Windows 7
Front End Applications
Google Colaboratory
Back End Applications
Python 3.7
Others
Firefox, Internet Explorer
SOFTWARE TOOLS USED:
VARIANCE:
BIAS:
Linear Regression:
GRADIENT DESCENT :
Gradient descent is an optimization technique used to tune the
coefficient and bias of a linear equation.
Imagine you are on the top left of a u-shaped cliff and moving
blind-folded towards the bottom center. You take small steps in the
direction of the steepest slope. This is what gradient descent does — it
is the derivative or the tangential line to a function that attempts to
find local minima of a function.
Simple Linear Regression:
Simple linear regression is one of the simplest (hence the name) yet
powerful regression techniques. It has one input ($x$) and one output
variable ($y$) and helps us predict the output from trained samples by
fitting a straight line between those variables. For example, we can
predict the grade of a student based upon the number of hours he/she
studies using simple linear regression.
Mathematically, this is represented by the equation:
$$y = mx +c$$
where $x$ is the independent variable (input),
$y$ is the dependent variable (output),
$m$ is slope,
and $c$ is an intercept.
The above mathematical representation is called a linear equation.
Example: Consider a linear equation with two variables, 3x + 2y = 0.
The values which when substituted make the equation right, are the
solutions. For the above equation, (-2, 3) is one solution because
when we replace x with -2 and y with +3 the equation holds true and
we get 0.
$$3 * -2 + 2 * 3 = 0$$
A linear equation is always a straight line when plotted on a graph.
This is similar to simple linear regression, but there is more than one
independent variable. Every value of the independent variable x is
associated with a value of the dependent variable y. As it’s a multi-
dimensional representation, the best-fit line is a plane.
Mathematically, it’s expressed by:
$$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3$$
Imagine you need to predict if a student will pass or fail an exam.
We'd consider multiple inputs like the number of hours he/she spent
studying, total number of subjects and hours he/she slept for the
previous night. Since we have multiple inputs and would use multiple
linear regression.
Polynomial Regression:
While the linear regression model is able to understand patterns for a
given dataset by fitting in a simple linear equation, it might not might
not be accurate when dealing with complex data. In those instances
we need to come up with curves which adjust with the data rather than
the lines. One approach is to use a polynomial model. Here, the
degree of the equation we derive from the model is greater than one.
Mathematically, a polynomial model is expressed by:
$$Y_{0} = b_{0}+ b_{1}x^{1} + … b_{n}x^{n}$$
where $Y_{0}$ is the predicted value for the polynomial model with
regression coefficients $b_{1}$ to $b_{n}$ for each degree and a bias
of $b_{0}$.
If n=1, the polynomial equation is said to be a linear equation
REGULARIZATION :
Using polynomial regression, we see how the curved lines
fit flexibly between the data, but sometimes even these result in false
predictions as they fail to interpret the input. For example, if your
model is a fifth-degree polynomial equation that’s trying to fit data
points derived from a quadratic equation, it will try to update all six
coefficients (five coefficients and one bias), which
3. The three main metrics that are used for evaluating the trained
regression model are variance,bias and error.if the variance is
high,it leads to overfitting and when the bias is high,it leads to
underfitting.
Learning (that is you explain what the input is and what the
binary tree. Let’s say you want to predict whether a person is fit given
their information like age, eating habit, and physical activity, etc. The
decision nodes here are questions like ‘What’s the age?’, ‘Does he
exercise?’, ‘Does he eat a lot of pizzas’? And the leaves, which are
outcomes like either ‘fit’, or ‘unfit’. In this case this was a binary
the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision
variable is Categorical.
2. Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a
number like 123. Now that we know what a Decision Tree is, we’ll
see how it works internally. There are many algorithms out there
which construct Decision Trees, but one of the best is called the ID3
randomness in data.
beforehand that it’ll always be heads. In other words, this event has
feature A. H(S) is the Entropy of the entire set, while the second term
calculates the Entropy after applying the feature A, where P(x) is the
where the features are Outlook, Temperature, Humidity, Wind and the
outcome variable is whether Golf was played on the day. Now, our
parameters and predicts whether Golf will be played on the day. We’ll
Naive:
Bayes:
It is called Bayes because it depends on the principle of Bayes
Theorem
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law,
which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help
of the below example:
Suppose we have a dataset of weather conditions and corresponding
target variable "Play". So using this dataset we need to decide
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:
Convert the given dataset into frequency tables.
Gaussian:
MURDER
ASSAULT ON WOMEN
KIDNAPPING AND ABDUCTION
DACOITY
ROBBERY
ARSON
HURT
PREVENTION OF ATROCITIES (POA ACT)
PROTECTION OF CIVIL RIGHT(PCR ACT)
CRIMES AGNAIST SC’s
#PLOTTING ON INDIA MAP
Based on the bar graph that I computed, larceny happened the
most during May, June, and December, whereas September,
October, and August appear to be safer.
Here, we can tell the safest time of the day when larceny is the
least possible to happen in India is 5 am. However, people need
to be more careful from 4 to 6 pm.
TESTING
The usage of the word "testing " in relation to Machine Learning
models is primarily used for testing the model performance in terms
of accuracy/precision of the model. It can be noted that the word,
"testing" means different for conventional software development and
Machine Learning models development.
Poor quality in an ML model does not imply the presence of a bug.
Instead, to debug poor performance in a model, you investigate a
broader range of causes than you would in traditional programming.
Dual Coding:
With dual coding technique, the idea is to build different models
based on different algorithms and comparing the prediction from
each of these models given a particular input data set.
Let's say, a classification model is built with different algorithms such
as random forest, SVM, neural network. All of them demonstrate a
comparative accuracy of 90% or so with random forest showing the
accuracy of 94%. This results in the selection of random forest.
However, during testing, the model for quality control checks, all of
the above models are preserved and input is fed into all of the
models. For inputs where the majority of remaining models other
than random forest gives a prediction which does not match with that
of the model built with random forest, a bug/defect could be raised in
the defect tracking system. These bugs could later be prioritized and
dealt with by data scientists.
Metamorphic Testing:
In metamorphic testing, the test cases that result in success lead to
another set of test cases which could be used for further testing of
Machine Learning models.
One or more properties are identified that represent the metamorphic
relationship between input-output pairs.
Conclusion
By using Pandas, I analyzed and visualized the
open data of India Crime Incident Reports. Turns out Pandas is
indeed a very powerful Python package in terms of extracting,
grouping, sorting, analyzing, and plotting the data.
Here, we can tell the safest time of the day when larceny is
the least possible to happen in India is 5 am. However, people
need to be more careful from 4 to 6 pm.
Future Scope
The present scope of our project, which is prediction of the
crime an individual criminal is likely to commit, we can also predict
the estimated time for the crime to take place as a future scope.
2. Crimes records :
https://ncrb.gov.in/en/crime-india-year-2013
4. Chrome Driver –
https://chromedriver.chromium.org/