A Datamining Model For Detection of Fraudulent Behaviour in Water

A
MAIN PROJECT ON
A DATAMINING MODEL FOR DETECTION OF FRAUDULENT

BEHAVIOUR IN WATER
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE AWARD OF THE
DEGREE OF
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
OF
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY, HYDERABAD
SUBMITTED BY
S.SWETHA SRAVANTHI (16C51A0546)
K.MADHURI (16C51A0528)
A.SRUTHI (16C55A0504)
D.BHARATH (16C51A0514)
UNDER THE ESTEEMED GUIDENCE OF
Mr .V.V.SIVA PRASAD
Associate Professor
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SAI SPURTHI INSTITUTE OF TECHNOLOGY
(Approved by AICTE, Affiliated to JNTU, Hyderabad, Certified by ISO 9001:2008)
(ACCREDITED BY NAAC-‘B’ Grade)
B.GANGARAM-507303, JNTU-HYDERABAD, TS, 2019-2020

SAI SPURTHI INSTITUTE OF TECHNOLOGY
(Approved by AICTE, Affiliated to JNTU, Hyderabad, Certified by ISO 9001:2008)
(ACCREDITED BY NAAC)
B.GANGARAM-507303, JNTU-HYDERABAD, TS, 2019-2020
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CERTIFICATE
This is to certify that the main project entitled “A DATAMINING MODEL FOR
DETECTION OF FRAUDULENT BEHAVIOUR IN WATER” is a bonafied work done
by S.SWETHA SRAVANTHI (16C51A0546), K.MADHURI (16C51A0524), A.SRUTHI
(16C55A0504),D.BHARATH (16C51A0514) under the guidance and supervision of Mr
V.V.SIVA PRASAD Assoc. Professor in CSE Department at SAI SPURTHI INSTITUTE
OF TECHNOLOGY in the partial fulfillment of Bachelor of Technology in Computer
Science and Engineering from JNTU-Hyderabad during the year 2019-2020.
Project supervisor Head of the Department
Mr V.V.SIVA PRASAD Mr N.VENKATESWARA RAO
Associate Professor Associate Professor
Dr.CH. VIJAYA KUMAR EXTERNAL EXAMINER

PRINCIPAL
ACKNOWLEDGEMENT
We express our sincere thanks to our supervisor Mr V.V.SIVA PRASAD,

Associative Professor in CSE Department for giving us moral support, kind attention
and valuable guidance to us throughout this project work.
It is our privilege to thank Mr N. VENKATESWARA RAO, Head of the CSE

Department for his encouragement during the progress of this project work.
It is our privilege to thank all Project Review Committee members for allowing
us to do this project and providing us all the facilities to do our project.
We derive great pleasure in expressing our sincere gratitude to our principal

Dr.CH. VIJAYA KUMAR for his timely suggestions, which helped us to complete
this work successfully.
We thank to both teaching and non-teaching staff members of CSE department

for their kind cooperation and all sorts of help to bring this project work successfully.
In all sincerity,
K.KRISHNASRI (16C51A0523)
M.SPOORTHI (16C51A0528)
S.VAMSI KRISHNA (17C55A0507)
D.MANOJ KUMAR (16C51A0515)

ABSTRACT
Data mining is a powerful tool widely used by organizations to enhance their

businesses and gain a competitive advantage over their competitors. The data
mining process helps in extracting and analysing various data patterns,
information or trends from large databases. Various data mining techniques are
available to conduct the data mining process. Data mining techniques are used
in a variety of applications, one of which is the detection and prevention of
different types of frauds. Although there is existing research on data mining and
various data mining techniques that can be used to detect and identify different
types of frauds, there is little research that synthesizes various facets of fraud
that uses the data mining techniques. This research explores the use of two
classification techniques (SVM and KNN) to detect suspicious fraud water
customers. The SVM based approach uses customer load profile attributes to
expose abnormal behaviour that is known to be correlated with non-technical
loss activities. The data has been collected from the historical data of the
company billing system. To deploy the model, a decision tool has been built
using the generated model. The system will help the company to predict
suspicious water customers to be inspected on site.
CONTENTS
S.NO Topic Name Page No

1. Introduction 01
1.1. Data science 01
1.2. Machine Learning 02
1.3. Project Inroduction 05
2. Installations 06
2.1. Anaconda 06
2.2. Integrated Development Enviornment (IDE) 11
3. Python Libraries 13
3.1. Numpy 13
3.2.Pandas 13
3.3. Matplot lib 13
3.4.Scikit-learn 13
4. System Specifications 14
4.1. Hardware Requirements 14
4.2. Software Requirements 14
5. Tools & Technologies 15
5.1. Spyder 15
5.2. Python 15
5.3.Linear Regression 16
5.4.Support Vector Machine(SVM) 17
5.5.K-Nearest Neighbour(KNN) 18
6. Data Flow Diagram 21
7. Sample Code 22
7.1.Sample Code 23
7.2. Using SVM 23
7.3. Using KNN 23
8. Screenshots 25
8.1.Code 25
8.2.Datasets 26
8.3.Outputs 27
9. Conclusion 28
10. References 29
SCREENSHOTS
S.NO Figure Name Page No

1. 8.1. Code 25
2. 8.2.Datasets 26
3. 8.3.Outputs 27
A DATAMINIG MODEL FOR DETECTION OF
FRADULENT BEHAVIOUR IN WATER INTRODUCTION
1. INTRODUCTION
1.1. Data science
Data science is the process of deriving knowledge and
insights from a huge and diverse set of data through
organizing, processing and analyzing the data. It involves
many different disciplines like mathematical and
statistical modeling, extracting data from it source and
applying data visualization techniques. Often it also
involves handling big data technologies to gather both
structured and unstructured data.
Below we will see some example scenarios where Data science is

used.
 Recommendation system: Create models

predicting the shopper’s needs and show the
products the shopper is most likely to buy.
 Financial Risk management: The financial risk
involving loans and credits are better analysed by
using the customers past spend habits, past
defaults, other financial commitments. The
outcome is minimizing loss for the financial
organization by avoiding bad debt.
 Improvement in Health Care services: The
health care industry deals with a variety of data
which can be classified into technical data,
financial data, patient information, drug
information and legal rules. All this data need to
be analysed to produce insights that will save cost
both for the health care provider and care
receiver.
 Computer Vision: The advancement in
recognizing an image by a computer involves
processing large sets of image data from multiple
SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 1
objects of same category. For example, face
recog scientific computing. More over it is being continuously

nition upgraded in form of new addition to its plethora of
. libraries aimed at different programming requirements.
Python in Data
Science:
The
programming
requirements
of data
science
demand a
very versatile
yet flexible
language
which is
simple to
write the
code but can
handle highly
complex
mathematical
processing.
Python is
most suited
for such
requirements
as it has
already
established
itself both as
a language
for general
computing as
well as
1.2. Machine learning
Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is
very difficult to consider all the decisions based on all possible inputs.
To solve this problem, algorithms are developed that build knowledge from a specific data
and past experience by applying the principles of statistical science, probability, logic,
mathematical optimization, reinforcement learning, and control theory.
For example, machine learning programs can scan and process huge databases detecting
patterns that are beyond the scope of human perception.
Applications of Machine Learning
The developed machine learning algorithms are used in various applications such as
 Vision processing
 Language processing
 Forecasting things like stock market trends, weather
 Pattern recognition
 Games
 Data mining
 Expert systems
 Robotics
Types of machine learning algorithms

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
Supervised Learning:
Supervised learning involves building a machine learning model that is based on labeled
samples. Learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data.
For example, if we build a system to estimate the price of a plot of land or a house based on
various features, such as size, location, and so on, we first need to create a database and label
it. We need to teach the algorithm what features correspond to what prices. Based on this
data, the algorithm will learn how to calculate the price of real estate using the values of the
input features.
Supervised learning can be further classified into two types -

Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real
estate prices.
Regression algorithms:
 Linear regression
 Logistic regression
 Polynomial Regression
 Stepwise Regression etc.
Classification attempts to find the appropriate class label, such as analyzing

positive/negative sentiment, male and female persons, benign and malignant tumors, secure
and unsecure loans etc.
Classification algorithms:
 Decision tree algorithms
 K Nearest Neighbor algorithms
 Support Vector Machine algorithms
 Naïve Bayes algorithms etc..

Unsupervised learning:
Unsupervised learning has no labelled data here. When learning data contains only some
indications without any description or labels, it is up to the coder or to the algorithm to find
the structure of the underlying data, to discover hidden patterns, or to determine how to
describe the data. This kind of learning data is called unlabeled data.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends. They are most commonly used for clustering similar input
into logical groups. Unsupervised learning algorithms include
Clustering algorithms
 Kmeans
 Random Forests
 Hierarchical clustering etc..
Dimensionality reduction algorithms
 PCA (Principle Component Analysis).
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to dynamic conditions in order
to achieve a certain objective. The system evaluates its performance based on the feedback
responses and reacts accordingly. The best known instances include self-driving cars and
chess master algorithm Alpha Go.
1.3. Project Introduction
Water is an essential element for the uses of households, industry, and agriculture.
Fraudulent behavior in drinking water consumption is a significant problem facing water
supplying companies and agencies. This behavior results in a massive loss of income and
forms the highest percentage of non technical loss. Finding efficient measurements for
detecting fraudulent activities has been an active research area in recent years.
For this Prediction intelligent data mining techniques can help water supplying
companies to detect these fraudulent activities to reduce such losses. This research explores the
use of two classification techniques (SVM and KNN) to detect suspicious fraud water
customers. The SVM based approach uses customer load profile attributes to expose abnormal
behavior that is known to be correlated with non technical loss activities. The data has been
collected from the historical data. The system will help the company to predict suspicious
water customers to be inspected on site.
To do data science project we must know about some python libraries like:
 NumPy
 Pandas
 Scikitlearn
 Matplotl
ib And IDE’s
like
 Jupyter
 Spyder
A DATAMINING MODEL FOR DETECTION OF
FRAUDULENT BEHAVIOUR IN WATER INSTALLATIONS
2. INSTALLATIONS
2.1 ANACONDA:
Anaconda is a package manager, an environment manager,
and Python distribution that contain a collection of many open source packages. This is
advantageous as when you are working on a data science project, you will find that you need
many different packages (NumPy, Scikit-learn, SciPy, pandas to name a few), which an
installation of Anaconda comes preinstalled with.
Download and Install Anaconda:
1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python
2.x graphical installer (B). If you aren't sure which Python version you want to install, choose
Python 3. Do not choose both.
2. Locate your download and double click it.

Then download starts….
When the screen below appears, click on Next.
3. Read the license agreement and click on I Agree
4. Click on Next.
5. Note your installation location and then click Next.

6. This is an important part of the installation process. The recommended approach is to not
check the box to add Anaconda to your path. This means you will have to use Anaconda
Navigator or the Anaconda Command Prompt. When you wish to use Anaconda. If you want
to be able to use Anaconda in your command prompt please use the alternative approach and
check the box.
7. Click on Next.
8. Click on Next
9. Click on Finish.
Anaconda provides various IDE’s like Jupyter, Spyder, etc. You can launch them and use
them.

2.2. Integrated Development Environment (IDE):
Jupyter:
 The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects.
 A notebook integrates code and its output into a single document that combines
visualisations, narrative text, mathematical equations, and other rich media.
 It is possible to use many different programming languages within Jupyter Notebooks,
this article will focus on Python as it is the most common use case.
Spyder:
 Spyder was developed specifically for data science
 Spyder is an open source cross-platform IDE for data science.
 Spyder does the job of integrating the essentials libraries for data science like
IPython, SciPy, Matplotlib and NumPy.
 Spyder has features like code completion, a text editor with syntax highlighting, and
variable exploring, whose values you may edit using a GUI.
 An online help browser, allowing users to search and view Python and package
documentation inside the IDE
FRAUDULENT BEHAVIOUR IN WATER PYTHON LIBRARIES
3.PYTHON LIBRARIES
Libraries:
3.1 NumPy:
 NumPy is an open source extension module for Python.

 It’s very easy to work with large multidimensional arrays and matrices using
NumPy.
 Another advantage of NumPy is that you can apply standard mathematical operations
on an entire data set without having to write loops.
 Even though NumPy does not provide powerful data analysis functionalities,
understanding NumPy arrays and array-oriented computing will help you use other
Python data analysis tools more effectively.
3.2Pandas:
 Pandas is a Python module that contains high-level data structures and tools designed
for fast and easy data analysis operations.
 Pandas is built on NumPy and make it easy to use in NumPy-centric applications,
such as data structures.
 It is also easy to handle missing data using Python. Pandas are the best tool for doing
data munging.
3.3Matplotlib:
 Matplotlib is a Python module for visualization.

 Matplotlib allows you to quickly make line graphs, pie charts, histograms and other
professional grade figures.
 Using Matplotlib, you can customise every aspect of a figure.
 Matplotlib has interactive features like zooming and panning.
3.4 Scikit-Learn:
 Scikit-Learn is a Python package for machine learning.

 It provides a set of common machine learning algorithms to users through a consistent
interface.
FRAUDULENT BEHAVIOUR IN WATER SYSTEM SPECIFICATIONS
4.System Specifications
4.1 Hardware Requirements:
 Processor : i5 or higher
 Processor Speed : minimum 1.1GHz
 Hard Disk : maximum 100GB
 Input Devices : Keyboard, Mouse
 Ram : 8GB or higher.
4.2Software Requirements:
 Operating system : Windows 10.

 Coding Language : Python
 Libraries : NumPy,Pandas,Matplotlib,Scikitlearn
 Tool : Jupyter, Spyder
 Dataset : Water.csv
TOOLS AND TECHNOLOGIES
FRAUDULENT BEHAVIOUR IN WATER
5.TOOLS AND TECHNOLOGIES
Tools : Spyder / Jupyternotebook

Programming Language : Python
Algorithms : Linear Regression, Support Vector Machine (SVM) ,
K-Nearest Neighbors (KNN).
5.1 Spyder:
Spyder is an open source cross-platform integrated development environment (IDE) for
scientific programming in the Python language. Spyder integrates with a number of prominent
packages in the scientific Python stack
including NumPy, SciPy, Matplotlib, pandas, IPython, SymPy and Cython, as well as other open source
software. It is released under the MIT license. Initially created and developed by Pierre Raybaut in
2009, since 2012 Spyder has been maintained and continuously improved by a team of scientific Python
developers and the community.
Spyder is extensible with first- and third-party plugins, includes support for interactive
tools for data inspection and embeds Python-specific code quality assurance and introspection
instruments, such as Pyflakes, Pylint and Rope. It is available cross-platform through Anaconda, on
Windows, on macOS through MacPorts, and on major Linux distributions such as Arch
Linux, Debian, Fedora, Gentoo Linux, openSUSE and Ubuntu.
Spyder uses Qt for its GUI, and is designed to use either of the PyQt or PySide Python
bindings. QtPy, a thin abstraction layer developed by the Spyder project and later adopted by multiple
other packages, provides the flexibility to use either backend.
Features:
 An editor with syntax highlighting, introspection, code completion

 Support for multiple IPython consoles
 The ability to explore and edit variables from a GUI
5.2 Python:
Python is interpreted, high-level, general-purpose programming language. Created by Guido
van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its
notable use of significant whitespace. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.
Python dynamically typed and It supports multiple programming paradigms,
including procedural, object-oriented, and functional programming. Python is often described as a
"batteries included" language due to its comprehensive standard library.
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, introduced features like list comprehensions and a garbage collection system capable
of collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language that is
not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.
The Python 2 language, i.e. Python 2.7.x, was officially discontinued on 1 January 2020 (first
planned for 2015) after which security patches and other improvements will not be released for it. With
Python 2's end-of-life, only Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of
programmers develops and maintains CPython, an open source reference implementation. A non-profit
organization, the Python Software Foundation, manages and directs resources for Python and CPython
development.
Features:
 Python is a multi-paradigm programming language.
 Object-oriented programming and structured programming are fully supported.
 Supports functional programming and aspect-oriented programming.
Algorithms:
 Linear Regression,
 Support Vector Machine (SVM) ,
 K-Nearest Neighbors (KNN).
5.3 Linear Regression:

Linear regression is a linear methodology for demonstrating the link
between a scalar dependent variable y and one or more independent variables denoted X.
The instance of solitary independent variable is called simple linear regression. In linear
regression, the relationships are modeled using linear predictor functions whose unknown
model parameters are estimated from the data. Such models are called linear models.
Linear regression was the first type of regression analysis to be studied
rigorously, and to be used extensively in practical applications. This is because models which depend
linearly on their unknown parameters are easier to fit than models which are non-linearly related to their
parameters and because the statistical properties of the resulting estimators are easier to determine.
Fig: Linear Regression
Fig: Non Linear Regression

5.4 Support Vector Machine(SVM):
In machine learning, support-vector machines (SVMs, also support-vector
networks) are supervised learning models with associated learning algorithms that analyze
data used for classification and regression analysis. Given a set of training examples, each
marked as belonging to one or the other of two categories, an SVM training algorithm builds a
model that assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier (although methods such as Platt scaling exist to use SVM in a
probabilistic classification setting).
An SVM model is a representation of the examples as points in space,
mapped so that the examples of the separate categories are divided by a clear gap that is as
wide as possible. New examples are then mapped into that same space and predicted to
belong to a category based on the side of the gap on which they fall.
Advantages :
Support vector machine is one of the most widely used classification algorithms due to the advantages it
enjoys which are as follows:
 SVMs are helpful in text and hypertext categorization as their application can significantly
reduce the need for labeled training instances in both the standard inductive and transductive
settings.
 Classification of images can also be performed using SVMs.
 Experimental results show that SVMs achieve significantly higher search accuracy than
traditional query refinement schemes after just three to four rounds of relevance feedback.
 This is also true of image segmentation systems, including those using a modified version
SVM.
5.5 K-Nearest Neighbors (KNN):

K Nearest Neighbor(KNN) is a very simple, easy to understand, versatile
and one of the topmost machine learning algorithms. KNN used in the variety of applications
such as finance, healthcare, political science, handwriting detection, image recognition and
video recognition. In Credit ratings, financial institutes will predict the credit rating of
customers.
KNN is a non-parametric and lazy learning algorithm. Non-parametric means
there is no assumption for underlying data distribution. Lazy algorithm means it does not need
any training data points for model generation. All training data used in the testing phase. This
makes training faster and testing phase slower and costlier. Costly testing phase means time
and memory. In the worst case, KNN needs more time to scan all data points and scanning all
data points will require more memory for storing training data.
KNN makes predictions using the training dataset directly. In KNN, K is the
number of nearest neighbors. The number of neighbors is the core deciding factor. Predictions are made
for a new instance (x) by searching through the entire training set for the K most similar instances (the
neighbors) and summarizing the output variable for those K instances. For regression this might be the
mean output variable, in classification this might be the mode (or most common) class value.
To determine which of the K instances in the training dataset are most similar to a new
input a distance measure is used. For real-valued input variables, the most popular distance measure is
Euclidean distance. This is calculated as the square root of the sum of the squared differences between a
new point (x) and an existing point (xi) across all input attributes j.
Euclidean Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )

Other popular distance measures include:
 Hamming Distance: Calculate the distance between binary vectors.

 Manhattan Distance: Calculate the distance between real vectors using the sum of
their absolute difference. Also called City Block Distance.
 Minkowski Distance: Generalization of Euclidean and Manhattan distance.
DATA FLOW DIAGRAM
6. DATA FLOW DIAGRAM
INPUT
SELECT PROCESS
(LR, SVM, KNN)
TRAIN
PREDICT
OUTPUT
SAMPLE CODE
7.SAMPLE CODE
7.1 Sample code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
def predict(file,impacts,outcome,inps):
data = pd.read_csv(file)
X = data[impacts]
Y = data[outcome]
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
nx = [inps]
pred = linear_regressor.predict(nx)
return pred
speed = int(input("Enter speed:"))
time = int(input("Enter time:"))
users = int(input("Enter users:"))
print("Thefraudishappend: ",predict('water.csv',["speed","time","users"],"fraud",
[speed,time,users]))
SAMPLE CODE
7.2 Using SVM Technique:

data = pd.read_csv('water.csv')
X = data[["speed","time","users"]]
Y = data["fraud"]
training_set=np.c_[X,Y]
clf=svm.SVC(kernel='linear',gamma=2)
clf.fit(X,Y)
prediction=clf.predict([[speed,time,users]])
print("SVC: ",prediction[0])
7.3 Using KNN Technique:

data = pd.read_csv('water.csv')
X = data[["speed","time","users"]]
Y = data["fraud"]
knn=KNeighborsClassifier()
knn.fit(X,Y)
X_test=[[speed,time,users]]
prediction=knn.predict(X_test)
print("KNN: ",prediction[0])
Visual Representation:
plt.scatter(X["speed"], Y, color='r')
plt.xlabel('Speed')
plt.ylabel('Fraud')
plt.show()
plt.scatter(X["time"], Y, color='g')
plt.xlabel('time')
SAMPLE CODE
plt.ylabel('Fraud')
plt.show()
plt.scatter(X["users"], Y, color='b')
plt.xlabel('users')
plt.ylabel('Fraud')
plt.show()
SCREENSHOTS
8 . SCREENSHOTS
8.1 Code:
SCREENSHOTS
8.2 DataSets:
SCREENSHOTS
8.3 Outputs:
CONCLUSION
9.CONCLUSION
In this research, we applied the data mining classification techniques for the purpose of
detecting fraud behaviour in water consumption. We used SVM and KNN classifiers to build
classification models for detecting suspicious fraud. The models were built using the
customers’ historical metered consumption data.
This phase took a considerable effort and time to pre-process and format the data to fit
the SVM and KNN data mining classifiers. The conducted experiments showed that a good
performance of Support Vector Machines (SVM) and K-Nearest Neighbours (KNN) had been
achieved with overall accuracy around 70% for both. The model hit rate is 60%-70% which is
apparently better
FRAUDULENT BEHAVIOUR IN WATER REFERENCES
10. REFERENCES
1. Approach to Detection of Tampering in Water Meters”, In Procedia Computer

Science, 2015, 60: pp 413-421.
2. Juan Ignacio, Carlos Leon “Real Application on Nontechnical losses detection”, The
2011 World Cogress in Computer Science, Computer Engineering, and Applied
Computing (WORLDCOMP 11), Volume: The 2011 International Conference on
Data Mining.
3. N/A, “Jordan Water Sector Facts & Figures, Ministry of Water and irrigation of
Jordan”. Technical Report. 2015.
4. N/A, “Water Reallocation Policy, Ministry of Water and irrigation of Jordan”.
Technical Report. 2016.
5. B. Coma-Puig, J. Carmona, R. Gavald, S. Alcoverro, and V. Martin, “Fraud
detection in energy consumption: a supervised approach”. In Proc IEEE Intl. Conf.
on DSAA, 2016, pp. 120-129.

A Datamining Model For Detection of Fraudulent Behaviour in Water

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Datamining Model For Detection of Fraudulent Behaviour in Water

Uploaded by

Copyright:

Available Formats

A

A DATAMINING MODEL FOR DETECTION OF FRAUDULENT

COMPUTER SCIENCE AND ENGINEERING

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY, HYDERABAD

S.SWETHA SRAVANTHI (16C51A0546)

UNDER THE ESTEEMED GUIDENCE OF

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SAI SPURTHI INSTITUTE OF TECHNOLOGY

(Approved by AICTE, Affiliated to JNTU, Hyderabad, Certified by ISO 9001:2008)

(ACCREDITED BY NAAC-‘B’ Grade)

B.GANGARAM-507303, JNTU-HYDERABAD, TS, 2019-2020

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Project supervisor Head of the Department

Mr V.V.SIVA PRASAD Mr N.VENKATESWARA RAO

Associate Professor Associate Professor

Dr.CH. VIJAYA KUMAR EXTERNAL EXAMINER

We express our sincere thanks to our supervisor Mr V.V.SIVA PRASAD,

It is our privilege to thank Mr N. VENKATESWARA RAO, Head of the CSE

We derive great pleasure in expressing our sincere gratitude to our principal

We thank to both teaching and non-teaching staff members of CSE department

S.VAMSI KRISHNA (17C55A0507)

D.MANOJ KUMAR (16C51A0515)

Data mining is a powerful tool widely used by organizations to enhance their

S.NO Topic Name Page No

S.NO Figure Name Page No

Below we will see some example scenarios where Data science is

 Recommendation system: Create models

recog scientific computing. More over it is being continuously

1.2. Machine learning

Applications of Machine Learning

 Forecasting things like stock market trends, weather

Types of machine learning algorithms

Supervised learning can be further classified into two types -

 Stepwise Regression etc.

Classification attempts to find the appropriate class label, such as analyzing

 Decision tree algorithms

 K Nearest Neighbor algorithms

 Support Vector Machine algorithms

 Naïve Bayes algorithms etc..

 Hierarchical clustering etc..

Dimensionality reduction algorithms

 PCA (Principle Component Analysis).

1.3. Project Introduction

Download and Install Anaconda:

2. Locate your download and double click it.

5. Note your installation location and then click Next.

SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 10

 NumPy is an open source extension module for Python.

 Matplotlib is a Python module for visualization.

 Scikit-Learn is a Python package for machine learning.

4.1 Hardware Requirements:

 Operating system : Windows 10.

5.TOOLS AND TECHNOLOGIES

Tools : Spyder / Jupyternotebook

 An editor with syntax highlighting, introspection, code completion

5.3 Linear Regression:

Fig: Linear Regression

Fig: Non Linear Regression

5.5 K-Nearest Neighbors (KNN):

Euclidean Distance(x, xi) = sqrt( sum( (xj – xij)^2 ) )

Other popular distance measures include:

 Hamming Distance: Calculate the distance between binary vectors.