You are on page 1of 61

SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY

(AUTONOMOUS)

R.V.S.Nagar, Chittoor – 517 127. (A.P)


(Approved by AICTE, New Delhi, Affiliated to JNTUA,
Anantapur)
(Accredited by NBA, New Delhi & NAAC, Bangalore)
(An ISO 9001:2000 Certified Institution)
2022-2023

INTERNSHIP REPORT
A report submitted in partial fulfilment of the requirements for the Award of
Degree of
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE)
By
N SWETHA REDDY
Regd.No.21781A3294
Under supervision of
Mr/Ms Sarvesh Agarwal (Founder&CEO)
(Duration: 11/06/2023 to 22/07/2023)
SRI VENKATESWARA COLLEGE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
R.V.S.NAGAR, CHITTOOR – 517 127. (A.P)
(Approved by AICTE, New Delhi, Affiliated to
JNTUA, Anantapur)
(Accredited by NBA, New Delhi & NAAC, Bangalore)
(An ISO 9001:2000 Certified Institution)
2021-2022

CERTIFICATE
This is to certify that the “Internship report” submitted by N SWETHA
REDDY (Regd.No.:21781A3294) is work done by him and submitted
during 2022-2023.Academic year, in partial fulfilment of the
requirements for the award of the Degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE & ENGINEERING
(DATA SCIENCE), at Intershala Trainings.
Mr.Radhakrishna DR.M.LAVANYA

Internship Coordinator Head of the Department

(DATA SCIENCE)
CERTIFICATE
ACKNOWLEDGEMENT

 A Grateful thanks to Dr.R.Venkataswamy, Chairman of Sri


Venkateshwara College of Engineering & Technology(Autonomous) for
providing education in their esteemed institution. I wish to record my deep
sense of gratitude and profound thanks to our beloved Vice Chairman, Sri
R.V.Srinivas for his valuable support throughout the course.
 I express our sincere thanks to Dr.M.MOHAN BABU, our beloved
principal for his encouragement and suggestion during the course of study.
 With the deep sense of gratefulness, I acknowledge Dr.M.LAVANYA
Head of the Department, Computer Science Engineering (CSD), for
giving us inspiring guidance in undertaking internship.
 I express our sincere thanks to the internship coordinator Mr.
RADHAKRISHNA, for his keen interest, stimulating guidance, constant
encouragement with our work during all stages, to bring this report into
fruition.
 I wish to convey my gratitude and sincere thanks to all members for their
support and cooperation rendered for successful submission of report.
 Finally, I would like to express my sincere thanks to all teaching, non-
teaching faculty members, our parents, and friends and for all those who
have supported us to complete the internship successfully.

(NAME:N SWETHA REDDY)


(ROLL.NO. 21781A3294)
ORGANISATION INFORMATION:
Internshala is an internship and online training
platform, based in Gurgaon, India. Founded by
Sarvesh Agrawal, an IIT Madras alumnus, in 2010,
the website helps students find internships with
organizations in India. The platform started as a
WordPress blog which aggregated internships
across India and articles on education, technology
and skill gap in 2010. The website was launched in
2013. InternShala launched its online trainings in
2014.
ABOUT TRAINING:

The Data Science Training by Internshala is a 6week


online training program in which Internshala aim to
provide you with a comprehensive introduction to data
science. In this training program, you will learn the
basics of python, statistics, predictive modeling, and
machine learning. This training program has video
tutorials and is packed with assignments, assessments
tests, quizzes, and practice exercises for you to get a
hands-on learning experience.
INTRODUCTION TO ORGANIZATION

ABOUT TRAINING

Module-1: Introduction to Data Science


1.1. Data Science Overview

Module-2: Python for Data Science


2.1. Introduction to Python

2.2. Understanding Operators

2.3. Variables and Data Types

2.4. Conditional Statements

2.5. Looping Constructs

2.6. Functions
2.7. Data Structure

2.8. Lists
2.9. Dictionaries
2.10. Understanding Standard Libraries in Python

2.11. Reading a CSV File in Python

2.12. Data Frames and basic operations with Data Frames


2.13. Indexing Data Frame

Module-3: Understanding the Statistics for Data Science


3.1. Introduction to Statistics

3.2. Measures of Central Tendency


3.3. Understanding the spread of data

3.4. Data Distribution

3.5. Introduction to Probability


3.6. Probabilities of Discreet and Continuous Variables

3.7. Central Limit Theorem and Normal Distribution

3.8. Introduction to Inferential Statistics


3.9. Understanding the Confidence Interval and margin of error
3.10. Hypothesis Testing
3.11. T tests

3.12. Chi Squared Tests

3.13. Understanding the concept of Correlation

Module-4: Predictive Modeling and Basics of Machine Learning


4.1. Introduction to Predictive Modeling

4.2. Understanding the types of Predictive Models

4.3. Stages of Predictive Models


4.4. Hypothesis Generation

4.5. Data Extraction

4.6. Data Exploration


4.7. Reading the data into Python

4.8. Variable Identification


4.9. Univariate Analysis for Continuous Variables

4.10. Univariate Analysis for Categorical Variables


4.11. Bivariate Analysis

4.12. Treating Missing Values

4.13. How to treat Outliers

4.14. Transforming the Variables


4.15. Basics of Model Building

4.16. Linear Regression


4.17. Logistic Regression
4.18. Decision Trees

4.19. K-means
WEEKLY OVERVIEW OF INTERNSHIP ACTIVITIES

1ST WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
11/06/2023 Sunday Data science overview
12/06/2023 Monday Introduction to python
13/06/2023 Tuesday Understanding the operators
14/06/2023 Wednesday Variables and data types
15/06/2023 Thursday Conditional statements
16/06/2023 Friday Looping statements
17/06/2023 Saturday Functions

2ND WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
18/06/2023 Sunday Data structure
19/06/2023 Monday Lists and Dictionaries
20/06/2023 Tuesday Understanding standard libraries in python
21/06/2023 Wednesday Reading a CSV file in python
22/06/2023 Thursday Data frames and basic operators with data
frames
23/06/2023 Friday Indexing data frame
24/06/2023 Saturday Introduction to statistics
3RD WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
25/06/2023 Sunday Measures of central tendency
26/06/2023 Monday Understanding the spread of data
27/06/2023 Tuesday Data distribution
28/06/2023 Wednesday Introduction to probability
29/06/2023 Thursday Probabilities of discrete and continuous variable
30/06/2023 Friday Central limit theorem and Normal distribution
01/07/2023 Saturday Introduction to inferential statistics

4TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
02/07/2023 Sunday Understanding the confidence interval and margin
of error
03/07/2023 Monday Hypothesis testing
04/07/2023 Tuesday T tests and Chi squared tests
05/07/2023 Wednesday Understanding the concept of correlation
06/07/2023 Thursday Introduction to predictive modelling
07/07/2023 Friday Understanding the types of predictive models
08/07/2023 Saturday Stages of predictive models
5TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
09/07/2023 Sunday Hypothesis generation
10/07/2023 Monday Data extraction and Data exploration
11/07/2023 Tuesday Reading the data into python
12/07/2023 Wednesday Variable identification
13/07/2023 Thursday Unvariate analysis for continuous variables
14/07/2023 Friday Unvariate analysis for categorial variables
15/07/2023 Saturday Bivariate analysis and treating missing values

6TH WEEK:
DATE DAY NAME OF THE MODULE/TOPIC COMPLETED
16/07/2023 Sunday Treating missing values and how to treat outliers
17/07/2023 Monday Transforming the variables
18/07/2023 Tuesday Basics of model building
19/07/2023 Wednesday Linear Regression
20/07/2023 Thursday Logistic Regression
21/07/2023 Friday Decision Trees and K-Means
22/07/2023 Saturday Final Project
MODULE-1: INTRODUCTION TO DATA SCIENCE
DATA SCIENCE OVERVIEW:
Data science is the study of data. Like biological
sciences is a study of biology, physical sciences, it’s the
study of physical reactions. Data is real, data has real
properties, and we need to study them if we’re going to
work on them. Data Science involves data and some
signs.It is a process not an event.
What is statistical modelling?
The statistical modelling process is a way of applying statistical
analysis to datasets in data science. The statistical model
involves a mathematical relationship between random and
non-random variables. A statistical model can provide intuitive
visualizations that aid data scientists in identifying relationships
between variables and making predictions by applying
statistical models to raw data. Examples of common data sets
for statistical analysis include census data, public health data,
and social media data.

What is meant by statistical computing?


Computational statistics, or statistical computing, is the
bond between statistics and computer science. It
means statistical methods that are enabled by using
computational methods. It is the area of computational
science (or scientific computing) specific to the
mathematical science of statistics.

Predictive modelling:
Predictive modelling is a form of artificial intelligence
that uses data mining and probability to forecast or
estimate more granular, specific outcomes. For
example, predictive modelling could help identify
customers who are likely to purchase our new One AI
software over the next 90 days. Machine Learning:
Machine learning is a branch of artificial intelligence
(ai) where computers learn to act and adapt to new
data without being programmed to do so. The computer
is able to act independently of human interaction.

Forecasting:
Forecasting is a process of predicting or estimating
future events based on past and present data and most
commonly by analysis of trends. "Guessing" doesn't cut
it.

Applications of Data Science:


Data science and big data are making an undeniable
impact on businesses, changing day-to-day operations,
financial analytics, and especially interactions with
customers. It's clear that businesses can gain
enormous value from the insights data science can
provide. But sometimes it's hard to see exactly how.
Solet's look at some examples. In this era of big data,
almost everyone generates masses of data every day,
often without being aware of it. This digital trace
reveals the patterns of our online lives. If you have ever
searched for or bought a production a site like Amazon,
you'll notice that it starts making recommendations
related to your search.
MODULE-2: PYTHON FOR DATA SCIENCE
2.1. Introduction to Python:
Python is a high-level, general-purpose and a very popular
programming language. Python programming language (latest
Python 3) is being used in web development, Machine Learning
applications, along with all cutting-edge technology in
Software Industry
Below are some facts about Python Programming Language:

.Python is currently the most widely used multi-purpose, highlevel


programming language. Python allows programming in Object-Oriented
and Procedural paradigms.

.Python language is being used by almost all tech-giant companies like


–Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.

.The biggest strength of Python is huge collection of standard libraries


which can be used for the following: Machine Learning GUI Applications
(like Kivy , Tkinter, PyQt etc. )

2.2. Understanding operators:


1. ARITHMETIC OPERATORS:

Arithmetic operators are used to perform mathematical


operations like addition, subtraction, multiplication and
division.
2.RELATIONAL OPERATORS:

Relational operators compare the values. It either returns


True or False according to the Condition.
3.LOGICAL OPERATORS:

Logical operators perform Logical AND Logical OR and Logical


NOT operations. OPERATOR DESCRIPTION SYNTAX and Logical
AND: True if both the operands are true x and y or Logical OR:
True if either of the operands is true x or y not Logical NOT:
True if operand is false not x.
4.BITWISE OPERATORS:

Bitwise operators act on bits and performs bit by bit


operation. OPERATOR DESCRIPTION SYNTAX & Bitwise AND x
& y| Bitwise OR x | y~ Bitwise NOT ~x^ Bitwise XOR x ^ y>>
Bitwise right shift x>><< Bitwise left shift x<<
5.ASSIGNMENT OPERATORS:

Assignment operators are used to assign values to the


variables. OPERATOR DESCRIPTION SYNTAX Assign value of
right side of expression to left side operand x = y + z+=Add
AND: Add right side operand with left side operand and then
assign to left operand a+=b a=a+b-=Subtract AND: Subtract
right operand from left operand and then assign to left
operand a-=b a=a-*=Multiply AND: Multiply right operand
ivied(floor) AND: Divide left operand with right operand and
then assign the value(floor) to left operand a//=b
a=a//b**=Exponent AND: Calculate exponent(raise power)
value using operands and assign value to left operand a**=b
a=a**b&=Performs Bitwise AND on operands and assign value
to left operand a&=b a=a&b|=Performs Bitwise OR on operands.
2.3. Variables and Data Types
Variables and data types in python as the name suggests are
the values that vary. In a programming language, a variable is a
memory location where you store a value. The value that you
have stored may change in the future according to the
specifications.
2.4. Conditional Statements

Conditional statements (if, else, and elif)


are fundamental programming constructs that allow
you to control the flow of your program based on
conditions that you specify. They provide a way to make
decisions in your program and execute different code
based on those decisions.

2.5. Looping Constructs

Two types of looping constructs exist in Python– while


loop and for loop. The 'while statement' allows one to
perform repeated execution of a block of statements as
long as a condition is true.

2.6. Functions

A function is a block of code which only runs when it is


called.

You can pass data, known as parameters, into a


function.

A function can return data as a result.


2.7. Data Structure

Data Structures are a way of organizing data so that it


can be accessed more efficiently depending upon the
situation. Data Structures are fundamentals of any
programming language around which a program is
built. Python helps to learn the fundamental of these
data structures in a simpler way as compared to other
programming languages.

2.8. Lists
Lists are used to store multiple items in a single variable.Lists
are one of 4 built-in data types in Python used to store
collections of data, the other 3 are Tuple, Set, and Dictionary,
all with different qualities and usage.Lists are created using
square brackets.
2.9. Dictionaries

Dictionaries are used to store data values in key:value


pairs.

A dictionary is a collection which is ordered*,


changeable and do not allow duplicates.

2.10. Understanding Standard Libraries in


Python
Python standard library. The Python Standard
Library contains the exact syntax, semantics, and
tokens of Python. It contains built-in modules that
provide access to basic system functionality like I/O
and some other core modules. Most of the Python
Libraries are written in the C programming language.

2.11. Reading a CSV File in Python


There are various ways to read a CSV file that uses
either the CSV module or the pandas library.
 csv Module: The CSV module is one of the modules in
Python which provides classes for reading and
writing tabular information in CSV file format.
 pandas Library: The pandas library is one of the
open-source Python libraries that provide high-
performance, convenient data structures and data
analysis tools and techniques for Python
programming.
2.12. Data Frames and basic operations with Data
Frames
Data Frames are generic data objects of R which are used to
store the tabular data. Data frames are considered to be the
most popular data objects in R programming because it is
more comfortable to analyse the data in the tabular form.
Data frames can also be taught as mattresses where each
column of a matrix can be of the different data types. Data
Frame are made up of three principal components, the data,
rows, and columns.

Operations that can be performed on a Data Frame are:

 Creating a Data Frame


 Accessing rows and columns
 Selecting the subset of the data frame
 Editing data frames
 Adding extra rows and columns to the data frame
 Add new variables to data frame based on existing ones
 Delete rows and columns in a data frame

2.13. Indexing Data Frame


Indexing in pandas means simply selecting particular rows
and columns of data from a Data Frame. Indexing could mean
selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and
columns. Indexing can also be known as Subset Selection.

MODULE-3: UNDERSTANDING THE


STATISTICS FOR DATASCIENCE

3.1. Introduction to Statistics


Statistics: simply means numerical data, and is field of math
that generally deals with collection of data, tabulation, and
interpretation of numerical data. It is actually a form of
mathematical analysis that uses different quantitative models
to produce asset of experimental data or studies of real life.
Basic terminology of Statistics.
Population – It is actually a collection of set of individuals or
objects or events whose properties are to be analysed.
Sample – It is the subset of a population.
Types of Statistics:

3.2. Measures of Central Tendency


Mean: It is measure of average of all value in a sample
set.
Median: It is measure of central value of a sample set.
In these, data set is ordered from lowest to highest
value and then finds exact middle.
Mode: It is value most frequently arrived in sample set. The
value repeated most of time in central set is actually mode.
Range: It is given measure of how to spread apart values in
sample set or data set.
Range = Maximum value - Minimum value
3.3. Understanding the spread of data
The spread in data is the measure of how far
the numbers in a data set are away from the
mean or the median. The spread in data can
show us how much variation there is in the
values of the data set. It is useful for
identifying if the values in the data set are
relatively close together or spread apart.
3.4. Data Distribution
Data distribution is a function that specifies all
possible values for a variable and also quantifies
the relative frequency (probability of how often
they occur). Distributions are considered to be
any population that has a scattering of data. It’s
important to determine the population’s
distribution so we can apply the correct statistical
methods when analysing it.
Boxplot:
It is based on the percentiles of the data as shown in the figure
below. The top and bottom of the boxplot are 75 Th and 25th
percentile of the data. The extended lines are known as
whiskers that includes the range of rest of the data.
Frequency Table:
It is a tool to distribute the data into equally spaced ranges,
segments and tells us how many values fall in each segment.

Histogram:

It is a way of visualizing data distribution through


frequency table with bins on the x-axis and data count
on the y-axis
Density Plot:

It is related to histogram as it shows data-values being


distributed as continuous line. It is a smoothed
histogram version.
3.5. Introduction to Probability
Probability refers to the extent of occurrence of events.
When an event occurs like throwing a ball, picking a
card from deck, etc.., then the must be some probability
associated with that event.

3.6. Probabilities of Discreet and


Continuous Variables
Discrete distribution is a probability distribution
where the random variable can only take on a
finite or countable number of values. In contrast,
continuous distribution refers to a probability
distribution where the random variable can take
on any value within a certain range or interval.
3.7. Central Limit Theorem and Normal
Distribution
The central limit theorem (CLT) states that the
distribution of sample means approximates a
normal distribution as the sample size gets
larger, regardless of the population's distribution.

3.8. Introduction to Inferential Statistics


In Inferential statistics, we make an inference from a
sample about the population. The main aim of
inferential statistics is to draw some conclusions from
the sample and generalise them for the population
data.

3.9. Understanding the Confidence


Interval and margin of error
Confidence Interval = x +/- z*(s/√n)
The margin of error can be calculated in two
ways, depending on whether you
have parameters from a population
or statistics from a sample:
1. Margin of error (parameter) = Critical value
x Standard deviation for the population.
2. Margin of error (statistic) = Critical value
x Standard error of the sample.

3.10. Hypothesis Testing

Hypothesis Testing is a type of statistical analysis in


which you put your assumptions about a population
parameter to the test. It is used to estimate the
relationship between 2 statistical variables.

3.11. T tests
A t test is a statistical test that is used to compare the
means of two groups. It is often used in hypothesis
testing to determine whether a process or treatment
actually has an effect on the population of interest, or
whether two groups are different from one another.

3.12. Chi Squared Tests


A chi-square test is a statistical test that is used
to compare observed and expected results. The
goal of this test is to identify whether a disparity
between actual and predicted data is due to
chance or to a link between the variables under
consideration. As a result, the chi-square test is
an ideal choice for aiding in our understanding
and interpretation of the connection between our
two categorical variables.

3.13. Understanding the concept of


Correlation

Correlation is a statistical measure that


expresses the extent to which two variables are
linearly related (meaning they change together at
a constant rate). It’s a common tool for describing
simple relationships without making a statement
about cause and effect.
MODULE-4: PREDICTIVE MODELING
AND BASICS OF MACHINE LEARNING

4.1. Introduction to Predictive Modelling

Predictive analytics involves certain manipulations


ondata from existing data sets with the goal of
identifying some new trends and patterns. These trends
and patterns are then used to predict future outcomes
and trends. By performing predictive analysis, we can
predict future trends and performance. It is also defined
as the prognostic analysis, the word prognostic means
prediction.

4.2. Understanding the types of


Predictive Models
Supervised learning:
Supervised learning as the name indicates the
presence of a supervisor as a teacher. Basically
supervised learning is a learning in which we teach or
train the machine using data which is well labelled that
means some data is already tagged with the correct
answer.
Unsupervised Learning :
Unsupervised learning is the training of machine using
information that is neither classified nor labelled and
allowing the algorithm to act on that information
without guidance.

4.3. Stages of Predictive Models


Stages to perform predictive analysis:

Some basic steps should be performed in order to perform predictive


analysis.

1.Define Problem Statement:

Define the project outcomes, the scope of the effort,


objectives, identify the datasets that going to be used.
2.Data Collection:

Data collection involves gathering the necessary details


required for the analysis. It involves the historical or past data
from an authorized source over which predictive analysis is to
be performed.
3.Data Cleaning:
Data Cleaning is the process in which we refine our data sets.
In the process of data cleaning, we remove un-necessary and
erroneous data. It involves removing the redundant data and
duplicate data from our data sets.
4.Build Predictive Model:

In this stage of predictive analysis, we use various algorithms


to build predictive models based on the patterns observed. It
requires knowledge of python, Statistics and MATLAB and so
on.
5.Model Monitoring:

Regularly monitor your models to check performance and


ensure that we have proper results. It is seeing how model
predictions are performing against actual data sets.

4.4. Hypothesis Generation


A hypothesis is a function that best describes the target in
supervised machine learning. The hypothesis that an algorithm
would come up depends upon the data and also depends upon
the restrictions and bias that we have imposed on the data.

4.5. Data Extraction


In general terms, “Mining” is the process of extraction of some
valuable material from the earth e.g., coal mining, diamond
mining etc. In the context of computer science, “Data Mining”
refers to the extraction of useful information from a bulk of
data or data warehouses. One can see that the term itself is a
little bit confusing. In case of coal or diamond mining, the
result of extraction process is coal or diamond. But in case of
Data Mining, the result of extraction process is not
data!!Instead, the result of data mining is the patterns and
knowledge that we gain at the end of the extraction process. In
that sense, Data Mining is also known as Knowledge Discovery
or Knowledge Extraction.

4.6.Data Exploration

Data exploration is the first step in data analysis involving the


use of data visualization tools and statistical techniques to
uncover data set characteristics and initial patterns.

During exploration, raw data is typically reviewed with a


combination of manual workflows and automated data-
exploration techniques to visually explore data sets, look for
similarities, patterns and outliers and to identify the
relationships between different variables.

Data Explorate Steps of Data Exploration and Preparation:

Remember the quality of your inputs decide the quality of your


output. So, once you have got your business hypothesis ready,
it makes sense to spend lot of time and efforts here. With my
personal estimate, data exploration, cleaning and preparation
can take up to 70% of your total project time. Below are the
steps involved to understand, clean and prepare your data for
building your predictive model:
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation finally, we will need to iterate over steps 4
– 7 multiple times before we come up with our refined model.

4.7. Reading the data into python


Python provides inbuilt functions for creating, writing and
reading files. There are two types of files that can be handled
in python, normal text files and binary files(written in binary
language, 0s and 1s).
Text files:
In this type of file, each line of text is terminated with a special
character called EOL (End of Line), which is the new line character (‘\n’)
in python by default.
Binary files:
In this type of file, there is no terminator for a line and the data is
stored after converting it into machine-understandable binary
language. Access modes govern the type of operations possible in the
opened file. It refers to how the file will be used once it’s opened. These
modes also define the location of the File Handle in the file. File handle
is like a cursor,which defines from where the data has to be read or
written in the file. Different access modes for reading a file are –1.
Read Only (‘r’):
Open text file for reading. The handle is positioned at the beginning of
the file. If the file does not exist, raises I/O error. This is also the default
mode in which file is opened.2.
Read and Write (‘r+’):
Open the file for reading and writing. The handle is positioned at the
beginning of the file. Raises I/O error if the file does not exist.
Append and Read (‘a+’):
Open the file for reading and writing. The file is created if it does not
exist. The handle is positioned at the end of the file. The data being
written will be inserted at the end, after the existing data.

4.8. Variable Identification

First, identify
Predictor
(Input) and
Target
(output) variables. Next, identify the data type and category of
the variables. Example:- Suppose, we want to predict, whether
the students will play cricket or not (refer below data set)
4.9. Univariate Analysis for Continuous Variables:
In case of continuous variables, we need to understand the
central tendency and spread of the variable. These are
measured using various statistical metrics visualization
methods as shown below:
Note:
Univariate analysis is also used to highlight missing and
outlier values. In the upcoming part of this series, we will look
at methods to handle missing and outlier values.
4.10. Univariate Analysis for Categorical
Variables:
For categorical variables, we’ll use frequency table to understand
distribution of each category. We can also read as percentage of values
under each category. It can be be measured using two metrics,
Count and Count%

against each category. Bar chart can be used as visualization.

4.11.Bivariate Analysis:
Bi-variate Analysis finds out the relationship between two variables.
Here, we look for association and disassociation between variables at a
pre-defined significance level. We can perform bi-variate analysis for
any combination of categorical and continuous variables.
Continuous & Continuous:

While doing bi-variate analysis between two continuous


variables, we should look at scatter plot. It is a nifty way to find
out the relationship between two variables. The pattern of
scatter plot indicates the relationship between variables.

To find the strength of the relationship, we use Correlation.


Correlation varies between -1 and +1,-1: perfect negative linear
correlation,+1:perfect positive linear correlation and ,0: No
correlation Correlation can be derived using following formula:
Correlation = Covariance (X, Y) / SQRT(Var(X)* Var(Y))

Various tools have function or functionality to identify


correlation between variables. In Excel, function CORREL () is
used to return the correlation between two variables and SAS
uses procedure PROC CORR to identify the correlation. These
function returns Pearson Correlation value to identify the
relationship between two variables:
X 65 72 78 65 72 70 65 68
Y 72 69 79 69 84 75 60 73

Metrices Formula Value


Co-variance(X,Y) =COVAR(E6:L6,E7:L7) 18.77
Variance(X) =VAR.P(E6:L6) 18.48
Variance(Y) =VAR.P(E7:L7) 45.23
Correlation =G10/SQRT(G11*G12) 0.65
In above example, we have good positive relationship (0.65)
between two variables X and Y.
Categorical & Categorical:
To find the relationship between two categorical variables, we
can use following methods:
• Two-way table:

We can start analyzing the relationship by creating a two-way


table of count& count%. The rows represent the category of
one variable &the columns represent the categories of the
other variable.
Stacked Column Chart:

This method is more of a visual form of Two-way table.


ANOVA:

It assesses whether the average of more than two groups is


statistically different.

4.12. Treating Missing Values

Missing values can be handled by deleting the rows or


columns having null values. If columns have more than half of
the rows as null then the entire column can be dropped. The
rows which are having one or more columns values as null
can also be dropped.
Missing completely at random:
This is a case when the probability of missing variable is same
for all observations. For example: respondents of data
collection process decide that they will declare their earning
after tossing a fair coin. If ahead occurs, respondent declares
his / her earnings & vice versa. Here each observation has
equal chance of missing value.
Missing at random:
This is a case when variable is missing at random and missing
ratio varies for different values / level of other input variables.
For example: We are collecting data for age and female has
higher missing value compare to male.
Missing that depends on unobserved predictors:
This is a case when the missing values are not random and
are related to the unobserved input variable. For example: In a
medical study, if a particular diagnostic causes discomfort,
then there is higher chance of drop out from the study. This
missing value is not at random unless we have included
“discomfort” as an input variable for all patients.
4.13. How to treat Outliers:
An Outlier is an observation in a given dataset that lies
far from the rest of the observations. That means an
outlier is vastly larger or smaller than the remaining
values in the set.
Outliers are treated by either deleting them or
replacing the outlier values with a logical value as per
business and similar data.
4.14. Transforming the variables:

A variable transformation defines a transformation that


is used to some values of a variable. In other terms, for
every object, the revolution is used to the value of the
variable for that object.
4.15. Basics of Model Building:

Steps to build the basics of model building:

1. Loading the dataset

2. Understanding the dataset

3. Data preprocessing

4. Data visualization

5. Building a regression model

6. Model evaluation

7. Model prediction

4.16. Linear Regression

Linear regression is an approach for predicting


a response using a single feature. It is one of the most
basic machine learning models that a machine learning
enthusiast gets to know about. In linear regression, we
assume that the two variables i.e. dependent and independent
variables are linearly related.
It is a machine learning algorithm based on supervised
regression algorithm . Regression models a target prediction
value based on independent variables. It is mostly used for
finding out the relationship between variables and forecasting.
Different regression models differ based on – the kind of
relationship between the dependent and independent
variables, they are considering and the number of independent
variables being used.
EXAMPLE:
X 0 1 2 3 4 5 6 7 8
Y 1 3 2 5 7 8 8 9 10

For generality, we define:


x as feature vector, x = [x_1, x_2, …., x_n],
y as response vector, y = [y_1, y_2, …., y_n]
4.17. Logistic Regression:

Logistic regression aims to solve classification


problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a
continuous outcome.
Logistic regression is basically a supervised
classification algorithm . In a classification problem, the
target variable(or output), y, can take only discrete
values for a given set of features(or inputs), X.

EXAMPLE:
from matplotlib.colors import ListedColormap

X_set, y_set = xtest, ytest


X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1,
stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1,
stop = X_set[:, 1].max() + 1, step = 0.01))

plt.contourf(X1, X2, classifier.predict(


np.array([X1.ravel(), X2.ravel()]).T).reshape(
X1.shape), alpha = 0.75, cmap = ListedColormap(('red',
'green')))

plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())

for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)

plt.title('Classifier (Test set)')


plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
OUTPUT:

4.18. Decision Trees:


Decision Trees (DTs) are a non-parametric supervised
learning method used for classification and regression. The
goal is to create a model that predicts the value of a target
variable by learning simple decision rules inferred from the
data features.

Decision Tree Algorithm


In a decision tree, which resembles a flowchart, an inner
node represents a variable (or a feature) of the dataset,
a tree branch indicates a decision rule, and every leaf
node indicates the outcome of the specific decision. The
first node from the top of a decision tree diagram is the
root node. We can split up data based on the attribute
values that correspond to the independent
characteristics.The recursive partitioning method is for
the division of a tree into distinct elements. Making
decisions is aided by this decision tree's comprehensive
structure, which looks like a flowchart. It offers a
diagrammatic model that exactly mirrors how
individuals reason and choose. Because of this property
of the flowchart, decision trees are easy to understand
and comprehend.
4.19. K-Means:
K-means is an unsupervised learning method for
clustering data points. The algorithm iteratively divides
data points into K clusters by minimizing the variance in
each cluster.

Figure 1: shows the representation of data of different


items.
Figure 2: The items are grouped together.
FINAL PROJECT
Problem Statement:
Your client is a retail banking institution. Term deposits are a
major source of income for a bank.
A term deposit is a cash investment held at a financial
institution. Your money is invested for an agreed rate of
interest over a fixed amount of time, or term.
The bank has various outreach plans to sell term deposits to
their customers such as email marketing, advertisements,
telephonic marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most
effective ways to reach out to people. However, they require
huge investment as large call centers are hired to actually
execute these campaigns. Hence, it is crucial to identify the
customers most likely to convert beforehand so that they can
be specifically targeted via call.
You are provided with the client data such as: age of the client,
their job type, their marital status, etc. Along with the client
data, you are also provided with the information of the call
such as the duration of the call, day and month of the call, etc.
Given this information, your task is to predict if the client will
subscribe to term deposit.
DATA:

You are provided with following files:


1.train.csv:

Use this dataset to train the model. This file contains all the
client and call details as well as the target variable
“subscribed”. You have to train your model using this file.
2.test.csv:

Use the trained model to predict whether a new set of clients


will subscribe the term deposit.
DATA DICTIONARY

Here is the description of all the variables.


variable Definition
ID Unique client ID
Age Age of the client
Job Type of job
Marital Marital status of the client
Education Education level
Default Credit in default
Housing Housing loan
Loan Personal loan
Contact Type of communication
Month Contact month
Day of week Day of week of contact
Duration Contact duration
Campaign Number of contacts perfomed
during this campaign to the
contact
P days Number of days that passed
by after the client Was last
contacted
Previous Number of contacts
performed before this
Campaign
Outcome Outcome of the previous
marketing campaign
Subscribed(target) Has the client subscribed a
term deposit

SOLUTION:
CONCLUSION
In conclusion, I can say that internship was a
great experience. Thanks to this project, I
acquired deeper knowledge concerning my
technical skills.I am able to develop the skill to
build and assess data-based model.
Few factors that point out to data science feature
are:
•Companies inability to handle data:
Data is being regularly collected by businesses
and companies for transactions and through
website interactions.Many companies face a
common challenge to analyse and categorize that
the data is collected and stored.Companies can
progress a lot with proper and efficient handling
of data which results in productivity.
•Data Science is constantly evolving:
Data science is a broad career path that is
undergoing development and thus promises
abundant opportunities in the future.
TRAINING CERTIFICATE

You might also like