You are on page 1of 39

GALGOTIAS UNIVERSITY

School of computing sciences & engineering

Lab Manual

SUBJECT Datascience Lab PROGRAMME B. Tech. CSE

SUBJECT CODE BCSE3056 SEMESTER IV

DURATION OF
CREDITS 2 24 Weeks
SEMESTER

PREREQUISITE SESSION
- 2 Hrs per Week
SUBJECTS DURATION
1. VISION AND MISSION OF GALGOTIAS UNIVERSITY

VISION

 To be known globally for value-based education, research, creativity and innovation

MISSION

 Establish state-of-the-art facilities for world class education and research.


 Collaborate with industry and society to align the curriculum.
 Involve in societal outreach programs to identify concerns and provide sustainable
ethical solutions.
 Encourage life-long learning and team-based problem solving through an enabling
environment.

2. VISION AND MISSION OF DEPARTMENT

VISION

 To be known globally as a premier department of computer science and engineering


for value-based education, multi-disciplinary research and innovation.

MISSION

 Create a strong foundation on fundamentals of computer science and engineering


through outcome based teaching- learning process.
 Establish state-of-art facilities for analysis design and implementation to develop
sustainable ethical solution.
 Conduct multi-disciplinary research for developing innovative solution
 Involve the students in group activity including that of professional bodies to develop
leadership and communication skills.
3. PROGRAM EDUCATIONAL OBJECTIVES

Graduates of Computer Science and Engineering will be globally competent and provide
PEO1 sustainable solutions for interdisciplinary problems as team players
Graduates of Computer Science and Engineering will engage in professional activities
PEO2 with ethical practices in the field of Computer Science and Engineering to enhance their
own stature to contribute towards society
Graduates of Computer Science and Engineering will acquire specialized knowledge in
PEO3 emerging technologies for research, innovation and product development.

4. PROGRAMME OUTCOMES
Engineering knowledge: Apply the knowledge of mathematics, science, engineering
PO1 fundamentals, and an engineering specialization to the solution of complex engineering
problems.
Problem analysis: Identify, formulate, review research literature, and analyze complex
PO2 engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with
PO3 appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
Conduct investigations of complex problems: Use research-based knowledge and
PO4 research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions
Modern tool usage: Create, select, and apply appropriate techniques, resources, and
PO5 modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to
PO6 assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering
PO7 solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development
Ethics: Apply ethical principles and commit to professional ethics and responsibilities
PO8 and norms of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member or
PO9 leader in diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
PO10 write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.

PO11 Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and ability to
PO12 engage in independent and life-long learning in the broadest context of technological
change.

Programme specifics Outcome (PSO)


Able to analyze, design and implement sustainable and ethical solutions in the field of
PSO1 computer science.

PSO2 Able to use problem solving skills to develop efficient algorithmic solutions.

Course Objectives:
The primary objective of this course is to develop both theoretical knowledge on data analysis skills,
which can be applied to practical problems for explain how math and information sciences can
contribute to building better algorithms and software. To develop applied experience with data
science software, programming, applications and processes.

Course Outcomes

CO1 To acquire good introducing knowledge of the essentials in Statistical Fundamentals used
in Data science.

CO2 An ability to apply algorithmic principles and Programing knowledge using Python
language on Data Science .

CO3 Understand the fundamentals of statistics and probability used in data science.

CO4 To establish basic knowledge about optimization techniques in Data Virtualization.

CO5 Apply and Implement ML processing principles using Probability and Statistics .

CO6 To get knowledge on new research and Development in the field of Data Science.

Text Book (s)


1 Data Science from Scratch: First Principles with Python 1st Edition, by Joel Gru s , O’Reilly
Publication,2020.

2 James, G., Witten, D., Hastie, T., Tibshirani, R. An introduction to statistical learning with
applications in R. Springer, 2013.

3 Han, J., Kamber, M., Pei, J. Data mining concepts and techniques. Morgan Kaufmann, 2011.
Reference Book (s)
1 “Data Science for business”, F. Provost, T Fawcett, 2013

2 Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data, EMC Education Service s , 2015.

3 Murphy, K. Machine Learning: A Probabilistic Perspective. - MIT Press, 2012

Course Contents:
Unit-1: Introduction with Statistical Fundamentals 8 hours

Statistics introduction: need, advantages and disadvantages. Applications of statistics. Case study of a
statistical application. Types of statistics-Descriptive and inferential statistics. Variables and types of
data.Sampling Techniques.Descriptive measures-Measures of central Tendency,Measures of
variation,Measures of position.

Unit II: Python for Data Science 8 hours

Introduction about NumPy, Different NumPy Operations, Broadcasting with NumPy, Introduction about
Pandas, Reading or Loading data into Data frame, Pandas Dataframe Manipulations, Data Loading /Reading
in different formats(CSV,Excel,Json,HTML).

Unit III : Data science with tableau and BI 8 hourss

Tableau introduction : Bringing the data-connecting and visualization. Analysing the data - Adding
worksheets,creating dashboards and building stories. Publishing and Sharing-publishing workbooks,sharing
files with tableau readers.

Introduction to BI.

Unit IV : Data Visualizations & Data Cleaning 8 Hours

Introduction to data Visualizations, Principles Behind Data Visualizations, Histograms-Visualize, Box plots-
Visualize, the Distribution of Continuous Numerical Variables(Bar Plots ∙ Pie Chart ∙ Line Chart). Data
Visualization using R- Line Plots and Regression.

Unit V : Understanding Machine learning 8 Hours

Machine Learning: Types of Machine Learning,Advantages and Applications, Supervised Machine


Learning Techniques-Regression and classification, Unsupervised Machine Learning
Techniques.-Clustering and different types of clustering. Reinforcement learning
Prediction and overfitting
Unit VI : Advanced topics in Data science 5 Hours

Research done in data science in the last 3 years. Discussion of new tools and languages for data science.
Case studies of data science.

LIST OF PRACTICALS:

1. Perform basic statistical operations using suitable tools. Measures of central tendency etc.
2. Perform sampling of data and determine the distribution.
3. Write python code to display basic numpy functions. broadcasting etc.
4. Write a python code to read and display data in different formats.like csv,xls,web link etc.
5. Write python code to create dataframes in different ways.
6. Write a python code to do reshaping of data using pandas.
7. Write python code to perform subset operations using pandas.(both row and column)
8. Write a python code to perform method chaining using pandas.
9. Write a programme in Python to predict if a loan will get approved or not.
10. Write a programme in Python to predict the traffic on a new mode of transport.
11. Write a programme in Python to predict the class of the user..
12. Write a programme in Python to identify the tweets which are hate tweets and which are
not.
13. Write a programme in Python to predict the age of the actors.
14. Mini project to predict the time taken to solve a problem given the current status of the user.

LAB EVALUATION SCHEME

Component of evaluation Internal/external Rubric for CO Marks

Continuous evaluation R0 20
Internal
Pre-final lab test R1 20
Internal viva R3(Theory, tools, Result 10 (2+4+4)
interpretation )

ETE Lab test R1 20

Lab Report External R2 20

External Viva R3(Theory, tools, Result 10(2+4+4)


interpretation)

Total: 100

R0 : Rubrics for continuous evaluation

S.
Rubrics - Parts Marks
No.
1 Able to define the problem statement 2
2 Able to convert the problem statement into program 3
3 Implementation 5
4 Report Submission 5
5 Viva – voce 5
Total 20

R1 : Internal Lab Test


Maximum Marks: 20

Assessment Parameter Very Good


Excellent (4) Fair (2) Poor (1)
(3)

Adequate Superficial Lack of


Demonstrates deep Knowledge of Knowledge information
Knowledge of Data most Data of Data about most of
science science science the Data
Identify appropriate
components; components components science
a tests, procedures and
Answer the related answer the able to components
tools
questions with related answer only cannot even
explanations and questions, but some of the answer basic
elaboration fails to related basic related
elaborate Questions questions
b Implementation of Define the Data Define the Define the Define the Data
problem Statement science Data science Data science science
components with components components components
full justification with full with full with
and implement the justification justification insufficient
systems that works and implement and systems
perfectly alright the system implement knowledge and
that does not the system implement the
system that
give 100% that does not
does not give
results give results
results
Adequate
insight but
missed some
Little insight
important No insight and
and analyzed
Excellent insight points in entirely missed
only the most
and well focused results and the point of the
basic points;
result and discussion; experiment;
Interpreted
Result Analysis and discussion; Data interpreted little or no
c some data
Data Interpretation completely and most data attempt to
correctly but
appropriately correctly but interpret data or
significant
interpreted and no some over-
errors,
over- interpretation conclusions interpreted
omissions
may be data.
still present
suspect or
over-
interpreted

7.3 R2: Lab Report


Maximum Marks: 20

Level of Achievement
Assessment Mappe
Excellent(4) Very Good (3) Fair (2) Poor (1)
Parameter d CO
Scripting/queri
es are
Scripting/queri
complete,
es are Scripting/queri
relevant
complete and es are brief
Program and (Achieved all
Executed and missing No
Result the
a successfully significant Scripting/queri CO4
Representati requirements)
but failed in requirement of es are reported
on and Executed
representing problem
successfully.
result statement;
result
discussion.
discussion was
clear.
b Organization Lab report is Lab report is Report Poor CO5
of Report well organized well organized contains few organization
and Timely as directed and but not errors and not and late
Submission submitted on submitted on submitted on submission
Result time time time
Representati
on and
Discussion

R3: Viva

Level of Achievement
Very Good Mapped
Assessment Parameter Excellent (4) Fair (2) Poor (1)
(3) CO
Knowledge of
theory of
a CO1
practical
problems

Implementation
b CO2
logics

c Execution CO5

Internal lab Assessment format

GALGOTIAS UNIVERSITY
Department of Computer Science and Engineering
Assessment of Internal lab Test

Subject Code : Subject Name:


Session : Class :
Date : Max. Marks :
S. Knowledge of Execution Viva Total
No Enrollment Name of the Algorithms and Result voce(10) (30)
No. Student and (10)
Procedures
(10)
1.
2.
3.
4.
5.
6.
7.

Continuous Assessment Format


Internal Lab Assessment (End Semester)

Continuo Viva (10) Internal


S. Enrol. Name of the us Assessme Tota Marks
No. No. Student assessme nt Test l (50) (in words)
nt (20) (20)
1.
2.
3.
4.
5.
6.
7.
8.

Experiment Solution:
1.Perform basic statistical operations using suitable tools. Measures of
central tendency etc.
We consider a random variable x and a data set S = {x1, x2, …, xn} of size n which contains possible
values of x. The data set can represent either the population being studied or a sample drawn from the
population. We can also view the data as defining a distribution, as described in Discrete Probability
Distributions.

We seek a single measure (i.e. a statistic) which somehow represents the center of the entire data
set S. The commonly used measures of central tendency are the mean, median and mode. Besides the
normally studied mean (also called the arithmetic mean) we also consider two other types of mean:
the geometric mean and the harmonic mean.
Excel Functions: If R is an Excel range that contains the data elements in S then the Excel formula
which calculates each of these statistics is shown in Figure 1.
Statistic Excel 2007 Excel 2010/2013/2016

Arithmetic Mean AVERAGE(R) AVERAGE(R)

Median MEDIAN(R) MEDIAN(R)

Mode MODE(R) MODE.SNGL(R), MODE.MULT(R)

Geometric Mean GEOMEAN(R) GEOMEAN(R)

Harmonic Mean HARMEAN(R) HARMEAN(R)


Figure 1 – Measures of central tendency
Observation: All these functions ignore any empty or non-numeric cells.
While formulas such as AVERAGE(R1) (as well as VAR(R), STDEV(R), etc. described on other
webpages) ignore any empty or non-numeric cells, they return an error value if R contains an error
value such as #NUM or #DIV/0!. This limitation can often be overcome by using the following
approach:
=AVERAGE(IF(ISERROR(R),””,R))
This array formula returns the mean of all the cells in R1 ignoring any cells that contain an error
value. Since this is an array formula, you must press Ctrl-Shft-Enter. An alternative approach is to
use the following function.

Real Statistics Function: The Real Statistics Resource Pack provides the following array function:
DELErr(R) = the array of the same size and shape as R consisting of all the elements in R where any
cells with an error value are replaced by a blank (i.e. an empty cell).
E.g. to find the average of a range R which may contain error cells, you can use the formula
=AVERAGE(DELErr(R))
Real Statistics Data Analysis Tool: The Remove error cells option of the Reformatting a Data
Range data analysis tool described in Reformatting Tools makes a copy of the inputted range where
all cells that contain error values are replaced by empty cells.
To use this capability, press Ctrl-m and double click on Reformatting a Data Range. When the
dialog box shown in Figure 2 of Reformatting Tools, fill in the Input Range, choose the Remove error
cells option and leave the # of Rows and # of Columns fields blank. The output will have the same
size and shape as the input range.
Mean
We begin with the most commonly used measure of central tendency, the mean.
Definition 1: The mean (also called the arithmetic mean) of the data set S is defined by

Excel Function: The mean is calculated in Excel using the function AVERAGE.


Example 1: The mean of S = {5, 2, -1, 3, 7, 5, 0, 2} is (2 + 5 – 1 + 3 + 7 + 5 + 0 + 2) / 8 = 2.875. We
achieve the same result by using the formula =AVERAGE(C3:C10) in Figure 2.

Figure 2 – Excel examples of central tendency


Observation: When the data set S is a population the Greek letter µ is used for the mean. When S is a
sample, then the symbol x̄ is used.
Observation: When data is expressed in the form of frequency tables then the following property is
useful.
Property 1: If x̄ is the mean of sample {x1, x2, …, xm} and ȳ is the mean of sample  {y1, y2, …,
yn} then the mean of the combined sample is

Similarly, if µx is the mean of population {x1, x2, …, xm}  and µy is the mean of population {y 1, y2, …,
yn} then the mean of the combined population is

Real Statistics Functions: The Real Statistics Resource Pack furnishes the following array functions:
COUNTCOL(R1) = a row range which contains the number of numeric elements in each of the
columns in R1
SUMCOL(R1) = a row range which contains the sums of each of the columns in R1
MEANCOL(R1) = a row range which contains the means of each of the columns in R1
COUNTROW(R1) = a column range which contains the number of numeric elements in each of the
rows in R1
SUMROW(R1) = a column range which contains the sums of each of the rows in R1
MEANROW(R1) = a column range which contains the means of each of the rows in R1
Example 2: Use the COUNTCOL and MEANCOL functions to calculate the number of cells in each
of the three columns in the range L4:N11 of Figure 3 as well as their means.

Figure 3 – Count, Sum and Mean by Column


The array formula =COUNTCOL(L4:N11) produces the first result (in range L13:N13), while the
formula =MEANCOL(L4:N11) produces the second result (in range L14:N14) and the formula
=SUMCOL(L4:N11) produces the third result (in range L15:N15).
Remember that after entering any of these formulas you must press Ctrl-Shft-Enter.
Median
Definition 2: The median of the data set S is the middle value in S. If you arrange the data in
increasing order the middle value is the median. When S has an even number of elements there are
two such values; the average of these two values is the median.
Excel Function: The median is calculated in Excel using the function MEDIAN.
Example 3: The median of S = {5, 2, -1, 3, 7, 5, 0} is 3 since 3 is the middle value (i.e the 4th of 7
values) in -1, 0, 2, 3, 5, 5, 7. We achieve the same result by using the formula =MEDIAN(B3:B10) in
Figure 2.
Note that each of the functions in Figure 2 ignores any non-numeric values, including blanks. Thus
the value obtained for =MEDIAN(B3:B10) is the same as that for =MEDIAN(B3:B9).
The median of S = {5, 2, -1, 3, 7, 5, 0, 2} is 2.5 since 2.5 is the average of the two middle value 2 and
3 of -1, 0, 2, 2, 3, 5, 5, 7. This is the same result as =MEDIAN(C3:C10) in Figure 2.
Mode
Definition 3: The mode of the data set S is the value of the data element that occurs most often.
Example 4: The mode of S = {5, 2, -1, 3, 7, 5, 0} is 5 since 5 occurs twice, more than any other data
element. This is the result we obtain from the formula =MODE(B3:B10) in Figure 2. When there is
only one mode, as in this example, we say that S is unimodal.
If S = {5, 2, -1, 3, 7, 5, 0, 2}, the mode of S consists of both 2 and 5 since they each occur twice, more
than any other data element. When there are two modes, as in this case, we say that S is bimodal.
Excel Function: The mode is calculated in Excel by the formula MODE. If range R contains
unimodal data then MODE(R) returns this unique mode. For the first data set in Example 3 this is 5.
When R contains data with more than one mode, MODE(R) returns the first of these modes. For the
second data set in Example 4 this is 5 (since 5 occurs before 2, the other mode, in the data set). Thus
MODE(C3:C10) = 5.
As remarked above, if there is more than one mode, MODE returns only the first, although if all the
values occur only once then MODE returns an error value. This is the case for S = {5, 2, -1, 3, 7, 4, 0,
6}. Thus MODE(D3:D10) = #N/A.
Starting with Excel 2010 the array function MODE.MULT is provided which is useful for
multimodal data by returning a vertical list of modes. When we highlight C19:C20 and enter the array
formula =MODE.MULT(C3: C10) and then press Ctrl-Alt-Enter, we see that both modes are
displayed.
The function MODE.SNGL is also provided with versions of Excel starting with Excel 2010. This
function is equivalent to MODE.
Geometric Mean
Definition 4: The geometric mean of the data set S is calculated by

This statistic is commonly used to provide a measure of the average rate of growth as described in
Example 5.
Example 5: Suppose the sales of a certain product grow 5% in the first two years and 10% in the next
two years, what is the average rate of growth over the 4 years?
If sales in year 1 are $1 then sales at the end of the 4 years are (1 + .05)(1 + .05)(1 + .1)(1 + .1) =
1.334. The annual growth rate r is that amount such that (1+r)4 = 1.334. Thus r = 1.3341/4 – 1 = .0747.
The same annual growth rate of 7.47% can be obtained in Excel using the
formula GEOMEAN(H7:H10) – 1 = .0747.
Harmonic Mean
Definition 5: The harmonic mean of the data set S is calculated by the formula

The harmonic mean can be used to calculate an average speed, as described in Example 6.
Example 6: If you go to your destination at 50 mph and return at 70 mph, what is your average rate of
speed?
Assuming the distance to your destination is d, the time it takes to reach your destination is d/50 hours
and the time it takes to return is d/70, for a total of d/50 + d/70 hours. Since the distance for the whole
trip is 2d, your average speed for the whole trip is

This is equivalent to the harmonic mean of 50 and 70, and so can be calculated in Excel as
HARMEAN(50,70), which is HARMEAN(G7:G8) from Figure 2.
2. Perform sampling of data and determine the distribution.
What is sampling distribution?

A sampling distribution is a distribution that plots the values of a statistic for a given random sample
that's part of a larger sum of data. When data scientists work with large quantities of data they
sometimes use sampling distributions to determine parameters of the group of data, like what the mean
or standard deviation might be. Parameters are facts about data in the form of statistical values.

This is also called a probability distribution because it relies on probability to inform the data scientist
of sample statistics. Using a sampling distribution simplifies the process of making inferences about
large amounts of data. For this reason, it is used often as a statistical resource in data science.

Understanding sampling distribution: 3 variability factors

 The number observed in a population: This variable is represented by "N". It is the measure of


observed activity in a given group of data.
 The number observed in the sample: This variable is represented by "n". It is the measure of
observed activity in a random sample of data that is part of the larger grouping.
 The method of choosing the sample: How the samples were chosen can account for variability,
in some cases.

Types of Sampling

The most common techniques you’ll likely meet in elementary statistics or AP statistics include taking
a sample with and without replacement. Specific techniques include:

Bernoulli samples have independent Bernoulli trials on population elements. The trials decide whether
the element becomes part of the sample. All population elements have an equal chance of being
included in each choice of a single sample. The sample sizes in Bernoulli samples follow a binomial
distribution. Poisson samples (less common): An independent Bernoulli trial decides if each population
element makes it to the sample.

Cluster sampes divide the population into groups (clusters). Then a random sample is chosen from the
clusters. It’s used when researchers don’t know the individuals in a population but do know the
population subsets or groups.

In systematic sampling, you select sample elements from an ordered frame. A sampling frame is just a
list of participants that you want to get a sample from. For example, in the equal-probability method,
choose an element from a list and then choose every kth element using the equation k = N\n. Small “n”
denotes the sample size and capital “N” equals the size of the population.

SRS : Select items completely randomly, so that each element has the same probability of being chosen
as any other element. Each subset of elements has the same probability of being chosen as any other
subset of k elements.

In stratified sampling, sample each subpopulation independently. First, divide the population into
homogeneous (very similar) subgroups before getting the sample. Each population member only
belongs to one group. Then apply simple random or a systematic method within each group to choose
the sample. Stratified Randomization: a sub-type of stratified used in clinical trials. First, divide
patients into strata, then randomize with permuted block randomization.
Types of distributions

There are a couple of standard types of sampling distributions and how they can be applied. Read more
to learn about types of sampling distributions and their applications:

 T-distribution
 Normal distribution

T-distribution

A T-distribution is a sampling distribution that helps data professionals determine the population size
or the population variance. The T-distribution uses a t-score to evaluate data that wouldn't be
appropriate for a normal distribution. For instance, when analyzing a very small sample. The formula
for t-score looks like this:

t = [ x - μ ] / [ s / sqrt( n ) ]

In the above formula, "x" is the sample mean, "μ" is the population mean and signifies standard
deviation.

Normal distribution

A normal distribution is also called a "bell curve". These are distributions with features like a
symmetrical bell-shaped curve and the mean and median are the same number and positioned in
the center of the curve. If you have a lot of data and you create a sampling distribution, it's most
likely to model a normal distribution from which you can infer statistical values, unless a model like
t-scoring is applied.

3. Write python code to display basic Numpy functions. broadcasting etc.


Broadcasting:

import numpy as np

A = np.array([5, 7, 3, 1])

B = np.array([90, 50, 0, 30])

# array are compatible because of same Dimension

c=a*b

print (c)

Functions:

Syntax: numpy.sqrt()

# Python program explaining

# numpy.sqrt() method
# importing numpy

import numpy as geek

# applying sqrt() method on integer numbers

arr1 = geek.sqrt([1, 4, 9, 16])

arr2 = geek.sqrt([6, 10, 18])

print("square-root of an array1 : ", arr1)

print("square-root of an array2 : ", arr2)

Syntax : numpy.add(arr1, arr2, /, out=None, *, where=True, casting=’same_kind’, order=’K’,


dtype=None, subok=True[, signature, extobj], ufunc ‘add’)

# Python program explaining

# numpy.add() function

# when inputs are scalar

import numpy as geek

in_num1 = 10

in_num2 = 15

print ("1st Input number : ", in_num1)

print ("2nd Input number : ", in_num2)

out_num = geek.add(in_num1, in_num2)

print ("output number after addition : ", out_num)

4. Write a python code to read and display data in different formats.like


csv,xls,web link etc.
CSV:
# Import pandas
import pandas as pd
# reading csv file
pd.read_csv("filename.csv")
Xls:
# import pandas lib as pd
import pandas as pd
# read by default 1st sheet of an excel file
dataframe1 = pd.read_excel('SampleWork.xlsx')
print(dataframe1)

Web link:
# Import pandas
import pandas as pd
# reading csv file
dfs = pd.read_html('http://weburl.html')

5. Write python code to create dataframes in different ways.


Method #1: Creating Pandas DataFrame from lists of lists.
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
Df

Method #2: Creating DataFrame from dict of narray/lists


# Python code demonstrate creating
# DataFrame from dict narray / lists
# By default addresses.
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Method #3: Creates a indexes DataFrame using arrays.
# Python code demonstrate creating
# pandas DataFrame with indexed by

# DataFrame using arrays.


import pandas as pd
# initialise data of lists.
data = {'Name':['Tom', 'Jack', 'nick', 'juli'],
'marks':[99, 98, 95, 90]}
# Creates pandas DataFrame.
df = pd.DataFrame(data, index =['rank1', 'rank2', 'rank3', 'rank4'])
# print the data
df

6. Write a python code to do reshaping of data using pandas.

# import pandas library

import pandas as pd

# make an array

array = [2, 4, 6, 8, 10, 12]

# create a series

series_obj = pd.Series(array)

# convert series object into array

arr = series_obj.values

# reshaping series
reshaped_arr = arr.reshape((3, 2))

# show

reshaped_arr

# import pandas library

import pandas as pd

# make an array

array = ["ankit","shaurya",

"shivangi", "priya",

"jeet","ananya"]

# create a series

series_obj = pd.Series(array)

print("Given Series:\n", series_obj)

# convert series object into array

arr = series_obj.values

# reshaping series

reshaped_arr = arr.reshape((2, 3))

# show

print("After Reshaping: \n", reshaped_arr)

7. Write python code to perform subset operations using pandas.(both row


and column)
Dataframe.[ ] ; This function also known as indexing operator
Dataframe.loc[ ] : This function is used for labels.
Dataframe.iloc[ ] : This function is used for positions or integer based
Dataframe.ix[] : This function is used for both label and integer based

Selecting a single columns


# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving columns by indexing operator
first = data["Age"]
print(first)

Selecting multiple columns


# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving multiple columns by indexing operator
first = data[["Age", "College", "Salary"]]

Selecting a single row


# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)

Selecting multiple rows


import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print(first)

Selecting two rows and three columns


import pandas as pd

# making data frame from csv file


data = pd.read_csv("nba.csv", index_col ="Name")

# retrieving two rows and three columns by loc method


first = data.loc[["Avery Bradley", "R.J. Hunter"],
["Team", "Number", "Position"]]
print(first)

8. Write a python code to perform method chaining using pandas.


Pandas Chaining : Method chaining, in which methods are called on an object sequentially, one after
the another. It has always been a programming style that’s been possible with pandas, and over the
past few releases, many methods have been introduced that allow even more chaining.

Here we use the concept of chaining in Pandas to filter the dataframe and get the same rows as an
output. This can be easily explained with the help of examples. Here we use a dataframe which
consists of some data of person as shown below :

# import package

import pandas as pd

# define data

data = pd.DataFrame(

{'ID': {0: 105, 1: 102, 2: 101, 3: 106, 4: 103, 5: 104, 6: 107},

'Name': {0: 'Ram Kumar', 1: 'Jack Wills', 2: 'Deepanshu Rustagi',

3: 'Thomas James', 4: 'Jenny Advekar', 5: 'Yash Raj',

6: 'Raman Dutt Mishra'},

'Age': {0: 40, 1: 23, 2: 20, 3: 34, 4: 18, 5: 56, 6: 35},

'Country': {0: 'India', 1: 'Uk', 2: 'India', 3: 'Australia',

4: 'Uk', 5: 'India', 6: 'India'}

# view data

data
9. Write a programme in Python to predict if a loan will get approved or not.
1.1 Problem
• A Company wants to automate the loan eligibility process (real time) based on
customer detail provided while filling online application form. These details are
Gender, Marital Status, Education, Number of Dependents, Income, Loan
Amount, Credit History and others. To automate this process, they have given a
problem to identify the customers segments, those are eligible for loan amount
so that they can specifically target these customers. Here they have provided a
data set.

1.2 Data
• Variable Descriptions:

Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)
• Rows: 615
• Source: Datahack
• Jupyter Notebook: Github - Parth Shandilya

In [45]: # Importing
Library
import
pandas as
pd
import numpy as np from sklearn
import preprocessing from
sklearn.preprocessing import
LabelEncoder

# Reading the training dataset in a dataframe using


Pandas df = pd.read_csv("train.csv")

# Reading the test dataset in a dataframe using Pandas


test = pd.read_csv("test.csv")
In [48]: # First 10 Rows of training Dataset

df.head(10)
Out[48]: Loan_ID Gender Married Dependents Education Self_Employed \
1 LP001002 Male No 0 Graduate No
2 LP001003 Male Yes 1 Graduate No
3 LP001005 Male Yes 0 Graduate Yes 3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No
5 LP001011 Male Yes 2 Graduate Yes 6 LP001013 Male Yes 0 Not Graduate No
7 LP001014 Male Yes 3+ Graduate No
8 LP001018 Male Yes 2 Graduate No
9 LP001020 Male Yes 1 Graduate No

ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \

0 5849 0.0 NaN 360.0


1 4583 1508.0 128.0 360.0
2 3000 0.0 66.0 360.0
3 2583 2358.0 120.0 360.0
4 6000 0.0 141.0 360.0
5 5417 4196.0 267.0 360.0
6 2333 1516.0 95.0 360.0
7 3036 2504.0 158.0 360.0
8 4006 1526.0 168.0 360.0
9 12841 10968.0 349.0 360.0

Credit_History Property_Area Loan_Status


0 1.0 Urban Y 1 1.0 Rural N 2
1.0 Urban Y 3 1.0 Urban Y 4
1.0 Urban Y 5 1.0 Urban Y 6
1.0 Urban Y
7 0.0Semiurban N
8 1.0Urban Y
9 1.0Semiurban N

In [206]: # Store total number of observation in training dataset


df_length =len(df)

# Store total number of columns in testing data set


test_col = len(test.columns)

2 Understanding the various features (columns) of the dataset.


In [50]: # Summary of numerical variables for training data set
df.describe()
Out[50]: ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term \
count 614.000000 614.000000 592.000000 600.00000
mean 5403.459283 1621.245798 146.412162 342.00000
std 6109.041673 2926.248369 85.587325 65.12041
min 150.000000 0.000000 9.000000 12.00000
25% 2877.500000 0.000000 100.000000 360.00000
50% 3812.500000 1188.500000 128.000000 360.00000
75% 5795.000000 2297.250000 168.000000 360.00000
max 81000.000000 41667.000000 700.000000 480.00000

Credit_History
count 564.000000
mean 0.842199
std 0.364878
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
1. For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can
look at frequency distribution to understand whether they make sense or not.

In [51]: # Get the unique values and their frequency of variable Property_Area

’ ’
df[ Property_Area ].value_counts()
Out[51]: Semiurban 233
Urban 202
Rural 179
Name: Property_Area, dtype: int64

2. Understanding Distribution of Numerical Variables

• ApplicantIncome
• LoanAmount
In [53]: # Box Plot for understanding the distributions and to observe the outliers.

%matplotlib inline

# Histogram of variable ApplicantIncome


’ ’
df[ ApplicantIncome ].hist()
Out[53]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc932780>
In [54]: # Box Plot for variable ApplicantIncome of training data set

’ ’
df.boxplot(column= ApplicantIncome )
Out[54]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc85e278>
3. The above Box Plot confirms the presence of a lot of outliers/extreme values.
This can be attributed to the income disparity in the society.

In [55]: # Box Plot for variable ApplicantIncome by variable Education of training data set

’ ’ ’ ’
df.boxplot(column= ApplicantIncome , by = Education )
Out[55]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc82e588>

4. We can see that there is no substantial different between the mean income of
graduate and non-graduates. But there are a higher number of graduates with
very high incomes, which are appearing to be the outliers

In [56]: # Histogram of variable LoanAmount

’ ’
df[ LoanAmount ].hist(bins=50)

Out[56]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc73e2e8>


In [57]: # Box Plot for variable LoanAmount of training data set

’ ’
df.boxplot(column= LoanAmount )
Out[57]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc728be0>

In [58]: # Box Plot for variable LoanAmount by variable Gender of training data set

’ ’ ’ ’
df.boxplot(column= LoanAmount , by = Gender )
Out[58]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc79acc0>

5. LoanAmount has missing as well as extreme values, while ApplicantIncome


has a few extreme values.

3 Understanding Distribution of Categorical Variables


In [15]: # Loan approval rates in absolute numbers
’ ’ ’ ’
loan_approval = df[ Loan_Status ].value_counts()[ Y ]
print(loan_approval)

422

• 422 number of loans were approved.


’ ’
In [37]: # Credit History and Loan Status pd.crosstab(df [ Credit_History ], df
’ ’
[ Loan_Status ], margins=True)

Out[37]: Loan_Status N Y All


Credit_History
0.0 82 7 89
1.0 97 378 475
All 179 385 564
In [204]: #Function to output percentage row wise in a cross table
def percentageConvert(ser):
return ser/float(ser[-1])

# Loan approval rate for customers having Credit_History (1)


df=pd.crosstab(df ["Credit_History"], df ["Loan_Status"], margins=True).apply(percenta
’ ’
loan_approval_with_Credit_1 = df[ Y ][1] print(loan_approval_with_Credit_1*100)

79.04761904761905

• 79.58 % of the applicants whose loans were approved have Credit_History


equals to 1.

’ ’
In [39]: df[ Y ]

Out[39]: Credit_History
0.
0

0.
07
86
52
1.
0

0.
79
57
89
All

0.
68
26
24
Name: Y, dtype: float64

In [591]: # Replace missing value of Self_Employed with more frequent category


’ ’ ’ ’
df[ Self_Employed ].fillna( No ,inplace=True)

4 Outliers of LoanAmount and Applicant Income


In [588]: # Add both ApplicantIncome and CoapplicantIncome to TotalIncome
’ ’ ’ ’ ’ ’
df[ TotalIncome ] = df[ ApplicantIncome ] + df[ CoapplicantIncome ]
# Looking at the distribtion of TotalIncome
’ ’
df[ LoanAmount ].hist(bins=20)

Out[588]: <matplotlib.axes._subplots.AxesSubplot at 0x7f6fadc7ff98>

• The extreme values are practically possible, i.e. some people might apply for
high value loans due to specific needs. So instead of treating them as outliers,
let’s try a log transformation to nullify their effect:

In [112]: # Perform log transformation of TotalIncome to make it closer to normal


’ ’ ’ ’
df[ LoanAmount_log ] = np.log(df[ LoanAmount ])

# Looking at the distribtion of TotalIncome_log


’ ’
df[ LoanAmount_log ].hist(bins=20)

Out[112]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bbecec50>


5 Data Preparation for Model Building
• sklearn requires all inputs to be numeric, we should convert all our categorical
variables into numeric by encoding the categories. Before that we will fill all
the missing values in the dataset.

In [62]: # Impute missing values for Gender


’ ’ ’ ’
df[ Gender ].fillna(df[ Gender ].mode()[0],inplace=True)
’ ’ ’ ’
# Impute missing values for Married df[ Married ].fillna(df[ Married ].mode()
[0],inplace=True)
# Impute missing values for Dependents
’ ’ ’ ’
df[ Dependents ].fillna(df[ Dependents ].mode()[0],inplace=True)

# Impute missing values for Credit_History


’ ’ ’ ’
df[ Credit_History ].fillna(df[ Credit_History ].mode()[0],inplace=True)

# Convert all non-numeric values to number


’ ’’ ’’ ’’ ’’ ’’ ’’
cat=[ Gender , Married , Dependents , Education , Self_Employed , Credit_History , Prop

for var in cat:


le = preprocessing.LabelEncoder()
’ ’
df[var]=le.fit_transform(df[var].astype( str ))
df.dtypes
Out[62]: Loan_ID obje
ct
Gender int
64
Married int
64
Dependents int
64
Education int
64
Self_Employed int
64
ApplicantIncome int
64
CoapplicantIncome float6
4
LoanAmount float6
4
Loan_Amount_Term float6
4
Credit_History int
64
Property_Area int
64
Loan_Status obje
dtype: object ct
6 Generic Classification Function
In [208]: #Import models from scikit learn
module: from sklearn import
metrics
from sklearn.cross_validation import KFold

#Generic function for making a classification model and accessing performance:

def classification_model(model, data, predictors, outcome):


#Fit the model:
model.fit(data[predictors],data[outcome])

#Make predictions on training set:


predictions =
model.predict(data[predictors])

#Print accuracy
accuracy =
metrics.accuracy_score(predictions,data[outcome]) print
("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5
folds kf = KFold(data.shape[0],
n_folds=5)
error = [] for train, test in kf: # Filter training data
train_predictors = (data[predictors].iloc[train,:])


# The target we re using to train the algorithm.
train_target = data[outcome].iloc[train]

# Training the algorithm using the predictors and target.


model.fit(train_predictors, train_target)
#Record error from each cross-validation run
In [186]: #Combining both train and test dataset

#Create a flag for Train and Test Data set


’ ’ ’ ’ ’ ’ ’ ’
df[ Type ]= Train test[ Type ]= Test
fullData = pd.concat([df,test],axis=0, sort=True)

#Look at the available missing values in the dataset fullData.isnull().sum()

Out[186]: ApplicantIncome 0
CoapplicantIncome 0
Credit_History 29
Dependents 10
Education 0
Gender 11
LoanAmount 27
LoanAmount_log 389
Loan_Amount_Term 20
Loan_ID 0
Loan_Status 367
Married 0
Property_Area 0
Self_Employed 23
Type dtype: 0
int64
In [187]: #Identify categorical and continuous variables
’ ’
ID_col = [ Loan_ID ] target_col =
["Loan_Status"]
’ ’’ ’’ ’’ ’’ ’’
cat_cols = [ Credit_History , Dependents , Gender , Married , Education , Property_A
In [200]: #Imputing Missing values with mean for continuous variable
’ ’ ’ ’
fullData[ LoanAmount ].fillna(fullData[ LoanAmount ].mean(), inplace=True)
’ ’ ’ ’
fullData[ LoanAmount_log ].fillna(fullData[ LoanAmount_log ].mean(), inplace=True)
’ ’ ’ ’
fullData[ Loan_Amount_Term ].fillna(fullData[ Loan_Amount_Term ].mean(), inplace=True)
’ ’ ’ ’
fullData[ ApplicantIncome ].fillna(fullData[ ApplicantIncome ].mean(), inplace=True)
’ ’ ’ ’
fullData[ CoapplicantIncome ].fillna(fullData[ CoapplicantIncome ].mean(), inplace=Tru
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[tes

print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])

7 Model Building
#Imputing Missing values with mode for categorical variables
’ ’ ’ ’
fullData[ Gender ].fillna(fullData[ Gender ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Married ].fillna(fullData[ Married ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Dependents ].fillna(fullData[ Dependents ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Loan_Amount_Term ].fillna(fullData[ Loan_Amount_Term ].mode()[0],
’ ’ ’ ’
inplace=Tr fullData[ Credit_History ].fillna(fullData[ Credit_History ].mode()[0],
inplace=True)
In [202]: #Create a new column as Total Income

’ ’ ’ ’ ’ ’
fullData[ TotalIncome ]=fullData[ ApplicantIncome ] + fullData[ CoapplicantIncome ]

’ ’ ’ ’
fullData[ TotalIncome_log ] = np.log(fullData[ TotalIncome ])

#Histogram for Total Income


’ ’
fullData[ TotalIncome_log ].hist(bins=20)
Out[202]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bbd93a20>

In [197]: #create label encoders for categorical features


for var in cat_cols:
number = LabelEncoder()
’ ’
fullData[var] = number.fit_transform(fullData[var].astype( str ))
’ ’ ’ ’
train_modified=fullData[fullData[ Type ]== Train ]
’ ’ ’ ’
test_modified=fullData[fullData[ Type ]== Test ]
train_modified["Loan_Status"] =
number.fit_transform(train_modified["Loan_Status"].ast
/home/parths007/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8:
SettingWithCopyWa A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#

7.1 Logistic Regression Model


1. The chances of getting a loan will be higher for:

• Applicants having a credit history (we observed this in exploration.)


• Applicants with higher applicant and co-applicant incomes
• Applicants with higher education level
• Properties in urban areas with high growth perspectives

So let’s make our model with ‘Credit_History’, ’Education’ & ’Gender’

In [198]: from sklearn.linear_model import LogisticRegression

’ ’’ ’’ ’
predictors_Logistic=[ Credit_History , Education , Gender ]

x_train = train_modified[list(predictors_Logistic)].values
y_train = train_modified["Loan_Status"].values
x_test=test_modified[list(predictors_Logistic)].values

In [203]: # Create logistic regression object


model = LogisticRegression()

# Train the model using the training sets


model.fit(x_train, y_train)

#Predict Output
predicted=
model.predict(x_tes
t)

#Reverse encoding for predicted


outcome predicted =
number.inverse_transform(predicted)

#Store it to test dataset


’ ’ ’ ’
test_modified[ Loan_Status ]=predicted outcome_var = Loan_Status

classification_model(model, df,predictors_Logistic,outcome_var)

’ ’’ ’
test_modified.to_csv("Logistic_Prediction.csv",columns=[ Loan_ID , Loan_Status ])

Accuracy : 80.945%
Cross-Validation Score : 80.946%

/home/parths007/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:151:
Deprec if diff:
/home/parths007/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14:
SettingWithCopyW A value is trying to be set on a copy of a slice from a DataFrame. Try using
.loc[row_indexer,col_indexer] = value instead

You might also like