Professional Documents
Culture Documents
Lab Manual
DURATION OF
CREDITS 2 24 Weeks
SEMESTER
PREREQUISITE SESSION
- 2 Hrs per Week
SUBJECTS DURATION
1. VISION AND MISSION OF GALGOTIAS UNIVERSITY
VISION
MISSION
VISION
MISSION
Graduates of Computer Science and Engineering will be globally competent and provide
PEO1 sustainable solutions for interdisciplinary problems as team players
Graduates of Computer Science and Engineering will engage in professional activities
PEO2 with ethical practices in the field of Computer Science and Engineering to enhance their
own stature to contribute towards society
Graduates of Computer Science and Engineering will acquire specialized knowledge in
PEO3 emerging technologies for research, innovation and product development.
4. PROGRAMME OUTCOMES
Engineering knowledge: Apply the knowledge of mathematics, science, engineering
PO1 fundamentals, and an engineering specialization to the solution of complex engineering
problems.
Problem analysis: Identify, formulate, review research literature, and analyze complex
PO2 engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
Design/development of solutions: Design solutions for complex engineering problems
and design system components or processes that meet the specified needs with
PO3 appropriate consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
Conduct investigations of complex problems: Use research-based knowledge and
PO4 research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions
Modern tool usage: Create, select, and apply appropriate techniques, resources, and
PO5 modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
The engineer and society: Apply reasoning informed by the contextual knowledge to
PO6 assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
Environment and sustainability: Understand the impact of the professional engineering
PO7 solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development
Ethics: Apply ethical principles and commit to professional ethics and responsibilities
PO8 and norms of the engineering practice.
Individual and team work: Function effectively as an individual, and as a member or
PO9 leader in diverse teams, and in multidisciplinary settings.
Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
PO10 write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
PO11 Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and ability to
PO12 engage in independent and life-long learning in the broadest context of technological
change.
PSO2 Able to use problem solving skills to develop efficient algorithmic solutions.
Course Objectives:
The primary objective of this course is to develop both theoretical knowledge on data analysis skills,
which can be applied to practical problems for explain how math and information sciences can
contribute to building better algorithms and software. To develop applied experience with data
science software, programming, applications and processes.
Course Outcomes
CO1 To acquire good introducing knowledge of the essentials in Statistical Fundamentals used
in Data science.
CO2 An ability to apply algorithmic principles and Programing knowledge using Python
language on Data Science .
CO3 Understand the fundamentals of statistics and probability used in data science.
CO5 Apply and Implement ML processing principles using Probability and Statistics .
CO6 To get knowledge on new research and Development in the field of Data Science.
2 James, G., Witten, D., Hastie, T., Tibshirani, R. An introduction to statistical learning with
applications in R. Springer, 2013.
3 Han, J., Kamber, M., Pei, J. Data mining concepts and techniques. Morgan Kaufmann, 2011.
Reference Book (s)
1 “Data Science for business”, F. Provost, T Fawcett, 2013
2 Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data, EMC Education Service s , 2015.
Course Contents:
Unit-1: Introduction with Statistical Fundamentals 8 hours
Statistics introduction: need, advantages and disadvantages. Applications of statistics. Case study of a
statistical application. Types of statistics-Descriptive and inferential statistics. Variables and types of
data.Sampling Techniques.Descriptive measures-Measures of central Tendency,Measures of
variation,Measures of position.
Introduction about NumPy, Different NumPy Operations, Broadcasting with NumPy, Introduction about
Pandas, Reading or Loading data into Data frame, Pandas Dataframe Manipulations, Data Loading /Reading
in different formats(CSV,Excel,Json,HTML).
Tableau introduction : Bringing the data-connecting and visualization. Analysing the data - Adding
worksheets,creating dashboards and building stories. Publishing and Sharing-publishing workbooks,sharing
files with tableau readers.
Introduction to BI.
Introduction to data Visualizations, Principles Behind Data Visualizations, Histograms-Visualize, Box plots-
Visualize, the Distribution of Continuous Numerical Variables(Bar Plots ∙ Pie Chart ∙ Line Chart). Data
Visualization using R- Line Plots and Regression.
Research done in data science in the last 3 years. Discussion of new tools and languages for data science.
Case studies of data science.
LIST OF PRACTICALS:
1. Perform basic statistical operations using suitable tools. Measures of central tendency etc.
2. Perform sampling of data and determine the distribution.
3. Write python code to display basic numpy functions. broadcasting etc.
4. Write a python code to read and display data in different formats.like csv,xls,web link etc.
5. Write python code to create dataframes in different ways.
6. Write a python code to do reshaping of data using pandas.
7. Write python code to perform subset operations using pandas.(both row and column)
8. Write a python code to perform method chaining using pandas.
9. Write a programme in Python to predict if a loan will get approved or not.
10. Write a programme in Python to predict the traffic on a new mode of transport.
11. Write a programme in Python to predict the class of the user..
12. Write a programme in Python to identify the tweets which are hate tweets and which are
not.
13. Write a programme in Python to predict the age of the actors.
14. Mini project to predict the time taken to solve a problem given the current status of the user.
Continuous evaluation R0 20
Internal
Pre-final lab test R1 20
Internal viva R3(Theory, tools, Result 10 (2+4+4)
interpretation )
Total: 100
S.
Rubrics - Parts Marks
No.
1 Able to define the problem statement 2
2 Able to convert the problem statement into program 3
3 Implementation 5
4 Report Submission 5
5 Viva – voce 5
Total 20
Level of Achievement
Assessment Mappe
Excellent(4) Very Good (3) Fair (2) Poor (1)
Parameter d CO
Scripting/queri
es are
Scripting/queri
complete,
es are Scripting/queri
relevant
complete and es are brief
Program and (Achieved all
Executed and missing No
Result the
a successfully significant Scripting/queri CO4
Representati requirements)
but failed in requirement of es are reported
on and Executed
representing problem
successfully.
result statement;
result
discussion.
discussion was
clear.
b Organization Lab report is Lab report is Report Poor CO5
of Report well organized well organized contains few organization
and Timely as directed and but not errors and not and late
Submission submitted on submitted on submitted on submission
Result time time time
Representati
on and
Discussion
R3: Viva
Level of Achievement
Very Good Mapped
Assessment Parameter Excellent (4) Fair (2) Poor (1)
(3) CO
Knowledge of
theory of
a CO1
practical
problems
Implementation
b CO2
logics
c Execution CO5
GALGOTIAS UNIVERSITY
Department of Computer Science and Engineering
Assessment of Internal lab Test
Experiment Solution:
1.Perform basic statistical operations using suitable tools. Measures of
central tendency etc.
We consider a random variable x and a data set S = {x1, x2, …, xn} of size n which contains possible
values of x. The data set can represent either the population being studied or a sample drawn from the
population. We can also view the data as defining a distribution, as described in Discrete Probability
Distributions.
We seek a single measure (i.e. a statistic) which somehow represents the center of the entire data
set S. The commonly used measures of central tendency are the mean, median and mode. Besides the
normally studied mean (also called the arithmetic mean) we also consider two other types of mean:
the geometric mean and the harmonic mean.
Excel Functions: If R is an Excel range that contains the data elements in S then the Excel formula
which calculates each of these statistics is shown in Figure 1.
Statistic Excel 2007 Excel 2010/2013/2016
Real Statistics Function: The Real Statistics Resource Pack provides the following array function:
DELErr(R) = the array of the same size and shape as R consisting of all the elements in R where any
cells with an error value are replaced by a blank (i.e. an empty cell).
E.g. to find the average of a range R which may contain error cells, you can use the formula
=AVERAGE(DELErr(R))
Real Statistics Data Analysis Tool: The Remove error cells option of the Reformatting a Data
Range data analysis tool described in Reformatting Tools makes a copy of the inputted range where
all cells that contain error values are replaced by empty cells.
To use this capability, press Ctrl-m and double click on Reformatting a Data Range. When the
dialog box shown in Figure 2 of Reformatting Tools, fill in the Input Range, choose the Remove error
cells option and leave the # of Rows and # of Columns fields blank. The output will have the same
size and shape as the input range.
Mean
We begin with the most commonly used measure of central tendency, the mean.
Definition 1: The mean (also called the arithmetic mean) of the data set S is defined by
Similarly, if µx is the mean of population {x1, x2, …, xm} and µy is the mean of population {y 1, y2, …,
yn} then the mean of the combined population is
Real Statistics Functions: The Real Statistics Resource Pack furnishes the following array functions:
COUNTCOL(R1) = a row range which contains the number of numeric elements in each of the
columns in R1
SUMCOL(R1) = a row range which contains the sums of each of the columns in R1
MEANCOL(R1) = a row range which contains the means of each of the columns in R1
COUNTROW(R1) = a column range which contains the number of numeric elements in each of the
rows in R1
SUMROW(R1) = a column range which contains the sums of each of the rows in R1
MEANROW(R1) = a column range which contains the means of each of the rows in R1
Example 2: Use the COUNTCOL and MEANCOL functions to calculate the number of cells in each
of the three columns in the range L4:N11 of Figure 3 as well as their means.
This statistic is commonly used to provide a measure of the average rate of growth as described in
Example 5.
Example 5: Suppose the sales of a certain product grow 5% in the first two years and 10% in the next
two years, what is the average rate of growth over the 4 years?
If sales in year 1 are $1 then sales at the end of the 4 years are (1 + .05)(1 + .05)(1 + .1)(1 + .1) =
1.334. The annual growth rate r is that amount such that (1+r)4 = 1.334. Thus r = 1.3341/4 – 1 = .0747.
The same annual growth rate of 7.47% can be obtained in Excel using the
formula GEOMEAN(H7:H10) – 1 = .0747.
Harmonic Mean
Definition 5: The harmonic mean of the data set S is calculated by the formula
The harmonic mean can be used to calculate an average speed, as described in Example 6.
Example 6: If you go to your destination at 50 mph and return at 70 mph, what is your average rate of
speed?
Assuming the distance to your destination is d, the time it takes to reach your destination is d/50 hours
and the time it takes to return is d/70, for a total of d/50 + d/70 hours. Since the distance for the whole
trip is 2d, your average speed for the whole trip is
This is equivalent to the harmonic mean of 50 and 70, and so can be calculated in Excel as
HARMEAN(50,70), which is HARMEAN(G7:G8) from Figure 2.
2. Perform sampling of data and determine the distribution.
What is sampling distribution?
A sampling distribution is a distribution that plots the values of a statistic for a given random sample
that's part of a larger sum of data. When data scientists work with large quantities of data they
sometimes use sampling distributions to determine parameters of the group of data, like what the mean
or standard deviation might be. Parameters are facts about data in the form of statistical values.
This is also called a probability distribution because it relies on probability to inform the data scientist
of sample statistics. Using a sampling distribution simplifies the process of making inferences about
large amounts of data. For this reason, it is used often as a statistical resource in data science.
Types of Sampling
The most common techniques you’ll likely meet in elementary statistics or AP statistics include taking
a sample with and without replacement. Specific techniques include:
Bernoulli samples have independent Bernoulli trials on population elements. The trials decide whether
the element becomes part of the sample. All population elements have an equal chance of being
included in each choice of a single sample. The sample sizes in Bernoulli samples follow a binomial
distribution. Poisson samples (less common): An independent Bernoulli trial decides if each population
element makes it to the sample.
Cluster sampes divide the population into groups (clusters). Then a random sample is chosen from the
clusters. It’s used when researchers don’t know the individuals in a population but do know the
population subsets or groups.
In systematic sampling, you select sample elements from an ordered frame. A sampling frame is just a
list of participants that you want to get a sample from. For example, in the equal-probability method,
choose an element from a list and then choose every kth element using the equation k = N\n. Small “n”
denotes the sample size and capital “N” equals the size of the population.
SRS : Select items completely randomly, so that each element has the same probability of being chosen
as any other element. Each subset of elements has the same probability of being chosen as any other
subset of k elements.
In stratified sampling, sample each subpopulation independently. First, divide the population into
homogeneous (very similar) subgroups before getting the sample. Each population member only
belongs to one group. Then apply simple random or a systematic method within each group to choose
the sample. Stratified Randomization: a sub-type of stratified used in clinical trials. First, divide
patients into strata, then randomize with permuted block randomization.
Types of distributions
There are a couple of standard types of sampling distributions and how they can be applied. Read more
to learn about types of sampling distributions and their applications:
T-distribution
Normal distribution
T-distribution
A T-distribution is a sampling distribution that helps data professionals determine the population size
or the population variance. The T-distribution uses a t-score to evaluate data that wouldn't be
appropriate for a normal distribution. For instance, when analyzing a very small sample. The formula
for t-score looks like this:
In the above formula, "x" is the sample mean, "μ" is the population mean and signifies standard
deviation.
Normal distribution
A normal distribution is also called a "bell curve". These are distributions with features like a
symmetrical bell-shaped curve and the mean and median are the same number and positioned in
the center of the curve. If you have a lot of data and you create a sampling distribution, it's most
likely to model a normal distribution from which you can infer statistical values, unless a model like
t-scoring is applied.
import numpy as np
A = np.array([5, 7, 3, 1])
c=a*b
print (c)
Functions:
Syntax: numpy.sqrt()
# numpy.sqrt() method
# importing numpy
# numpy.add() function
in_num1 = 10
in_num2 = 15
Web link:
# Import pandas
import pandas as pd
# reading csv file
dfs = pd.read_html('http://weburl.html')
import pandas as pd
# make an array
# create a series
series_obj = pd.Series(array)
arr = series_obj.values
# reshaping series
reshaped_arr = arr.reshape((3, 2))
# show
reshaped_arr
import pandas as pd
# make an array
array = ["ankit","shaurya",
"shivangi", "priya",
"jeet","ananya"]
# create a series
series_obj = pd.Series(array)
arr = series_obj.values
# reshaping series
# show
Here we use the concept of chaining in Pandas to filter the dataframe and get the same rows as an
output. This can be easily explained with the help of examples. Here we use a dataframe which
consists of some data of person as shown below :
# import package
import pandas as pd
# define data
data = pd.DataFrame(
# view data
data
9. Write a programme in Python to predict if a loan will get approved or not.
1.1 Problem
• A Company wants to automate the loan eligibility process (real time) based on
customer detail provided while filling online application form. These details are
Gender, Marital Status, Education, Number of Dependents, Income, Loan
Amount, Credit History and others. To automate this process, they have given a
problem to identify the customers segments, those are eligible for loan amount
so that they can specifically target these customers. Here they have provided a
data set.
1.2 Data
• Variable Descriptions:
Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)
• Rows: 615
• Source: Datahack
• Jupyter Notebook: Github - Parth Shandilya
In [45]: # Importing
Library
import
pandas as
pd
import numpy as np from sklearn
import preprocessing from
sklearn.preprocessing import
LabelEncoder
df.head(10)
Out[48]: Loan_ID Gender Married Dependents Education Self_Employed \
1 LP001002 Male No 0 Graduate No
2 LP001003 Male Yes 1 Graduate No
3 LP001005 Male Yes 0 Graduate Yes 3 LP001006 Male Yes 0 Not Graduate No
4 LP001008 Male No 0 Graduate No
5 LP001011 Male Yes 2 Graduate Yes 6 LP001013 Male Yes 0 Not Graduate No
7 LP001014 Male Yes 3+ Graduate No
8 LP001018 Male Yes 2 Graduate No
9 LP001020 Male Yes 1 Graduate No
Credit_History
count 564.000000
mean 0.842199
std 0.364878
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
1. For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can
look at frequency distribution to understand whether they make sense or not.
In [51]: # Get the unique values and their frequency of variable Property_Area
’ ’
df[ Property_Area ].value_counts()
Out[51]: Semiurban 233
Urban 202
Rural 179
Name: Property_Area, dtype: int64
• ApplicantIncome
• LoanAmount
In [53]: # Box Plot for understanding the distributions and to observe the outliers.
%matplotlib inline
’ ’
df.boxplot(column= ApplicantIncome )
Out[54]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc85e278>
3. The above Box Plot confirms the presence of a lot of outliers/extreme values.
This can be attributed to the income disparity in the society.
In [55]: # Box Plot for variable ApplicantIncome by variable Education of training data set
’ ’ ’ ’
df.boxplot(column= ApplicantIncome , by = Education )
Out[55]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc82e588>
4. We can see that there is no substantial different between the mean income of
graduate and non-graduates. But there are a higher number of graduates with
very high incomes, which are appearing to be the outliers
’ ’
df[ LoanAmount ].hist(bins=50)
’ ’
df.boxplot(column= LoanAmount )
Out[57]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc728be0>
In [58]: # Box Plot for variable LoanAmount by variable Gender of training data set
’ ’ ’ ’
df.boxplot(column= LoanAmount , by = Gender )
Out[58]: <matplotlib.axes._subplots.AxesSubplot at 0x7f93bc79acc0>
422
79.04761904761905
’ ’
In [39]: df[ Y ]
Out[39]: Credit_History
0.
0
0.
07
86
52
1.
0
0.
79
57
89
All
0.
68
26
24
Name: Y, dtype: float64
• The extreme values are practically possible, i.e. some people might apply for
high value loans due to specific needs. So instead of treating them as outliers,
let’s try a log transformation to nullify their effect:
#Print accuracy
accuracy =
metrics.accuracy_score(predictions,data[outcome]) print
("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 5
folds kf = KFold(data.shape[0],
n_folds=5)
error = [] for train, test in kf: # Filter training data
train_predictors = (data[predictors].iloc[train,:])
’
# The target we re using to train the algorithm.
train_target = data[outcome].iloc[train]
Out[186]: ApplicantIncome 0
CoapplicantIncome 0
Credit_History 29
Dependents 10
Education 0
Gender 11
LoanAmount 27
LoanAmount_log 389
Loan_Amount_Term 20
Loan_ID 0
Loan_Status 367
Married 0
Property_Area 0
Self_Employed 23
Type dtype: 0
int64
In [187]: #Identify categorical and continuous variables
’ ’
ID_col = [ Loan_ID ] target_col =
["Loan_Status"]
’ ’’ ’’ ’’ ’’ ’’
cat_cols = [ Credit_History , Dependents , Gender , Married , Education , Property_A
In [200]: #Imputing Missing values with mean for continuous variable
’ ’ ’ ’
fullData[ LoanAmount ].fillna(fullData[ LoanAmount ].mean(), inplace=True)
’ ’ ’ ’
fullData[ LoanAmount_log ].fillna(fullData[ LoanAmount_log ].mean(), inplace=True)
’ ’ ’ ’
fullData[ Loan_Amount_Term ].fillna(fullData[ Loan_Amount_Term ].mean(), inplace=True)
’ ’ ’ ’
fullData[ ApplicantIncome ].fillna(fullData[ ApplicantIncome ].mean(), inplace=True)
’ ’ ’ ’
fullData[ CoapplicantIncome ].fillna(fullData[ CoapplicantIncome ].mean(), inplace=Tru
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[tes
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
7 Model Building
#Imputing Missing values with mode for categorical variables
’ ’ ’ ’
fullData[ Gender ].fillna(fullData[ Gender ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Married ].fillna(fullData[ Married ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Dependents ].fillna(fullData[ Dependents ].mode()[0], inplace=True)
’ ’ ’ ’
fullData[ Loan_Amount_Term ].fillna(fullData[ Loan_Amount_Term ].mode()[0],
’ ’ ’ ’
inplace=Tr fullData[ Credit_History ].fillna(fullData[ Credit_History ].mode()[0],
inplace=True)
In [202]: #Create a new column as Total Income
’ ’ ’ ’ ’ ’
fullData[ TotalIncome ]=fullData[ ApplicantIncome ] + fullData[ CoapplicantIncome ]
’ ’ ’ ’
fullData[ TotalIncome_log ] = np.log(fullData[ TotalIncome ])
’ ’’ ’’ ’
predictors_Logistic=[ Credit_History , Education , Gender ]
x_train = train_modified[list(predictors_Logistic)].values
y_train = train_modified["Loan_Status"].values
x_test=test_modified[list(predictors_Logistic)].values
#Predict Output
predicted=
model.predict(x_tes
t)
classification_model(model, df,predictors_Logistic,outcome_var)
’ ’’ ’
test_modified.to_csv("Logistic_Prediction.csv",columns=[ Loan_ID , Loan_Status ])
Accuracy : 80.945%
Cross-Validation Score : 80.946%
/home/parths007/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/label.py:151:
Deprec if diff:
/home/parths007/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:14:
SettingWithCopyW A value is trying to be set on a copy of a slice from a DataFrame. Try using
.loc[row_indexer,col_indexer] = value instead