Professional Documents
Culture Documents
TRAINING REPORT
ON
“DATA ANALYSIS”
Submitted By:
Name: ARYAN VERMA
Roll No.: 20001003027
Branch: ECE
Year: 4th Year (7th Sem)
Submitted To:
Department of Electronics and Communication
Engineering
DCRUST (Murthal)
lOMoARcPSD|32626499
INDEX
Sr. Topic Page
No. No.
1 Title 1
2 Index 2
3 Training Certificate 3
4 Declaration 4
5 Acknowledgement 5
6 About Training 6
7 About Internshala 6
8 Objectives 6
9 Data Science 6-8
10 My Learnings 8-22
11 Final Project 12-27
12 Reason for choosing Data Science 28
13 Learning Outcome 29
14 Scope in Data Science 30
15 Results 31
DECLARATION
I hereby certify that the work which is being presented in the report entitled
“Data Science” in fulfilment of the requirement for completion of six weeks
industrial training in Department of Mechanical Engineering of "Deenbandhu
Chhotu Ram University of Science and Technology, Murthal" is an authentic
record of my own work carried out during industrial training.
Aryan Verma
20001003027
ECED, 7th Sem
DCRUST MURTHAL
lOMoARcPSD|32626499
ACKNOWLEDGEMENT
The work in this report is an outcome of continuous work over a period and
drew intellectual support from MED TOUR EASYand other sources. I would
like to
articulate our profound gratitude and indebtedness to MED TOUR EASYhelped
us
in completion of the training. I am thankful to MED TOUR EASYTraining
Associates for teaching and assisting me in making the training successful.
Aryan Verma
20001003027
ECED, 7th Sem
DCRUST MURTHAL
lOMoARcPSD|32626499
1. ABOUT TRAINING
• NAME OF TRAINING: DATA SCIENCE
• HOSTING INSTITUTION: INTERNSHALA
• DATES: From 1st July 2021 to 12th August 2021
3. OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and
reach conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of
marketing and sales with sales forecasting based on weather.
4. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer
science to study and evaluate data. The key objective of Data Science is to extract valuable
information for use in strategic decision making, product development, trend analysis, and
forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and
data retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data science
practitioners apply machine learning algorithms to numbers, text, images, video, audio, and more
to produce artificial intelligence (AI) systems to perform tasks that ordinarily require
lOMoARcPSD|32626499
human intelligence. In turn, these systems generate insights which analysts and business users
can translate into tangible business value.
DATA SCIENCE PROCESS:
1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner. The
result is data in its raw form, which probably needs polishing and transformation before
it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in the data,
combine data from different data sources, and transform it. If you have successfully
completed this step, you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of
the data. You’ll look for patterns, correlations, and deviations based on visual and
descriptive techniques. The insights you gain from this phase will enable you to start
modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain the insights or make
the predictions stated in your project charter. Now is the time to bring out the heavy guns,
but remember research has taught us that often (but not always) a combination of
simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the
analysis, if needed. One goal of a project is to change a process and/or make better
decisions. You may still need to convince the business that your findings will indeed
change the business process as expected. This is where you can shine in your influencer
role. The importance of this step is more apparent in projects on a strategic and tactical
level. Certain projects require you to perform the business process over and over again,
so automating the project will save time.
5. MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
• Applications of Data Science
➢ Unfamiliar detection (fraud, disease, etc.)
lOMoARcPSD|32626499
Data Science
The field of bringing insights from data using scientific techniques is called data science.
Applications
Computer Vision - The advancement in recognizing an image by a computer involves processing large
sets of image data from multiple objects of same category. For example, Face recognition.
Big Data
What is likely to
happen?
Predictive Analysis
C
o
m What’s happening
pl now?
ex
ity Dashboards
Why did it
happen?
Detective Analysis
What happened?
Reporting
Detective Analysis
Asking questions based on data we are seeing, like. Why something happened?
Predictive Modelling
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data.
• Recommendation System
Example-In Amazon recommendations are different for different users according to their past search.
• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
• Deciding the right credit limit for credit card customers.
• Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
• How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
4. Personalized search results
lOMoARcPSD|32626499
Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax
and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.
Why Python???
• CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
• LOOPING CONSTRUCTS:
For loop
• FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
LISTS: A list is an ordered data structure with elements separated by comma and enclosed
within square brackets.
DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
lOMoARcPSD|32626499
Statistics
Descriptive Statistic
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject
column print(median_data)
Types of variables
• Continous – Which takes continuous numeric values. Eg-marks
• Categorial-Which have discrete values. Eg- Gender
• Ordinal – Ordered categorial variables. Eg- Teacher feedback
• Nominal – Unorderd categorial variable. Eg- Gender
lOMoARcPSD|32626499
Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
• Measurement Error-Outliers in data due to measurement operator being faulty.
T Tests
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
lOMoARcPSD|32626499
Z Score
The distance in terms of number of standard deviations, the observed value is away from mean, is
standard score or z score.
Predictive Modelling
Making use of past data and attributes we predict future using this data.
Eg-
Past Horror Movies
Future Unwatched Horror Movies
• Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behaviour.
• Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.
lOMoARcPSD|32626499
Hypothesis Generation
List down all possible variables, which might influence problem objective. These variables should be
free from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.
Data Extraction/Collection
Collect data from different sources and combine those for exploration and model building.
While looking at data we might come across new hypothesis.
Data Exploration and Transformation
Data extraction is a process that involves retrieval of data from various sources for further data processing
or data storage.
Steps of Data Extraction
Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
lOMoARcPSD|32626499
Identifying Outlier
Graphical Method
• Box Plot
• Scatter Plot
Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 *
IQR Where IQR= Q3 – Q1
Q3=Value of 3rd quartile Q1=Value
of 1st quartile
Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
Variable Transformation
Is the process by which-
1. We replace a variable with some function of that variable. Eg – Replacing a variable x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
1. Change the scale of a variable
2. Transforming non linear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
lOMoARcPSD|32626499
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
Algorithm Selection
Example-
Yes No
Supervised Unsupervised
Learning Learning
Is dependent
variable continuous?
Yes No
Regression Classification
Eg- Predict the customer will buy product or not.
lOMoARcPSD|32626499
Algorithms
• Logistic Regression
• Decision Tree
• Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
• Train
Past data (known dependent variable).
Used to train model.
• Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.
10
0
0 1 2 3 4 5 6 7 8 9
lOMoARcPSD|32626499
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many more complex extensions exist.
K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the
number of groups represented by the variable K. The algorithm works iteratively to assign each data point
to one of K groups based on the features that are provided. Data points are clustered based on feature
similarity.
lOMoARcPSD|32626499
6. FINAL PROJECT
PREDICTING IF CUSTOMER BUYS TERM DEPOSIT
Problem Statement:
Your client is a retail banking institution. Term deposits are a major source of income for a
bank.
A term deposit is a cash investment held at a financial institution. Your money is invested
for an agreed rate of interest over a fixed amount of time, or term. The bank has various
outreach plans to sell term deposits to their
customers such as email marketing, advertisements, telephonic marketing and
digital marketing.
Telephonic marketing campaigns still remain one of the most effective ways to reach out to
people. However, they require huge investment as large call centers are hired to actually
execute these campaigns. Hence, it is crucial to identify the customers most likely to
convert beforehand so that they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their job type, their marital
status, etc. Along with the client data, you are also provided with the information of the
call such as the duration of the call, day and month of the call, etc. Given this information,
your task is to predict if the client will subscribe to term deposit.
Data Dictionary: -
lOMoARcPSD|32626499
Prerequisites:
We have the following files:
• train.csv: This dataset will be used to train the model. This file contains all the
client and call details as well as the target variable “subscribed”.
• test.csv: The trained model will be used to predict whether a new set of clients
will subscribe the term deposit or not for this dataset.
TEST.csv file: -
TRAIN.csv file: -
lOMoARcPSD|32626499
Problem Description
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as the
‘sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing about the
technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data science
and give out a real picture. In this article, we will discuss these points in detail and provide you with the
necessary insights about Data Science.
Advantages: -
1. It’s in Demand
2. Abundance of Positions
3. A Highly Paid Career
4. Data Science is Versatile
Disadvantages: -
1. Mastering Data Science is near to impossible
2. A large Amount of Domain Knowledge Required
3. Arbitrary Data May Yield Unexpected Results
4. The problem of Data Privacy
lOMoARcPSD|32626499
Learning Outcome
• Apply data science concepts and methods to solve problem in real-world contexts and will
communicate these solutions effectively.
lOMoARcPSD|32626499
Data is being regularly collected by businesses and companies for transactions and through
website interactions. Many companies face a common challenge – to analyze and categorize
the data that is collected and stored. A data scientist becomes the savior in a situation of
mayhem like this. Companies can progress a lot with proper and efficient handling of data,
which results in productivity.
• Revised Data Privacy Regulations
Countries of the European Union witnessed the passing of the General Data Protection
Regulation (GDPR) in May 2018. A similar regulation for data protection will be passed by
California in 2020. This will create co -dependency between companies and data scientists for the
need of storing data adequately and responsibly. In today’s times, people are generally more
cautious and alert about sharing data to businesses and giving up a certain amount of control
to them, as there is rising awareness about data breaches and their malefic consequences.
Companies can no longer afford to be careless and irresponsible about their data. The GDPR
will ensure some amount of data privacy in the coming future.
• Data Science is constantly evolving
Career areas that do not carry any growth potential in them run the risk of stagnating. This
indicates that the respective fields need to constantly evolve and undergo a change for
opportunities to arise and flourish in the industry. Data science is a broad career path that is
undergoing developments and thus promises abundant opportunities in the future. Data
science job roles are likely to get more specific, which in turn will lead to specializations in
the field. People inclined towards this stream can exploit their opportunities and pursue what
suits them best through these specifications and specializations.
• An astonishing incline in data growth
Data is generated by everyone on a daily basis with and without our notice. The interaction
we have with data daily will only keep increasing as time passes. In addition, the amount of
data existing in the world will increase at lightning speed. As data production will be on the
rise, the demand for data scientists will be crucial to help enterprises use and manage it well.
• Virtual Reality will be friendlier
In today’s world, we can witness and are in fact witnessing how Artificial Intelligence is
spreading across the globe and companies’ reliance on it. Big data prospects with its current
innovations will flourish more with advanced concepts like Deep Learning and neural
networking. Currently, machine learning is being introduced and implemented in almost every
lOMoARcPSD|32626499
application. Virtual Reality (VR) and Augmented Reality (AR) are undergoing monumental
modifications too. In addition, human and machine interaction, as well as dependency, is
likely to improve and increase drastically.
• Blockchain updating with Data science
The main popular technology dealing with cryptocurrencies like Bitcoin is referred to as
Blockchain. Data security will live true to its function in this aspect as the detailed
transactions will be secured and made note of. If big data flourishes, then Iot will witness
growth too and gain popularity. Edge computing will be responsible for dealing with data
issues and address them.
8. RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m
able to perform data analysis using python. I also attempted various quizzes and assignments
provided for periodic evaluation during 6 weeks and completed this training with 100% score
in Final Test.
9. TRAINING CERTIFICATE