Professional Documents
Culture Documents
Topic
1 Title
2 Index
3 Training Certificate
4 Declaration
5 Acknowledgement
6 About Internshala
7 About Training
8 Objectives
9 Data Science
10 My Learnings
11 Final Project
13 Learning Outcome
To explore, sort and analyse mega data from various sources to take advantage of them and reach
conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of marketing and
sales with sales forecasting based on weather.
DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer science
to study and evaluate data. The key objective of Data Science is to extract valuable information for
use in strategic decision making, product development, trend analysis, and forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and data
retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data science
practitioners apply machine learning algorithms to numbers, text, images, video, audio, and more to
produce artificial intelligence (AI) systems to perform tasks that ordinarily require human
intelligence. In turn, these systems generate insights which analysts and business users can translate
into tangible business value.
DATA SCIENCE PROCESS:
1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data available for analysis, so
this step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming
the data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from different
data sources, and transform it. If you have successfully completed this step, you can
progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations based on
visual and descriptive techniques. The insights you gain from this phase will enable you to
start modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain the insights or make
the predictions stated in your project charter. Now is the time to bring out the heavy guns,
but remember research has taught us that often (but not always) a combination of simple
models tends to outperform one complicated model. If you’ve done this phase right, you’re
almost done.
6. The last step of the data science model is presenting your results and automating
the analysis, if needed. One goal of a project is to change a process and/or make better
decisions. You may still need to convince the business that your findings will indeed change
the business process as expected. This is where you can shine in your influencer role. The
importance of this step is more apparent in projects on a strategic and tactical level.
Certain projects require you to perform the business process over and over again, so
automating the project will save time.
MY LEARNINGS
Data science generally has a five-stage life cycle that consists of:
Data Science
The field of bringing insights from data using scientific techniques is called data science. Applications
Given data is
collected and used.
Big Data
What is likely to
happen?
Predictive Analysis
What’s happening
Complexity
now?
Dashboards
Why did it
happen?
Detective Analysis
What happened?
Reporting
Detective Analysis
Asking questions based on data we are seeing, like. Why something happened?
Predictive Modelling
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale
data.
Recommendation System
Example-In Amazon recommendations are different for different users according to their
past search.
Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
1. Recommendation System
How google and other search engines know what are the more relevant results for our
search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
Why Python???
4. Extensive Packages.
UNDERSTANDING OPERATORS:
Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean
and strings.
FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem. Two types: Built-in
functions and User- defined functions.
DATA STRUCTURES:
LISTS: A list is an ordered data structure with elements separated by comma and enclosed
within square brackets.
Descriptive Statistic
Mode
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
print(mode_data)
Mean
import pandas as pd
print(mean_data)
Median
import pandas as pd
print(median_data)
Types of variables
Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
Legit Outlier—These are values which are not actually errors but in data due to legitimate
reasons. Eg - a CEO’s salary might actually be high as compared to other employees.
Is difference between third and first quartile from last. It is robust to outliers.
Histograms
Histograms depict the underlying frequency of a set of discrete or continuous data that are
measured on an interval scale.
import pandas as pd
histogram=pd.read_csv(histogram.csv)
%matplot inline
plt.show()
Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample data.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data,
and then examining what the data tells us about how to proceed. The hypothesis to be tested is
called the null hypothesis and given the symbol Ho. We test the null hypothesis against an
alternative hypothesis, which is given the symbol Ha.
T Tests
Use sample standard deviation to estimate population standard deviation. T test is more prone to
errors, because we just have samples.
Z Score
The distance in terms of number of standard deviations, the observed value is away from mean, is
standard score or z score.
The distribution once converted to z- score is always same as that of shape of original distribution.
Correlation
Syntax
import pandas as pd
import numpy as np
data=pd.read_csv("data.csv")
data.corr()
Predictive Modelling
Making use of past data and attributes we predict future using this data.
Eg-
Types
1. Supervised Learning
Supervised learning is a type algorithm that uses a known dataset (called the training
dataset) to make predictions. The training dataset includes input data and response values.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither classified
nor. Here the task of machine is to group unsorted information according to similarities,
patterns and differences without any prior training of data.
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
5. Predictive Modelling
6. Model Development/Implementation
Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.
Hypothesis Generation
List down all possible variables, which might influence problem objective. These variables should be
free from personal bias and preferences.
Data Extraction/Collection
Collect data from different sources and combine those for exploration and model building. While
looking at data we might come across new hypothesis.
Data extraction is a process that involves retrieval of data from various sources for further data
processing or data storage.
Variable identification
Univariate Analysis
Bivariate Analysis
Outlier treatment
Variable Transformation
Variable Treatment
Univariate Analysis
Bivariate Analysis
1. When two variables are studied together for their empirical relationship.
2. When you want to see whether the two variables are associated with each other.
1. Non-response – Eg-when you collect data on people’s income and many choose not to
answer.
Types
1. MCAR (Missing completely at random): Missing values have no relation to the variable in
which missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value
exist and the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which
missing value exists
Identifying
Syntax: -
1. describe()
2. Isnull()
1. Imputation
2. Deletion
Outlier Treatment
Reasons of Outliers
2. Measurement Errors
3. Processing Errors
Types of Outlier
Univariate
Bivariate
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
Graphical Method
Box Plot
Scatter Plot
Formula Method
Where IQR= Q3 – Q1
Treating Outlier
1. Deleting observations
Used to –
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past
data.
Eg-A retail wants to know the default behaviour of its credit card customers. They want to predict
the probability of default for each customer in next three months.
It moves the probability towards one of the extremes based on attributes of past information.
A customer with healthy credit history for last years has low chances of default (closer to 0).
1. Algorithm Selection
2. Training Model
3. Prediction / Scoring
Algorithm Selection
Algorithms
Logistic Regression
Decision Tree
Random Forest
Example-
Yes No
Supervised Unsupervised
Learning Learning
Is dependent
Variable community?
Yes No
Regression Classification
Dataset
Train
Test
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.
Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent variable
with a given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That
predicts the response value(y) as accurately as possible as a function of the feature or independent
variable(x).
12
0 1 2 3 4 5 6 7 8 9
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to model a
binary dependent variable, although many more complex extensions exist.
K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the
data, with the number of groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are provided. Data points are
clustered based on feature similarity.
FINAL PROJECT
Problem Statement:
Your client is a retail banking institution. Term deposits are a major source of income for a bank.
A term deposit is a cash investment held at a financial institution. Your money is invested for an
agreed rate of interest over a fixed amount of time, or term.
The bank has various outreach plans to sell term deposits to their customers such as email
marketing, advertisements, telephonic marketing and digital marketing.
Telephonic marketing campaigns still remain one of the most effective ways to reach out to people.
However, they require huge investment as large call centers are hired to actually execute these
campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that
they can be specifically targeted via call.
You are provided with the client data such as: age of the client, their job type, their marital status,
etc. Along with the client data, you are also provided with the information of the call such as the
duration of the call, day and month of the call, etc. Given this information, your task is to predict if
the client will subscribe to term deposit.
Data Dictionary:-
Prerequisites:
train.csv: This dataset will be used to train the model. This file contains all the client
test.csv: The trained model will be used to predict whether a new set of clients will
subscribe the term deposit or not for this dataset.
TEST.csv file: -
TRAIN.csv file: -
Problem Description
Use train.csv dataset to train the model. This file contains all the client and call details as well as the
target variable “subscribed”. Then use the trained model to predict whether a new set of clients will
subscribe the term deposit.
Reason for choosing data science
Data Science has become a revolutionary technology that everyone seems to talk about. Hailed as
the ‘sexiest job of the 21st century’. Data Science is a buzzword with very few people knowing about
the technology in its true sense.
While many people wish to become Data Scientists, it is essential to weigh the pros and cons of data
science and give out a real picture. In this article, we will discuss these points in detail and provide
you with the necessary insights about Data Science.
Advantages: -
1. It’s in Demand
2. Abundance of Positions
Disadvantages: -
Learning Outcome
Apply data science concepts and methods to solve problem in real-world contexts and
will communicate these solutions effectively.
SCOPE IN DATA SCIENCE FIELD
Few factors that point out to data science’s future, demonstrating compelling reasons why it is
crucial to today’s business needs are listed below:
Data is being regularly collected by businesses and companies for transactions and through website
interactions. Many companies face a common challenge – to analyze and categorize the data that is
collected and stored. A data scientist becomes the savior in a situation of mayhem like this.
Companies can progress a lot with proper and efficient handling of data, which results in
productivity.
Countries of the European Union witnessed the passing of the General Data Protection Regulation
(GDPR) in May 2018. A similar regulation for data protection will be passed by California in 2020.
This will create co-dependency between companies and data scientists for the need of storing data
adequately and responsibly. In today’s times, people are generally more cautious and alert about
sharing data to businesses and giving up a certain amount of control to them, as there is rising
awareness about data breaches and their malefic consequences. Companies can no longer afford to
be careless and irresponsible about their data. The GDPR will ensure some amount of data privacy in
the coming future.
Career areas that do not carry any growth potential in them run the risk of stagnating. This indicates
that the respective fields need to constantly evolve and undergo a change for opportunities to arise
and flourish in the industry. Data science is a broad career path that is undergoing developments
and thus promises abundant opportunities in the future. Data science job roles are likely to get more
specific, which in turn will lead to specializations in the field. People inclined towards this stream can
exploit their opportunities and pursue what suits them best through these specifications and
specializations.
Data is generated by everyone on a daily basis with and without our notice. The interaction we have
with data daily will only keep increasing as time passes. In addition, the amount of data existing in
the world will increase at lightning speed. As data production will be on the rise, the demand for
data scientists will be crucial to help enterprises use and manage it well.
In today’s world, we can witness and are in fact witnessing how Artificial Intelligence is spreading
across the globe and companies’ reliance on it. Big data prospects with its current innovations will
flourish more with advanced concepts like Deep Learning and neural networking. Currently, machine
learning is being introduced and implemented in almost every
application. Virtual Reality (VR) and Augmented Reality (AR) are undergoing monumental
modifications too. In addition, human and machine interaction, as well as dependency, is likely to
improve and increase drastically.
The main popular technology dealing with cryptocurrencies like Bitcoin is referred to as Blockchain.
Data security will live true to its function in this aspect as the detailed transactions will be secured
and made note of. If big data flourishes, then Iot will witness growth too and gain popularity. Edge
computing will be responsible for dealing with data issues and address them.
Conclusion
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also, now I’m
able to perform data analysis using python. I also attempted various quizzes and
assignments provided for periodic evaluation during 6 weeks and completed this training
with 100% score in Final Test.