0 ratings0% found this document useful (0 votes) 91 views73 pagesFoundation of Data Science Unit 1 - 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
UNIT IT
UNIT. I
UNITIV
aN
FOUNDATIONS OF DATA SCIENCE
Rapa ead uy Al
Syllabus
INTRODUCTION
Data Science: Benefits and uses ~ facets of data - Data Science Proc.
Overview — Defining research goals ~ Retrieving data— Data preparat.
+ Exploratory Data arialysis ~ build the model- presenting findings a
building applications - Data Mining - Data Warehousing — Basic Statistica,
descriptions of Data
DESCRIBING DATA r
Types of Data - Types of Variables -Describing Data with Tables and
Graphs —Describing Data with Averages - Describing Variability = Normal
Distributions and Standard (z) Scores
DESCRIBING RELATIONSHIPS
Correlation “Scatter plots correlation coefficient for quantitative data
-computational formula for correlation coefficient — Regression -
regression line ~least squares regression line — Standard error of estimate
~ interpretation of 12 =multiple regression equations —regression towards
the mean ‘ 9 t ae
PYTHON LIBRARIES FOR DATA WRANGLING
1 ige
Basics of Numpy arrays ~aggregations computations on arrays -
comparisons, masks, boolean logic — fancy indexing — structured arrays
~ Data manipulation with Pandas — data indexing and selection — operating
on data ~ missing data — Hierarchical indexing — combining datasets -
aggregation and grouping ~ pivot tables
_ DATA VISUALIZATION
Importing Matplotlib — Line plots — Scatter plots — visualizing errors -
ity and contour plots — Histograms — legends — colors — subplots -
and Sanctetion 7 Sustomization ~ three dimensional plotting -
"Br hic Data wit Basemap - Visualization with Seaborn.Contents
Unit - I : Introduction
11
12
13
14
15
1.6
17
18
19
1.10
Unit -
21
2.2
23
Data Science...
1.1.1 Data Science Components
1.1.2 Benefits and uses of data science...
Facets of Data.....
Data Science Process: Overview
Defining Research Goals .....
Retrieving Data
Data Preparation: Cleansing, Integrating, and Transforming Data
1.6.1 Cleansing Data.
1.6.2. Combining Data from different Data Sources.
1.6.3 Transforming Data
Exploratory Data Analysis
Build the Model...
1.8.1 © Model and Variable Selection.
Presenting Findings and Building Applications on top of them.
Data Mining and Data Warehousing.
1.10.1 Data warehousing ....
1.10.2 Data Mining...
Basic Statistical Descriptions of Dati
Part A...
Part B...
2: Describing Data
Types of Data...
2.1.1 Qualitative or Categorical data.
2.1.2 Quantitative Dat
Types of Variables...
2.2.1 Discrete an
2.2.2 Experimen
2.2.3. Confounding varial
Factor)
g Data with Tables and Graphs.
\d Continuous Variables
ble (Confounders or Confounding
Describin,
Ll
1.2
13
15
1.12
LAS
1.16
119
1.20
1.27
131
1.34
1.40
141
1.49
1.49
1.49
1.52
1.55
1.58
1.64
21
21
2.2
2BUs
233:
24
24
2524 Describing Data with Averages
2.5. Describing Variability ...
2.6. Normal Distributions and Standard (Z) Scores..
2.3.1
2.3.2
24.1
2.4.2
2.4.3
24.4
2.4.5
2.5.1
2.5.2
2.5.3
2.5.4
2.5.5
2.5.6
2.5.7
25.8
2.6.1 The Normal Curve.....
2.6.2 z Scores...
2.6.3 Standard Normal Curve
2.6.4 Solving Normal Curve Problems...
2.6.5 Finding Proportions
2.6.6. Finding Scores.....
2.6.7 z Scores for. non-normal distribution.
Part A...
Part B & C
Tables...
Graphs...
Mode...
Median
Mean...
Suitable measures of central tendency
Averages for Qualitative and Ranked Data.
Importance of Varial
Standard Deviation
Sum of Squares..
Degrees of Freedom (DF) .
Interquartile Range (IQR)
Measures of Variability for Qualitative and Ranked
Da
Unit - 3 : Describing Relationships
3.1
3.2
3.3
3.4
3.5
Correlation
Scatter Plots
Correlation Coefficient for Quantitative Data .
Computational Formula for Correlation Coefficient
Regression
25
215
2.23
2.24
2.24
2.25
2.26
2.27
2.28
2.28
2.28
2.29
231
2.32
2.32
2,32
2.34
2,34
2.34
2.36
2.37
2.38
2.38
2.39
2.39
2.40
2.43
3.1
3.5
3.13
3.18
3.203.6 Regression Line .
3.7 Least Squares Regression Line
3.8 Standard Error of Estimate (SEE)...
3.9 Interpretation of 7...
3.10 Multiple Regression Equations
3.11 Regression towards the Mean
Part A,
Part: B
Unit - 4 : Python Libraries for Data Wrangling
4,1 Introduction to Numpy Arrays
4.1.1 Dynamic Data Types in Python .
4.1.2 Python Lists.
4.1.3 Fixed-Type Arrays in Python.
4.1.4 Creating NumPy Arrays.
4.2 The Basics of NumPy Arrays
4.2 Aggregations .....
4.3 Computation on NumPy Arrays: Universal Function:
4.3.1 Characteristics of ufunc .. ae
4,3.2. Computations on arrays: Broadcasting
4.4 Comparisons, Masks, Booleanlogic
4.4.1 Boolean Arrays .....00
4.4.2 Boolean Arrays as Mask:
4.5 Fancy Indexing...
4.5.1 Combined Indexing .
4.5.2 Modifying Values with Fancy Indexing
4.5.3 Sorting Arrays...
4.6 Structured Arrays..
4.7 Data Manipulation with Panda:
4.7.1 Pandds Objects...
4.7.2 Create a Pandas Series from a lis
4.7.3. Pandas DataFrame Object.
4.7.4 Pandas Index Object...
4.8 Data Indexing and Selection.
4.8.1. Data Selection in Serie:
3.26
3.28
3.30
3.35,
3.38
3.40
3.43
3.47
4d
42
43
45
4.6
42
4.22
4.28
4.29
4.36
4.39
442
4.44
4.45
4.46
447
4.48
4.50
4.53
4,55
4.56
4.62
4.67
4.68
4.69y 4.8.2. Indexing & Description . 47
4.8.3 Data Sclection in DataFrame 473
4.9 Operating on Data.. 4.82
4.9.1 Ufuncs: Index Preservation. 4.82
4.9.2. UFunes: Index Alignment 483
4.9.3 Fill_value 4.85
4.10 Missing Data 4.89
4.10.1. Missing Data in Pandas.. 4.89
4.11 Hierarchical Indexing 495
4.11.1 MultiIndexed Series. * 4.96
4.11.2 Multilndex. Creation using Data frames 4.99
4.11.3 Explicit MultiIndex constructors 4.100
4.11.4 MultiIndex level .names ...... 4.101
4.11.5. Sorted and unsorted indices, 4.102
4.11.6 Rearranging Multi-Indices.. 4.105
4.12 Combining Datasets... 4.107
4.12.1. Concatenating objects 4.108
4.12.2. The. append() method ..... 4.114
4.12.3 Combining Datasets: Merge and Join. 4.115
4.13 Aggregation and Grouping .... 4,122
4.13.1 Grouping in Pandas 4,124
4.13.2 Transformation .. 4.126
4.13.3 The apply() method 4.126
4.13.4 Split key.. 4.127
4,14 Pivot Tables.. 4.134
4.14.1 Multi-level pivot tables 4.136
cane 4.138
, . 4.445
Unit - 5 : Data Visualization
5.1 Importing Matplotlib.... 5.1
5.1.1 Plotting from a script... x 52
5.1.2 Plotting from an [Python shell 5.2
5.1.3 Plotting from an IPython notebook ‘ 52
5.1.4 Saving Figures to File... be 535.1.5 Two Interfaces of Matplotlib 34
52 Line Plots... 56
5.2.1 Setting Line Colours and Styles. 59
5.2.2 Setting the Axes Limits of plots Sl
5.23 Labeling Plots. 5.13
5.3. Scatter Plots... 5.14
5.3.1 Scatter Plots with [Link].. 5.16
53.2 Scatter Plots with [Link]. 5.17
5.4 Visualizing Errors. 5.20
5.4.1. Matplotlib chart error bars in x values 5.21
5.4.2 Matplotlib chart error bars in y values 5.22
5.43 Continuous Errors... 5.25
5.6 Density and Contour Plots 5.26
5.7 Histograms 531
5.1.1 Two-Dimensional Histograms and Binnings. 5.33
5.7.2 Kernel density estimation .. 534.
5.8 Legends .. 5.35
5.8.1 Choosing Elements for the Legend 537
5.8.2 Legend for Size of Points 5.38
_ 59 Colors... 5.39
; 5.9.1 Choosing’ the colormap.. 5.4L
+ 5.10 Subplots .. 5.43
| SL Text and Annotation 5.46
' 5.11.1 Transforms and Text Position. 5.48
5.11.2 Arrows and Annotation. 5.49
5.12 _ Customization .. 5.50
5.12.1 Customizing Tic! 5.50
5.12.2 Customizing Matplotlib.. 5.52
5.12.3. Using style sheets... 5.54
5.13 Three Dimensional Plotting ... 5.55
5.13.1 Three-Dimensional Contour Plots 5.57
5.13.2 Wireframes and Surface Plots.. 5.58
5.14 Geographic Data with Basemap ... 5.59
5.14.1 Map Projections... 5.61
iS Na EP em een ee CEN5.14.2, Drawing a Map Background..... 5h
5.14.3 Plotting Data on Maps 5h
5.15 Visualization with Seabom.. 545
5.15.1 Exploring Seabom Plots .... 57
5.15.2 Histograms, KDE, and densities 5.15
5.15.3 Pair plots 5.14
Faceted histograms 55
Factor plots 577
Joint distribution: 577
Bar plots 519
541
Part B 557
Model Question Paper
Lab ProgramINTRODUCTION
Data Science: Benefits and uses — facets of data - Date Science
Process: Overview — Defining research goals — Retrieving data — Data
Preparation- Exploratory Data analysis - build the model — presenting
findings and building applications - Data Mining - Data Warehousing-
Basic Statistical descriptions of Data
1.1 DATA SCIENCE
Data Science is the area of study which involves extracting insights from vast
amounts of data using various scientific methods, algorithms, and processes. It is useful
to discover hidden patterns from the voluminous raw data. The term’ Data Science
has emerged because of the evolution of mathematical statistics, data analysis, and big
data.
Data science is an extension of statistics dealing with the massive amounts of
data produced in the real world. It adds methods from computer science to the
statistical computations to analyze the voluminous data and extract the knowledge
from the data.
Data science focuses on processing the huge volume of heterogeneous data known
as big data. Big data refers to the data sets that are large and complex in nature and
difficult to process using traditional data-processing application software.
The characteristics of big data are referred to as the three Vs:
Volume - How much data is there?
Variety - How diverse are different types of data?
Velocity - At what speed is new data generated?
And complemented with a fourth V,
Veracity - How accurate is the data?
ur V's of big data make the data. capturing, cleaning, pre-processing,
haring, transferring and visualization processes a complex task.
. are required to extract the insights from this huge volume of
These fo
storing, searching. SI
Specialized technique
data.
Data Science is an interdisciplinary field that allows to extract knowledge from
ge
structured or unstructured data. It enables to translate a business problem into aTesearch project and then translate it back into a_ practi 1 solution. With the right
tools, technologies, algorithms, significant knowledge and ins
from data and converted into distinct business advant:
faster decisions.
can be captured
les for better and
1.1.1. Data Science Components
Fig. 11: Convergence of Mathematics, computer science and domain expertise
for data sclence
The above figure depic c
domain expertise for data
computer s
nce and
The main components of Data Science are given in Fi
igt.2.
1. Statistics
portant components of. data sci
a way to collect and analyze the i i
meaningful insights from it,
2. Domain Expertise
Expert knowledge or skills of a particular area like health care, automobile, retail
industry, etc. 1}
3. Data engineering
Data engineering involves acquiring, storing, retrieving, and transfo:
Data engineering also includes metadata (data al
rming the data.
bout data) to the data,1.3
Introduction
4, Visualization
Representing data in a visual context for easy understanding of data and insights,
and making it easy to understand huge amount of data.
5, Advanced computing
Advanced computing involves designing, writing, debugging, and maintaining, the
source code of computer programs to perform the complex analysis.
6. Mathematics
Mathematics involves the study of quantity, structure, space, and changes.
7. Machine learning
Machine learning trains the software model so that it can perform the tasks as
a human expert, (Fig. 1.2)
Doman gs
Expertise 3
3 &
Z@-y) =
22 WB) a
Statistics ~
Data Engineeing
Data
Science
Visualization Advanced Computing
Fig, 1.2 : Main components of Data Science
1.1.2. Benefits and uses of data science
Data science and big data are used almost ever i
c r a 'ywhere in boll i
non-commercial settings: h commercial and14 Foundations of Data Science
Commercial applications
© Commercial companies in almost every industry use data science and big
data to gain insights into their customers, processes, staff, order completion,
and products interes!
¢ Many companies use data science to offer customers a better user experience,
as Well as to cro Ml, up-sell, and personalize their offerings.
Example: Google AdSense, which collects data from internet users to match the
relevant commercial messages to the person browsing the internet.
Human resource management
¢ Human resource professionals use people analytics and text mining to screen
candidates, monitor the mood of employees, and study informal networks
among coworkers.
© Correlated signal analysis done for American baseball, applied. Statistics to
hire the right players and pit them against the opponents to have the biggest
advantage.
Financial applications
¢ Financial institutions use data science to predict stock markets, determine
the risk of lending money and to learn how to attract new clients for the
services.
Government Sector
* Governmental organizations are also aware of data’s. value. Many
governmental organizations also share their data with the public to use to
gain insights or build data-driven applications.
Data scientist in a government organization works on diverse projects such
as detecting fraud and other criminal activity or optimizing project funding.
Example: British Government Communications Head quarters. used data science
and big data to Monitor millions of individuals. Those organizations collected 5 billion
data records from widespread applications such as Google Maps, Angry Birds, email,
and text messages, among many other data sources. Then they applied data science
techniques to distill information,
_. Non-governmental organizations
© NGOs use it to raise money and defend their causes,1.5
Introduction
* World Wildlife Fund (WWF), for instance, employed data scientists to
increase the effectiveness of their fund ing efforts.
* Data Kind is a data scientist group that devotes its time to the benefit of
mankind.
Education
* — Universities use data science in their research and to enhance the study
experience of their students.
* The rise of massive open online courses (MOOC) produces a lot of data,
which allows universities to study how this type of learning can complement
traditional classes.
* MOOCs are an invaluable asset “to become a data scientist and big data
professional. Few of the Online course portals: Coursera, Udacity, and edX.
Few application domains of data science
¢ Fraud and Risk Detection in banking
* Healthcare prediction and analysis
* Internet Search and recommendations
© Market analysis.
© Customer analysis.
* Targeted Advertising.
* Website Recommendations,
* Advanced Image Recognition.
* Speech Recognition.
* Airline Route Planning.
1.2 FACETS OF DATA
Data science is focused on processing of complex datasets and in building
Predictive models from those data. It includes a wide array of different activities.
from the upstream Processes. of acquirin, cleaning and integrating data to
downstream processes of analysis, modeling and prediction.
There are many facets of data s
.
ence. including:
Identifying the structure of data
* Cleaning, filtering, reorganizing, augmenting, and aggregating dataa
Foundations of Data Science
© Visualizing data
© Data analysis, statistics, and modeling
¢ Machine Learning
¢ Assembling data processing pipelines to link these steps
© Leveraging high-end computational resources for large-scale problems
Different tools are used in different parts of this process,
Therefore, interoperability among tools, based on common data structures and
interfaces, is an important element in enabling the construction of complex,
multifaceted data analysis pipelines. In data science and big data many different types
of data are handled, and each of them require different tools and techniques.
Main categories of data
1. Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and images
"2 we wD
Streaming
Structured Data
* — Structured data is data that depends on a data model
« — Resides in a fixed field within a record.
« Easy to store structured data in tables within databases or Excel files.
(Fig. 1.3)
© SQL is the preferred way to manage and query data that resides in databases.
© Hierarchical data such as a family tree is also called as structured data as
they are stored in a particular structure.
Example: Examples of structured data include dates, names, addresses, credit
card numbers, etc.Untroduction 17
Order Date | Region Rep Units [Unit Cost] Total
9-114 Central | Smith 2 125.00 | 250.00
6-17-15 Central | Kivell 5 125.00 | 625.00
9-10-15 Central Gill Pencil 7 1.29 9.03
M-17-15 | Central | Jardine |” Binder mn 4.99 54.89
10-31-15 Central | Andrews | Pencil 14 1:29 18.06
2-26-14 Central Gill Pen 27 19.99 539.73
10-5-14 Central |. Morgan |. Binder 28 8.99 251.72
12-21-15 | Central | Andrews | Binder 28 4.99 139.72
2-9-14 Central} Jardine | Pencil 36 4.99 179.64
* S715 Central | Kivell | Pen Set a2 23.95. | 1,005.90
1-15-15 Central Gill Binder 46 8.99 413.54
1-23-14 Central | Kivell | Binder 50 19.99 | 999.50
Fig. 1.3: An Excel table is an example of structured data.
Unstructured Data’
© — Unstructured data is data that isn't easy to fit into a data model because
the content is context-specific or varying.
© Unstructured data, typically categorized as qualitative data, cannot be
processed and analyzed. via conventional data tools and methods.
— It is best managed in non-relational (NoSQL) databases or data lakes to
preserve it in raw form.
© The importance of unstructured “data is rapidly increasing. Recent
projections indicate that unstructured data is over 80% of all enterprise data,
‘while 95% of businesses prioritize unstructured data management.
Example of unstructured data : email (Fig 1.4)
© Although email contains structured elements such as the sender, title, and
body text, it's a challenge to find the number of people who have written
an email complaint about a specific employee because so many ways exist18 Foundations of Data Science
to refer to a person, for example. Thousands of different languages ang
dialects out there further complicate this.
New team of Ul engineers
# [Link]
*
To syzerogiam com
‘An iwestment banking cient of mine has had the go ahead to build a new team of Ul engineers to work on
‘Yarous areas of a cutting-edge single-dealer trading platform.
‘Thay will be recruiting at all lavels and paying between 40k & BSk (+ all the usual benefits of the banking
word). | understand you may not be looking. | also understand you may be a contractor. Of the last 3 hires
they brought ino the teem, two were contractors of 10 years who I honestly thought would never tum to
‘what they considered “the dack side.”
“This isa genuine opportundy to work in an environment that's bult up for best in industry and alows you to
(gain commercial experience with all the latest tools, tech, and processes.
‘Thare is more information below. | appreciate the spec is rather loose - They are nct looking for speciasts
in Angular / Node / Backbone or any of the other buzz words in particular, rather an “engineer” who can
‘wear mary hats and is in touch with current tech & tnkers in their own time,
For more information and a confidential chat, please drop me a reply emai. Appreciate you mary not have
aan updated CV, but if you do that would be handy io have a look through f you don’ mind sending.
Fig 1.4: Email is an example of unstructured data and natural language data.
Natural language
Natural language is a special type of unstructured data; it’s challenging to process
because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis. Models
trained in one domain don’t generalize well to other domains. Even state-of-the-art
techniques aren't able to decipher the meaning of every piece of text.
Even humans struggle with natural language as well. It's ambiguous by nature.
The meaning of the same words can vary based on the context.
Many areas like Healthcare, Finance, Media, Human Resources, etc are using
NLP for utilizing the data available in the form of text and speech. Many text and
speech Fecognition applications are built using NLP. For example, personal voice
assistants like Siri, Cortana, Alexa, etc,