21CS2213RA
AI for Data Science
Session -19
Contents: Data science-an introduction
1
Session Objective
• An ability to understand about Data Science.
• An ability to Understand the real-life application and uses of
Data Science
Data Science
• Data science combines the scientific method, math and statistics,
specialized programming, advanced analytics, AI, and even storytelling to
uncover and explain the business insights buried in data.
• Data science is a multidisciplinary approach to extracting actionable
insights from the large and ever-increasing volumes of data collected and
created by organizations.
• Data science is all about using data to solve problems.
Cont.
Data science is–
preparing data for analysis and processing.
performing advanced data analysis.
presenting the results to reveal patterns and enable stakeholders to draw
informed conclusions.
Cont.
Data science enables businesses to process huge amounts of structured and
unstructured big data to detect patterns.
Data science lifecycle
• The data science lifecycle also called the data science pipeline. Following steps
involved in Data Science Life Cycle.
Step 1: Define Problem Statement: Creating a well-defined problem
statement is a first and critical step in data science.
Step 2: Data Collection: need to collect the data which can help to solve
the problem through systematic approach.
Step 3: Data Quality Check and Remediation: Ensuring the data that is
used for analysis and interpretation is of good quality.
Cont.
Step 4: Exploratory Data Analysis: Before you model the steps to arrive
at a solution, it’s important to analyse the data.
Step 5: Data Modelling: Modelling means formulating every step and
gather the techniques required to achieve the solution.
Step 6: Data Communication: This is the final step where you present the
results from your analysis to the stakeholders. You explain to them how you
came to a specific conclusion and your critical findings .
Cont.
Data Science life cycle
Cont. (Given by IBM)
• The data science lifecycle includes anywhere from five to sixteen steps.
• The processes common to just about everyone’s definition of the lifecycle
include the following:
Capture: This is the gathering of raw structured and unstructured data
from all relevant sources via just about any method.
Prepare and maintain: This involves putting the raw data into a
consistent format for analytics or machine learning or deep learning
models.
Cont.
Preprocess or process: To examine biases, patterns, ranges, and
distributions of values within the data to determine the data’s suitability
for use with predictive analytics, machine learning, and/or deep learning
algorithms.
Analyze: This is where perform statistical analysis, predictive analytics,
regression, machine learning and deep learning algorithms, and more to
extract insights from the prepared data.
Cont.
Communicate: Finally, the insights are presented as reports, charts, and
other data visualizations that make the insights—and their impact on the
business—easier for decision-makers to understand.
Types of data
• Always need to look at what
types of data are involved.
Known Data
Unknown Data
Others’ decisions
Your decisions
Data Science Tools
• To build and run code in order to create models, the most popular programming
languages are open-source tools that include or support pre-built statistical, machine
learning and graphics capabilities. These languages include:
R: An open-source programming language and environment for developing statistical
computing and graphics
Python: Python is a general-purpose, object-oriented, high-level programming
language that emphasizes code readability through its distinctive generous use of
white space.
Cont.
SQL Analysis Services: Use perform in-
database analytics using common data
mining functions and basic predictive
models.
SAS/ACCESS: Can be used to access
data from Hadoop and is used for creating
repeatable and reusable model flow
diagrams.
SAS: Statistical Analysis System
Data Science Applications
Identifying and predicting disease
Personalized healthcare recommendations
Optimizing shipping routes in real-time
Getting the most value out of soccer rosters
Finding the next slew of world-class athletes
Stamping out tax fraud
Automating digital ad placement
Algorithms that help you find love
Predicting incarceration rates
Big Data
• Big data is a collection of massive and complex data sets and data volume.
• It include the huge quantities of data, data management capabilities, social
media analytics and real-time data.
• Big data is about data volume and large data set's measured in terms of
terabytes or petabytes.
• After examining of Bigdata, the data has been launched as Big Data analytics.
• Big data analytics is the process of examining large amounts of data.
5 Vs in Big Data
• Doug Laney introduced this concept of 3 Vs of Big Data, viz. Volume, Variety, and
Velocity.
Volume: refers to the amount of data that is being collected (the data could be
structured or unstructured).
Velocity: refers to the rate at which data is coming in.
Variety: refers to the different kinds of data (data types, formats, etc.) that is
coming in for analysis.
Cont.
Over the last few years, 2 additional Vs of data have also
emerged i.e. value and veracity.
Value refers to the usefulness of the collected data.
Veracity refers to the quality of data that is coming in from
different sources.
Types of Data Science
Data Analytics
•Data analytics is the science of analyzing raw data to make
conclusions about that information.
•The techniques and processes of data analytics have been
automated into mechanical processes and algorithms that work
over raw data for human consumption.
•Data analytics help a business optimize its performance.
Data Science and Data Analytics (Two sides of the same coin)
• Data science is an umbrella term that encompasses data
analytics, data mining, machine learning, and several other
related disciplines.
Data Science and Data Analytics utilize data in different ways.
Data Science and Data Analytics deal with Big Data, each
taking a unique approach.
Data analytics is mainly concerned with Statistics,
Mathematics, and Statistical Analysis.
Cont.
Data Science focuses on finding meaningful correlations
between large datasets.
Data Analytics is designed to uncover the specifics of extracted
insights.
Note: Data Analytics is a branch of Data Science that focuses on
more specific answers to the questions that Data Science brings
forth.
Key Points
• Data science and data analytics both fields are ways of understanding big data, and
both often involve analyzing massive databases using R and Python.
• SAS/ACCESS engines are tightly integrated and used by all SAS solutions for third-
party data integration, supported integration standards include ODBC, JDBC, Spark
SQL (on SAS Viya) and OLE DB.
• Internet users generate about 2.5 quintillion bytes of data every day. By 2020, every
person on Earth will be generating about 146,880 GB of data every day, and by 2025,
that will be 165 zettabytes every year.
Lab/Skilling
Case Study: Diabetes Prevention
What if we could predict the occurrence of diabetes and take appropriate measures
beforehand to prevent it?
Conclusion
• We should be careful and not directly link data analytics and data science to artificial
intelligence and machine learning.
• There are different types of data to consider when we face a complex problem with
lots of data.
• We can also use Apache Spark, Tableau and Snowflake, Google machine learning
stack Tensorflow, NLP training and Deep learning experience are all part of the data
science toolkit
Placement Related/Industry Oriented
• Data preparation and analysis are the most important data science skills, but data
preparation alone typically consumes 60 to 70 percent of a data scientist’s time.
• By 2020, there will be around 40 zettabytes of data, that's 40 trillion gigabytes.
• The amount of data that exists grows exponentially.
• At any time, about 90 percent of this huge amount of data gets generated in the most
recent two years, according to sources like IBM and SINTEF.
• This means there is a huge amount of work in data science.
References
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
• [Link]
Next Class Topic
In next class I will cover following topics-
Data pre-processing
Feature extraction technique
Thank you
29