You are on page 1of 13

DS211- THEORY OF DATA SCIENCE

SECTION: DATA SCIENCE


SPRING 2023.

E n g r. A b i n t a M e h m o o d M i r

FABRIKAM
OUTLINE

• Data Analysis Process/Life Cycle

FABRIKAM 2
DATA SCIENCE PROCESS

FABRIKAM 3
DATA SCIENCE PROCESS
• First, we have the Real World. Inside the Real World are lots of people busy at various activities. We are
having data for various things done.

• Specifically, we’ll start with raw data—logs, Olympics records, Enron employee emails, or recorded genetic
material.

• We want to process this to make it clean for analysis. So we build and use pipelines of data munging:
joining, scraping, wrangling, or whatever you want to call it. To do this we use tools such as Python, shell
scripts, R, or SQL, or all of the above.

• Eventually we get the data down to a nice format, like something with columns:

name | event | year | gender | event time

• Once we have this clean dataset, we should be doing some kind of EDA.

FABRIKAM 4
DATA SCIENCE PROCESS
• In EDA, we may realize that it isn’t actually clean because of duplicates, missing values, absurd outliers, and data that
wasn’t logged or incorrectly logged. If that’s the case, we may have to go back to collect more data or spend more time
cleaning the dataset.

• Next, we design our model to use some algorithm like k-nearest neighbor (k-NN), linear regression, Naive Bayes, or
something else. The model we choose depends on the type of problem we’re trying to solve, of course, which could be a
classification problem, a prediction problem, or a basic description problem.

• We then can interpret, visualize, report, or communicate our results

• Alternatively, our goal may be to build or prototype a “data product”; e.g., a spam classifier, or a search ranking algorithm,
or a recommendation system.

• Now the key here that makes data science special and distinct from statistics is that this data product then gets
incorporated back into the real world, and users interact with that product, and that generates more data, which creates
a feedback loop.

FABRIKAM 5
FABRIKAM 6
FABRIKAM
EXPLORATORY DATA ANALYSIS
• “Exploratory data analysis” is an attitude, a state of flexibility, a willingness to look
for those things that we believe are not there, as well as those we believe to be
there. — John Tukey

• The “exploratory” aspect means that your understanding of the problem you are
solving, or might solve, is changing as you go.

• The basic tools are plots, graphs and summary statistics.

• You want to understand the data—gain intuition, understand the shape of it, and
try to connect your understanding of the process that generated the data to the
data itself.

FABRIKAM 8
DATA PRE-PROCESSING

• Data preprocessing is the process of converting raw data into clean data that is proper for modeling.

• A model fails for various reasons. One is that the modeler doesn’t correctly preprocess data before modeling.

FABRIKAM 9
TYPE OF DATA
• When we talk about data or analytics, the terms structured, unstructured, and semi-structured data often get
discussed. These are the three forms of data that have now become relevant for all types of business applications.

• Structured data is information that has been formatted and transformed into a well-defined data model, XML and CSV
are the most popular formats.

• The raw data is mapped into predesigned fields that can then be extracted and read through SQL easily.

• SQL relational databases, consisting of tables with rows and columns, are the perfect example of structured data.

• Unstructured data is defined as data present in absolute raw form. This data is difficult to process due to its complex
arrangement and formatting. Unstructured data management may take data from many forms, including social media
posts, chats, satellite imagery, IoT sensor data, emails, and presentations, to organize it in a logical, predefined manner
in a data storage.

FABRIKAM 10
TYPE OF DATA
• We may encounter semi-structured data:

In digital photographs, the image does not have a pre-defined structure itself but has certain structural attributes making
them semi-structured.

For instance, if an image is taken from a smartphone, it would have some structured attributes like geotag, device ID, and
DateTime stamp.

FABRIKAM 11
DATA
VISUALIZATION
To convey information through
graphical representations of data

FABRIKAM 12
Questions??

FABRIKAM

You might also like