Professional Documents
Culture Documents
TO ANALYTICS
2022 - 2023
LESSON 5.
https://www.youtube.com/watch?v=X3paOmcrTjQ
DATA SCIENCE
LIFE CYCLE
Data Science Life Cycle
Business
problem
Predictive Data
modeling wrangling
Data Science Life Cycle
• Ask questions
Business • Define business problem
problem • Define analytics goals
Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem
Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem
• Reformat data
Visualization & Data • Consolidate & validate
Communication preparation • Transform & normalize
• Cleanse data
• Store data
Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem
• Explore data
• Build & train machine
learning models Predictive Data
• Evaluate model modeling
performance wrangling
• Deploy models
Data Science Life Cycle
Business
problem
• Visualize data
• Publish &
communicate to Visualization & Data
stakeholders Communication preparation
• Incorporate
analytics into
business process
Predictive Data
modeling wrangling
Data Science Life Cycle What phases of the data
science lifecycle take
Business
problem up most time and effort?
Predictive Data
modeling wrangling
Data Science Life Cycle
Business Module 7: Analytics
problem project basics
Predictive Data
modeling wrangling
What phases of the data science lifecycle
take up most time and effort?
SYSTEMS IN
DATA LIFE CYCLE
Module 2
Data Life Cycle
• Gather, integrate and transform data from SORs into consistent, conformed,
comprehensive, clean and current information
• Used to be synonymous with ETL (Exchange-Transform-Load) processes - however, may
fulfill many additional functions
• Includes solutions to support current demands for real-time integration, high volumes,
variety and velocity
Provide business information that has been integrated and prepared for BI applications
Data warehouse: stores key data from operational systems that needs to be stored for a significant
time (including timestamp history of data changes). Data is protected and loaded in a controlled
way. Powerful data processing and storage capabilities, not meant for analytical work
Analytical "sandbox" - copies of the select data from data warehouse for "playing with" and
studying (avoid compromising the data warehouse). Greater flexibility, reduced risk of data loss or
corruption
Class quiz
OPERATIONAL VS ANALYTICAL
SYSTEMS
Data Science Life Cycle
Relational Dimensional
Dimensions
Facts Dimension
hierarchies
Entities
Facts
Snowflake
Entity Relationship Diagram (ERD) Star schema
schema
https://tdan.com/crows-feet-are-best/7474
Types of Entities and Attributes
Independent: can exist on their own
Dependent: child records that need
a parent record to exist
Goal Communicate structured Understand the details of the Capture detailed database design
business view of data data
Level of detail Names key entities; Captures ERD elements: Physical objects definitions (DBMS-
business relationships between Entities and attributes to be specific): tables, columns
entities implemented; Referential integrity rules (foreign keys,
Business rules and relationships constraints, triggers etc.)
between data objects Performance & optimization entities
Primary, foreign keys (indexes, procedures, partitions, views
etc.)
Attributes
Application Independent of applications Application agnostic Application-specific
dependence and databases (application-
agnostic)
Data Model Example
https://www.edrawsoft.com/simple-chen-erd-example.html
https://towardsdatascience.com/coding-and-implementing-a-relational-database-using-mysql-d9bc69be90f5
What is normalization?
1st normal form (1NF) Eliminate repeating groups of data – create a separate table for
each entity
2nd normal form (2NF) Eliminate redundant data stored in different entities
3rd normal form (3NF) Eliminate data not related to the entity key