You are on page 1of 52

INTRODUCTION

TO ANALYTICS
2022 - 2023
LESSON 5.

DATA SCIENCE LIFE CYCLE.


DATA MODELING BASICS
Learning Objectives

• Examine the data science life cycle


• Recognize activities in each phase of the life cycle
• Describe systems in each data life cycle
• Distinguish between operational & analytical systems
• Understand the foundations of data modeling
• Recognize entities, attributes and relationships as they relate to real-
life concepts
• Create simple entity relationship diagrams
Agenda

1. Data science life cycle


2. Systems of Record/Integration/Analytics/BI
3. Operational vs. analytic systems
4. Data modelling; types of data models
5. Relational data models and their elements
6. ERD practice
Data science life cycle - video

https://www.youtube.com/watch?v=X3paOmcrTjQ
DATA SCIENCE
LIFE CYCLE
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Data Science Life Cycle
• Ask questions
Business • Define business problem
problem • Define analytics goals

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem

• Gather & scrape data


Monitoring & Data • Mine data from sources
Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

• Reformat data
Visualization & Data • Consolidate & validate
Communication preparation • Transform & normalize
• Cleanse data
• Store data

Predictive Data
modeling wrangling
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

• Gather, filter & subset


• Restructure & de-normalize for
Predictive target schemas
Data • Advanced cleaning & validation
modeling wrangling • Enrich data
• Publish wrangled data for analytics
tools
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

Visualization & Data


Communication preparation

• Explore data
• Build & train machine
learning models Predictive Data
• Evaluate model modeling
performance wrangling
• Deploy models
Data Science Life Cycle
Business
problem

Monitoring & Data


Maintenance acquisition

• Visualize data
• Publish &
communicate to Visualization & Data
stakeholders Communication preparation
• Incorporate
analytics into
business process

Predictive Data
modeling wrangling
Data Science Life Cycle What phases of the data
science lifecycle take
Business
problem up most time and effort?

• Monitor model Monitoring & Data


performance
• Manage the models Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
Data Science Life Cycle
Business Module 7: Analytics
problem project basics

Module 4, Lesson 6 Module 4, Lesson 5


Monitoring & Data
Maintenance acquisition

Visualization & Data


Communication preparation

Predictive Data
modeling wrangling
What phases of the data science lifecycle
take up most time and effort?
SYSTEMS IN
DATA LIFE CYCLE
Module 2
Data Life Cycle

Storage & Protection


Sourcing Sharing Archiving Destruction
Preparation & Usage

Textbook Chapter 4 Figure 4.3


Storage & Protection
Sourcing Sharing Archiving Destruction
Preparation & Usage

Textbook Chapter 4 Figure 4.4


Systems of Record (SOR)

• Capture and update data in operational and transactional systems


• Data is generated from operating the enterprise
• Include business applications and manual- and semi-manual systems of maintaining
business records (documents, spreadsheets etc.)
• Data can only be changed in a very controlled and auditable way

Textbook Chapter 4 Figure 4.10


Systems of Record (SOR) examples

Thinks of systems that a business needs to manage


its business data and support day-to-day operations
Systems of Integration (SOI)

• Gather, integrate and transform data from SORs into consistent, conformed,
comprehensive, clean and current information
• Used to be synonymous with ETL (Exchange-Transform-Load) processes - however, may
fulfill many additional functions
• Includes solutions to support current demands for real-time integration, high volumes,
variety and velocity

Textbook Chapter 4 Figure 4.9


Systems of Analytics (SOA)

Provide business information that has been integrated and prepared for BI applications
Data warehouse: stores key data from operational systems that needs to be stored for a significant
time (including timestamp history of data changes). Data is protected and loaded in a controlled
way. Powerful data processing and storage capabilities, not meant for analytical work
Analytical "sandbox" - copies of the select data from data warehouse for "playing with" and
studying (avoid compromising the data warehouse). Greater flexibility, reduced risk of data loss or
corruption

Textbook Chapter 4 Figure 4.8


Business Intelligence (BI)

Provide business reporting, analytics and insights for businesspeople


Interpret and present insights to decision makers: turn data into actionable information
Perform best when designed to satisfy business requirements for analytics

Textbook Chapter 4 Figure 4.7


SOR, SOI, SOA

Class quiz
OPERATIONAL VS ANALYTICAL
SYSTEMS
Data Science Life Cycle

Storage & Protection


Sourcing Sharing Archiving Destruction
Preparation & Usage
Operational Analytical

Storage & Protection


Sourcing Sharing Archiving Destruction
Preparation & Usage

Textbook Chapter 4 Figure 4.4


Operational vs. Analytic Systems
Difference Operational System (SOR) Analytic System (SOA)
Purpose Support business process; complete Provide access to information and insights that
operational tasks lead to improved decisions for managing
business
Historic data Current information with very little history; Larger volumes of historical data; allow to do
older data may be periodically purged trend analysis and year over year comparisons
to improve performance
Timeliness Real-time information Data is periodically extracted from operational
systems to load into analytic systems (near real-
time, daily, monthly)
Level of detail Detailed data as required for operational Selected, aggregated or summarized data as
tasks required for analysis

Cindi Howson (2013)


Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data
McGraw-Hill Education
Operational vs. Analytic Systems (cntd)
Difference Operational System (System of Record, Analytic System (System of Analytics, SOA)
SOR)
Response Fast inputs, slow queries Read-only; optimized for fast queries
time
Table Normalized* tables; large number of May be partially normalized, partially de-
structure tables normalized to optimize querying; fewer tables
than operational systems
Dimensions* Rarely hierarchical groupings Hierarchical groupings (time periods, accounts,
product groups, customer groups etc.)
Reporting & Fixed operational reports; often by only Rich reporting capabilities and analysis by
analysis one dimension multiple dimensions

* See data modeling section of the lesson

Cindi Howson (2013)


Successful Business Intelligence, Second Edition: Unlock the Value of BI & Big Data
McGraw-Hill Education
DATA MODELLING
https://dama.org/sites/default/files/download/DAMA-DMBOK2-Framework-V2-20140317-FINAL.pdf
What is a Data Model?
Data model: An abstract model that organizes elements of data and standardizes how they
relate to one another and to the properties of real-world entities. (Wikipedia)

Textbook Chapter 8 Figure 8.1


Three Levels of Data Models

Difference Conceptual Logical Physical

View High-level business Design view Implementation blueprint


view

Audience Business Architect, designer Database administrator,


stakeholders, project developer
team

Goal Communicate Understand the Capture detailed


structured business details of the data database design
view of data
Data Modeling Workflow

Textbook Chapter 8 Figure 8.2


Data Modeling Approaches

Relational Dimensional

Dimensions

Facts Dimension
hierarchies
Entities
Facts

Snowflake
Entity Relationship Diagram (ERD) Star schema
schema

Examples from the Textbook Chapter 8-9


Relational vs. Dimensional Modeling
Difference Relational Dimensional

Used for Operational systems (SOR); BI and DW applications (SOA);


Online Transaction Processing (OLTP) Online Analytical Processing (OLAP)
Units of storage Tables (relations) Cubes

Normalization Data is normalized*: optimized for Data is de-normalized*: optimized


OLTP for OLAP
Numbers of tables Many tables with relationships Few tables and fact tables
between them connected to dimensional tables
Elements Entities, attributes and relationships Facts and dimensions

Models Entity Relationship Diagrams (ERD) Snowflake schema; star schema


ERD ELEMENTS
ER Building Blocks
Building block Definition Examples

Entity A person, object, event or concept Customer, product, order, supplier,


about which the business keeps data employee, department,
interaction
Relationship Logical links between entities that Product is supplied by a supplier;
represent how entities relate to each Order contains products
other via business rules or constraints
Department is composed of
employees
Attribute Distinct characteristic of an entity for A customer has a name, address
which data is maintained and cell number
A product has a price, colour and
size

Constraint: a limitation or restriction on the relationships between entities


Entity Relationship (ER) Modeling
ER Modeling: logical design modeling technique
ER building blocks: Entity, Relationship, Attribute

Textbook Chapter 8 Figure 8.4


Relationship Cardinality

‘Crow’s foot notation, or Barker’s


notation

https://tdan.com/crows-feet-are-best/7474
Types of Entities and Attributes
Independent: can exist on their own
Dependent: child records that need
a parent record to exist

Key: uniquely identify an entity


Non-Key: do not uniquely identify an
entity
*Foreign Keys: used to link an entity
to another entity

Textbook Chapter 8 Figure 8.5


Three Levels of Data Models (cntd)
Difference Conceptual Logical Physical

View High-level business view Design view Implementation blueprint

Audience Business stakeholders, project Architect, designer Database administrator, developer


team

Goal Communicate structured Understand the details of the Capture detailed database design
business view of data data
Level of detail Names key entities; Captures ERD elements: Physical objects definitions (DBMS-
business relationships between Entities and attributes to be specific): tables, columns
entities implemented; Referential integrity rules (foreign keys,
Business rules and relationships constraints, triggers etc.)
between data objects Performance & optimization entities
Primary, foreign keys (indexes, procedures, partitions, views
etc.)
Attributes
Application Independent of applications Application agnostic Application-specific
dependence and databases (application-
agnostic)
Data Model Example

Textbook Chapter 8 Figure 8.11


Other ERD notations: example
Chen’s notation: NOT USED IN THIS COURSE – DO NOT USE FOR ASSIGNMENTS

https://www.edrawsoft.com/simple-chen-erd-example.html

https://towardsdatascience.com/coding-and-implementing-a-relational-database-using-mysql-d9bc69be90f5
What is normalization?

Normalization is an approach to data modelling aimed at reducing data redundancy

Normal form Requires

1st normal form (1NF) Eliminate repeating groups of data – create a separate table for
each entity
2nd normal form (2NF) Eliminate redundant data stored in different entities

3rd normal form (3NF) Eliminate data not related to the entity key

Normalization usually requires more tables to reduce data redundancy

Textbook Chapter 8 p. 189


First normal form – 1NF

Textbook Chapter 8 Figure 8.15


Second normal form – 2NF

Textbook Chapter 8 Figure 8.16


Third normal form – 3NF

Textbook Chapter 8 Figure 8.17


Later in the program

Module 6 Business • Databases, data warehouses and data marts


Intelligence
• Data architecture choices
Architecture

Module 7: Analytics • Role data modeling in an analytics project


project basics
Introduction to • Database management systems (DBMS)
Database and SQL
• Database models
course
• Entity relationship diagrams (ERD)
• Data normalization; First, second and third normal form
ERD PRACTICE

You might also like