Professional Documents
Culture Documents
Chapter 02
Data Management
Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
The Era of Big Data Is Here
The pace of data collection is accelerating due to web and mobile apps,
social media, and embedded sensors.
• Large datasets are a core organizational asset.
• A data infrastructure enabling real-time decision making requires a
strong data strategy aligned with overall business strategy.
Big data is a relative term describing massive
amounts of data with the following characteristics:
• Volume is data per time unit. Many are moving
• Variety is structured or unstructured. away from the term
“big data” and
• Veracity is accuracy or missing data. adopting “smart
• Velocity is the speed at which data arrives. data.”
• Value is the usefulness of the data in making
accurate decisions.
© McGraw-Hill Education 2
Characteristics of Big Data
© McGraw-Hill Education 3
Database Management Systems (DBMS)
Relational Databases
How is big data organized to create smart data that provides value?
• A database contains current data from company operations.
A relational database is a DBS storing data in rows and columns.
• Columns (features, predictors, variables) store many records.
• Rows (records) have a unique primary key.
• A foreign key is a set of columns that refers to a primary key in another
table.
• Both keys are important to combine data from different tables.
© McGraw-Hill Education 4
Database Management Systems (DBMS)
Non-Relational Databases
© McGraw-Hill Education 5
Enterprise Data Architecture
© McGraw-Hill Education 6
Traditional ETL
© McGraw-Hill Education 7
ETL Using Hadoop
© McGraw-Hill Education 8
Exhibit 2-7 Simple Architecture of a Data Repository
© McGraw-Hill Education 9
A Closer Look at Data Storage
A data lake is a storage repository that holds a large amount of data in its
native format.
© McGraw-Hill Education 10
Data Quality
© McGraw-Hill Education 11
Data Understanding, Preparation, and Transformation
© McGraw-Hill Education 12
Data Understanding
© McGraw-Hill Education 13
Data Preparation
Feature selection.
• Pay attention the variables or features and avoid overfitting.
Sample size.
• Sample size may be determined using a power calculation.
Unit of analysis.
• A unit of analysis is the what, when, and who of the analysis.
Missing values.
• Common in datasets – resolve with imputation, omission, or exclusion.
Outliers.
• A method of identifying outliers is the use of cluster analysis.
© McGraw-Hill Education 14
Data Transformation
© McGraw-Hill Education 15
Case Study
Avocado Toast: A Recipe to Learn SQL
Using SQLite and data from the Hass Avocado Board, this case study
explores the data in 20 steps.
• The text guides users through importing the data into SQLite and then
answering several questions through guided SQL queries.
• For numeric values, you can use SQL aggregate functions such as
sum, min, max, and average.
• For categorical data, you can use the count function.
• The exercise guides the user through building their own supplier table
and adding data to the table, then merging the table with another.
• Updating data is covered with a guided exercise to change a
supplier’s country of origin.
• Finally, the exercise describes the important act of deleting unneeded
data to prevent it from being included in future analysis.
© McGraw-Hill Education 16
End of main content.
www.mheducation.com
Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.