You are on page 1of 17

Because learning changes everything.

Chapter 02
Data Management

Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
The Era of Big Data Is Here

The pace of data collection is accelerating due to web and mobile apps,
social media, and embedded sensors.
• Large datasets are a core organizational asset.
• A data infrastructure enabling real-time decision making requires a
strong data strategy aligned with overall business strategy.
Big data is a relative term describing massive
amounts of data with the following characteristics:
• Volume is data per time unit. Many are moving
• Variety is structured or unstructured. away from the term
“big data” and
• Veracity is accuracy or missing data. adopting “smart
• Velocity is the speed at which data arrives. data.”
• Value is the usefulness of the data in making
accurate decisions.

© McGraw-Hill Education 2
Characteristics of Big Data

Volume – companies now store and analyze petabytes of data.


• Data is collected from many sources and shows the customer journey.
Variety – structured and unstructured data provides a holistic view.
• Strengths and challenges to each format when integrating.
Veracity increases complexity and reduces confidence in the data.
• There may be missing values, inconsistencies in units of measure,
erroneous information, and lack of reliability.
Velocity – data inundates companies at a rapid pace.
• Data arrives in milliseconds to support real-time response strategies.
Value – data must be converted into quality insights that provide benefits.
• Achieving value requires an understanding of business goals and
objectives.

© McGraw-Hill Education 3
Database Management Systems (DBMS)
Relational Databases

How is big data organized to create smart data that provides value?
• A database contains current data from company operations.
A relational database is a DBS storing data in rows and columns.
• Columns (features, predictors, variables) store many records.
• Rows (records) have a unique primary key.
• A foreign key is a set of columns that refers to a primary key in another
table.
• Both keys are important to combine data from different tables.

Relational data is accessible by a database management language called


structured querying language (SQL).
• A query can be used to join, select, manipulate, retrieve, and analyze
data from relational databases.

© McGraw-Hill Education 4
Database Management Systems (DBMS)
Non-Relational Databases

Non-relational databases, or NoSQL databases, can store large volumes


of structured or unstructured data.
• They display data vertically, combined rather than in structured tables.
• They allow greater flexibility for storing ever-changing data and new
data types but drilling down to specific types of data is more difficult.
• Most companies use both relational and non-relational databases.
• The difficulty of maintaining multiple databases is compounded by
inappropriate data storage architecture.

© McGraw-Hill Education 5
Enterprise Data Architecture

Data storage architecture allows a company to organize, understand, and


use their data to make both small and large decisions.

Data analytics can analyze any kind of data repository, including:


• Customer Relationship Management (CRM).
• Enterprise Resource Planning (ERP)
• Other Online Transaction Processing (OLTP) software.
A CRM database may store recent customer transactions and allows
marketers to monitor developments in real time.
• Internal data can be combined with other sources like social media.
• Streaming data is the continuous transfer of data from numerous
sources in different formats.

© McGraw-Hill Education 6
Traditional ETL

Extract, Transform, and


Load (ETL) is an integration Functions begin with
process designed to extracting key data from the
consolidate data from a source and converting it
variety of sources into a into the appropriate format.
single location.

Transformation requires The third ETL step is load,


conforming to the in which the data is loaded
appropriate data storage into a storage system such
format for where data will as data warehouse, data
be stored. marts, or a data lake.

© McGraw-Hill Education 7
ETL Using Hadoop

Hadoop is an open-source software that divides big data processing over


multiple computers.
• This allows large amounts of data to be handled simultaneously at
reduced cost.
Hadoop facilitates analysis using MapReduce programming which
manages two steps with the data.
• The first step is to map the data by dividing it into subsets and
distributing it to a group of computers for storing and processing.
• The second step is to combine the answers from the computer nodes
into one answer.
The ETL process on Hadoop can handle structured, semi-structured, and
unstructured data.
• After ETL, data can be stored in a warehouse, a mart, or a lake.

© McGraw-Hill Education 8
Exhibit 2-7 Simple Architecture of a Data Repository

Access text alternative for this image.

© McGraw-Hill Education 9
A Closer Look at Data Storage

A data warehouse holds historical data from various company databases


and provides a structure for high-speed querying.

A data mart provides a specific value to a group of users – can be


dependent or independent.

A data lake is a storage repository that holds a large amount of data in its
native format.

Data management as a process is the lifecycle management of data


from acquisition to disposal.

© McGraw-Hill Education 10
Data Quality

Numerous success stories show the use of data


in decision making.
• Not all companies have the same success.
“Garbage in, garbage out” refers to deficient data Timeliness.
quality – you can follow the trail of bad data.
Completeness.
• When data are of poor quality, insights from
Accuracy.
marketing analytics will be unreliable.
Consistency.
• Data are important, but high-quality data is
critical. Format.
Although data quality can be measured in
numerous dimensions, the most common are
listed on the right.

© McGraw-Hill Education 11
Data Understanding, Preparation, and Transformation

Data inspires curiosity for marketers


because they are eager to use
insights to strategize the next move.

There is no tangible benefit to data in


raw form.

Data is messy and must be tidied up


before being used.

© McGraw-Hill Education 12
Data Understanding

Understanding the data is critical to correctly addressing the business


problems and reducing the chance of inaccurately reporting results.
• Individual data fields are easily overlooked when dealing with large
datasets.
Suppose your supervisor requests an analysis of sales data.
• You notice the unit of measure for reporting sales is monthly.
• You create an annual column for each year.
• You do not realize the third year’s data only contains six months.
• Your report shows sales plummeting in the third year.

© McGraw-Hill Education 13
Data Preparation

Feature selection.
• Pay attention the variables or features and avoid overfitting.
Sample size.
• Sample size may be determined using a power calculation.
Unit of analysis.
• A unit of analysis is the what, when, and who of the analysis.
Missing values.
• Common in datasets – resolve with imputation, omission, or exclusion.
Outliers.
• A method of identifying outliers is the use of cluster analysis.

© McGraw-Hill Education 14
Data Transformation

Aggregation is a key process to prepare data for valuable insights.


• Aggregate weekly sales to calculate sales by month, quarter, or year.
Normalization brings all variables into the same scale.
• Scale a variable by subtracting it from the mean and dividing by the
standard deviation.
New column (feature) construction.
• Using the sales data, construct new columns for the day of the week,
month, quarter, and year.
Dummy coding is useful when considering nominal categorical
variables.
• Geographic location is nonmetric and needs dummy coding.

© McGraw-Hill Education 15
Case Study
Avocado Toast: A Recipe to Learn SQL

Using SQLite and data from the Hass Avocado Board, this case study
explores the data in 20 steps.
• The text guides users through importing the data into SQLite and then
answering several questions through guided SQL queries.
• For numeric values, you can use SQL aggregate functions such as
sum, min, max, and average.
• For categorical data, you can use the count function.
• The exercise guides the user through building their own supplier table
and adding data to the table, then merging the table with another.
• Updating data is covered with a guided exercise to change a
supplier’s country of origin.
• Finally, the exercise describes the important act of deleting unneeded
data to prevent it from being included in future analysis.

© McGraw-Hill Education 16
End of main content.

Because learning changes everything. ®

www.mheducation.com

Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.

You might also like