Hair EOMA 1e Chap002 PPT

Because learning changes everything.
Chapter 02
Data Management
Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.
The Era of Big Data Is Here
The pace of data collection is accelerating due to web and mobile apps,
social media, and embedded sensors.
• Large datasets are a core organizational asset.
• A data infrastructure enabling real-time decision making requires a
strong data strategy aligned with overall business strategy.
Big data is a relative term describing massive
amounts of data with the following characteristics:
• Volume is data per time unit. Many are moving
• Variety is structured or unstructured. away from the term
“big data” and
• Veracity is accuracy or missing data. adopting “smart
• Velocity is the speed at which data arrives. data.”
• Value is the usefulness of the data in making
accurate decisions.
© McGraw-Hill Education 2
Characteristics of Big Data
Volume – companies now store and analyze petabytes of data.

• Data is collected from many sources and shows the customer journey.
Variety – structured and unstructured data provides a holistic view.
• Strengths and challenges to each format when integrating.
Veracity increases complexity and reduces confidence in the data.
• There may be missing values, inconsistencies in units of measure,
erroneous information, and lack of reliability.
Velocity – data inundates companies at a rapid pace.
• Data arrives in milliseconds to support real-time response strategies.
Value – data must be converted into quality insights that provide benefits.
• Achieving value requires an understanding of business goals and
objectives.
Database Management Systems (DBMS)
Relational Databases
How is big data organized to create smart data that provides value?
• A database contains current data from company operations.
A relational database is a DBS storing data in rows and columns.
• Columns (features, predictors, variables) store many records.
• Rows (records) have a unique primary key.
• A foreign key is a set of columns that refers to a primary key in another
table.
• Both keys are important to combine data from different tables.
Relational data is accessible by a database management language called

structured querying language (SQL).
• A query can be used to join, select, manipulate, retrieve, and analyze
data from relational databases.
Database Management Systems (DBMS)
Non-Relational Databases
Non-relational databases, or NoSQL databases, can store large volumes

of structured or unstructured data.
• They display data vertically, combined rather than in structured tables.
• They allow greater flexibility for storing ever-changing data and new
data types but drilling down to specific types of data is more difficult.
• Most companies use both relational and non-relational databases.
• The difficulty of maintaining multiple databases is compounded by
inappropriate data storage architecture.
Enterprise Data Architecture
Data storage architecture allows a company to organize, understand, and

use their data to make both small and large decisions.
Data analytics can analyze any kind of data repository, including:

• Customer Relationship Management (CRM).
• Enterprise Resource Planning (ERP)
• Other Online Transaction Processing (OLTP) software.
A CRM database may store recent customer transactions and allows
marketers to monitor developments in real time.
• Internal data can be combined with other sources like social media.
• Streaming data is the continuous transfer of data from numerous
sources in different formats.
Traditional ETL
Extract, Transform, and

Load (ETL) is an integration Functions begin with
process designed to extracting key data from the
consolidate data from a source and converting it
variety of sources into a into the appropriate format.
single location.
Transformation requires The third ETL step is load,

conforming to the in which the data is loaded
appropriate data storage into a storage system such
format for where data will as data warehouse, data
be stored. marts, or a data lake.
ETL Using Hadoop
Hadoop is an open-source software that divides big data processing over

multiple computers.
• This allows large amounts of data to be handled simultaneously at
reduced cost.
Hadoop facilitates analysis using MapReduce programming which
manages two steps with the data.
• The first step is to map the data by dividing it into subsets and
distributing it to a group of computers for storing and processing.
• The second step is to combine the answers from the computer nodes
into one answer.
The ETL process on Hadoop can handle structured, semi-structured, and
unstructured data.
• After ETL, data can be stored in a warehouse, a mart, or a lake.
Exhibit 2-7 Simple Architecture of a Data Repository
Access text alternative for this image.
A Closer Look at Data Storage
A data warehouse holds historical data from various company databases

and provides a structure for high-speed querying.
A data mart provides a specific value to a group of users – can be

dependent or independent.
A data lake is a storage repository that holds a large amount of data in its
native format.
Data management as a process is the lifecycle management of data

from acquisition to disposal.
Data Quality
Numerous success stories show the use of data

in decision making.
• Not all companies have the same success.
“Garbage in, garbage out” refers to deficient data Timeliness.
quality – you can follow the trail of bad data.
Completeness.
• When data are of poor quality, insights from
Accuracy.
marketing analytics will be unreliable.
Consistency.
• Data are important, but high-quality data is
critical. Format.
Although data quality can be measured in
numerous dimensions, the most common are
listed on the right.
Data Understanding, Preparation, and Transformation
Data inspires curiosity for marketers

because they are eager to use
insights to strategize the next move.
There is no tangible benefit to data in

raw form.
Data is messy and must be tidied up

before being used.
Data Understanding
Understanding the data is critical to correctly addressing the business

problems and reducing the chance of inaccurately reporting results.
• Individual data fields are easily overlooked when dealing with large
datasets.
Suppose your supervisor requests an analysis of sales data.
• You notice the unit of measure for reporting sales is monthly.
• You create an annual column for each year.
• You do not realize the third year’s data only contains six months.
• Your report shows sales plummeting in the third year.
Data Preparation
Feature selection.
• Pay attention the variables or features and avoid overfitting.
Sample size.
• Sample size may be determined using a power calculation.
Unit of analysis.
• A unit of analysis is the what, when, and who of the analysis.
Missing values.
• Common in datasets – resolve with imputation, omission, or exclusion.
Outliers.
• A method of identifying outliers is the use of cluster analysis.
Data Transformation
Aggregation is a key process to prepare data for valuable insights.

• Aggregate weekly sales to calculate sales by month, quarter, or year.
Normalization brings all variables into the same scale.
• Scale a variable by subtracting it from the mean and dividing by the
standard deviation.
New column (feature) construction.
• Using the sales data, construct new columns for the day of the week,
month, quarter, and year.
Dummy coding is useful when considering nominal categorical
variables.
• Geographic location is nonmetric and needs dummy coding.
Case Study
Avocado Toast: A Recipe to Learn SQL
Using SQLite and data from the Hass Avocado Board, this case study
explores the data in 20 steps.
• The text guides users through importing the data into SQLite and then
answering several questions through guided SQL queries.
• For numeric values, you can use SQL aggregate functions such as
sum, min, max, and average.
• For categorical data, you can use the count function.
• The exercise guides the user through building their own supplier table
and adding data to the table, then merging the table with another.
• Updating data is covered with a guided exercise to change a
supplier’s country of origin.
• Finally, the exercise describes the important act of deleting unneeded
data to prevent it from being included in future analysis.
End of main content.
Because learning changes everything. ®
www.mheducation.com
Copyright 2022 © McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill
Education.

Hair EOMA 1e Chap002 PPT

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hair EOMA 1e Chap002 PPT

Uploaded by

Copyright:

Available Formats

Because learning changes everything.

Volume – companies now store and analyze petabytes of data.

Relational data is accessible by a database management language called

Non-relational databases, or NoSQL databases, can store large volumes

Data storage architecture allows a company to organize, understand, and

Data analytics can analyze any kind of data repository, including:

Extract, Transform, and

Transformation requires The third ETL step is load,

Hadoop is an open-source software that divides big data processing over

Access text alternative for this image.

A data warehouse holds historical data from various company databases

A data mart provides a specific value to a group of users – can be

Data management as a process is the lifecycle management of data

Numerous success stories show the use of data

Data inspires curiosity for marketers

There is no tangible benefit to data in

Data is messy and must be tidied up

Understanding the data is critical to correctly addressing the business

Aggregation is a key process to prepare data for valuable insights.

Because learning changes everything. ®

You might also like