Professional Documents
Culture Documents
Chapter 9:
Big Data, Cloud Computing,
and Location Analytics: Concepts and
Tools
1. Learn what Big Data is and how it is changing the world of analytics
2. Understand the motivation for and business drivers of Big Data analytics
3. Become familiar with the wide range of enabling technologies for Big Data
analytics
4. Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data
analytics
5. Compare and contrast the complementary uses of data warehousing and Big Data
technologies
6. Become familiar with in-memory analytics and Spark applications
7. Become familiar with select Big Data platforms and services
8. Understand the need for and appreciate the capabilities of stream analytics
9. Learn about the applications of stream analytics
10. Describe the current and future use of cloud computing in business analytics
11. Describe how geospatial and location-based analytics are assisting organizations
CHAPTER OVERVIEW
Big Data, which means many things to many people, is not a new technological
fad. It has become a business priority that has the potential to profoundly change the
competitive landscape in today’s globally integrated economy. In addition to providing
innovative solutions to enduring business challenges, Big Data and analytics instigate
new ways to transform processes, organizations, entire industries, and even society
altogether. Yet extensive media coverage makes it hard to distinguish hype from reality.
This chapter aims to provide a comprehensive coverage of Big Data, its enabling
technologies, and related analytics concepts to help understand the capabilities and
limitations of this emerging technology. The chapter starts with a definition and related
concepts of Big Data followed by the technical details of the enabling technologies,
including Hadoop, MapReduce, and NoSQL. We provide a comparative analysis between
data warehousing and Big Data analytics. The last part of the chapter is dedicated to
analytics, which is one of the most promising value propositions of Big Data analytics.
stream This chapter contains the following sections:
1
Copyright © 2019 Pearson Education, Inc.
9.1 Opening Vignette: Analyzing Customer Churn in a Telecom Company Using Big
Data Methods
1. What problem did customer service cancellation pose to AT’s business survival?
The company was churning through customers to quickly and there were concerns about
the financial impacts.
2. Identify and explain the technical hurdles presented by the nature and characteristics of
AT’s data.
The company allowed cancellations at many different contact points and would wants to
record information about why customers were canceling at each of these points.
Unfortunately this resulted in a multitude of different types of data that would be
collected. Additionally due to the large number of cancellations a large amount of data
was being generated.
While not discussed in the case, sessionizing is ordering customer interactions into a flow
so that the flow of actions can be visualized and analyzed. This is what is seen in figure
2
Copyright © 2019 Pearson Education, Inc.
9.2. It was important for the company to sessionizing the data so that they could visualize
the path the customers took to cancellation.
4. Research other studies where customer churn models have been employed. What types
of variables were used in those studies? How is this vignette different?
5. Besides Teradata Vantage, identify other popular Big Data analytics platforms that
could handle the analysis described in the preceding case. (Hint: see Section 9.8.)
1. Why is Big Data important? What has changed to put it in the center of the
analytics world?
As more and more data becomes available in various forms and fashions, timely
processing of the data with traditional means becomes impractical. The
exponential growth, availability, and use of information, both structured and
unstructured, brings Big Data to the center of the analytics world. Pushing the
boundaries of data analytics uncovers new insights and opportunities for the use
of Big Data.
Big Data means different things to people with different backgrounds and
interests, which is one reason it is hard to define. Traditionally, the term “Big
Data” has been used to describe the massive volumes of data analyzed by huge
organizations such as Google or research science projects at NASA. Big Data
includes both structured and unstructured data, and it comes from everywhere:
data sources include Web logs, RFID, GPS systems, sensor networks, social
networks, Internet-based text documents, Internet search indexes, detailed call
records, to name just a few. Big data is not just about volume, but also variety,
velocity, veracity, and value proposition.
3. Out of the Vs that are used to define Big Data, in your opinion, which one is the
most important? Why?
3
Copyright © 2019 Pearson Education, Inc.
methods or ad hoc query and reporting tools, Big Data means “big” analytics. Big
analytics means greater insight and better decisions, something that every
organization needs nowadays. (Different students may have different answers.)
4. What do you think the future of Big Data will be like? Will it lose its popularity to
something else? If so, what will it be?
Big Data could evolve at a rapid pace. The buzzword “Big Data” might change to
something else, but the trend toward increased computing capabilities, analytics
methodologies, and data management of high volume heterogeneous information
will continue. (Different students may have different answers.)
1. What is Big Data analytics? How does it differ from regular analytics?
Big Data analytics is analytics applied to Big Data architectures. This is a new
paradigm; in order to keep up with the computational needs of Big Data, a
number of new and innovative analytics computational techniques and platforms
have been developed. These techniques are collectively called high-performance
computing, and include in-memory analytics, in-database analytics, grid
computing, and appliances. They differ from regular analytics which tend to focus
on relational database technologies.
2. What are the critical success factors for Big Data analytics?
Critical factors include a clear business need, strong and committed sponsorship,
alignment between the business and IT strategies, a fact-based decision culture, a
strong data infrastructure, the right analytics tools, and personnel with advanced
analytic skills.
3. What are the big challenges that one should be mindful of when considering
implementation of Big Data analytics?
Traditional ways of capturing, storing, and analyzing data are not sufficient for
Big Data. Major challenges are the vast amount of data volume, the need for data
integration to combine data of different structures in a cost-effective manner, the
need to process data quickly, data governance issues, skill availability, and
solution costs.
4. What are the common business problems addressed by Big Data analytics?
Here is a list of problems that can be addressed using Big Data analytics:
4
Copyright © 2019 Pearson Education, Inc.
• Enhanced customer experience
• Churn identification, customer recruiting
• Improved customer service
• Identifying new products and market opportunities
• Risk management
• Regulatory compliance
• Enhanced security capabilities
5
Copyright © 2019 Pearson Education, Inc.
4. What are the main Hadoop components? What functions do they perform?
Major components of Hadoop are the HDFS, a Job Tracker operating on the
master node, Name Nodes, Secondary Nodes, and Slave Nodes. The HDFS is the
default storage layer in any given Hadoop cluster. A Name Node is a node in a
Hadoop cluster that provides the client information on where in the cluster
particular data is stored and if any nodes fail. Secondary nodes are backup name
nodes. The Job Tracker is the node of a Hadoop cluster that initiates and
coordinates MapReduce jobs or the processing of the data. Slave nodes store data
and take direction to process it from the Job Tracker.
Querying for data in the distributed system is accomplished via MapReduce. The
client query is handled in a Map job, which is submitted to the Job Tracker. The
Job Tracker refers to the Name Node to determine which data it needs to access to
complete the job and where in the cluster that data is located, then submits the
query to the relevant nodes which operate in parallel. A Name Node acts as
facilitator, communicating back to the client information such as which nodes are
available, where in the cluster certain data resides, and which nodes have failed.
When each node completes its task, it stores its result. The client submits a
Reduce job to the Job Tracker, which then collects and aggregates the results from
each of the nodes.
5. What is NoSQL? How does it fit into the Big Data analytics picture?
NoSQL, also known as “Not Only SQL,” is a new style of database for processing
large volumes of multi-structured data. Whereas Hadoop is adept at supporting
large-scale, batch-style historical analysis, NoSQL databases are mostly aimed at
serving up discrete data stored among large volumes of multi-structured data to
end-user and automated Big Data applications. NoSQL databases trade ACID
(atomicity, consistency, isolation, durability) compliance for performance and
scalability.
1. What are the challenges facing data warehousing and Big Data? Are we
witnessing the end of the data warehousing era? Why or why not?
What has changed the landscape in recent years is the variety and complexity of
data, which made data warehouses incapable of keeping up. It is not the volume
of the structured data but the variety and the velocity that forced the world of IT
to develop a new paradigm, which we now call “Big Data.” But this does not
mean the end of data warehousing. Data warehousing and RDBMS still bring
many strengths that make them relevant for BI and that Big Data techniques do
not currently provide.
6
Copyright © 2019 Pearson Education, Inc.
2. What are the use cases for Big Data and Hadoop?
In terms of its use cases, Hadoop is differentiated two ways: first, as the
repository and refinery of raw data, and second, as an active archive of historical
data. Hadoop, with their distributed file system and flexibility of data formats
(allowing both structured and unstructured data), is advantageous when working
with information commonly found on the Web, including social media,
multimedia, and text. Also, because it can handle such huge volumes of data (and
because storage costs are minimized due to the distributed nature of the file
system), historical (archive) data can be managed easily with this approach.
3. What are the use cases for data warehousing and RDBMS?
Three main use cases for data warehousing are performance, integration, and the
availability of a wide variety of BI tools. The relational data warehouse approach
is quite mature, and database vendors are constantly adding new index types,
partitioning, statistics, and optimizer features. This enables complex queries to be
done quickly, a must for any BI application. Data warehousing, and the ETL
process, provide a robust mechanism for collecting, cleaning, and integrating data.
And, it is increasingly easy for end users to create reports, graphs, and
visualizations of the data.
There are several possible scenarios under which using a combination of Hadoop
and relational DBMS-based data warehousing technologies makes sense. For
example, you can use Hadoop for storing and archiving multi-structured data,
with a connector to a relational DBMS that extracts required data from Hadoop
for analysis by the relational DBMS. Hadoop can also be used to filter and
transform multi-structural data for transporting to a data warehouse, and can also
be used to analyze multi-structural data for publishing into the data warehouse
environment. Combining SQL and MapReduce query functions enables data
scientists to analyze both structured and unstructured data. Also, front end query
tools are available for both platforms.
7
Copyright © 2019 Pearson Education, Inc.
Section 9.6 Review Questions
Hadoop utilizes the batch processing framework and lacks real time processing
capabilities, whereas Spark is a unified analytics engine that can execute both batch and
streaming data .
2. Give examples of companies that have adopted Apache Spark. Find new examples
online.
Examples include Uber, Pinterest, Netflix, Yahoo, and eBay. Uber uses Apache SparkTM
to detect fraudulent trips at scale. Pinterest measures user engagement in real-time using
Apache SparkTM. The recommendation engine of Netflix also utilizes the capabilities of
Apache SparkTM. Yahoo, one of the early adopters of Apache SparkTM, has used it for
creating business intelligence applications. Finally, eBay has used Apache SparkTM for
data management and stream processing.
Searches for updated users will vary based on the date of the search.
3. Run the exercise as described in this section. What do you learn from this exercise?
8
Copyright © 2019 Pearson Education, Inc.
use to the organization. Therefore it is critical for a number of business situations to
analyze the data as soon as it is streamed into the analytics system.
5. Define data stream mining. What additional challenges are posed by data stream
mining?
Data stream mining is the process of extracting novel patterns and knowledge structures
from continuous, rapid data records. Processing data streams, as opposed to more
permanent data storages, is a challenge. Traditional data mining techniques can process
data recursively and repetitively because the data is permanent. By contrast, a data stream
is a continuous flow of ordered sequence of instances that can only be read once and must
be processed immediately as they come in.
Many industries can benefit from stream analytics. Some prominent examples include e-
commerce, telecommunications, law enforcement, cyber security, the power industry,
health sciences, and the government.
Companies such as Amazon and eBay use stream analytics to analyze customer behavior
in real time. Every page visit, every product looked at, every search conducted, and every
click made is recorded and analyzed to maximize the value gained from a user’s visit.
Behind the scenes, advanced analytics are crunching the real-time data coming from our
clicks, and the clicks of thousands of others, to “understand” what it is that we are
interested in (in some cases, even we do not know that) and make the most of that
information by creative offerings.
9
Copyright © 2019 Pearson Education, Inc.
8. In addition to what is listed in this section, can you think of other industries and/or
application areas where stream analytics can be used?
Stream analytics could be of great benefit to any industry that faces an influx of relevant
real-time data and needs to make quick decisions. One example is the news industry. By
rapidly sifting through data streaming in, a news organization can recognize
“newsworthy” themes (i.e., critical events). Another benefit would be for weather
tracking in order to better predict tornados or other natural disasters. (Different students
will have different answers.)
9. Compared to regular analytics, do you think stream analytics will have more (or less)
use cases in the era of Big Data analytics? Why?
Stream analytics can be thought of as a subset of analytics in general, just like “regular”
analytics. The question is, what does “regular” mean? Regular analytics may refer to
traditional data warehousing approaches, which does constrain the types of data sources
and hence the use cases. Or, “regular” may mean analytics on any type of permanent
stored architecture (as opposed to transient streams). In this case, you have more use
cases for “regular” (including Big Data) than in the previous definition. In either case,
there will probably be plenty of times when “regular” use cases will continue to play a
role, even in the era of Big Data analytics. (Different students will have different
answers.)
1. Identify some of the key Big Data technology vendors whose key focus is on-premise
Hadoop platforms.
Student search results will vary, but may include Qubole, Yash & HortonWorks.
2. What is special about the Big Data vendor landscape? Who are the big players?
The Big Data vendor landscape is developing very rapidly. It is in a special period of
evolution where entrepreneurial startup firms bring innovative solutions to the
marketplace. Cloudera is a market leader in the Hadoop space. MapR and Hortonworks
are two other Hadoop startups. DataStax is an example of a NoSQL vendor. Informatica,
Pervasive Software, Syncsort, and MicroStrategy are also players. Most of the growth in
the industry is with Hadoop and NoSQL distributors and analytics providers. There is still
very little in terms of Big Data application vendors. Meanwhile, the next-generation data
warehouse market has experienced significant consolidation. Four leading vendors in this
space—Netezza, Greenplum, Vertica, and Aster Data—were acquired by IBM, EMC,
HP, and Teradata, respectively. Mega-vendors Oracle and IBM also play in the Big Data
space, connecting and consolidating their products with Hadoop and NoSQL engines.
3. Search and identify the key similarity and differences between cloud providers’
analytics offerings and analytics providers’ presence on specific cloud platforms.
10
Copyright © 2019 Pearson Education, Inc.
The results of student research will vary based on the date of the research and the
provider selected.
Some results will vary based on the data search, but current features listed at
https://www.teradata.com/Products/Software/Vantage include:
• tool integration
• scalability
1. Define cloud computing. How does it relate to PaaS, SaaS, and IaaS?
Cloud computing offers the possibility of using software, hardware, platform, and
infrastructure, all on a service-subscription basis. Cloud computing enables a more
scalable investment on the part of a user. Like PaaS, etc., cloud computing offers
organizations the latest technologies without significant upfront investment.
In some ways, cloud computing is a new name for many previous related trends:
utility computing, application service provider grid computing, on-demand
computing, software as a service (SaaS), and even older centralized computing with dumb
terminals. But the term cloud computing originates from a reference to the Internet as a
“cloud” and represents an evolution of all previous shared/centralized computing
trends.
2. Give examples of companies offering cloud services. 3. How does cloud computing
affect BI?
Companies offering such services include 1010data, LogiXML, and Lucid Era. These
companies offer feature extract, transform, and load capabilities as well as advanced
data analysis tools. Other companies, such as Elastra and Rightscale, offer dashboard
and data management tools that follow the SaaS and DaaS models.
In the DaaS model, the actual platform on which the data resides doesn’t matter. Data
can reside in a local computer or in a server at a server farm inside a cloud-computing
environment. With DaaS, any business process can access data wherever it resides.
Customers can move quickly thanks to the simplicity of the data access and the fact
that they don’t need extensive knowledge of the underlying data.
11
Copyright © 2019 Pearson Education, Inc.
5. What are the different types of cloud platforms?
The three different types of cloud platforms are IaaS, PaaS and SaaS.
AaaS in the cloud has economies of scale and scope by providing many virtual
analytical applications with better scalability and higher cost savings. The capabilities
that a service orientation (along with cloud computing, pooled resources, and parallel
processing) brings to the analytic world enable cost-effective data/text mining large-
scale optimization, highly-complex multi-criteria decision problems, and distributed
simulation models.
Examples includes Amazon, IBM Cloud, Microsoft Azure, Google App Engine and
Openshift.
Examples include IBM Cloud, MineMyText.com, SAS Viya, Tableau and Snowflake.
Traditional analytics produce visual maps that are geographically mapped and based
on the traditional location data, usually grouped by the postal codes. The use of postal
codes to represent the data is a somewhat static approach for achieving a higher level
view of things.
They help the user in understanding “true location-based” impacts, and allow them to
view at higher granularities than that offered by the traditional postal code
aggregations. Addition of location components based on latitudinal and longitudinal
attributes to the traditional analytical techniques enables organizations to add a new
dimension of “where” to their traditional business analyses, which currently answer
questions of “who,” “what,” “when,” and “how much.” By integrating information
about the location with other critical business data, organizations are now creating
location intelligence (LI).
12
Copyright © 2019 Pearson Education, Inc.
3. What is the value provided by geospatial analytics?
4. Explore the use of geospatial analytics further by investigating its use across various
sectors like government census tracking, consumer marketing, and so forth.
If a user on a smart phone enters data, the location sensors of the phone can help
find others in that location who are facing similar circumstances, as well as local
companies providing services and products that the consumer desires. The user
can thus see what others in their location are choosing, and the opportunities for
meeting his or her needs. Conversely, the user’s behaviors and choices can then
contribute information to other consumers in the same location.
8. What other applications can you imagine if you were able to access cell phone location
data?
13
Copyright © 2019 Pearson Education, Inc.
ANSWERS TO APPLICATION CASE QUESTIONS FOR DISCUSSION
Application Case 9.1: Big Data Analytics Helps Luxottica Improve Its Marketing
Effectiveness
For Luxottica, Big Data includes everything they can find about their customer
interactions (in the form of transactions, click streams, product reviews, and social
media postings). They see this as constituting a massive source of business
intelligence for potential product, marketing, and sales opportunities.
There are a wide variety of different types of marketing campaigns, but some of those
mentioned within the case include targeted online and direct mail campaigns, advertising
through various channels, growing the loyalty program by providing different customer
incentives and so on.
14
Copyright © 2019 Pearson Education, Inc.
2. By visualizing the most common customer paths to sales, how would you use that
information to make decisions on the future marketing campaigns?
By visualizing these paths you would be able to determine the effectiveness or necessity
of each of the different types of marketing campaigns and determine if there was
important interplay between them to generate purchase events.
3. What other applications of such path analysis techniques can you think of?
Student opinions and ideas will vary, but many retail businesses use a similar marketing
mix a variety of different advertising methods.
eBay is the world’s largest online marketplace, and its success requires the ability
to turn the enormous volumes of data it generates into useful insights for
customers. Big Data is essential for this effort.
2. What were the challenges, the proposed solution, and the obtained results?
eBay was experiencing explosive data growth and needed a solution that did not
have the typical bottlenecks, scalability issues, and transactional constraints
associated with common relational database approaches. The company also
needed to perform rapid analysis on a broad assortment of the structured and
unstructured data it captured. eBay’s solution includes NoSQL via Apache
Cassandra and DataStax Enterprise. It also uses integrated Apache Hadoop
analytics that come with DataStax. The solution incorporates a scale-out
architecture that enables eBay to deploy multiple DataStax Enterprise clusters
across several different data centers using commodity hardware. Now, eBay can
more cost effectively process massive amounts of data at very high speeds. The
new architecture serves a wide variety of new use cases, and its reliability and
fault tolerance has been greatly enhanced.
3. Can you think of other e-commerce businesses that may have Big Data challenges
comparable to that of eBay?
Any large company with a significant online presence will have similar
challenges as eBay. Amazon and Walmart are two of the largest.
15
Copyright © 2019 Pearson Education, Inc.
Application Case 9.4: Understanding Quality and Reliability of Healthcare Support
Information on Twitter
1. What was the data scientists’ main concern regarding health information that is
disseminated on the Twitter platform?
The primary concern was the quality and accuracy of the information that was being
distributed.
2. How did the data scientists ensure that nonexpert information disseminated on social
media could indeed contain valuable health information?
The scientists evaluated the type of information that was distributed and the size of the
following of the user for both influential and non-influential users. They found that
influential users typically provided higher quality information with more discussion of
facts.
3. Does it make sense that influential users would share more objective information
whereas less influential users could focus more on subjective information? Why?
Student opinions will vary, but it may be that users gain influence by providing higher
quality information and other users are able to identify this information themselves.
The company was able to use historical information on user reviews to predict the type of
review they would give a particular restaurant.
Spark was used because of its ability for parallel processing and in memory processing.
2. Besides customer retention, what are other benefits of using predictive analytics?
Companies are able to make predictions and decisions about their consumers more
rapidly. This ensures that businesses target, attract, and retain the right customers and
maximize their value.
16
Copyright © 2019 Pearson Education, Inc.
Application Case 9.7: Using Social Media for Nowcasting Flu Activity
1. Why would social media be able to serve as an early predictor of flu outbreaks?
Social media can be used because individuals will mention if they become sick and this
information, along with the users geographic location, can be analyzed with models that
have historically been used for disease propagation.
Student opinions will vary, but some examples might include the density level of
individuals in metropolitan areas and the percentage of individuals active on the
particular social media platform.
3. Why would this problem be a good problem to solve using Big Data technologies
mentioned in this chapter?
This is a very complex problem and process since data needs to be extracted for a
secondary use from a social media platform, geographic information tagged and then an
analysis performed.
1. What could be the reasons behind the health disparities across gender?
Students opinions on these topics will vary. One possible answer is that individuals have
different lifestyles related to their gender that affect these results.
A network is comprised of a defined set of items called nodes, which are linked to each
other through edges.
Application Case 9.9: Major West Coast Utility Uses Cloud-Mobile Technology to
Provide Real-Time Incident Reporting
17
Copyright © 2019 Pearson Education, Inc.
1. How does cloud technology impact enterprise software for small and mid-size
businesses?
2. What are some of the areas where businesses can use mobile technology?
Mobile technology is ideal when employees themselves are mobile either due to the
nature of their work or because they are away from the office.
In addition to businesses that have compelling needs to communicate with employees that
are separate from an office location, businesses that have needs for distributed
applications without large budgets or IT staffs may benefit.
4. What are the advantages of cloud-based enterprise software instead of the traditional
on-premise model?
There are number of benefits including the ability to use familiar hardware that does not
need to be standardized, the ability to easily perform software updates, and store data in
the cloud (for both ease of access and backup reasons).
5. What are the likely risks of cloud versus traditional on-premise applications?
Student opinions will vary, but there will always be a concern about data security and
privacy.
Application Case 9.10: Great Clips Employs Spatial Analytics to Shave Time in
Location Decisions
Great Clips depends on a growth strategy that is driven by rapidly opening new stores in
the right locations and markets. They use geospatial analysis to help analyze the locations
based on the requirements for a potential customer base, demographic trends, and sales
impact on existing franchises in the target location. They use their Alteryx-based solution
to evaluate each new location based on demographics and consumer behavior data,
aligning with existing Great Clips customer profiles and the potential revenue impact of
the new site on the existing sites.
18
Copyright © 2019 Pearson Education, Inc.
2. What criteria should a company consider in evaluating sites for future locations?
Major criteria include potential customer base, demographic trends, and sales
impact on existing franchises in the target location.
3. Can you think of other applications where such geospatial data might be useful?
Geospatial data can be used to help customers find the right location (for example, the
closest Great Clips location). It is certainly relevant for other companies in a variety of
industries. Analyzing customer profiles and applying these to geographic information can
assist with many retail firms. Another possibility is utilizing geospatial analysis to find
locations for manufacturing facilities; in this case you would be looking for supplier and
raw materials’ locations more than customer locations. In the consumer market,
geospatial analysis can help users in a variety of applications; for example finding the
best locations for restaurants or stores catering to the customer’s desires. (Student
answers will vary.)
Application Case 9.11: Starbucks Exploits GIS and Analytics to Grow Worldwide
1. What type of demographics and GIS information would be relevant for deciding on a
store location?
Information such as trade areas, retail clusters and generators, traffic, and demographics
is important in deciding the next store’s location.
2. It has been mentioned that Starbucks encourages its customers to use its mobile app.
What type of information might the company gather from the app to help it better plan
operations?
The company will be able to determine where the user is ordering from, and the distance
to the nearest Starbucks from that location.
3. Will the availability of free Wi-Fi at Starbucks stores provide any information to
Starbucks for better analytics?
Traditionally, the term “Big Data” has been used to describe the massive volumes
of data analyzed by huge organizations like Google or research science projects at
NASA. But for most businesses, it’s a relative term: “Big” depends on an
19
Copyright © 2019 Pearson Education, Inc.
organization’s size. In general, Big Data exceeds the reach of commonly used
hardware environments and/or capabilities of software tools to capture, manage,
and process it within a tolerable time span for its user population. Big Data has
become a popular term to describe the exponential growth, availability, and use of
information, both structured and unstructured. Big data is important because it is
there and because it is rich in information and insight that, if effectively tapped,
can lead to better business decisions and improved company performance. Much
has been written on the Big Data trend and how it can serve as the basis for
innovation, differentiation, and growth. Big Data includes both structured and
unstructured data, and it comes from everywhere: data sources include Web logs,
RFID, GPS systems, sensor networks, social networks, Internet-based text
documents, Internet search indexes, and detailed call records, to name just a few.
Big data is not just about volume, but also variety, velocity, veracity, and value
proposition.
2. What do you think the future of Big Data will be? Will it lose its popularity to
something else? If so, what will it be?
Big Data could evolve at a rapid pace. The buzzword “Big Data” might change to
something else, but the trend toward increased computing capabilities, analytics
methodologies, and data management of high volume heterogeneous information
will continue. (Different students may have different answers.)
3. What is Big Data analytics? How does it differ from regular analytics?
Big Data analytics is analytics applied to Big Data architectures. This is a new
paradigm; in order to keep up with the computational needs of Big Data, a
number of new and innovative analytics computational techniques and platforms
have been developed. These techniques are collectively called high-performance
computing, and include in-memory analytics, in-database analytics, grid
computing, and appliances. They differ from regular analytics which tend to focus
on relational database technologies.
4. What are the critical success factors for Big Data analytics?
Critical factors include a clear business need, strong and committed sponsorship,
alignment between the business and IT strategies, a fact-based decision culture, a
strong data infrastructure, the right analytics tools, and personnel with advanced
analytic skills.
5. What are the big challenges that one should be mindful of when considering
implementation of Big Data analytics?
Traditional ways of capturing, storing, and analyzing data are not sufficient for
Big Data. Major challenges are the vast amount of data volume, the need for data
integration to combine data of different structures in a cost-effective manner, the
need to process data quickly, data governance issues, skill availability, and
solution costs.
20
Copyright © 2019 Pearson Education, Inc.
6. What are the common business problems addressed by Big Data analytics?
Here is a list of problems that can be addressed using Big Data analytics:
7. In the era of Big Data, are we about to witness the end of data warehousing?
Why?
What has changed the landscape in recent years is the variety and complexity of
data, which made data warehouses incapable of keeping up. It is not the volume
of the structured data but the variety and the velocity that forced the world of IT
to develop a new paradigm, which we now call “Big Data.” But this does not
mean the end of data warehousing. Data warehousing and RDBMS still bring
many strengths that make them relevant for BI and that Big Data techniques do
not currently provide.
8. What are the use cases for Big Data/Hadoop and data warehousing/RDBMS?
In terms of its use cases, Hadoop is differentiated two ways: first, as the
repository and refinery of raw data, and second, as an active archive of historical
data. Hadoop, with their distributed file system and flexibility of data formats
(allowing both structured and unstructured data), is advantageous when working
with information commonly found on the Web, including social media,
multimedia, and text. Also, because it can handle such huge volumes of data (and
because storage costs are minimized due to the distributed nature of the file
system), historical (archive) data can be managed easily with this approach.
Three main use cases for data warehousing are performance, integration, and the
availability of a wide variety of BI tools. The relational data warehouse approach
is quite mature, and database vendors are constantly adding new index types,
partitioning, statistics, and optimizer features. This enables complex queries to be
done quickly, a must for any BI application. Data warehousing, and the ETL
process, provide a robust mechanism for collecting, cleaning, and integrating data.
And, it is increasingly easy for end users to create reports, graphs, and
visualizations of the data.
21
Copyright © 2019 Pearson Education, Inc.
9. Is cloud computing “just an old wine in a new bottle?” How is it similar to other
initiatives? How is it different?
In some ways, cloud computing is a new name for many previous related trends:
utility computing, application service provider grid computing, on-demand
computing, software as a service (SaaS), and even older centralized computing
with dumb terminals. But the term cloud computing originates from a reference to
the Internet as a “cloud” and represents an evolution of all previous
shared/centralized computing trends.
Cloud computing offers the possibility of using software, hardware, platform, and
infrastructure, all on a service-subscription basis. Cloud computing enables a
more scalable investment on the part of a user. Like PaaS, etc., cloud-computing
offers organizations the latest technologies without significant upfront investment.
10. What is stream analytics? How does it differ from regular analytics?
11. What are the most fruitful industries for stream analytics? What is common to
those industries?
Many industries can benefit from stream analytics. Some prominent examples
include e-commerce, telecommunications, law enforcement, cyber security, the
power industry, health sciences, and the government.
12. Compared to regular analytics, do you think stream analytics will have more (or
fewer) use cases in the era of Big Data analytics? Why?
13. What are the potential benefits of using geospatial data in analytics? Give examples.
22
Copyright © 2019 Pearson Education, Inc.
of integrated sensor technologies and global positioning systems installed in their
smartphones. Using geospatial data, companies can identify the specific needs of the
customers and customer complaint locations, and easily trace them back to the
products. Another example is in the telecommunications industry, where geospatial
analysis can enable communication companies to capture daily transactions from a
network to identify the geographic areas experiencing a large number of failed
connection attempts of voice, data, text, or Internet.
14. What types of new applications can emerge from knowing locations of users in real
time? What if you also knew what they have in their shopping cart, for example?
One prominent application is in the emerging area of reality mining, which uses
location-enabled devices for finding nearby services, locating friends and family,
navigating, tracking of assets and pets, dispatching, and engaging in sports, games,
and hobbies. Adding shopping cart knowledge will enhance the application’s ability
to provide targeted information to a customer; for example, the app could find prices
for similar products in nearby stores.
15. How can consumers benefit from using analytics, especially based on location
information?
Students’ answers will differ. Privacy threats relate to user-profiling, intrusive use of
personal information, and not being able to control what is being collected.
17. Is cloud computing “just an old wine in a new bottle?” How is it similar to other
initiatives? How is it different?
In some ways, cloud computing is a new name for many previous related trends:
utility computing, application service provider grid computing, on-demand
computing, software as a service (SaaS), and even older centralized computing with dumb
terminals. But the term cloud computing originates from a reference to the Internet as a
“cloud” and represents an evolution of all previous shared/centralized computing
trends.
Cloud computing offers the possibility of using software, hardware, platform, and
infrastructure, all on a service-subscription basis. Cloud computing enables a more
scalable investment on the part of a user. Like PaaS, etc., cloud-computing offers
organizations the latest technologies without significant upfront investment.
18. Discuss the relationship between mobile devices and social networking.
23
Copyright © 2019 Pearson Education, Inc.
Mobile social networking enables social networking where members converse and
connect with one another using cell phones or other mobile devices.
4. Go to teradatauniversitynetwork.com, and search for BSI Videos that talk about Big
Data. Review these BSI videos, and answer the case questions related to them.
5. Go to the teradata.com and/or asterdata.com Web sites. Find at least three customer
case studies on Big Data, and write a report where you discuss the commonalities and
differences of these cases.
Student selection of case studies and the resulting reports will differ.
6. Go to IBM.com. Find at least three customer case studies on Big Data, and write a
report where you discuss the commonalities and differences of these cases.
24
Copyright © 2019 Pearson Education, Inc.
8. Go to mapr.com. Find at least three customer case studies on Hadoop implementation,
and write a report where you discuss the commonalities and differences of these cases.
11. Go to youtube.com. Search for videos on Big Data computing. Watch at least two.
Summarize your findings.
12. Go to google.com/scholar, and search for articles on stream analytics. Find at least
three related articles. Read and summarize your findings.
Student summaries will vary based on the data search and the articles found.
13. Enter google.com/scholar, and search for articles on data stream mining. Find at least
three related articles. Read and summarize your findings.
14. Enter google.com/scholar, and search for articles that talk about Big Data versus data
warehousing. Find at least five articles. Read and summarize your findings.
16. Enter YouTube.com. Search for videos on cloud computing, and watch at least two.
Summarize your findings.
25
Copyright © 2019 Pearson Education, Inc.
Solution Manual for Analytics, Data Science, & Artificial Intelligence Systems for Decision
17. Enter Pandora.com. Find out how you can create and share music with friends.
Explore how the site analyzes user preferences.
18. Enter Humanyze.com. Review various case studies and summarize one interesting
application of sensors in understanding social exchanges in organizations.
Students will select different case studies and have different reactions.
19. The objective of the exercise is to familiarize you with the capabilities of
smartphones to identify human activity. The data set is available at
archive.ics.uci.edu/ml/datasets/ Human+Activity+Recognition+Using+Smartphones. It
contains accelerometer and gyroscope readings on 30 subjects who had the smartphone
on their waist. The data is available in a raw format and involves some data preparation
efforts. Your objective is to identify and classify these readings into activities like
walking, running, climbing, and such. More information on the data set is available on
the download page. You may use clustering for initial exploration and to gain an
understanding of the data. You may use tools like R to prepare and analyze this data.
26
Copyright © 2019 Pearson Education, Inc.