You are on page 1of 10

2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

Big-Data/Analytics Projects Failure: A Literature Review

Gianna Reggio Egidio Astesiano


DIBRIS - University of Genoa DIBRIS - University of Genoa
Genoa, Italy Genoa, Italy
gianna.reggio@unige.it astes@unige.it

Abstract—Nowadays, no one doubt that big data and analyt- factors.” [26] (2018). “Through 2020, 80% of AI projects
ics are fundamental tools for successfully running an enterprise. will remain alchemy, run by wizards whose talents will
However, big data/analytics projects do not seem to be all not scale in the organization. Through 2022, only 20%
sunshine and rainbows, and this is documented by many
sources providing startling figures of failed projects. Motivated of analytic insights will deliver business outcomes” [35]
by such figures, we conducted a (grey and scientific) literature (2019).
review aimed to answer: The figures reported above may be just anecdotal evidence
RQ1) Which are the documented cases of failed BD/A projects, without any scientific relevance, also because they have been
and which are the root causes of their failures? published in non-scientific venues, and thus cannot be a
RQ2) What is assumed useful to reduce the chance of failure
of BD/A projects? definitive answer about the success of big data/analytics
We collected and examined 188 sources. The survey resulted projects. Perhaps, we may get completely different figures
first in a list of hints for helping avoid the failure of the concerning the success rate of projects of such kind.
BD/A projects (RQ2), and in a list of 21 cases of failed BD/A There are notorious cases of failed BDA projects, such
projects documented in the literature (RQ1). The analysis of as the Google Flu project for detecting the outbursting
those failures resulted in confirming the hints of RQ2 and
prompting other relevant ones. of a flu epidemy by analysing the search made on the
The result of this study will be useful not only to people web of flu-related terms, or the results of the USA 2016
developing a BDA project but also to researchers investigating Presidential Election (all the data analysis forecasted the
the challenges proposed by a BDA project to the traditional victory of Hilary Clinton with extremely high percentages)
methods of software engineering. to report two of them. But again, these may be just anecdotal
Keywords-big data project; data analytics project; big data evidence, and these failures may be due to reasons not
project failure; data analytics project failure related to big data/analytics. On the other side, there is a long
list of successful big data analytics projects; for example,
I. I NTRODUCTION companies such as Netflix and Amazon thrive on big data
Nowadays, no one doubt that big data and analytics are /analytics.
fundamental tools for successfully running an enterprise, Since ”where there is smoke, there is fire,” we decided
but big data/analytics projects (shortly BDA projects) do to investigate the topic – the failure of Big Data/Analytics
not seem to be all sunshine and rainbows, as reported projects – by conducting a literature review.
below. [32] (2013) states that “55% of Big Data projects We use “big data/analytics project” in a quite general
don’t get completed, and many others fall short of their sense encompassing any project involving big data or some
objectives”. “through 2017, 60% of big data projects will form of data analysis (e.g. business intelligence, data mining,
fail to go beyond piloting and experimentation, and will be machine learning, artificial intelligence and deep learning).
abandoned” [12] (2015). Later, on Twitter Nick Heudecker Trying to precisely define what big data are and how to
about such statement says “We were too conservative. The distinguish between the different brands of data analysis is
failure rate is closer to 85%. And the problem isn’t tech- outside the scope of this paper. That issue is thoroughly
nology” [13]. “However, too many executives have assumed discussed in [1], where it is stated that “there is no single
that because they’ve made such big moves, the main chal- unified definition of big data.”
lenges to becoming analytics-driven are behind them. But Our literature review aims to answer the following re-
frustrations are beginning to surface; it’s starting to dawn search questions:
on company executives that they’ve failed to convert their RQ1: Which are the documented cases of failed BD/A
analytics pilots into scalable solutions. (A recent McKinsey projects, and which are the root causes of their failures?
survey found that only 8 percent of 1,000 respondents with RQ2: What is assumed useful to reduce the chance of failure
analytics initiatives engaged in effective scaling practices.)” of BD/A projects?
[11] (2018). “the technical factors of a successful analytics We decided to consider both scientific and grey literature
initiative only account for 20-25% of all of the success because the first is still scarce (most conferences and jour-

978-1-7281-9532-2/20/$31.00 ©2020 IEEE 246


DOI 10.1109/SEAA51224.2020.00050
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
Library, Springer Open, Science Direct and ResearchGate1
(a professional network for scientists and researchers), again
considering only papers after 2010. Altogether, we found
3289 different papers; [25] reports the numbers of papers
found with the different searches.
Then, we examined those papers and selected the one
discussing failed BDA projects or presenting suggestions
for avoiding the failure of BDA projects (inclusion criteria).
At the end, we collected 188 papers (see [25] for their
Figure 1. Search strings and number of selected sources bibliographic references). Fig. 1 also shows the numbers of
papers selected by the various searches (we counted each
nals started in the last five years) and to get the view of paper only the first time it was selected), illustrating how
the practitioners. We collected and examined 188 sources: the scientific literature is still quite limited. The searches
all agree that it is difficult to run successful BD/A projects and the selection procedure were performed during January
and that their failures are not related to technical issues. The 2020.
survey resulted in two lists: In our procedure for collecting sources, neither in defining
i) one containing hints for helping avoid the failure of the the search strings nor in the selection phase, we did not
BD/A projects (answer to RQ2), look for hints for successful BDA projects nor for success
ii) and another one of 21 cases of failed BD/A projects, stories. In our opinion, the point of view of analysing the
each one documented in at least one source in the literature failure could result in a sharper analysis of the underlying
(answer to RQ1), sorted by the failures genuinely related to problems, because we can learn more from a failed case than
the management of big data/analytics and by those due to from a success story.
other reasons (e.g. forgetting the basics of software testing). Fig. 2 presents the kinds and the years of production of
The analysis of the cases in ii) resulted in confirming the the selected sources, showing the prominence of the grey
relevance of the hints in i) and in prompting other ones. literature. The scientific production started later that the grey
The result of this study will help people develop a BDA literature, both increased in 2016/17, then the grey literature
project and researchers in the field of software engineering seems stable (3 items in January 2002 suggest that 2020 will
interested in investigating the challenges proposed by a BDA reach at least the number of 2019).
project to the traditional methods of software engineering. All the sources support the view that BDA projects fail
We detail the sources collection process in Sect. II. Then,
we present and discuss the found hints in Sect. III (RQ2), 1 https://www.researchgate.net/

and list and analyse the found failed BDA projects in


Sect. IV assessing if they are true failures and if some
lessons may be learned (RQ1). The related work and the
conclusion are in Sect. V and VI, respectively.

II. C OLLECTING THE SOURCES


Big data/analytics is a very new field, and thus the existing
scientific literature is scarce with respect to other fields (for
example, the main conferences and journals have appeared
in the last five years). Thus, we decided to consider also
the grey literature (e.g. technical web sites, online technical
journals, and blogs), to get the view of the practitioners.
We used the search strings reported in Fig. 1. Notice that
many of them are multi-word search strings. Indeed, using,
for example, “analytics” together with “failure” we got a
huge number of results (in some case a five-digit number),
but most of them concern the use of analytics to discover
the failures of something (e.g. a device) or the failure of
analytics algorithms.
First, we searched the web using Google and Google
Scholar (considering only web pages appearing after 2010
and the first ten pages of results, obviously disregarding the
advertised items). Later, we searched the main repositories
of scientific literature, precisely IEEE Explore, ACM Digital Figure 2. Selected Sources: Kinds and Production Dates

247

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
and that the problems are not in the technicalities of big data The deployment phase of an analytics task is the set of
and analytics. activities needed to use the result of the task, e.g., the choice
We have then read the selected sources to answer the two between on-line and off-line computation, visualization of
research questions, collecting: results, reporting, modification of some business processes
– any reference to cases of failed BDA projects (RQ1), see and related aspects.
Sect. IV, • Domain Knowledge Get a good knowledge of the domain.
– any proposed hint for reducing the chance of failure of That is imperative in any project; but in the case of BDA
BDA projects (RQ2), see Sect. III. projects, it is especially important to relate the used data
to the domain to assess their context and their quality while
III. W HAT IS CONSIDERED USEFUL TO REDUCE THE keeping in mind that the domain may change along the time.
CHANCE OF FAILURE OF BDA PROJECTS • Executives’ Support Ensure the support of the executives,
A. Hints for avoiding the failure of BDA projects and mainly of the top executives.
• Interaction with other Analytics Programs Look for possible
To answer RQ2, we examined the selected sources looking interactions with other analytics-based programs.
for hints on what to do when putting together a BDA project • No Data Silos Avoid data silos.
to avoid its failure. Since many hints are supported by many • Planning Carefully plan a BDA project, taking into account
sources, we have also recorded which sources propose each all possible alternatives, the costs, the needed resources, and
hint, to discover which are the most agreed upon (in [25] all possible risks.
there is the link to a spreadsheet showing the hints supported • Possible irrelevance of derived facts Keep in mind that facts
by each of the 188 sources. Hints prompted by a unique or knowledge derived from data may be useless.
source are not considered here, except one involved in a • Problem Identification The problem to solve or the question
well-known failure case. to answer by a BDA project must be not only known before
The discovered hints are presented below in alphabetical to start but also quite precisely defined and its relevance
order. In the following subsection, we will briefly discuss assessed by appropriate key indicators).
their role in a project and their perceived relevance in the “Begin your efforts with a defined goal that is measurable
survey. and shared with others. Simply to improve efficiency or
• Agile Adhere to the agile approach in the BDA projects, enhance the customer experience are not measurable or
favouring iteration with short cycles and many interactions sharable.” [3]
with users to validate the results, and keep the data/analytics • Short Time To Value Ensure that a BDA project provides
activities synchronized with the business. value in a short time.
• Data Access Make accessible the data needed for a BDA • Specific Skills Analytics/big data skills must be available.
project. • Start Small Especially within a large organization, start
• Data Culture Ensure a data-driven culture. with one department or business unit, execute with focus and
“A data-driven culture is a workplace environment that resolve, get tangible results, and then use that momentum to
employs a consistent, repeatable approach to tactical and spread to other divisions.
strategic decision-making through emphatic and empirical Fig. 3 lists the above hints together with the number of
data proof. Put simply, it is an organization that bases supporting sources sorted first by type of source (total, grey
decisions on data, not gut instinct.” [23] and scientific), and then by chronology (before 2016, after
• Data Governance Establish data governance. 2017 and from 2016-2017).
• Data Production Design Whenever possible, design which
data collect and how, as well as which data should be B. Discussion
generated. Some of the elicited hints apply to any software devel-
“The big data mantra is to store everything . . . Not every opment project, and the software engineering world largely
data set is important. Just because you can collect data agrees on their value: Agile, Deployment, Domain Knowledge,
doesn’t mean you should. You can’t just collect all the data Problem Identification and Planning. They have been incor-
and expect someone to figure it out someday” [28] porated in the most known and used software development
• Data Quality Ensure the best level of quality of the used process models and methods. The lack of attention to those
data. issues may be motivated by the fact that the novelty of
• Data Scientists, Business and IT People Together Bring to- big data/analytics and the possibility to get awesome results
gether IT, data science, and line of business perspectives to from the data resulted in projects organized and managed
run a BDA project. (if they were) in an ad hoc way, completely forgetting the
• Data Security Protect the data against internal and external wealth of results of the software engineering community.
attacks. However, Agile, Deployment, Domain Knowledge, Problem
• Deployment Carefully consider the deployment phase. Identification and Planning are, and should be, declined

248

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
Figure 3. Hints for avoiding the failure of BDA projects: number of sources sponsoring them and production years

differently in the context of the BDA projects, as supported prompted also after 2017). Namely, a BDA project should
by [22]. be motivated by a clear (i.e. neither ambiguous nor generic)
Surprisingly, at least in our opinion, testing and main- question or problem to solve, and such question/problem
tenance/evolution are never mentioned as means to reduce must be valuable for who (organization, company) pays for
the chance of failure of BDA projects. Now, we cannot the project. E.g., a project motivated by a generic “increasing
say whether these aspects are managed using traditional the income” or “what the data says about my clients” (which
approaches in a successful way or they are just ignored. is not sure will result into something actionable) is not
We only know that testing in this context poses new hard following that hint.
challenges and has to be rethought completely, and that some
Possible irrelevance of derived facts (6 supporters, only one
BDA projects require to answer some questions about data
after 2017) suggests being aware that a fact derived by data
only once (e.g. to make a strategic decision in a company)
may be useless, as in a famous case reported by [19]. A
and so maintenance/evolution is not an issue.
retailer by mining his stock and purchase data had found
Executives’ Support is an essential issue for any project
that a particular bottle of wine sold exceptionally well on a
based on very innovative techniques. In the case of the
Tuesday, and even more so if it was raining. But so what?
BDA projects, it is more relevant and hard to grant since
The issue is that shelf-space is pre-assigned and cannot be
executives must decide to trust the data more than their gut,
increased for this brand for just this one day.
and that may result in self-esteem issues.
Data privacy is not present among the found hints, and A related hint is Domain Knowledge (moderately sup-
Data Security is one of the less supported hints (3 sources ported, with an increase after 2017): the domain of a BDA
prompt it, none after 2016). We can interpret this result that project should be well known and understood. In this case,
now data privacy and security are standard issues in any it is essential also to know well how it evolves and to keep
BDA project, and are agreed by anyone. it under control during all the life of the project. Good
Now we can consider the hints specifically related to BDA knowledge of the domain is also relevant for assessing the
projects. The major hint emerging from the survey is Problem quality of the data and for helping to consider also the “small
Identification (mainly found in the grey literature and strongly data (en.wikipedia.org/wiki/Small data).

249

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
The hints Problem Identification, Possible irrelevance of
derived facts and Domain Knowledge may be summarized as
follows: – get a good knowledge of the domain, – define a
relevant problem whose solution has verifiable value, – recap
the available data and those that can be produced (big, small
and thick) and – assess their relationships with the domain.
Data Culture and Executives’ Support (moderately sup-
ported, with an increase after 2017) show that an organi-
zation embracing a BDA project should at any level, and
absolutely at the top management level, agree that decisions
should be based on data.
A group of hints concern directly with the data: – Data
Access, scarcely relevant; – No Data Silos, relevance is de-
creasing along the time (since it is agreed by anyone); –Data
Governance, scarcely relevant; – Data Quality, moderately
relevant and increasing along the time; – Data Production
Design, marginal (and this is surprising in our opinion).
Data Scientists, Business and IT People Together (highly
relevant and increasing along the time), and Specific Skills
(moderately relevant) concern the way a BDA project is
managed, and require to organize the team in a way to have
all the people together and obviously to have experts in data
science.
Start Small (moderately relevant) and Short Time To Value Figure 4. Documented failure cases (the last column is the number of
(scarcely relevant) again concern the BDA project manage- citing sources)
ment and should help successfully introduce the new big customers and the market; Blockbuster refused. A decade
data/analytics technologies in an organization. later, Blockbuster filed for bankruptcy. [9]
The scientific and the grey literature support more or less
That is one of the most referred failure cases, but it is
the various hints similarly; thus, both communities agree
not. We should read it oppositely. Indeed, it is a case where
about which are the most relevant hints.
big data/analytics technologies were mistakingly not used by
IV. FAILURE CASES Blockbusters, and conversely their use resulted in a “success
story” for Netflix. This case shows the relevance of Data
We summarize the found documented failure cases in
Culture because the refusal of using the data resulted in the
Fig. 4 (in [25] you can find the references to all the sources
company’s failure.
citing them).
“jobs” versus “Jobs” “A big data project was set out to use
We analysed each case to determine first if it is a real
failure related to big data/analytics, then we tried to deter- Twitter feeds to predict the U.S. unemployment rate. The
mine the root causes of the failure, listing the hints, among researchers devised a category of many words that pertained
those reported in Sect. III-A, that have been disregarded. to unemployment, including jobs. They culled tweets that
The analysis of the various cases also prompted new hints contained these words then looked for correlations between
that except one were never mentioned in any of the examined the total number of words per month in this category and
sources. We are aware that the list in Fig. 4 contains only the the monthly unemployment rate. The work crept along, when
cases that have become public either because a “victim” had all of a sudden there was a tremendous spike in the number
some connection with the media, or because the failed BDA of tweets containing the type of words that fell into this
project was heavily publicized before to start, e.g. cases category. Maybe the researchers were really onto something.
related to politics and sport, or because the failure of the What they hadn’t noticed was that Steve Jobs died.” [31]
involved company went public. However, we think they are The disregarded hint is Domain Knowledge, which also
interesting to derive suggestions for successfully developing suggests considering that the domain may change along the
BDA projects. time and how the analysed data are related to the domain
Blockbuster In 2000 Blockbuster was at the top of its game itself.
when Netflix proposed a strategic brand partnership where NHS UK National Health Service undertook a project to
they wanted to run Blockbuster’s online brand in exchange integrate all patient medical records into a central healthcare
for the latter promoting Netflix in their retail chains. Netflix database. The analytics this project predicted would have
offered the use of data analysis to gather insights about been significant and valuable to the NHS. The failure of the
project is directly related to several challenges. The effort

250

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
to build this analytical platform began with a lack of clear insights to respond faster to customer preferences. On paper,
business objectives. Each health care provider within the this would simplify processes and capture structured and
system had different requirements, yet they needed each unstructured data (e.g., customers sentiment) so it could
of their systems to work together. Because the leadership deliver the best possible customer experience. Nevertheless,
team did not clearly understand, define or communicate no meaningful insight from those analytics solutions was
the benefit of this effort to those affected, health care gained. Instead, sales started to decline. [21] The Problem
providers continued to work within their respective silos. Identification was disregarded in this case, and the result was
The program was also challenged by technical difficulties a big failure.
stemming from a lack of expertise within the analytics
Romney’s Failed Bid During the 2012 US presidential elec-
team. [15] In this case, the disregarded hints were: Problem
tion campaign, Romney’s campaign utilized a big data
Identification, Specific Skills, and No Data Silos.
system called Orca. It was supposed to read data from
The Making of a Fly Price In April 2011, a biology book
various sources, and from the campaign volunteers field
was being sold by two book retailers under the Amazon
feedback to generate insights about voters who support Mr.
umbrella. Both were using big data solutions that would
Romney and did not vote yet, so the campaign volunteers
change the prices of the books based on market demand
can target them on the election day, and convince them to
and book availability. Unfortunately, in this particular case,
participate. The Orca system failed in finding the right voters
both retailers added a rule by which if a book has a certain
and was useless in the election day, due to its complexity
demand and availability patterns, then they will charge for it
that left volunteers confused about how to feed it with their
a little bit more than the market. Thus one system increased
field data and how to use its outcome. [10]
the book price a little bit more, then the other system in
turn responded shortly after in the same way, and both That is a paradigmatic example of a software engineer-
systems kept increasing the book price until it reached $23 ing&project management failure: no concern for possible
million. [4] The disregarded hints are Interaction with other distributed denial of service attacks, no stress test, no re-
Analytics Programs, which invites to consider the existence dundancy to avoid unavailability, no user training, 60 pages
of competing analytics programs, and Domain Knowledge. of instructions sent the evening before the election, helpline
That failure could have been avoided if the system was not working, system accessible only by a web site (not an
designed to signal when the automatically generated price app) and available only the morning of the election day.
was not sensible anymore, and this prompts a new hint: Hence the analytics part was not used in practice.
• Sensible Results: The sensible results of an analytics task
Target Target set off a maelstrom of outrage and privacy
should be determined, and any non-sensible outcome should
concerns when the retail behemoth angered an unsuspecting
be signalled.
father by sending discount coupons for cribs and baby
Canadian Bank The case involved an analytic model for
clothes to his teenage daughter, who had not yet revealed her
acquiring new customers for a new product of a well-known
pregnancy to her parents. Target clients purchase data were
Canadian bank. The model was built and worked very well
mined to look for patterns (for example, pregnant women
and then it was implemented and made operational within
buy unscented lotion at the start of their second trimester)
a future marketing campaign. Unfortunately, the system
to send them special deals and coupons for baby items. [24]
did not automatically generate the final result. Indeed, the
user had to perform various operations manually over the Technically that is a big success of analytics because
equation resulting from the model, including multiplying expecting women were spotted precisely, but it also prompts
the entire equation by -1. Multiplying the equation by -1 us to modify the hint Problem Identification by adding to its
was forgotten by the user when scoring the list of eligible definition
customers. As a result, clients with a high score were
Moreover any possible side effect due to its solution
considered the less eligible for the new product and vice-
should be investigated.
versa, and the marketing campaign was a complete failure.
[5] This is just a plain software engineering failure: it is In this case, it should have been considered that discov-
not sensible not to have automatized and tested the entire ering to be pregnant is a sensible issue and not always a
procedure. happy event.
JC Penny From a leading light in the retail consumer space
Sears The Sears decline is also related to the failure to
in 2012, JC Penney witnessed a mass exodus of customers in
exploit the data related to their membership program, see
under a year. The culprit? Poor deployment of a large scale
[18] for a detailed presentation of the case. The hints not
merchandising retail analytics solution. A few months into
followed in that case are Data Scientists, Business and IT
a complete store revamp, a complete, open and integrated
People Together and Data Culture.
suite of retail analytics solutions was implemented. The hope
was that JC Penney would have new customer shopping Moreover, it seems that they forgot to consider also

251

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
thick data2 that could have helped to avoid the failure: the incidence of flu in 2011-2012 and 2012-2013 by more
“The purchase data we used reflected a subset of the than 50%. The reason for the failure was the data selected
entire customer experience, missing data undermined our for analytics. Google wrongly assumed that its search data
delivery of a complete customer value proposition. Our was an accurate measurement of flu infection. It was missing
data analytics team mostly focused on transactional data what researchers call “small data,” i.e. traditional predictors
and product-related behavior such as online browse paths. such as past flu seasons and regional factors, and by using
However, delivery of a full retail business value proposition its own search analytics as a metric, Google was modifying
depends on creating an experience for the customer, not its own findings as it measured results, creating a feedback
just understanding the products they buy/consider buying. loop that would skew the findings. [33]
Our corporate team should have augmented our transactional Google Flu is a true big data/analytics failure. The disre-
data with qualitative data from focus groups and surveys garded hint is Domain Knowledge, which suggests controlling
to understand customer motives for shopping our owned the consistency of the data with the domain and keeping in
channels, as this may have helped us address key issues mind that the domain may change. Moreover, a good knowl-
in our business model.” [18] This last point prompts a new edge of the domain may reduce the chance of overlooking
hint: the small data.
• Thick Data: Whenever human behaviour is concerned, Nokia Smart Phone Failure Tracy Wang, a technology ethno-
consider complementing quantitative data with thick data. grapher, provides a clue on the Nokia failure, mainly due to
Chicago Police The Chicago Police to reduce violent crime the fact that the company decided not to embrace the smart-
produced a list of the 400 individuals most likely to break phone wave [34]. She tells how in 2009, her ethnographic
the law. The index of violent individuals was the result research in China suggested the rising interest for smart-
of a predictive analytics program that used a mathematical phones, and that Nokia dismissed the suggestion since their
algorithm to sift through crime data. However, the algorithm selling data was confirming the growing trend for traditional
ran into a firestorm of controversy in late 2013 when a mobile phones. Thus she too suggests complementing big
Chicago Tribune article told the story of a man on the list data with thick data (see Sears case) whenever the project
who had no criminal arrests. [17]. is concerned with human behaviour.
That is a case of unethical use of data and analytics, which Hotel Prices A hotel chain used some pretty sophisticated
is currently a rising topic for researchers and practitioners. mathematics, data mining, and time series analysis to coor-
Concerning our hints, Domain Knowledge was disregarded dinate its yield management pricing and promotion efforts.
(remind that it also requires checking the consistency of the The system worked fine for about a third of the hotels but
data with the domain), and here the data were not truly was wildly, destructively off for another third. After long
representing the crime in Chicago. and careful investigations, the most likely explanation was
Keep Calm T-shirts “Amazon was caught up in a big data that the analysts had priced against the hotel group’s peer
fiasco when poor programming and analytics led it to sell competitors, but hey had not weighted discount hotels into
T-shirts on its site with offensive messages such as Keep either pricing or room availability. [29]
Calm and Rape a Lot and Keep Calm and Punch Her. The Not taking into account budget hotels is a clear case of
seller company failed to check the batch phrases generated disregarding Domain Knowledge.
via a scripted computer program and sent it for publishing.” OfficeMax OfficeMax sent discount coupons to a customer,
[8] This project is wrongly considered a failure related to and the envelope was addressed to Mike Seay, Daughter
big data/analytics because it is a case where the results of Killed in Car Crash. Seay’s daughter Ashley had died in
a non-analytic algorithm were assumed valuable and used a car crash a year before. Even after a company executive
without a second thought. The hint Sensible Results perhaps apologized, they still did not explain why an office supply
could have helped, driving the developers to think which company knew that his daughter had died. [24] The disre-
would be the acceptable combinations. garded hint is Data Quality; indeed, OfficeMax likely used
Google Flu The Google Flu Project was launched to pro- data from an insurance company without evaluating their
vide real-time monitoring of flu cases. While the premise quality.
seemed a good way to prove the power of big data, Google Suicide-Prevention App Samaritans developed a free app to
got it wrong. In building its analytics, Google assumed that notify people whenever someone they followed on Twitter
there was a close correlation between people who search posted potentially suicidal phrases like “hate myself” or
for flu-related information on the web and the number of “tired of being alone.” The group quickly removed the app
people that had the flu. However, Google Flu overestimated after complaints from people who warned that it could be
misused to harass users at their most vulnerable moments.
2 Thick data is qualitative information that provides insights into the
[16]
everyday emotional lives of consumers. It goes beyond big data to explain
why consumers have certain preferences, the reasons they behave the way Here Problem Identification was disregarded; again, simi-
they do, why certain trends stick, and so on. larly to the Target case, they should have carefully guessed

252

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
the effects of using the answer. Also, Domain Knowledge may was to know who worked for the party and to be able to
have helped, making aware of the potential stressing danger connect with them. The biggest problem was data quality
in such situations. since 50%-70% of the data depending on the regions were
Scottish Independence Vote During a report on the refer- fake [2] (party volunteers were encouraged to insert as much
endum, CNN displayed a graphic that showed that 110% as possible people in the system, and the result was that they
of Scotland’s population had been polled. According to the added random names). Then, the data were managed in a
graphic, 58% of Scots had said Yes, and 52% voted No. The silos style preventing the access to whom may get something
hint Sensible Results could have saved this tiny failure. out of them. Finally, when the system was used to survey
Tay Chat Bot Tay, modeled to learn from conversations and the connected people, questions were poorly posed so that
speak like a millennial, was promptly taken offline less than responders echoed the view of the chiefs of the party.
a day after its debut. In less than 24 hours, Tay had gained The analysis presented above of the 21 projects, see Fig. 4,
more than 50,000 followers and produced nearly 100,000 documented in the (grey and scientific) literature as big
tweets, slurred with bigotry and racism. [14] data/analytics failure led us to the following results. Six
Technically, it has been a success, since the chatbot of them cannot be considered big data/analytics failures (5
learned from its users. However, a better understanding and 7 failed for software engineering issues; 1 and 11 can
of the domain (Domain Knowledge) and some preliminary hardly be considered BDA projects; 11 is a standard case
thoughts about the effect of echoing current contents on of automatic generation of combinations of words; and 19
social media (Problem Identification part on side effects) may and 20 are considered failures because people cannot un-
have avoided such a failure. derstand percentages and probabilities). The most frequently
USA 2016 Election All data based forecasts for the 2016 disregarded hints in the real failures are Domain Knowledge,
presidential elections were in accord: Clinton will win, and Data Quality and No Data Silos, which confirm their relevance
the percentages ranged from 70% to 95%, but the voters for the success of BDA projects.
chose Trump instead. Finally, the case analysis prompted three new hints:
As in other cases considered before, this is not exactly a Sensible Results, Analytics’ Accuracy and Thick Data, and the
failure since the analytics answers were accompanied with refinement of Problem Identification.
percentages, as stated in [7] “The technology can be, and
is, enormously useful. But the key thing to understand is V. R ELATED W ORK
that data science is a tool that is not necessarily going to
give you answers, but probabilities, . . . people often do not Saltz and others in [27] present a systematic literature
understand that if the chance that something will happen is survey run in 2016. Among other results, it provides a list
70 percent, that means there is a 30 percent chance it will of success factors for big data projects, some of which are
not occur. The election performance is not really a shock to included in our list (Data Scientists, Business and IT People
data science and statistics. It’s how it works.” [7]. This case Together, Problem Identification, Data Culture, Specific Skills,
study prompted us the new hint: Executives’ Support, Data Quality and Data Governance), while
• Analytics’ Accuracy: State which is a suitable error margin the others concern some tools.
for analytics computations in a BDA project. In the literature, there are other papers investigating the
In our opinion, Data Culture should also include a clear success factors for BDA projects by means of literature
understanding of what a probability means, and that, in surveys. In general, they examined a smaller set of papers,
general, an analytics computation has an estimated margin and the found factors are included in our list of hints. For
of error, which may play a role in a BDA project. If a project example, [30] presents the results of a scientific literature
intendeds to find all customers suitable for an email-based review related to success factors and methodologies for big
marketing campaign, an error of 10% may be ok, whereas data projects. Differently from us, their search was driven
if it is about managing the safety of a nuclear plant, a very by success-related keywords; moreover, they do not detail
low error margin is needed. which repositories explored and consider papers after 2016,
But, on the other hand, in this case, it seems that there 10 altogether. Some of the reported success factors are
have also been technical errors (see e.g. the detailed analysis included in our list.
in [36], and the use of flawed poll data [16] Data Quality). [20] reports on a limited personal opinion survey whose
A similar case is 2018 Football World Cup, where analytics results agree on the relevance of three of our hints.
forecasted the victory of Brazil defeating Germany with a [4] is an interesting (scientific and grey literature) review
relevant percentage, but in the end France won the cup. [6] primarily aimed to understand which are the “big data
Project Shakti India National Congress Party launched the challenges” for a company to get a competitive advantage.
Project Shakti gearing up for 2019 national elections, which It also partially investigates the causes of failures and agrees
at the end lost. Shakti aimed to link party supporters to on the relevance of Specific Skills, followed by issues about
their mobile numbers, through a simple SMS. The idea collecting and integrating the data and the complexity of the

253

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
used technologies. There is only a tiny intersection of cases [7] S. Butt. Election 2016 forecast - a big data failure?
with ours (two failed projects). Linkedin Web Site, 2016. www.linkedin.com/pulse/election-
The review [1] presents a list of challenges for the big 2016-forecast-big-data-failure-shahid-butt.
data projects; most of those not related to the technology are [8] Y. Chowdhury. 5 cases when big data failed. Linkedin Web
matched by some of our hints, precisely: Specific Skills, Data Site, 2016. www.linkedin.com/pulse/5-cases-when-big-data-
Scientists, Business and IT People Together, Data Governance, failed-cybyte-webmaster/.
Data Quality, and Data Security.
[9] Y. Chowdhury. 5 places where big data projects failed.
VI. C ONCLUSION CyByte Blog Web Site, 2016. blog.cybyte.com/5-places-
where-big-data-projects-failed/print/.
This paper reports the results of a (scientific and grey)
literature review aimed to shed light on the high rate of [10] J. Ekdahl. Mitt Romney’s “Project ORCA” was a disaster,
failure of BDA projects. We collected and analyzed docu- and it may have cost him the election. Business Insider Web
mented failure cases to discover the root causes (RQ1), and Site, 2012.
collected what is assumed to reduce the chance of failure [11] O. Fleming, T. Fountaine, N. Henke, and T. Saleh. Ten
of those projects (RQ2). The survey at the end selected and red flags signaling your analytics program will fail.
examined 188 sources dated from 2011 (140 grey and 48 McKinsey Web Site, 2018. www.mckinsey.com/business-
scientific, this is not surprising since the field is really new). functions/mckinsey-analytics/our-insights/ten-red-flags-
The answer to RQ2 is a set of hints to decrease the chance signaling-your-analytics-program-will-fail12/7/2018.
of failure for BDA projects distinguished in those highly/ [12] Gartner. Gartner says business intelligence and ana-
moderately and scarcely supported by the examined sources; lytics leaders must focus on mindsets and culture to
perhaps surprisingly, they concern the project conception, kick start advanced analytics. Gartner web site, 2015.
organization and management, but not technical issues. www.gartner.com/newsroom/id/3130017.
The answer to RQ1 is a set of 21 projects documented as
[13] N. Heudecker. Twit. Twitter web site, 2017. twit-
big data/analytics failures. However, their analysis revealed ter.com/nheudecker/status/928720268662530048.
that six could not be considered failures at all or failed for
other reasons. The analysis of the remaining ones resulted [14] E. Khoury. 7 big data blunders of 2016. VCloud News
in improving one hint and in further hints, two of which Web Site, 2017. www.vcloudnews.com/7-big-data-blunders-
never proposed up to the best of our knowledge. The results of-2016/.
of this study could help people develop a BDA project and
[15] Y. Liu, H. Han, and J. DeBello. The challenges of business
researchers in the field of software engineering interested in analytics: Successes and failures. In Proceedings of HICSS-
adapting current methods or devising new ones to cope with 51, 2018.
the challenges of big data and analytics.
[16] S. Lohr and N. Singer. How data failed us in calling an
R EFERENCES election. The New York Times Web Site, 2016.

[1] Z. A. Al-Sai, R. Abdullah, and M. h. husin. Big data impacts [17] K. Lum and W. Isaac. To predict and
and challenges: A review. In IEEE JEEIT 2019, pages 150– serve? Significance Magazine Web site, 2016.
155, 2019. rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1740-
9713.2016.00960.x.
[2] V. Ananth. When a project to empower left congress stranded.
The Economic Times Web Site, 2019. [18] mahylights. Can data save us all? lessons from
attempting to transform Sears and Kmart through data
[3] P. Bajis. Before starting, consider 5 reasons your big & analytics. Harvard Business School Web Site, 2017.
data project will fail. SmartData Collective Web Site, digital.hbs.edu/platform-digit/submission/can-data-save-us-
2018. www.smartdatacollective.com/before-starting-consider- all-lessons-from-attempting-to-transform-sears-and-kmart-
5-reasons-your-big-data-project-will-fail/. through-data-analytics/#.

[4] H. Barham. Achieving competitive advantage through big [19] B. Marr. Where big data projects fail. Forbes Web Site, 2015.
data: A literature review. In PICMET 2017, pages 1–7, 2017. hwww.forbes.com/sites/bernardmarr/2015/03/17/where-big-
data-projects-fail/#4ca727c4239f.
[5] R. Boire and T. Senator. What not to do: Exam-
ples of business analytics failures. KDnuggets Web [20] G. J. Miller. Comparative analysis of big data analytics and
site, 2012. www.kdnuggets.com/2012/06/business-analytics- BI projects. In FedCSIS 2018, pages 701–705, 2018.
failures.html.
[21] S. Narayanan. From big data to botched data: 5
[6] K. Budek. When predictive analytics in football fall short (an steps to total big data failure. MIS Asia Web Site,
example). deepsense.ai Web Site, 2018. deepsense.ai/when- 2014. www.mis-asia.com/resource/applications/from-big-
predictive-analytics-in-football-fall-short-an-example/. data-to-botched-data-5-steps-to-total-big-data-failure/.

254

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.
[22] F. Rabhi, M. Bandara, A. Namvar, and o. Demirors. Big
data analytics has little to do with analytics. In A. Beheshti,
M. Hashmi, H. Dong, and W. Zhang, editors, Service Re-
search and Innovation, pages 3–17. Springer, 2018.

[23] P. Ramaswamy. How to create a data culture. Cognizant Web


Site, 2015. www.cognizant.com/InsightsWhitepapers/how-to-
create-a-data-culture-codex1408.pdf.

[24] T. Reddy. 7 big data blunders you’re thankful


your company didn’t make. Umbel Web Site, 2014.
www.umbel.com/blog/big-data/7-big-data-blunders/.

[25] G. Reggio and E. Astesiano. Big data analytics projects fail-


ure literature review: Detailed information, 2020. Web page
available at https://sepl.dibris.unige.it/2020-FAILURE.php.

[26] S. Remington. How to avoid analytics failure: 10 factors


for analytics success from research and practice. Minerra
Web Site, 2018. www.minerra.net/business-analytics/how-to-
avoid-analytics-failure-10-factors-for-analytics-success-from-
research-and-practice/.

[27] J. S. Saltz and I. Shamshurin. Big data team process


methodologies: A literature review and the identification of
key factors for a project’s success. In Big Data 2016, pages
2872–2879, 2016.

[28] G. Satell. 4 things every leader should know be-


fore starting a big data project. Inc. Web Site,
2018. www.inc.com/greg-satell/4-things-every-leader-should-
know-before-starting-a-big-data-project.html.

[29] M. Schrage. Learn from your analytics failures. Harvard


Business Review web site, 2014.

[30] M. Soukaina, H. Anoun, M. Ridouani, and L. Hassouni. A


study of the factors and methodologies to drive successfully
a big data project. In ICDS 2019, pages 1–6, 2019.

[31] L. Tucci. Big data can mean bad analytics.


TechTarget – Search Cio Web Site, 2013.
searchcio.techtarget.com/opinion/Big-data-bad-
analytics10/09/2018.

[32] Unknown. Report: Cios & big data: What your IT


team wants you to know. Infochimps Web Site,
2013. www.infochimps.com/resources/report-cios-big-data-
what-your-it-team-wants-you-to-know-6/.

[33] Unknown. How to avoid big data project failures. Ingram Mi-
cro Inc. Web Site, 2014. www.ingrammicroadvisor.com/data-
center/how-to-avoid-big-data-project-failures.

[34] T. Wang. The human insights missing from big data. TED
Talk Web Site (video), 2016.

[35] A. White. Our top data and analytics


predicts for 2019. Gartner web site, 2019.
blogs.gartner.com/andrew white/2019/01/03/our-top-data-
and-analytics-predicts-for-2019/03/02/2020.

[36] A. Woodie. Six data science lessons from the


epic polling failure. Datanami Web Site, 2016.
www.datanami.com/2016/11/11/data-science-lessons-epic-
polling-failure/.

255

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on July 06,2023 at 00:47:03 UTC from IEEE Xplore. Restrictions apply.

You might also like