You are on page 1of 14

Industrial Marketing Management 86 (2020) 16–29

Contents lists available at ScienceDirect

Industrial Marketing Management


journal homepage: www.elsevier.com/locate/indmarman

You have not been archiving emails for no reason! Using big data analytics T
to cluster B2B interest in products and services and link clusters to financial
performance
Yang Yanga, Eric W.K. See-Tob, Savvas Papagiannidisc,

a
Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
b
Department of Computing and Decision Sciences, Faculty of Business, Lingnan University, Tuen Mun, Hong Kong
c
Business School, Newcastle University, 5 Barack Road, Newcastle upon Tyne NE1 4SE, UK

ARTICLE INFO ABSTRACT

Keywords: The potential of big data analytics when it comes to gaining business insights, such as market trends and con-
Big data sumer preferences, has captured the interest of both scholars and business practitioners. However, the extant
Analytics literature has so far provided limited empirical evidence to demonstrate how big data analytics can create
Emails business value. To address this research gap, this paper followed a novel big data analytical approach that
Internal
involved analysing email archives about product/services demand clusters in a B2B setting. We analysed 621 k
Market research
Sales
emails exchanged between 2009 and 2018. We identified a number of discussion clusters that were considered
proxies for the interest buyers expressed in the products/services on offer. These clusters and associated dis-
cussion trends were linked to the company's revenues and financial performance, showing good predictive
power. In doing this, we have demonstrated how widely available data, such as emails, which all companies
have, can be used to underpin new methods for the early identification and monitoring of product demand
trends, informing marketing strategies.

1. Introduction techniques and tools are still the dominant way of gaining business
insights (Xu, Frankwick, & Ramirez, 2016). Unlike big data analytics,
For practitioners, and especially marketers, the environment has which aim to analyse massive volumes of data in real time, traditional
been rapidly changing due to the unprecedented volume, velocity, and marketing analytics focus mainly on improving key performance in-
variety of data available from consumers, competitors, and partners dicators such as market share, customer relationship, and revenues
(John Walker, 2014). The speed with which information is generated (Sathi, 2014). It is true that traditional marketing analytics can still
requires different data management and faster market analyses than provide useful business insights (Rusetski, 2014). Still, they can often
traditional market analytics can handle. Emerging literature has aimed be limiting when it comes to capturing information and making sense of
at understanding the value of big data analytics in exploiting business large and unstructured data, which may result in unreliable or biased
insights (Gordon & Mohammad, 2012; Mithas, Lee, Earley, Murugesan, decision-making (Xu et al., 2016). With the increasing volume and
& Djavanshir, 2013) and the implications these can have. For example, complexity of business information, the data-driven decision-making
Wiersema (2013) believed that the potential power of big data tech- process is expected to play an ever increasingly important role, with big
niques can make current B2B models obsolete. Lilien (2016) also agreed data analytics being the underpinning tool to support this evolution
on the importance of data driven decisions in the B2B field, highlighting (Gustke, 2013).
the need to harness the potential of big data and analytics. Companies Due to the novelty of big data analytics in the B2B context, studies
that perform well against the competition have been found to rely more assessing its effect and performance are scarce. Most of the studies so-
on big data analytics (Asay, 2014). Scholars have recognised that big lely focus on the development of conceptual frameworks to understand
data analytics can lead to sustainable competitive advantage by iden- the impact of big data analytics on marketing activities (Erevelles et al.,
tifying underlying business insights (Erevelles, Fukawa, & Swayne, 2016; Gunasekaran et al., 2017; Xu et al., 2016). Such limited de-
2016; Lycett, 2013). Despite the above, traditional market analytical monstrations of big data analytics may also be due to the difficulties in


Corresponding author.
E-mail address: savvas.papagiannidis@ncl.ac.uk (S. Papagiannidis).

https://doi.org/10.1016/j.indmarman.2019.01.016
Received 30 July 2018; Received in revised form 28 December 2018; Accepted 25 January 2019
Available online 01 February 2019
0019-8501/ © 2019 Elsevier Inc. All rights reserved.
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

collecting data from diverse information sources (Fan, Lau, & Zhao, emails (McAfee et al., 2012). This suggests that companies should look
2015). Only a few studies have presented applications of big data not just externally for data, but also explore making good use of the
techniques in the B2B context. For example, Bohanec, Borstnar, and internal data. This is especially true given that the increasingly digitised
Robnik-Sikonja (2015) developed machine learning models to under- business environment and activities are creating new streams of data
stand B2B better sales forecasting. Similarly, D'Haen and Van den Poel and information that could be leveraged to gain significant advantage
(2013) presented an iterative three-phased automated machine by enhancing productivity and competitiveness (Manyika et al., 2011).
learning model to help acquire clients in a B2B environment. It is worth Manyika et al. (2011) describe big data as “the next frontier for in-
noting that typically such studies consider data acquired from outside novation, competition, and productivity”. Big data can change com-
data providers or collected from the Internet (Bohanec et al., 2015; petition by transforming processes, altering corporate ecosystems, and
D'Haen & Van den Poel, 2013; Lackman, 2007). Despite the increasing facilitating innovation (Brown, Chui, & Manyika, 2011). Not surpris-
importance and relatively higher accessibility of data generated in ingly, and given the potential benefits that big data can result in, there
companies' internal environments (McAfee, Brynjolfsson, Davenport, is a “widespread belief that large data sets offer a higher form of in-
Patil, & Barton, 2012), it is not often the case that such data sets are telligence and knowledge that can generate insights that were pre-
utilised. The increasing availability of electronic information generated viously impossible, with the aura of truth, objectivity, and accuracy”
by daily operations, such as emails, trading records or consumer re- (Boyd & Crawford, 2012, p.663). Research and business applications
views has created an accessible data pool for management and provides should be driven by the big questions, not the big data as such
several business opportunities. Identifying an approach to extract the (Papagiannidis, See-to, Assimakopoulos, & Yang, 2018). The ability to
underlying knowledge contained in heterogeneous information sources analyse massive amounts of data should not be an end in itself but,
is a difficult, but worthwhile, challenge for both marketing strategists instead, analytics should focus on applying statistics for gaining prac-
and scholars. Part of the challenge relates to dealing with information tical insight (Wang, Gunasekaran, Ngai, & Papadopoulos, 2016). Big
overload (Eppler & Mengis, 2004). In the B2B context, as business and data can be leveraged effectively to support business decisions. For
marketing related data become larger and more inexplicable, the lim- example, it has been recognised that big data analytics can be embraced
ited processing capacity of decision-makers causes bottle-necks in de- as a disruptive technology that will reshape marketing intelligence (Fan
ciphering and interpreting a complicated environment (Gordon & et al., 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).
Mohammad, 2012). This problem is widely recognised for its adverse For example, McAfee et al. (2012) put forward empirical evidence to
effect, causing loss of productivity or creating stress, in turn reducing show that companies relying more on data-driven decision-making are
the performance of individuals and organisations (Eppler & Mengis, performing better when it comes to productivity and profitability. Si-
2004). Considering the high complexity and dynamics of B2B markets, milarly, Davenport (2014) presented a number of cases of organisations
the negative influence of information overload can be amplified by the drawing competitive advantages from the use of data and analytics. For
complexity of decision-making (Edmunds & Morris, 2000). Thus, there instance, Wal-Mart uses dynamic big data analytics to improve its
is a pressing need to present information in a manageable and easy to supply chain strategy and pricing models. In addition, LaValle et al.
comprehend manner. (2011) found that top-performing companies use analytics five times
Given the above, the objective of this paper is threefold. First, we more than lower performers, and half of the participating companies
provide empirical evidence in the form of a novel big data methodology stated that the improvement of information and analytics was a top
of how big data analytics can be leveraged in the B2B context to obtain priority in their organisations. However, they also argue that the pro-
marketing insights from a new vantage point. The presented approach cess of transforming big data analytics into business value is still far
can help transform data that is often considered redundant into input from mature.
for marketing and managerial decision-making. We apply our metho- Three challenges have been identified in relation to B2B data ana-
dology to the email archives of Company X, a third-party assurance firm lytics (Martin et al., 2015). First, establishing a long-term mechanism to
that specialises in testing, inspection and certification services. Second, continually collect data from multiple information sources could be a
given that the major barrier to applying big data analytics in the B2B costly task. This requires investment in new approaches, applications,
environment is the great difficulty of collecting and storing data, we and frameworks for effective data management (Fan et al., 2015). Data
considered an internal data stream to address this challenge. Instead of in the B2B context can be less voluminous and more challenging to
acquiring costly data from the external environment, we propose uti- collect than data from consumer sources. Secondly, there is a trend in
lising the most conventional, but often unexploited, information source big data analytics towards mixing first-party, reasonably verified data,
inside a company: business emails. Finally, the identified clusters of with public and third-party external data, which has largely not been
interest are then related to financial performance, showcasing the po- validated and checked by any standard methodology. The reliability of
tential that the information extracted can have when it comes to pro- any finding based on such mixed data sources may vary widely (Kaisler,
ducing tangible results for organisations. Systematising the collection, Armour, Espinosa, & Money, 2013). Considering the importance of
analysis and presentation can offer marketing managers a demand/ basing decisions on data analytics, for instance, when it comes to in-
performance forecasting tool that complements their existing arsenal. forming marketing strategies and campaigns, this limitation can be
critical. Thirdly, changing the experience-based decision-making cul-
2. Literature review ture into data-driven thinking can be challenging. As McAfee et al.
(2012) stated in their business review, “exploiting vast new flows of
2.1. Big Data and analytics in the B2B context information can radically improve your company's performance. But
first you'll have to change your decision-making culture”. This is par-
Nearly 2.5 billion gigabytes of data are produced every day ticularly true when data is expensive to obtain or is in an unstructured
(Sivarajah, Kamal, Irani, & Weerakkody, 2017), and this number is format. To support a data-driven culture, data analytic professionals
doubling every 40 months or so (George, Haas, & Pentland, 2014). need to map the available data and devise a plan as to how to effec-
Laney (2001) used volume, velocity and variety, known as the 3Vs, to tively communicate this knowledge to the domain experts (Chen,
capture the key dimensions of big data. Volume refers to the vast Chiang, & Storey, 2012). Additionally, using big data as essential
amounts of data. Velocity refers to the speed at which new data is components of business decision-making requires new capabilities,
generated and the speed at which it moves around. Finally, variety which includes IT infrastructure, as well as organisational and cultural
refers to the range of data types and sources. Big data may involve a changes (Bughin, Chui, & Manyika, 2010). As such, even when com-
number of different sources such as web pages, user-generated content, panies can access large data sets, they still cannot analyse them in
social media, and even business data like transactional records or daily meaningful ways, resulting in sub-optimal outputs (Griffin et al., 2013;

17
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Lilien, 2016). Tackling the challenges mentioned above can form the section, we have suggested a different way of addressing the difficulties
main pillars of a strategy to address common issues related to big data in collecting data (Chen et al., 2012; Kaisler et al., 2013). Given that
in B2B settings and maximise returns. The business value of big data email is profoundly embedded in the daily operation and communica-
analytics stems from its ability to create a data-driven decision culture tion of a company, it can be continuously generated and added to an
in B2B companies, helping them gain competitive advantages ever growing and traceable repository (Campbell, Maglio, Cozzi, &
(Morabito, 2015; Wedel & Kannan, 2016). Dom, 2003). Additionally, companies do not need to invest resources to
collect email data. The automatic archiving feature of emails solves the
2.2. Using emails for B2B marketing insights data collection problem in applying big data analytics and enables
dynamic and real-time analysis (Alrashed, Awadallah, & Dumais,
A possible way to alleviate the data collection difficulty for big data 2018). In view of the cost-effectiveness and content richness, we sug-
analytics is to reuse the data generated in the internal environment of gest that email should be deemed as an important data source for big
the companies, such as business emails. Email is the most prevalent data analytics in B2B companies. A possible limitation of using email as
channel and collaboration tool in both business and social environ- the data source is privacy. For example, mining emails, even when
ments. Since the mid-90s, when its popularity exploded, email has results are displayed in aggregated forms, might expose private in-
continued to be a quick and user-friendly communication and decision- formation (Jackson, Yates, & Orlikowski, 2007). Pre-processing to dis-
making tool within organisations (Lahiri, Mihalcea, & Lai, 2017). tinguish personal emails and business emails may be necessary
Compared to other communication channels, email is cost effective (Goodpaster, 2015).
(Whittaker, Bellotti, & Moody, 2005) and provides a traceable record Given the above and to contribute towards addressing the chal-
via quoted messages (Tedmori, Jackson, & Bouchlaghem, 2006). A lenges outlined, this study will explore adopting big data techniques to
survey conducted in 2017 (The Radicati Group, 2017) estimated that extract business information from a corporate email archive containing
there are over 6.3 billion email accounts. This figure is predicted to B2B communications. In doing so it will showcase how to utilise the
reach 7.7 billion by 2021, i.e. a growth of > 22%. When it comes to contained business information to support management decision-
business communications, email continues to play a dominant role, making.
bringing tangible results. > 60% of consumers would prefer to be
contacted by brands via email, with 55% of companies generating > 3. Methodology
10% of their sales by email (Bawm & Nath, 2014).
Given the importance of emails in the business environment, there is 3.1. The data
a rich body of literature studying the impact of emails at the organi-
sational or individual level. When it comes to the business use of emails, In this study, we obtained an email archive from Company X.
three main literature streams exist. The first one examines the knowl- Company X is a third-party assurance firm that specialises in testing,
edge source identification (or “expert location”) (Jackson, Tedmori, inspection and certification services. This company has several teams of
Hinde, & Bani-Hani, 2012; Paul, 2016; Tedmori et al., 2006), and aims experts to meet the testing and inspection needs of manufacturers,
to identify users who hold the knowledge in an organisation. The traders, and buyers around the world. Over the past few years, the
second type is spam email detection by using text-mining techniques number of professionals in Company X has been growing, reaching 300
(Ajaz, Nafis, & Sharma, 2017; Basavaraju & Prabhakar, 2010). The final testing staff with a qualification certificate, while its service scope and
one considers a variety of algorithms that have been developed to ex- service sectors are expanding rapidly (7 divisions to provide > 200
tract keywords from emails (Lahiri et al., 2017; Rose, Engel, Cramer, & testing services). In 2017, the revenue of Company X reached 105
Cowley, 2010; Shah, Perez-Iratxeta, Bork, & Andrade, 2003). However, million HK dollars and served > 1200 customers worldwide. To meet
among the studies in these streams, only a few of them attempted to use the increasingly diversified testing requirements of customers,
big data analytics on emails to create practical business value in real- Company X has established a team of experts. This team consists of
world cases. For example, Li, Sen, and Zaman (2015) attempted to re- multiple senior testing staff from different departments and is re-
trieve keywords from business emails, concluding that they could be sponsible for providing customers with professional and timely testing
utilised for business management solutions. Although this study is re- recommendations. According to the instruction from the CEO, the ex-
latively close to business information extraction, they did not demon- pert team represents the best that Company X can offer and as such
strate how the extracted information can support decision-making. their professional opinion is highly valued by both customers and
From a marketing communications perspective, marketers consistently Company X management. In this study, the expert team provided the
ranked email as the single-most-effective tactic for awareness, acqui- necessary information about the company's products and services. They
sition, conversion, and retention (Deal, 2014). Email can provide sev- also suggested possible topics that staff may discuss in daily operations,
eral important, often unexploited, opportunities for knowledge-finding, based on their experience with customer interactions. Their suggested
and the knowledge extracted from emails can be accessed and reused topics were used as a valuable reference for our analysis, as it offered a
directly (Janine, Ton, & Van Joolingen Wouter, 2004). However, as benchmark for the methodology's findings.
organisations adopt email as the primary method of communication, Company X provided an archive contained 18 GB (originally 292GB
they often neglect the fact that email content contains information of Microsoft PST format files that included attachments) of text emails
about business decisions, actions and transactions (Jackson & Tedmori, generated or received by employees of the Food and Pharmaceutical (F
2004). These business emails become documents and records with legal &P) department. The archive covered a period of about 10 years (from
requirements and restrictions. The challenge is that due to the un- June 2007 to March 2018). The raw dataset contained 621,090 unique
structured format of emails and the large volumes often accumulated, messages (i.e. considering only the new messages sent and not previous
the knowledge that they contain is difficult to extract (Jackson & replies that may have been contained in the same email), all in English
Tedmori, 2004). and UTF8-coded. Each data point had four features: (a) sender, (b)
The business information contained in emails is critical because it receiver, (c) email subject and (d) email body. As we extracted data
can be used for several purposes including a market trend identification from the inbox of Outlook software, the emails collected from each
(Kok & Yih, 2009). Losing or failing to optimally exploit the underlying employee were actually the emails received by them. On average, each
business information in these unstructured texts can lead to a sig- employee received 15,925 emails with a standard deviation of 24,702
nificant loss of business opportunity (Duff, 1996). More importantly, emails. Fig. 1 below shows the distribution of emails among the 39
companies rarely recognise email as an essential information source to employees. It should be noted that a few employees have a small
support big data analytics (Jackson & Tedmori, 2004). In the previous number of emails received or available (e.g., employee ID 7, 14, 18, 19,

18
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 1. Distribution of incoming emails by employee.

20). This is because these employees joined the company after July 3.3. Content pre-processing
2017, thus having fewer emails compared to other colleagues. The
difference among individuals does not affect the analysis as we con- The content pre-processing phase converted the original email texts
sidered the entire department as a whole. into a data-mining-ready structure. Firstly, we removed stop-words
from the subject line and body, based on a combined dictionary. This
3.2. Email cleaning and discussion thread Identification 870 word dictionary was built based on the complementary set of
several English stop-words lists, which involved the NLTK stop-words
Before exploring the latent business topics in this archive, we ap- list (Madnani, 2007), Fox stop-words list (Fox, 1989), GoogleStopWords
plied a number of pre-processing techniques to eliminate noise and (Rose et al., 2010) and MySQL stop-word list (Moh & Bhagvat, 2012).
unnecessary information. Firstly, we removed all non-business emails Following the application of stop-words, all the names (based on the
from the dataset. The non-business email extraction was based on de- NLTK name corpus), numbers, symbols, punctuation marks and mea-
tection of keywords. A “non-business dictionary” was manually estab- surement units were removed. As a last step, in order to avoid the data
lished to support a rule-based classifier (Androutsopoulos, Koutsias, sparsity problem of the topic (Hong & Davison, 2010), we removed
Chandrinos, & Spyropoulos, 2000). An example of the simple rule is: IF emails that had a length shorter than five words. The refined dataset
“word in the dictionary appears in the subject” OR “word in the dic- contained 43,137 threads.
tionary appears in email body” THEN “the email is a non-business
email”. The dictionary contained 27 words or phrases like “fax notifi- 3.4. Topic modelling
cation”, “automatic reply”, “alert email”, and “sick leave”. The entries
added to this dictionary were provided by the supervisors of the IT and In extant studies of topic modelling, four techniques are well re-
human resource departments. In total, we removed 76,881 non-busi- cognised: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic
ness emails (544,209 remained). Analysis (pLSA), Explicit Semantic Analysis (ESA) and Latent Dirichlet
We then extracted the body text from each email and deleted all the Allocation (LDA). LSA is a classic technique to cluster documents and
greetings, closings, sign-off, and personal information as these do not extract latent topics. The core idea of LSA is to generate a documents-
contribute to our study's objectives. In the next step, we combined terms matrix and use singular value decomposition (SVD) to decompose
emails with the same subject. For instance, if an employee had issued a the matrix into a separate document-topic matrix and a topic-term
test inquiry and received a reply from an expert, these two emails were matrix. The basic rationale behind LSA is that documents which share
combined as one data point under a subject generated by the employee. frequently co-occurring terms will have a similar representation in the
One issue was that it was possible to have different threads sharing the vector space. LSA is a simple and easy to use approach, but it does have
same subject title, e.g. “test enquiry”. To address this issue, we checked a primary limitation: low efficiency. The size of the documents-terms
the raw data and consulted with the department's expert team. As a matrix can significantly increase with a large dataset. Accordingly, the
professional certification institution, this firm has strict regulations for high computational complexity for the SVD process can make an ana-
business email writing, and the use of ambiguous language in email lysis unfeasible. To address the efficiency problem of LSA, Hofmann
subjects is forbidden. General subject titles like “sales enquiry” or “test (2001) proposed pLSA use a probabilistic generative process to infer
enquiry” are seldom used in this email archive. Although we still found hidden topics using posterior inference. However, one shortcoming of
a small number of emails with general subjects such as “enquiry” and pLSA is that it contains many free parameters and the number of
“result”, they were not expected to significantly influence our analysis. parameters grows linearly with the size of training documents (Hoff-
After the above steps we ended up with a total 544,209 emails. We mann, 2011). Again, for a large document set, pLSA can become very
then combined emails according to their subjects as planned. After this complex. LDA extends PLSA to deal with the weakness of free para-
step, the dataset contained 61,435 threads, with 8.8 emails per subject meters. LDA uses Dirichlet priors for the document-topic and word-
on average (the standard deviation was 12.3). topic distributions, lending itself to better generalization and much

19
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

fewer parameters (Chien & Wu, 2008, Chang & Chien, 2009). Finally, 4. Results and discussion
ESA is not considered in our study as this technique cannot discover
latent topics. Instead it can produce labelled concepts based on an ex- Before embarking on the analysis, several parameters for the LDA
isting knowledge base (e.g., Wikipedia). Thus, ESA cannot address our model needed to be determined. First, the number of topics had to be
research objective: to discover the underlying information that is not specified. We consulted with the expert team. They provided us with a
identified in the business emails. list showing the 15 topics they commonly discussed in the F&P de-
Given the above, we employed LDA to identify the latent topics in partment (Table 1). The topic selection was based on the service in-
this dataset. LDA is a popular topic modelling technique for exploring troductory brochure of Company X (9 types of testing services). We also
document collections (Blei, Ng, & Jordan, 2003). LDA topics are for- asked the expert team to recall any other possible topics they commonly
mally a multinomial distribution over words, and by convention the top discussed in daily operations (6 topics). These 15 topics were then
ten words are used to identify the subject area or give an interpretation considered as a reference for our analysis. If our algorithm could extract
of a topic. The generative probabilistic model of LDA can be described topics that matched the majority of the manually assigned labels, then
as a three-step process: first, LDA assigns a set of topics (t) for each we would have had a way of validating the topic extraction accuracy of
document (d) with Dirichlet distribution, i.e., P(t| d). Second, each topic LDA algorithm and its relevance to Company X business operations. In
defines some unique multinomial probability over the words (w) in the this case, a perfect match between the known topics and the generated
dictionary, i.e., P(w| t). Third, the product of P(t| d) and P(w| t) is cal- ones may have not been desirable either, as if such a match was to be
culated to find the probability of a word i existing in a given document achieved then no new business insights could be generated. Given the
d and the word i belonging to the mixture word set of topic t. This above, to maintain the comparison between expert suggested topics and
process can be formalised as: machine-extracted topics from daily emails, we also fixed 15 as the
number of topics for our LDA model. This setting provided us with an
T
opportunity to compare the topics generated from emails vs. the topics
P (w | d ) = P (w | t = k ) P (t = k | d )
k=1
given by the firm experts, which may generate more insights from the
latent business topics in the emails.
Where P(w| t = k) is the probability of word i belonging to the word Secondly, we used asymmetric Dirichlet priors in the LDA estima-
mixture of topic k, and P(t = k| d) is the probability of selecting word i tion with α = “auto” and β = 0.01, which are common settings in the
from topic k in the document d. These two probabilities are estimated literature (Wei & Croft, 2006). Fixing α = “auto” enables the model to
from the training document sets using Dirichlet priors and a fixed learn an asymmetric prior over words directly from the training data
number of topics. Blei et al. (2003) chose to use the Gibbs sampling (Řehůřek & Sojka, 2010). Setting asymmetric Dirichlet prior over the
approach (Darling, 2018) to iterate multiple times over each word i in document–topic distributions has substantial advantages over a sym-
document d, and sample a new topic k for the word based on the metric prior, while an asymmetric prior over the topic–word distribu-
probability P(w| d), until the LDA model parameters converge. The tions provides no real benefit (Wallach, Mimno, & McCallum, 2009).
outcomes of the LDA model are a document-topic matrix and a topic- Then, following the instructions of Blei et al. (2003), the learning
word matrix. Leveraging on these two matrices, we can cluster docu- iterations for the training model is the number of processes that find the
ments into semantically-meaningful groupings and understand the optimising values of the variational parameters for each document. In
content. This technique matches our research purpose as the main aim this case, we fixed the iterations at 4000, in order to ensure that LDA
of using LDA is to explain the nature of data by producing interpretable model parameters can converge (the log file of gensim confirmed that
clusters. For example, Griffiths and Steyvers (2004) used LDA to iden- all the parameters converge within 3700 iterations). Table 2 presents
tify scientific topics in a large set of academic documents. Similarly, we the result generated using the aforementioned settings. LDA requires
would like to identify business activities in the email archive. Compared manual labelling, according to the top words distribution. Thus, we
to traditional topic modelling techniques like Latent Semantic Analysis generated labels together with the company to understand the business
or Probabilistic Latent Semantic Analysis, the number of parameters nature of each email cluster, namely how emails related to the products
that LDA has to estimate only scales with the number of topics, making and services offered by Company X.
it much better-suited to working with large datasets (Tirunillai & Tellis, To assess the performance of the LDA model, we compared the B2B
2014). We coded a Python (a high-level programming language for product and service topics given by the expert team and the topics
general-purpose programming) script to implement LDA on the target generated by LDA. Table 3 shows the results of topic matching. We used
dataset. We leveraged the well-developed topic modelling package, the statistical measures precision and recall evaluating the accuracy of
genism, to apply LDA on the refined dataset. LDA. Out of 15 topics identified, 12 were true positives, that is, they
exactly match 12 of the manually assigned topics. There were three
Table 1 false positives in the set of LDA topics, resulting in a precision of 80%.
. Topics (B2B products and services) suggested by the expert team. Comparing the 12 true positives to the total 15 manually assigned to-
Topic ID Topic label pics, the recall was also 80%. As an unsupervised technique without
any domain knowledge being needed, LDA still reached 80% accuracy,
1 Hokla enquiry, information and feedback showing an acceptable performance. Additionally, LDA identified three
2 Microbiological testing
underlying topics that may have been neglected by the company, which
3 Heavy metal testing
4 Acid migration testing
were “promotion and marketing emails”, “application emails” and “test
5 Promotional events and business development advisory”. These three topics, especially promotion emails, can still
6 Fabric and textile material testing contribute to a better understanding of the business condition of the
7 Food webinars and training firm. This was particularly true for the “other” cluster as in our ap-
8 Food inspection, audit and certification
proach a potentially “black box” is opened up for areas of interest that
9 Food contact material and food grade testing
10 Proprietary Chinese medicines testing and pharmaceutical services at a given moment in time are not necessarily of significant interest.
11 Invoices and quotations Still Company X could explore opportunities among them for potential
12 Customer service enquiry future growth.
13 Genetically modified organisms and genetically modified food testing
Below are the word-cloud visualizations of four topics (See Fig 2).
14 Shelf life determination
15 Others
Such a visualisation technique can help the practitioners easily identify
the business information for each cluster of emails. The weight of words

20
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Table 2
. Top ten words of the 15 clusters of B2B products and services on offer.
Cluster ID Cluster label Top 10 words

1 Heavy metal tests Metal, rsi, heavy, content, cadmium, phthalate, accessory, jewelry, hair, application
2 Water and microbiology tests Water, method, aatcc, microbiology, srv, cushion, standard, legionella, iso, textile
3 Fabric related tests Fabric, plate, light, green, textile, quotation, advise, swab, black, composition
4 Application emails Application, contact, question, report, provide, advise, item, office, origin, letter
5 Invoice related emails Invoice, payment, account, industry, bank, create, pdf, term, outstand, webinar
6 Webinars and trainings Webinar, message, training, seafood, instrument, contain, transmission, industry, premium, icp
7 Customer service Client, contact, iaq, measure, provide, report, air, application, schedule, advise
8 Hokla related emails Hokla, method, compliance, apc, coli, bag, cfu, detect, special, intern
9 Others Ice, pesticide, tpc, zone, branch, item, provide, dose, opinion, addition
10 Acid migration tests Migration, acid, acet, oil, article, phthalate, ethanol, metal, heavy, simulation
11 Test advisory Feedback, order, regulation, advise, express, leachable, style, proceed, Europe, cordial
12 Marketing and promotions Invite, market, astm, promotion, course, provide, event, university, cost, nutrition
13 Audit related emails Audit, document, schedule, record, author, contact, preparation, rate, hospital, suggest
14 Material related tests Material, plastic, silicon, steel, quotation, fda, advise, stainless, coat, lid
15 Pharmaceutical related tests Pharmacopoeia, volume, bottle, tablet, republic, apc, vitamin, coli, regulation, capsule

Table 3
. Topic matching.
Matching status 15 topics given by the expert team 15 topics generated by LDA

Topics that matched Hokla enquiry, information and feedback Hokla related emails
Microbiological testing Water and microbiology tests
Heavy metal testing Heavy metal tests
Acid migration testing Acid migration tests
Fabric and textile material testing Fabric related tests
Food webinars and training Webinars and training
Food inspection, audit and certification Audit related emails
Food contact material and food grade testing Material related tests
Proprietary Chinese medicines testing and pharmaceutical services Pharmaceutical related tests
Invoices and quotations Invoice related emails
Customer service enquiry Customer service
Others Others
Topics that could not be matched Allergen identification Promotion and marketing emails
Genetically modified organisms and genetically modified food testing Application emails
Shelf life determination Test advisory

Fig. 2. Word-cloud visualizations of four topics.

in the diagrams below depends on the P(w| t), the probability of word w 4.1. Trend analysis
belonging to topic t. This probability is given by the topic-word matrix
of the LDA model. We extracted the sending time of the first email in each thread and
Leveraging on the matrix of top words, we clustered 43,137 threads labelled it as the “starting time stamp” for the corresponding thread. If
to 15 business clusters with an average 2876 emails in each. Below is we deem each email thread as an information stream about the firm
the thread distribution diagram for each cluster (See Fig 3). activity, then this time stamp can be used to record when this inner
activity started. Given that we had the time stamps and business labels

21
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 3. Cluster distribution.

Fig. 4. Monthly distribution of threads.

of threads, we conducted a longitudinal analysis. For each business heavy metal test emails, we can see that the department experienced a
cluster, we counted the cumulative number of threads under each topic significant peak season from Jan, 2016 to Jan, 2017. Conversely, for
from 2009 January to 2018 March. According to the monthly dis- topic 5 (Fig. 6), the invoice related tests show a continuously upward
tribution statistics for the refined dataset (Fig. 4), for the trend analysis, trend from year 2014. These kinds of business trends may provide
we used the data points after Jan 2009 as the data points during 2007 to management with a roadmap or an overview of the interest in a par-
2008 were too sparse to show a trend. By plotting the thread count for ticular type of product/service and provide valuable insights into their
each cluster by month, we can see a business trend for this cluster. demand and serve as a proxy for their potential in the financial per-
Below is an example diagram (Fig. 5) generated based on cluster 1, formance of the firm.
heavy metal related tests.
We employed Moving Average (MA) to smooth out short-term 4.2. Predictive power of business trend on revenues
fluctuations and to highlight longer-term trends and the time-window
of the averaging process is set to 6 months. This technique is widely As we had the department monthly revenue data from 2007 to
used in economics and signal processing, for instance, for a trend 2017, we tested if there was any relationship between the topic trends
monitor or low-pass filter. According to the chronological diagram of and revenues. Given that email is now one of the primary business

22
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 5. Business trend figure of topic 1 (heavy metal tests).

Fig. 6. Business trend figure of topic 5 (invoice related emails).

Table 4 productivity applications and is considered as the most frequently used


. The results of granger causality tests (1-month lag). communication tool in a firm (Ciccio & Mecella, 2013), it can serve
indirectly as an overview map for the inner activities of a firm, ranging
Cluster id Cluster label F-statistics P-value
from knowledge transfer to business development process. As we suc-
12 Marketing and promotions 32.6818 < .0001*** cessfully clustered emails, we were able further to explore which one
7 Customer service 26.5975 < .0001*** causes the most significant impact on a firm's business success (reflected
10 Acid migration tests 13.8779 .0003***
in monthly revenues).
2 Water and microbiology tests 13.0549 .0005***
3 Fabric related tests 12.9089 .0005*** The focal dependent variable “monthly revenue” was a dynamic
5 Invoice related emails 11.7687 .0009*** time-series. We tested which business trend improved short-term or
4 Application emails 10.601 .0015*** long-term forecasts of future movements in monthly revenue, and vice
14 Material related tests 10.2468 .0018*** versa. To tackle this question, we employed Granger causality. This
13 Audit related emails 9.0333 .0033***
9 Others 8.3826 .0046***
method is a well-known test for bivariate causality, and involves esti-
11 Test advisory 7.7159 .0065*** mating a linear reduced-form vector auto regression (VAR). This can be
15 Pharmaceutical related tests 7.6113 .0068*** expressed as:
8 Hokla related emails 7.331 .0079***
6 Webinars and trainings 7.0751 .012** yt = L ( )iyt 1 + L ( ) jx t 1 + t1
1 Heavy metal tests 2.7251 .1017
x t = L ( )ix t 1 + L ( ) jyt 1 + t2
Notes: **p < .05, ***p < .01.
In the maintained structure, x and y are jointly determined

23
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 7. Sales and business trend of topic 12 (marketing and promotions) – 3, 6 and 9 months moving average comparison.

endogenous variables, and εt1and εt2 are assumed to be iid (0, σ2). L(∙)i activities, or the business model of a firm can influence the financial
denotes the polynomial lag operator of order i, for instance, L performance (revenue generation). Leveraging on the big data techni-
(β)iyt−1 = β1yt−1 + β2yt−2 + … + βiyt−i. To examine Granger caus- ques, we provide empirical evidence to support this.
ality between x and y, the following hypotheses are tested: Secondly, we found that marketing and promotion activities (topic
12, rank 1) were very important in maintaining revenues in this B2B
H0 : L ( ) j = 0
context. The predictive power of topic 12 can be seen in Fig. 7 below.
H1: L ( )i 0 Here we set the time-window of the averaging process to three, six and
nine months separately and compare the synchronism of sales and topic
If H0 is rejected, then we can say x provides statistically significant trend under different time window settings. The trend diagram of this
information about future values of y (Cromwell, Hannan, Labys, & topic indeed showed synchronism with monthly revenues. From a
Terraza, 1994). The F-statistics can also be used to rank the explanatory short-term view (3-month moving average) to a long-term view (9-
power of x on y. month moving average), the trend of topic 12 illustrates synchroniza-
In the current study, the business trend of cluster j was oper- tion with monthly sales, showing its good predictive power and ro-
ationalised by the dynamic number of emails of cluster j in month t, bustness across different settings of time-window values.
denoted as Njt. Revenue of month i was denoted as Rt, thus we had the The F-statistics for this topic were much higher than other test-re-
VAR model to test: lated emails, showing its influence on revenues. It also showed that the
Rt = L ( )iRt + L ( ) kNj, t + marketing and promotion activities of this firm were successful and
1 1 t
effective. The management may consider investing more resources in
The polynomial lag operator was fixed at 5 at this stage, in order to this part to stimulate further growth in revenues.
test the impact of the business trend in a maximal time window of Third, general customer service (topic 7, rank 2) was much more
5 months. Below is the result of the Granger causality test (Table. 4). important than specific testing enquires. Similar to topic 12, the pre-
The above results were based on a 1-month lag operator setting. We dictive power of topic 7 can also be seen in Fig. 8.
rejected lag operator 2 to 5 as their F-statistics were all lower than 1- By checking the top word list of topic 7, we found the word “report”,
month's. Thus, we adopted the 1-month lag as the best operator. From which is the final product provided to customers. This finding implies
the above table several insights were obtained. that service quality, whether pre-sale or after-sale, was vital for the
Firstly, at the 0.01 significance level, most of the business clusters business success of the firm, especially given the B2B context. After
showed significant predictive power on monthly sales (except heavy discussing this finding with the expert team, we identified an alter-
metal tests and training). This finding confirms that the internal native explanation for this insight, that is, that customers who had more

24
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 8. Sales and business trend of topic 7 (customer service) – 3, 6 and 9 months moving average comparison.

general enquiries were not as knowledgeable and price sensitive as long-term investment. A 1-month scrutiny window is much shorter for
customers who had specific testing enquires. Thus, transactions with us to observe its benefit. As Bartel (1995) remarked, the outcomes of
such customers were more likely to be successful. the first-year training (e.g., performance or productivity improvement)
Fourth, acid migration tests, water and microbiology tests and fabric will not appear until the second year.
related tests caused relatively more impact on revenues. This finding To quantify the impact of business clusters, we redid the VAR
implies that the pre-communications of these tests were more easily analysis following the model proposed previously:
successful. In other words, the manpower and resource investment in
Rt = L ( )iRt + L ( ) kNj, t +
these tests can obtain higher ROI. A possible explanation for this finding 1 1 t

is that the “acid migration tests” and “water and microbiology tests” are Table 5 shows the result of this VAR model.
two market‑leading services provided by Company X. Therefore, these Here we ranked the clusters by the standardised beta, showing their
two business activities are relatively more trusted by customers com- impact strength in descending order. As shown in the above table, the
pared to other services. We are surprised to find that the “fabric related rank of clusters was similar to the Granger causality test. Given the
tests” is also identified as an influential activity. This service is not one coefficient beta, we can quantitatively interpret the impact of business
of the major services provided by this department, but its profitability is clusters on revenues. Taking topic 12 as an example, and assuming the
significant. Thus, its commercial potential is worth further in- other variables (e.g., sales of last month) held constant, with an increase
vestigating. of one standard deviation in marketing and promotions related emails,
Conversely, not all topics have good synchronism with monthly revenues rose 0.466 standard deviations. More generally, in terms of
sales. As shown in Table 4, Topic 1 (heavy metal test related emails) unstandardised beta, with one email increase of the marketing and
causes an insignificant impact on sales (p-value = .1017). In Fig. 9 promotions cluster, the revenue will increase by $4280 HK dollars. Note
below, we can observe that the business trend of heavy metal related that we could not compare unstandardised betas directly to rank their
tests cannot accurately predict the future monthly sales of the F&P importance, as the email scales for each cluster are different.
department. This finding implies the low effectiveness of communica- Trend analysis can provide strategists and marketers with a more
tion about this service. This could be considered as a warning sign for intuitive understanding of the internal operation situation and emer-
the firm. That is, resources spent on this service may not contribute to ging market trends. As emails are dynamically updated through the
revenue increase. The management needs to further investigates the daily operation of firms, it is possible and feasible for us to capture the
reason why this is the case. (See Tables 5 and 6.) short-term information about market movements, which can be valu-
Fifth, it was surprising to find that training and webinar activities able to support future decision making. For example, if we observe that
(topic 6, rank 14) were relatively less important than other clusters. the number of emails about “fabric related tests” is rapidly increasing,
Given that the firm is a professional certification company, this was then management should consider whether it is necessary to invest
expected to be a core element of competitiveness. An explanation for more resources (staff or facilities) to meet the possible surge in the
this finding could be that the training of employees is a considerably capability needs of fabric testing. Also, the result of the proposed LDA

25
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Fig. 9. Sales and business trend of topic 1 (heavy metal related tests) – 3, 6 and 9 months moving average comparison.

Table 5
. The results of vector auto regression model (1-month lag).
delta adjusted R2 Beta standardised beta Significance

Topic 12 Marketing and promotions 0.121 0.428 0.466 < .001***


Topic 7 Customer service 0.102 0.23 0.438 < .001***
Topic 10 Acid migration tests 0.057 0.529 0.327 < .001***
Topic 2 Water and microbiology tests 0.054 0.362 0.324 < .001***
Topic 3 Fabric related tests 0.053 0.752 0.314 < .001***
Topic 4 Application emails 0.044 0.288 0.314 .002***
Topic 5 Invoice related emails 0.048 0.026 0.313 .001***
Topic 14 Material related tests 0.042 0.238 0.293 .002***
Topic 13 Audit related emails 0.037 0.178 0.276 .003***
Topic 9 Others 0.034 0.376 0.257 < .001***
Topic 8 Hokla related emails 0.03 0.339 0.252 < .001***
Topic 6 Webinars and training 0.028 0.44 0.23 .009***
Topic 15 Pharmaceutical related tests 0.031 0.695 0.23 .007***
Topic 11 Test advisory 0.031 0.153 0.227 .006***
Topic 1 Heavy metal tests 0.009 0.136 0.127 .102

Notes: ***p < .01.

approach highlights three more business activities that may be ne- resources of a firm, management should investigate the business ac-
glected by the expert team (promotion and marketing emails, applica- tivities that cannot contribute to revenue increase, understanding the
tion emails and test advisory emails), showing that the big data ana- reasons for their low effectiveness.
lytics can indeed complement the experience-based decision process. Finally, in the above analysis we have consistently considered the
This finding supports the statement of McAfee et al. (2012) that “data- period 2009–2018. Still as in the period between 2009 and 2012 the
driven decisions tend to be better decisions”, and we suggest that volume of emails was significantly less than the volume of emails be-
companies should consider leveraging on both domain expertise to- tween 2013 and 2018, we could have potentially focused our analytical
gether with data science when making final business decisions. Finally, effort on this period. The analysis (Tables 6 and 7) returns a significant
the impact rank of business trends on monthly sales provides a quan- performance improvement for the VAR model, if we only use the data
titative basis for the firm's resource distribution plan. Given the limited after 2013. Taking topics 12 and 13 as examples, the delta adjusted R2

26
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

Table 6 data techniques can solve the information overload problem and the
. The results of Granger causality tests (1-month lag) (2013–2018). knowledge extracted can support marketing decisions (Morabito, 2015;
Cluster id Cluster label F-statistics P-value Wedel & Kannan, 2016). To address the above, we conducted an em-
pirical study to showcase how a big data approach can transform
12 Marketing and promotions 45.4364 < .001*** mundane emails into valuable information using data analytics prac-
7 Customer service 34.736 < .001***
tices. Leveraging an advanced big data analytic technique, we identified
3 Fabric related tests 11.206 .0014***
10 Acid migration tests 10.2637 .0022***
a number of thematic clusters. These could be used to monitor the
2 Water and microbiology tests 10.1039 .0024*** demand trend in a regular manner. The business information extracted
5 Invoice related emails 9.2015 .0036*** was then compared to revenues, showing good predictive power for
14 Material related tests 6.9465 .0107** explaining the performance of a company.
4 Application emails 5.7972 .0192**
Given the evolving nature of the volume of data and the increasing
13 Audit related emails 4.9 .0307**
11 Test advisory 3.9476 .0516* importance of data-driven strategies, this study contributes to the ex-
9 Others 3.9469 .0515* tant body of literature in several ways. We have showcased a detailed
15 Pharmaceutical related tests 3.8939 .0532* process of using a big data NLP method to achieve effective knowledge
8 Hokla related emails 3.4545 .0681*
extraction. Our approach demonstrates the transformation process from
6 Webinars and training 3.1738 .0755*
1 Heavy metal tests 0.9374 .3369
data to business value, producing a better understanding of the appli-
cation of big data analytics and the knowledge discovery process.
Notes: *p < .1, **p < .05, ***p < .01. Additionally, the proposed approach using an internal and cost-effec-
tive information source (e.g., daily business emails) to achieve business
of topic 12 increase from 0.121 to 0.296 and the delta adjusted R2 of trend analysis of companies can help address the significant problem of
topic 13 increase from 0.037 to 0.043. Such a difference meant a dif- data acquisition and management in the B2B context (Fan et al., 2015).
ferent ranking of the clusters when sorting them by the delta adjusted Given that email is profoundly embedded in the daily operation and
R2. All clusters that were found to be statistically significant in the full- communication of a company, it can be sustainably generated and
period analysis remained significant in the 2013–2018 period. Heavy added to a massive and traceable repository. Our study showcases how
metal test results were found to be insignificant in both cases. email can be deemed an important data source in firms and how the
underlying business information in emails can offer valuable insights.
Also, as emails can be automatically updated and kept in dynamic form,
5. Conclusion the chronological analysis of the short-term trend of these business
activities can generate profound insights that would have been difficult
The underlying business information, especially the market trend to obtain otherwise. We suggest that companies should consider
and consumer preferences, has captured the interest of scholars and leveraging on both domain expertise together with data science when
business practitioners in the last few decades. Gaining such information making business decisions.
in advance can provide companies with considerable competitive ad-
vantage compared to peers. The era of web 2.0 has brought both op-
portunity and challenges for business practitioners. The data revolution 6. Practical implications
underway demands a broader appreciation of the variety of emerging
data sources and types, and a more comprehensive set of skills, in- This study also has several practical implications for B2B strategists,
cluding those being developed in the digital humanities, as well as basic marketers, as well as IT professionals. First, practitioners should realise
coding and simulation. Such a situation results in something of a that the growth of the digital universe continues to outpace the growth
paradox in that, despite the emerging data deluge, information ex- of human analytic ability. The ability to utilise the ever-increasing in-
tracted from the data is still highly limited. As McAfee et al. (2012) formation is critical to seize the opportunity in competition, and the
describe, despite the increasing importance and relatively higher ac- data utilisation relies on big data analytics. Using the case in this study
cessibility of data generated in companies' internal environments (e.g., as an example, we identified the most “profitable” business activity in
emails, transaction records) (McAfee et al., 2012), seldom do compa- this company as marketing and promotion events, while “heavy-metal
nies try to extract information from it. To cope with this paradox, the related testing service” is not related to revenue growth. Such in-
concept “big data analytics” has been recognised by scholars in the formation can be critical for business strategists as they now know how
business and IT domains. They believe that the proper leverage of big to distribute company resources and how to estimate return-on-

Table 7
. The results of vector auto regression model (1-month lag) (2013–2018).
s Delta adjusted R2 Beta Standardised beta Significance

Topic 12 Marketing and promotions 0.296 0.575 0.728 < .01***


Topic 7 Customer service 0.251 0.317 0.702 < .01***
Topic 3 Fabric related tests 0.102 0.884 0.44 < .01***
Topic 10 Acid migration tests 0.094 0.601 0.425 < .01***
Topic 2 Water and microbiology tests 0.092 0.435 0.441 < .01***
Topic 5 Invoice related emails 0.085 0.324 0.438 < .01***
Topic 14 Material related tests 0.063 0.261 0.371 .012**
Topic 4 Application emails 0.052 0.29 0.362 .017**
Topic 13 Audit related emails 0.043 0.185 0.318 .028**
Topic 6 Webinars and training 0.035 0.415 0.27 .042**
Topic 9 Others 0.033 0.355 0.27 .047**
Topic 15 Pharmaceutical related tests 0.033 0.593 0.244 .049**
Topic 11 Test advisory 0.033 0.129 0.24 .048**
Topic 8 Hokla related emails 0.028 0.331 0.277 .063*
Topic 1 Heavy metal tests −0.001 0.087 0.113 .307

Notes: *p < .1, **p < .05, ***p < .01.

27
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

investment more holistically. Second, there is a growing consensus in sales forecasting. Proceedings of 13th International Symposium on Operational Research,
the business domain that using big data techniques will take hard work SOR, Vol. 15.
Boyd, D., & Crawford, K. (2012). Critical questions for big data. Information,
and significant investment. This is true when data is expensive to obtain Communication & Society, 15(5), 662–679. https://doi.org/10.1080/1369118X.2012.
or manage. Nevertheless, we argue that this problem can be alleviated 678878.
by using daily emails as an analytic information source. As the majority Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’.
Mckinsey Quarterly, 4(1), 24–35.
of companies neglects the fact that email content contains information Bughin, J., Chui, M., & Manyika, J. (2010). Clouds, big data, and smart assets: Ten tech-
about business decisions, actions, and transactions (Jackson & Tedmori, enabled business trends to watch. Mckinsey Quarterly, 56(1), 75–86.
2004), the email archive has the potential to become a cost-effective Campbell, C. S., Maglio, P. P., Cozzi, A., & Dom, B. (2003). Expertise identification using
email communications. Paper presented at the Proceedings of the twelfth international
tool to support big data techniques. It holds out the prospect of gaining conference on Information and knowledge management, New Orleans, LA, USA.
detailed knowledge about company business activities not captured by Chang, Y. L., & Chien, J. T. (2009). Latent Dirichlet learning for document summarization.
expensive business intelligence software. In Acoustics, Speech and Signal Processing, 1689–1692 ICASSP 2009. IEEE International
Conference.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From
7. Future research big data to big impact. MIS Quarterly, 1165–1188.
Chien, J. T., & Wu, M. S. (2008). Adaptive Bayesian latent semantic analysis. IEEE
This study has some limitations that can be leveraged to inform Transactions on Audio, Speech, and Language Processing, 16(1), 198–207.
Ciccio, C. D., & Mecella, M. (2013). Mining artful processes from knowledge workers'
future research. First, owing to data and time restrictions, we were emails. IEEE Internet Computing, 17(5), 10–20. https://doi.org/10.1109/MIC.
unable to capture email data from other companies/industries, espe- 2013.60.
cially those which are outside of the B2B context. Due to the diverse Cromwell, J. B., Hannan, M. J., Labys, W. C., & Terraza, M. (1994). Multivariate tests for
time series models. Thousand Oaks, CA: Sage.
business nature and company culture, the predictive power of the Darling, W. (2018). A theoretical and practical implementation tutorial on topic modeling and
identified business trends may change on a cross-organisational basis. gibbs sampling.
Therefore, future research could apply our big data NLP methodology Davenport, T. (2014). Big data at work: Dispelling the myths, uncovering the opportunities.
Harvard Business Review Press.
to other industries and compare the results against different contextual Deal, D. (2014). Workhorses and dark horses: Digital tactics for customer acquisition.
settings. Second, although we address the importance of the internal Gigaom Research Raport.
data, it may still be possible to include other useful information sources D'Haen, J., & Van den Poel, D. (2013). Model-supported business-to-business prospect
prediction based on an iterative customer acquisition framework. Industrial Marketing
that could be obtained from the external environment, such as the web- Management, 42(4), 544–551.
pages of competitors. Mining the business information from their Duff, B. (1996). Document management offers security and order for intranet informa-
websites may obtain some insights in mapping the entire market of a tion. IIE Solutions, 28(12), 28–31.
Edmunds, A., & Morris, A. (2000). The problem of information overload in business or-
specific industry. Also, having internal sales by product type would
ganisations: A review of the literature. International Journal of Information
make it possible to test against the corresponding clusters. Third, in Management, 20(1), 17–28.
terms of the validation of the LDA method, we utilise the manual labels Eppler, M. J., & Mengis, J. (2004). The concept of information overload: A review of
as the benchmark to calculate the precision and support of this ap- literature from organization science, accounting, marketing, MIS, and related dis-
ciplines. The Information Society, 20(5), 325–344.
proach. Future work may consider incorporating a more solid valida- Erevelles, S., Fukawa, N., & Swayne, L. (2016). Big data consumer analytics and the
tion process to increase the confidence in the use of LDA for the iden- transformation of marketing. Journal of Business Research, 69(2), 897–904.
tification of business clusters. Fourth, for the Granger causality test in Fan, S., Lau, R. Y., & Zhao, J. L. (2015). Demystifying big data analytics for business
intelligence through the lens of marketing mix. Big Data Research, 2(1), 28–32.
this paper, we fixed the polynomial lag operator at 5 to test the impact Fox, C. (1989). A stop list for general text. SIGIR Forum, 24(1–2), 19–21. https://doi.org/
of the business trend in a maximal time window of 5 months. This 10.1145/378881.378888.
setting may result in the relatively lower significance of some long-term George, G., Haas, M. R., & Pentland, A. (2014). Big data and management. Academy of
Management Journal, 57(2), 321–326.
orientated activities (e.g., training and webinars for employees). There Goodpaster, K. E. (2015). Business ethics. Wiley Encyclopedia of Management.
is a possibility that these business activities have to take sufficient time GORDON, S., & MOHAMMAD, S. (2012). Perspective-taking and the attribution of ig-
to harvest their outcomes, such as performance improvement (Bartel, norance. Journal for the Theory of Social Behaviour, 42(2), 181–200.
Griffin, A., Josephson, B. W., Lilien, G., Wiersema, F., Bayus, B., Chandy, R., & Spanjol, J.
1995). To address this concern, we encourage future research to be
(2013). Marketing's roles in innovation in business-to-business firms: Status, issues,
concerned with these activities for further exploring their potential and research agenda. Marketing Letters, 24(4), 323–337.
from the long-term perspective. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National
Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.
Gunasekaran, A., Papadopoulos, T., Dubey, R., Wamba, S. F., Childe, S. J., Hazen, B., &
References Akter, S. (2017). Big data and predictive analytics for supply chain and organiza-
tional performance. Journal of Business Research, 70, 308–317. https://doi.org/10.
Ajaz, S., Nafis, M. T., & Sharma, V. (2017). Spam mail detection using hybrid secure hash 1016/j.jbusres.2016.08.004.
based naive classifier. International Journal of Advanced Research in Computer Science, Gustke, C. (2013). Big data takes turn as market darling. Retrieved from https://www.
8(5), 5. https://doi.org/10.26483/ijarcs.v8i5.3675. cnbc.com/id/100638376.
Alrashed, T., Awadallah, A. H., & Dumais, S. (2018). The lifetime of email messages: A large- Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis.
scale analysis of email revisitation. Paper presented at the Proceedings of the 2018 Machine learning, 42(1-2), 177–196.
Conference on Human Information Interaction & Retrieval, New Brunswick, NJ, USA. Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in Twitter. Paper
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000). An presented at the Proceedings of the first Workshop on Social Media Analytics,
experimental comparison of naive bayesian and keyword-based anti-spam filtering with (Washington D.C., District of Columbia).
personal e-mail messages. (Paper presented at the Proceedings of the 23rd annual Jackson, A., Yates, J., & Orlikowski, W. (2007). Corporate blogging: building community
international ACM SIGIR conference on Research and development in information through persistent digital talk. Jan. 2007. Paper presented at the System Sciences,
retrieval, Athens, Greece). 2007. HICSS 2007. 40th Annual Hawaii International Conference on.
Asay, M. (2014). 8 reasons why big data projects fail. Retrieved from https://www. Jackson, T., & Tedmori, S. (2004). Capturing and managing electronic knowledge: The de-
informationweek.com/big-data/big-data-analytics/8-reasons-big-data-projects-fail/ velopment of the email knowledge extraction system.
a/d-id/1297842. Jackson, T. W., Tedmori, S., Hinde, C. J., & Bani-Hani, A. I. (2012). The boundaries of
Bartel, A. P. (1995). Training, wage growth, and job performance: Evidence from a natural language processing techniques in extracting knowledge from emails. Journal
company database. Journal of Labor Economics, 13(3), 401–425. of Emerging Technologies in Web Intelligence, 4(2), 119–127.
Basavaraju, M., & Prabhakar, D. R. (2010). A novel method of spam mail detection using Janine, S., Ton, D. J., & Van Joolingen Wouter, R. (2004). The effects of discovery
text based clustering approach. International Journal of Computer Applications, 5(4), learning and expository instruction on the acquisition of definitional and intuitive
15–25. knowledge. Journal of Computer Assisted Learning, 20(4), 225–234.
Bawm, Z. L., & Nath, R. P. D. (2014). A conceptual model for effective email marketing. John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and
22–23 Dec. 2014. Paper presented at the 2014 17th International Conference on think. International Journal of Advertising, 33(1), 181–183.
Computer and Information Technology (ICCIT). Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: Issues and challenges
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of moving forward. Paper presented at the 46th Hawaii international conference on
Machine Learning Research, 3(Jan), 993–1022. System sciences (HICSS).
Bohanec, M., Borstnar, M., & Robnik-Sikonja, M. (2015). Feature subset selection for B2B Kok, S., & Yih, W. T. (2009). Extracting product information from email receipts using markov
logic. Paper presented at the 6th Conference on Email and Anti-Spam, Mountain

28
Y. Yang et al. Industrial Marketing Management 86 (2020) 16–29

View, California, USA. Systems.


Lackman, L. C. (2007). Forecasting sales for a B2B product category: Case of auto com- Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora.
ponent product. Journal of Business & Industrial Marketing, 22(4), 228–235. Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic keyword extraction from
Lahiri, S., Mihalcea, R., & Lai, P. H. (2017). Keyword extraction from emails. Natural individual documents. Text mining (pp. 1–20). .
Language Engineering, 23(2), 295–317. Rusetski, A. (2014). Pricing by intuition: Managerial choices with limited information.
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. Journal of Business Research, 67(8), 1733–1743.
Retrieved from http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data- Sathi, A. (2014). Engaging customers using big data: How marketing analytics are transforming
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. business. Palgrave Macmillan.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, Shah, P. K., Perez-Iratxeta, C., Bork, P., & Andrade, M. A. (2003). Information extraction
analytics and the path from insights to value. MIT Sloan Management Review, from full text scientific articles: Where are the keywords? BMC Bioinformatics, 4(1),
52(2), 21. https://doi.org/10.1186/1471-2105-4-20 20.
Li, J., Sen, S., & Zaman, N. (2015). Entity extraction from business emails. International Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big
Journal of Information Technology and Computer Science, 7(9), 15–22. data challenges and analytical methods. Journal of Business Research, 70, 263–286.
Lilien, G. L. (2016). The B2B knowledge gap. International Journal of Research in Tedmori, S., Jackson, T. W., & Bouchlaghem, D. (2006). Locating knowledge sources
Marketing, 33(3), 543–556. through keyphrase extraction. Knowledge and Process Management, 13(2), 100–107.
Lycett, M. (2013). ‘Datafication’: Making sense of (big) data in a complex world. European The Radicati Group (2017). Email statistics report, 2017–2021. Retrieved from https://
Journal of Information Systems, 22(4), 381–386. www.radicati.com/wp/wp-content/uploads/2017/01/Email-Statistics-Report-2017-
Madnani, N. (2007). Getting started on natural language processing with Python. 2021-Executive-Summary.pdf.
Crossroads, 13(4), 5. Tirunillai, S., & Tellis, G. J. (2014). Mining marketing meaning from online chatter:
Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Hung Byers, A. strategic brand analysis of big data using latent dirichlet allocation. Journal of
(2011). Big data: The next frontier for innovation, comptetition, and productivity. Marketing Research, 51(4), 463–479.
Martin, B., Elin, C.-E., Dorota, K., Marek, K., John, S., & Yu, Z. (2015). Internal migration Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking LDA: Why priors
data around the world: assessing contemporary practice. Population, Space and Place, matter. Advances in Neural Information Processing Systems, 1973–1981.
21(1), 1–17. Wang, G., Gunasekaran, A., Ngai, E. W. T., & Papadopoulos, T. (2016). Big data analytics
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data: in logistics and supply chain management: Certain investigations for research and
The management revolution. Harvard Business Review, 90(10), 60–68. applications. International Journal of Production Economics, 176, 98–110.
Mithas, S., Lee, M. R., Earley, S., Murugesan, S., & Djavanshir, R. (2013). Leveraging big Wedel, M., & Kannan, P. K. (2016). Marketing analytics for data-rich environments.
data and business analytics. IT Professional, 15(6), 18–20. Journal of Marketing, 80(6), 97–121.
Moh, T.-S., & Bhagvat, S. (2012). Clustering of technology Tweets and the impact of stop Wei, X., & Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. Paper
words on clusters. Paper presented at the Proceedings of the 50th Annual Southeast presented at the Proceedings of the 29th annual international ACM SIGIR conference
Regional Conference, Tuscaloosa, Alabama. on Research and development in information retrieval, Seattle, Washington, USA.
Morabito, V. (2015). Big data and analytics: Strategic and organizational impacts. Springer. Whittaker, S., Bellotti, V., & Moody, P. (2005). Revisiting and reinventing email. Human
Papagiannidis, S., See-to, E., Assimakopoulos, D., & Yang, Y. (2018). Identifying in- Computer Interaction, 20(1), 1–9.
dustrial clusters with a novel big-data methodology: Are SIC codes (not) fit for pur- Wiersema, F. (2013). The B2B agenda: The current state of B2B marketing and a look ahead.
pose in the internet age? Computers & Operations Research, 98(October 2018), Vol. 42.
355–366. Xu, Z., Frankwick, G. L., & Ramirez, E. (2016). Effects of big data analytics and traditional
Paul, S. A. (2016). Find an expert: Designing expert selection interfaces for formal help-giving. marketing analytics on new product success: A knowledge fusion perspective. Journal
Paper presented at the 2016 CHI Conference on Human Factors in Computing of Business Research, 69(5), 1562–1566.

29

You might also like