You are on page 1of 44

1

Study
Unit

Business Analytics Research & Text


Mining Overview
ANL312  Business Analytics Research & Text Mining Overview

Learning Outcomes

By the end of this study unit, you should be able to:

1. Explain what Business Analytics research involves and sources of information


2. Develop report writing skills by using the guidelines of report writing
3. Apply APA style to credit sources
4. Describe text mining and its importance in the current world
5. Describe the challenges of text mining
6. Describe the general text mining process flow
7. Apply CRISP-DM framework in the text mining context

SU1-2
ANL312  Business Analytics Research & Text Mining Overview

Overview

Business Analytics involves the use of data mining and knowledge discovery techniques
to provide business users with critical insights into the operational and performance
characteristics of various aspects of business. It spans disciplinary areas such as Statistics,
Machine Learning and Artificial Intelligence. Research into new techniques and areas of
applications is expanding rapidly.

To stay relevant in this field, one has to keep pace with the latest developments as things
that are learnt today may no longer be applicable in the future. Furthermore, one needs to
expand one’s knowledge so that one is aware of the various tools and techniques available.
Having a broad knowledge will help one to employ appropriate tools for problems that
need to be solved. Thus, it is important to develop some basic skills in research and
writing.

Chapter 1 of this study unit gives an overview of research and writing in the Business
Analytics field. The main goal is to sharpen skills in learning the latest development
in Business Analytics. To this end, the aims of chapter 1 of this study unit are two-
fold firstly, to provide basic research skills that are needed to expand knowledge in
Business Analytics. One will learn how to go beyond textbooks to learn new methods
and applications of Business Analytics and identify existing problems and the proposed
solutions to the problems. Secondly, it is hoped that through the exercises of research and
writing, one will be more prepared for the ANL488 Business Analytics Applied Project.

ANL488 is a 10-credit unit course that is to be completed at the end of the Business
Analytics programme. In this project, it is expected that the student will identify a specific
business problem and propose a Business Analytics solution to solve the problem. The focus
of chapter 1 of this study unit is on information collection and report writing. Firstly, the
sources of Business Analytics research will be explained. Then, where to look up Business
Analytics information will be covered. Lastly, the guidelines of report writing will be
explained.

SU1-3
ANL312  Business Analytics Research & Text Mining Overview

Chapter 2 of this study unit gives an overview of text mining. Text mining or text analytics
refers to a broad range of technologies that can process unstructured text data and convert
unstructured text data to structured data. The common theme behind these technologies
is converting text into numbers so that different algorithms can be applied. Similar to
data mining, text mining seeks to extract useful information from large data sources by
identifying interesting patterns. The difference is that the data is now unstructured rather
than structured data in numeric or categorical format.

As there are far more unstructured text data than structured data in the real world, text
mining becomes a very important tool for companies to better understand their internal
business processes, their customers and their competitors. There are many applications of
text mining in the business world. For example, in Customer Relationship Management,
the text sources can be the call centre logs that detail the nature of the calls. Interesting
patterns can be derived from the ability to categorise text into those that are billing
complaints, frustration with the use of certain services or interruption to services that
takes too long to resolve. The outputs of a text mining application may be directly used for
business decision making or be used as additional inputs to predictive models, customer
churn behaviour model for instance.

SU1-4
ANL312  Business Analytics Research & Text Mining Overview

Chapter 1: Business Analytics Research & Report


Writing

Lesson Recording

Business Analytics Research & Report Writing

1.1 Typical Information Required


Basically, there are three types of information required for a typical business analytics
project: (i) data, (ii) method and (iii) application of data mining methods.

Data

Data is essential for data mining, without which no mining can be carried out. Usually,
data is available in the company operational database. For example, many ANL488 project
students get data from their employers. Some companies invest in data warehouses which
provide a consistent and less complicated view of data from various data sources.

It is also possible to obtain data from research agencies that already hold the data. One
typical example is the UCI Machine Learning Data Repository (Dua & Graff, 2019). Many
datasets are freely available for research purposes.

Another possible source of data is government organisations. For example, the Singapore
Department of Statistics’ website (http://www.singstat.gov.sg/) contains data that may
be used in projects. In some cases, getting data from organisations may not be possible
because the data is classified. One example is data from the Ministry of Defence.

When existing data cannot be obtained readily, one may need to collect or generate the
data required for mining. One typical data collection option is using a survey. Conducting
a survey is time-consuming; it involves questionnaire design, sampling of respondents
and so on. It is usually not recommended if the project time frame is short. In addition, the

SU1-5
ANL312  Business Analytics Research & Text Mining Overview

quality of survey data depends on the proper usage of measurement instruments. Survey
is more suitable for collecting data on respondents’ perceptions which typically are not
available in a company’s database. If the data mining project requires behavioural data, it
is more accurate to retrieve the data from a company’s database that stores employee or
customer activity information, if available. This is because the activity information records
the true behaviours while responses to a survey can be inaccurate due to many reasons.
It’s possible that respondents cannot remember exactly or purely don’t want to tell the
true information. In the case that internal behavioural data is not available, especially if
a company wants to do competitor analysis, the common way is to crawl the publicly
available data by using data/web crawling tools or writing some codes.

Methods

For a data mining project to be successful, it is necessary to know the appropriate method
to solve a specific problem. If the choice of method is not obvious, it is important to search
the literature extensively to find the best possible method to solve a particular problem at
hand.

Sometimes researching for the right method may involve an extensive empirical
evaluation of different data mining applications. In this case, one may need to install
different software packages, and conduct a comparative study of these packages in a
controlled environment.

In some cases, there is indeed no method available to solve a specific problem at hand. In
such cases, it will be necessary to reassess the empirical needs and propose an alternative
solution to the problem.

Applications of data mining

In some projects, there may be a need to learn from other researchers who have completed
projects in a similar domain. For example, an insurance company planning to analyse its
policy holders’ profiles may want to know the factors considered by others working in
insurance analytics.

SU1-6
ANL312  Business Analytics Research & Text Mining Overview

Usually, commercial data mining software companies publish white papers describing
innovative solutions to difficult problems. These companies also run education forums to
update their users and prospective customers of the latest trends and applications of their
products.

Many research conferences include an application track. Under this track, researchers
report innovative applications of data mining in difficult or new problem domains.

1.2 Sources of Information

1.2.1 Literature
Among the different types of information sources, the literature is the most important
source of information. There are two types of literature: (i) primary literature and (ii)
secondary literature.

Primary literature includes technical reports, theses and dissertations generated from the
research conducted by universities and research organisations. It also includes research
conference proceedings, journal articles and patents. These are formal publications that
have undergone rigorous review by experts in the field. Primary literature provides
information related to the latest development of different subjects and is very useful for
reviewing the state-of-the-art.

Following are several examples of conferences and journals in Business Analytics.

Conferences

• Pacific-Asia Conference on Knowledge Discovery and Data Mining


• Annual ACM SIGKDD conference on Knowledge Discovery and data mining
• Other conferences:
◦ ICML (International Conference on Machine Learning)
◦ ECML (European Conference on Machine Learning)
◦ ICDM (International Conference on Data Mining)
◦ SDM (SIAM Data Mining Conference)

SU1-7
ANL312  Business Analytics Research & Text Mining Overview

◦ Many AI conferences accept data mining papers. For example, AI2010


(Australasian Joint Conference on Artificial Intelligence)

Journals
• JMLR (Journal of Machine Learning Research)
• MLJ (Machine Learning Journal)
• TKDE (IEEE Transactions on Knowledge and Data Engineering)
• DMKD (Data Mining and Knowledge Discovery, Springer)
• JAIR (Journal of AI Research)
• PRJ (Pattern Recognition Journal)
• Many others

Secondary literature includes textbooks, encyclopedia, dictionaries, handbooks and


Wikipedia (http://en.wikipedia.org/wiki/Main_Page). Secondary literature is good
for learning proven concepts. Information in secondary literature is normally very
systematically organised.

Following are several examples of encyclopaedia and books in data mining.

Examples of Encyclopaedia on data mining:


• “Encyclopedia of Data Warehousing and Mining”, (4 Volumes) edited by John Wang
from Montclair State University (Wang, 2009).
• “Wikipedia on data mining”

http://en.wikipedia.org/wiki/Data_mining

Examples of Books on data mining

University library usually has an up-to-date collection of books on Business Analytics.

SU1-8
ANL312  Business Analytics Research & Text Mining Overview

1.2.2 Other Sources


Apart from the literature, one can get information from people. For example, if one knows
experts in a particular subject, one could ask them for certain information or opinion.
One advantage of consulting experts is that one may get information that is otherwise
unavailable in the literature. Another advantage is that experts are easy to reach, either
by email, phone or face-to-face meetings. The information exchange is flexible because it
allows one to expand or narrow the topic of discussion.

One potential problem with getting information from people is that the information may
not be accurate or even erroneous. Human error and subjectivity are always an issue and
sometimes, it is necessary to check the validity or accuracies of such information.

In Business Analytics projects, it is usually necessary to get in touch with organisations to


get information. For example, ANL488 students often contact their employers, commercial
companies or government organisations (e.g., Singapore General Hospital) to discuss
Business Analytics project requirements and to get datasets.

When beginning a new research topic, it is usually useful to read bibliographies, which
list books and papers related to a given topic. Another important source of information is
a survey paper/review paper, which summarises several research papers within a specific
research area or on a specific topic. This is especially useful when one is new to the field
and wants to have a good grasp of the latest development within a short time.

1.3 Ways to Search for Information


There are many ways to search for information required for our research. Broadly
speaking, there are two ways: Online and offline.

Online search engines such as Google, Yahoo, and Bing are some of the more common
tools that can be used for information searching. When searching for academic paper, one
can use Google Scholar, a public search engine designed to find academic publications.
Another example is CiteSeerX; this is a digital library for the scientific community to
download academic papers.

SU1-9
ANL312  Business Analytics Research & Text Mining Overview

When searching online for general information concerning an unfamiliar topic, one can
start with a general internet search engine such as Google or Yahoo. At this stage, the
search term would be general. For example, the search term could be ‘Cluster Analysis’.
After one has a better idea of the common terms used in the topic of interest, one can
use more specific terms. Examples of such terms are ‘cluster validity scores’, ‘cluster
validation’, ‘cluster validity index’, etc.

When one has a specific question in mind, it is useful to find websites that host FAQs
(Frequently Asked Questions) and forums. For example, data mining software, Weka,
has a technical team to support an online forum to answer users’ questions pertaining
to the software. KDNuggets, a website hosting resources for data mining (https://
www.kdnuggets.com/faq/index.html), has forums for beginners and experts, as well as
FAQs for data mining users.

Searching online may sometimes lead to research articles that could not be accessed
readily. This may be because one is not a journal subscriber, or one is not a member of
a professional organisation. In some cases, one could buy an article online if one finds
it essential to one’s research. Interested readers may contact the author to request for a
complimentary copy of an article. One can also choose to attend a conference. This will
allow one to attend presentations, network with people, and get a copy of the proceedings
which contains all the papers published in that conference.

While online search for information is easy, there is a caveat against the use of such
information. One needs to exercise judgement as to the value of information found on the
Internet. This is because most information on the Internet is not refereed.

If a document on the Internet is important to one’s work, it is advisable to download and


save the document in one’s local computer. When citing a document obtained from the
Internet, it is always good to state the date of accessing that document. This is because
web pages or documents can be removed without notice.

Apart from searching online, one can also search for information offline. There are many
possible reasons to go offline. One main reason is that old articles may no longer be

SU1-10
ANL312  Business Analytics Research & Text Mining Overview

available online. In fact, many research articles published before the 90’s are usually not
available online. These articles can, however, be found in libraries’ archives.

Libraries are a common place to search for information offline. Many libraries offer
catalogue search engines for users to find relevant books and articles.

1.4 Report Writing


Students of this course would be expected to have basic writing skills. The aim of
this section is to guide and orientate students on report writing for the End-of-Course
Assessment for ANL312 Text Mining and Applied Project Formulation as well as the
project proposal and project final report for ANL488 Business Analytics Applied Project.

First of all, the importance of having good report writing skills cannot be emphasised
enough. Many students fail to recognise the importance of report writing. They tend to
spend more time on the project and devote little time to reporting on the project. As a
result, the report is poorly written although much effort has been put into the project.

It is important to note that the task of report writing is a thinking process. The process starts
with ideas being jotted down on small pieces of notes; the ideas are then further organised
into a draft. Thinking and writing go hand in hand as the ideas get written up. Parts of the
writing may be deleted after some deliberations, and existing paragraphs are re-phrased
to improve the clarity and completeness of the ideas. In fact, many good reports are not
written in one go. They usually undergo numerous rounds of revisions before taking final
shape.

1.4.1 Guidelines on Report Writing


This section will provide guidelines to address common weaknesses found in most
students’ reports.

SU1-11
ANL312  Business Analytics Research & Text Mining Overview

Correct grammatical and typographical errors

It is commonly observed that many students’ reports come with rather conspicuous
grammatical and typographical errors. This is surprising because most students actually
possess a reasonable command of English. So, a possible reason could be that the reports
have not been proof-read. And what could be the reason for not proof-reading? Well, it
could be that the report was written at the last minute!

In fact, proofreading is not a difficult task. Typographical errors are usually quite easy
to fix using the spell-checking feature that is common in the word processing software.
Furthermore, proofreading could also be done by enlisting the help of someone with a
fresh pair of eyes to look for errors.

Make report coherent

A basic rule of writing is to use appropriate but simple words. Avoid using big words.
Keep sentences short. Avoid mingling different points into a single long paragraph.
Instead, express one point in each paragraph.

If there is a big idea, break it down into smaller points, and link up these points to ensure
a coherent presentation of the big idea.

Proper reference citation

Proper reference citation shows that the report is built upon the knowledge contributed
by others that are relevant to the field. When citing the works of others, it also helps the
readers to find out more information if they need to. Reference citation also gives due
acknowledgements to the authors for their contributions. Please refer to section 1.5 for
details on how to credit sources using the American Psychological Association (APA)
style.

A business analytics project normally involves using a dataset. It is necessary to cite the
source of this dataset. At the same time, acknowledgement should be given to those who
provided help in the project. The person could be a friend who has helped to proof-read the

SU1-12
ANL312  Business Analytics Research & Text Mining Overview

report or someone who has given information that led to the successful implementation
of the project. It is also necessary to acknowledge the funding organisation, if any.

Proper use of figures and tables

Tables and figures are important elements in reports. They can be used to communicate
ideas or results in an intuitive manner. Poorly constructed tables or unclear figures reflect
poorly upon the report, and should be avoided at all costs.

One way to make tables and figures more intuitive is to provide captions that are
descriptive and meaningful. For example, an informative caption such as ‘Table 1:
Customer Data; Group 1 has the highest propensity to return’, is definitely better than a
meaningless caption such as ‘Table 1: data’.

Sometimes, tables and figures in reports could be too small in size. This could happen
when authors reduce the size of the figures and tables to meet page-limit requirement for
publications. Small tables and figures are unreadable – it is suggested that the report be
reorganised to make room for important tables and figures.

It is also important to ensure that figures are of consistent size and the details are clearly
visible when printed on a black and white printer. For tabulated data, it is usually
necessary to highlight parts of the table so as to catch the readers’ attention. Usually,
textual effects such as boldface and underline can be used to enhance the visual impact of
tabulations.

Ensure consistency

Consistency is a simple and general rule, yet it is very important to every aspect of report
writing.

For textual presentation, a consistent text font, style and size for the respective parts of a
report must be used. For example, if a font size of 14 is used for section title, make sure
that this is consistently applied to all the section titles in the reports.

SU1-13
ANL312  Business Analytics Research & Text Mining Overview

Consistency in textual presentations can also be achieved using word processing software.
For example, Microsoft Word has style and formatting features that one can use to achieve
the consistency required.

In terms of language, consistency means that Standard English should be used; and there
is definitely no place for ‘Singlish’ and slang. For terms that may have more than one way
of writing them, it is still important to stay consistent. For example, the word ‘dataset’ can
also be written as ‘data set’. It is important to ensure that this word is written in the same
way throughout the entire report.

In terms of citations and reference list, it is also important to ensure a consistent style is
used and in-text citations match references at the end of a report. While revising a report,
one may decide to take out certain paragraphs and add in new ones. This could leave old
references that are no longer cited in the reference list. On the other hand, there might
be new citations which are not in the reference list. Hence, checking for consistency is
important in report writing.

1.4.2 Final Remarks


It is important to note that there is no shortcut to becoming a good writer. Writing cannot
be improved by simply reading and understanding guidelines as detailed in section 1.4
of the study unit. Report writing can only be improved, not just by following guidelines,
but through constant practice in repeated revisions of the report.

In short, good report writing involves a lot of hard work and common sense. Good report
writers are meticulous. They pay attention to every word, sentence, paragraph, section
and chapter and use words with care. They think critically and write clearly. Not only
do they write their reports only once, they review and correct them repeatedly! They
normally start their report writing at the early stage of their projects. And they will
normally allocate sufficient time at the end of the project to ensure that the report is vetted
through and thoroughly revised before their final submission.

SU1-14
ANL312  Business Analytics Research & Text Mining Overview

1.5 Crediting Sources Using the APA Style

1.5.1 Introduction
In report writing, there is a need to credit the sources where ideas, concepts, statistics and
findings are drawn from to support the report that is being written. It is used extensively
throughout the report especially in the Introduction and Literature Review sections. In the
Introduction section, general concepts and ideas are introduced and it is common to cite
statistics to support these concepts and ideas. In the Literature Review section, it is normal
to describe the details, findings and conclusion of several journal articles that relate to the
topic chosen for the report.

In this section, the guidelines for crediting sources according to the Publication Manual
of the American Psychological Association (APA) (7th edition) will be used and it will be
referred to as the “APA style”.

There are essentially two parts to crediting sources in a report. They are placing a brief
reference citation in the text of the report and giving complete citation details in the
reference list. Section 1.5.2 will address how to cite references in text while section 1.5.3
will cover the reference list.

1.5.2 Citing References in Text


The author-date method is used to cite references in text and there are two ways in which
it can be done. They are:

Part of the narrative

Example are given below:

In this light, Miller (2015) defined text mining as “the automated or partially automated
processing of text. It involves imposing structure upon text and extracting relevant
information from text”. In this study, Leong et al. (2012) explored the application of
analysing teaching feedback from students using sentiment mining.

SU1-15
ANL312  Business Analytics Research & Text Mining Overview

Not part of the narrative – using the parenthetical format

Examples are given below:

“Text mining is defined as the automated or partially automated processing of text.


It involves imposing structure upon text and extracting relevant information from
text” (Miller, 2015).

Sentiment mining is a useful tool to analye teaching feedback from students (Leong et al.,
2012).

Both ways can be used in the same report. Table 1.1 below shows the basic citation formats
using the APA style (American Psychological Association, 2020). The citation format
depends on the type of citation and whether the citation is the first or subsequent citation.

Table 1.1 Basic Citation Formats of Citing References in Text

Type of citation Narrative citation Parenthetical citation

One work by one author Cokins (2009) Cokins (2009)

One work by two authors Olson and Shi (2007) Olson and Shi (2007)

One work by three or more Chan et al. (2010) Chan et al. (2010)
authors

Groups (with Singapore University of Singapore University of


abbreviation) as authors Social Sciences (SUSS, Social Sciences (SUSS,
• First citation 2017) 2017)

Groups (with SUSS (2017) SUSS (2017)


abbreviation) as authors
• Subsequent citation

SU1-16
ANL312  Business Analytics Research & Text Mining Overview

Group without Stanford University (2020) (Stanford University,2020)


abbreviation as authors

1.5.3 Reference List


A reference list cites works that specifically support a report or article. It is unlike a
bibliography which cites works for background information or further reading and may
also include descriptive notes. Each reference that is cited in text needs to be listed in the
reference list at the end of a report or article.

Each item in the reference list is made up of the following basic elements: author, year
of publication, title and publishing data (American Psychological Association,2020). The
general APA style for items in a reference list is given in Table 1.2.

Table 1.2 General Formats for Items in a Reference List

Item General Format and Example

Books Author, A. A., & Author, B. B. (YYYY). Title of book. Publisher.

Olson, D., & Shi, Y. (2007) Introduction to business data mining.


McGraw-Hill.

Books with Author, A. A., & Author, B. B. (YYYY). Title of book (Xth ed.).
subsequent Publisher.
editions
Leedy, P. D., & Ormrod, J. E. (2010). Practical research: Planning and
design (9th ed.). Pearson Education Inc.

Journal Articles Author, A. A., & Author, B. B. (YYYY). Title of article. Title of
Journal, vol (issue, when appropriate), page-number/s.

Garrett, J. M. (1997). Graphical assessment of the Cox model


proportional hazards assumption, Stata Technical Bulletin, 35, 9-14.

SU1-17
ANL312  Business Analytics Research & Text Mining Overview

Course reports Author, A. A., & Author, B. B. (YYYY). Title of report (Report No.).
Publisher.

Chee, A. (2010). Text mining on product satisfaction and feedback:


A case study of car model Citroën C4 (ANL311 End of Course
Assessment Report). Singapore University of Social Sciences,
School of Business.

Online Author, A. A., & Author, B. B. (year if available, if not, n.d.). Title
documents from of document. Retrieved from statement with retrieval date.
websites
Victoria University. (2019). APA referencing: Sample APA reference
list. Retrieved September 12, 2019, from

http://libraryguides.vu.edu.au/apa-referencing/sample-apa-
reference-list

Online Author, A. A., & Author, B. B. (year if available, if not, n.d.).


documents from Title of article. Title of Journal, vol (issue, when appropriate), page-
websites with number/s. doi:number
DOI assigned
Lau, K., Lee, K., & Ho, Y. (2005). Text mining for the hotel industry.
Cornell Hotel and Restaurant Administration Quarterly, 46 (3), 344362.

doi: 10.1177/0010880405275966

The items in the reference list are to be sorted according to the following rules:
• Alphabetical order by the surname of the first author followed by initials of the
author’s given name
• Several works by same first author
◦ One-author entries by same author arranged by year of publication with the
earliest first

SU1-18
ANL312  Business Analytics Research & Text Mining Overview

Cokins, G. (2007)…
Cokins, G. (2009)…
◦ One-author entries precede multiple-author entries beginning with same
surname even if multiple-author entries are published earlier

Olson, D. (2009)…
Olson, D., & Shi,Y. (2007)…
◦ Multiple author entries are arranged alphabetically based on the surname of
the second author that is different

Olson, D., DeFrain, J., & Skogrand, L. (2006)…


Olson, D., & Shi, Y. (2007)…
• Several works by different first authors with same surname
◦ Arrange works by different authors with the same surname alphabetically
by first and subsequent initials

Olson, D., & Shi, Y. (2007)…


Olson, D. H. L., DeFrain, J., & Skogrand, L. (2006)…

SU1-19
ANL312  Business Analytics Research & Text Mining Overview

Chapter 2: Text Mining Overview

Lesson Recording

Text Mining Overview

2.1 What is Text Mining?


Text mining is a broad concept that involves different technologies aiming at achieving
different text processing objectives. In general, text mining can be defined as “the
automated or partially automated processing of text. It involves imposing structure upon
text and extracting relevant information from text”(Miller, 2015). Data in the world,
roughly, can be classified as structured data and unstructured data. Structured data is
usually represented by numbers and/or categories with defined data types in a formal
structure of data models such as tables or database. For example, the transaction record
stored in a company’s database is structured data. In contrast, unstructured data is
in unstructured format without a standard format of presentation. Text is one type of
unstructured data. However, unstructured data is not limited to text but also includes
image, audio, video or other data without a standard format.

In fact, the amount of unstructured data is more than you could have imagined. The 80
Percent Rule points out that estimated unstructured information within an organisation
runs as high as 80% (Shilakes, 1998). Although this rule of thumb is not based on any
primary or any quantitative research, we admit there is more unstructured data than
structured data, and business organisations are realising the importance of unstructured
data within an organisation or over the Internet. The focus of this course will be
unstructured text data.

Here are some examples of unstructured text data.

• Emails

SU1-20
ANL312  Business Analytics Research & Text Mining Overview

• Reports
• Call centre logs
• Product online reviews
• Contracts
• News articles
• Social media posts
• …

The value of text data comes from the rich information contained in various text document
that supplements the structured data. For business organisations, for example, structured
data only records customers’ purchase behaviour such as who purchased which product
at what time. However, text data like online reviews and social media posts provides
opportunities for business organisations to understand their customers’ evaluations and
expectations towards their products. Due to the huge amount of text data that is valuable
to explore, it’s very costly to manually process the text data as it requires time and
manpower. Therefore, automated or partially automated technologies to process text data
are preferred and text mining is well recognised as an effective and efficient tool to extract
insights from text data.

The basic task of text mining is to convert unstructured text data to structured data. For
example, the text mining objective could be classifying the text document so that each
text document is assigned a category which represents the features of this text document.
Then the unstructured text data can be presented as structured data with a column of
text document id and a column of category name. That’s why we say Text mining refers
to a broad range of technologies that can process unstructured text data and convert
unstructured text data to structured data, in the overview section of this study unit. The
common theme behind these technologies is converting text into numbers so different
algorithms can be applied. How unstructured text data can be converted to numbers will
be covered in study unit 4.

SU1-21
ANL312  Business Analytics Research & Text Mining Overview

2.2 Challenges in Dealing with Text


In this section, various challenges in dealing with text are discussed.

No standard way of writing text

Text that are written, captured or recorded in files and databases may not be written in a
standard way as it depends on the purpose of why the text was written, what the recording
medium is and also on the person doing the recording. Figure 1.1 shows a sample of call
centre logs. The logs are written to identify the issues raised in the call without details of
the caller as seen in the text coloured in green. The text is a summary of the call done by
either the call centre representative that answered the call or by another call centre staff
reading from the verbatim logs or recording of the conversation.

Figure 1.1 No Standard Way of Writing Text

There can also be different formats between different types of text. Analysts’ reports,
news articles and companies’ press releases are usually written with the highest standard
of adherence to grammatical rules and clarity so that they are comprehensible to the
respective audiences. Text in airline incident reports, patient records and police case files
may be written with the respective field’s jargon with fewer adherences to grammatical
rules. The other extreme is text in the various instant messaging systems that operate
through phones, the internet and within computer networks that tend to be short, not in

SU1-22
ANL312  Business Analytics Research & Text Mining Overview

proper sentences and also populated with emoticons to denote emotions like sad or happy.
(e.g., ;-( :-))

The challenge posed by these varieties of formats to text mining is that the Text Mining
Process must be robust to support all these formats.

Abbreviations and Short-forms

The use of abbreviations and short-forms is closely tied with the formats discussed in the
previous challenge. With more formal texts, there will be fewer of them. The use increases
with the technical texts and is rampant with text used for instant messaging. In Figure 1.2,
the highlighted term rep could stand for repetitive or representative. Only after reading the
whole text can it be concluded that rep stands for representative.

Figure 1.2 Abbreviations and Short-forms

Spelling Errors

Spelling errors are another challenge when it comes to text. Not all spelling errors are easy
to resolve. Figure 1.3 highlights to as being misspelled. However, it is difficult to detect
the misspelling without reading the whole sentence for contextual clues.

SU1-23
ANL312  Business Analytics Research & Text Mining Overview

Figure 1.3 Spelling Errors

Synonyms

Another challenge would be dealing with synonyms in text. In Figure 1.4, plan is also
synonymous to phone plan, contract, phone contract, etc. There is a need to map all these
together to reduce the number of terms that one has to deal with in a collection of text
data.

Figure 1.4 Synonyms

Redundant Terms

Some terms are redundant. Examples are those highlighted in green in Figure 1.5 such as
the, is, etc. Removal of these terms would not involve any loss of information.

SU1-24
ANL312  Business Analytics Research & Text Mining Overview

Figure 1.5 Redundant Terms

Because of the various challenges in dealing with text, it is very important and also
a common step to preprocess the text data before further applying any text mining
technique. Preprocessing text is just like cleaning structured data when we do data mining.
The purpose of preprocessing text is to produce clean text data so that further text mining
steps can generate more reliable results. The details of preprocessing text will be covered
in study unit 2.

2.3 General Text Mining Process

2.3.1 Text Mining Terminology


Here is a list of text mining terminology you need to understand before introducing the
general text mining process:

• Text is usually referred to as unstructured data. Text or document is minimally a


sentence but it can also be a paragraph, several paragraphs, a section or several
sections, a chapter or several chapters up to a full report or article.
• Corpus is a collection of text documents.
• Natural language processing (NLP) is the computer processing of text using the
language’s linguistics rules.

SU1-25
ANL312  Business Analytics Research & Text Mining Overview

• Parsing is the process of extracting, clean and create a dictionary of words from the
documents using NLP.

2.3.2 Text Mining Process


A typical predictive modelling project deals with data in numerical or categorical form,
therefore statistical method can be used to generate insights from the numerical or
categorical data. Likewise, the fundamental idea of text mining is to derive a quantitative
representation of text documents. Once the text is transformed into a set of numbers
that properly represents the features of the text, any traditional statistical method can be
applied to generate insights from the text data.

A typical text mining project in Business Analytics includes data collection, text parsing, text
filtering, transformation/vectorisation, and functioning/application. Figure 1.6 shows the text
mining process flow adapted from Chakraborty, et al. (2014)..

Figure 1.6 Text Mining Process Flow

• Data collection: The first step of any text mining project is to collect the text
data required for the analysis. The text data can come from internal or external
documents, web pages, user/customer comments from social media or be collected
by survey or other sources containing text data that is useful to the project.

SU1-26
ANL312  Business Analytics Research & Text Mining Overview

• Text parsing: The next step is to extract, clean and create a dictionary of words
from the documents using NLP. This step normally involves identifyingsentences,
splitting sentences into words (tokens) basing on some rules, determining parts of
speech and stemming words. This step also includes removing general stop words
and spelling-checking. Most of the challenges in dealing with text that are discussed
above will be addressed in this step. After text parsing, text data is converted to a
collection of words (terms) for further processing. The details of text parsing will
be covered in study unit 2 in text preprocessing section.
• Text filtering: Among all the terms generated in the text parsing step, you will
likely have many terms that are irrelevant or not important to either differentiating
documents from each other or summarising the documents. You will have to
manually browse through the terms to eliminate irrelevant terms by creating
customised start/stop lists. Text filtering requires domain knowledge to create
stop/start words. Or you will have to remove the unimportant terms by setting a
threshold to eliminate terms with extremely high frequency.
• Transformation/Vectorisation: The next step is text transformation. This step deals
with the numerical representation of the text using linear algebra-based methods.
The remaining terms after text filtering will be transformed to vector format.
The commonly used format is document-term matrix or term-document matrix.
Document-term matrix or term-document matrix is a mathematical matrix that
describes the frequency of terms that occur in a collection of documents. In a
document-term matrix, each row corresponds to a document in the corpus and
each column corresponds to a term, while in a term-document matrix, each column
corresponds to a document in the corpus and each row corresponds to a term.
Document-term matrix and term-document matrix are essentially the same but in
different presentation formats. Table 1.3 shows an example of a document-term
matrix.

SU1-27
ANL312  Business Analytics Research & Text Mining Overview

Table 1.3 Document-Term Matrix

Document/ book sport command love Athletes must Fan


Term

Document 1 0 0 3 1 0 2 0

Document 2 0 2 2 0 0 1 0

Document 3 1 0 0 2 0 3 1

Document N 0 1 1 2 0 0 2

• Functioning/Analysis: Text data is converted to numerical data in the


transformation step. Therefore, in the last step, traditional data mining algorithms
such as clustering, classification, association analysis, and other statistical method
can be applied.

Next section will introduce the seven text mining practice areas and popular applications,
and the rest of study units of this course will cover some of the important and popular
text mining applications in the business world in detail.

2.4 Text Mining Practice Areas and Applications

2.4.1 Seven Text Mining Practice Areas


According to framework proposed by Miner et al. (2012), text mining has seven practice
areas. Though distinct, these areas are highly interrelated; a typical text mining project
will require techniques from multiple areas. The seven practice areas are as follows:

• Search and information retrieval (IR): Search and information retrieval includes
tasks of indexing, searching, and retrieving text from a large collection of text
documents stored on database or over the Internet with keyword queries. Search
and information retrieval is the core task of a search engine.

SU1-28
ANL312  Business Analytics Research & Text Mining Overview

• Information extraction (IE): Information extraction is identification and extraction


of relevant facts and relationships from unstructured text. It is a process of making
structured data from unstructured text.
• Natural language processing (NLP): Natural language processing refers to low-
level language processing and understanding tasks (e.g., tagging part of speech)
and is often used synonymously with computational linguistics.
• IE and NLP are usually applied in the text parsing and text filtering process of a
text mining project in business analytics.
• Document clustering: Document clustering is to group and categorise terms,
snippets, paragraphs or documents, using data mining clustering methods. Text
data is converted to numeric formats, so traditional clustering algorithms can be
applied to cluster the documents.
• Document classification: Document classification is to group and categorise terms,
snippets, paragraphs, or documents, using data mining classification methods,
based on models trained on labelled examples. Text data is converted to numeric
formats, so traditional classification algorithms can be applied to classify the
documents.
• Web mining: Web mining refers to data and text mining on the Internet, with a
specific focus on web hyperlink structure, page content and usage data.
• Concept extraction: Concept extraction is to group words and phrases into
semantically similar groups. The high level group is the summary of the text
documents and provides an effective and efficient way for human to understand a
collection of text documents.

2.4.2 Text Ming Applications and Related Practice Areas


You may have heard about other terminologies like entity extraction, sentiment analysis
or topic modelling but found they are not in the practice areas we just introduced.
The practice areas are sub-disciplines of text mining and each of the practice areas
has many topics/applications that aim to achieve a concrete objective. For example,
sentiment analysis is one application of document classification and topic modelling is one

SU1-29
ANL312  Business Analytics Research & Text Mining Overview

application of concept extraction. Table 1.4 lists some of the popular text mining topics/
applications and corresponding practice areas (Miner et al., 2012).

Table 1.4 Text Mining Topics and Related Practice Areas

Topic/Application Practice Area

Keyword search Search and information retrieval

Inverted index Search and information retrieval

Document clustering Document clustering

Document similarity Document clustering

Feature selection Document classification

Sentiment analysis Document classification

Web crawling Web mining

Link analytics Web mining

Entity extraction Information extraction

Link extraction Information extraction

Part of speech tagging Natural language processing

Tokenisation Natural language processing

Topic modelling Concept extraction

Synonym identification Concept extraction

SU1-30
ANL312  Business Analytics Research & Text Mining Overview

There have been various successful applications of text mining in the literature. These
serve to motivate data/text miners in applying text mining to new applications.

Customer Relationship Management

In banks and the telecommunications industry, text mining has been used on email, call
centre logs and surveys to uncover patterns in customer preference and complaints.

Fault Detection

In the airline industry, text mining has been used to detect trends in incident reports.
Particular airports may be experiencing heavy occurrences of certain mechanical problems
that are caused by certain fixable defective procedures.

Police/Security

Text mining can be used on police reports to detect trends that relate to how crimes were
committed. This information can be used to solve future crimes as well as leveraged on in
scheduling police patrols.

Web Mining

Text mining can be used to extract useful information from web sites, discussion forums
and blogs about various topics of interest and provide insights into current trends and
themes.

Medical/Health

Mining text of patient records can uncover symptoms associated with certain drugs.

Investment Research

Text mining can be used on daily analyst reports, news articles and company press releases
to uncover trends that affect stock price movements for certain companies or industries
over time.

SU1-31
ANL312  Business Analytics Research & Text Mining Overview

2.5 Text Mining Using CRISP-DM


CRISP–DM is the cross-industry standard process for data mining. Table 1.5 spells out the
different activities typically being carried in a data mining project (Lee et al., 2014). The
same framework can also be used for text mining. Each phase will be examined to look at
the corresponding activities that will be carried out in text mining.

Table 1.5 CRISP – DM Framework

Phase Descriptions

1 Business • Determine the business goals and objectives


Understanding • Translate business goals into clear definitions
of the data mining problem
• Prepare a preliminary plan to achieve these
data mining objectives

2 Data Understanding • Collect the data


• Perform exploratory data analysis to
understand the data
• Evaluate the quality of the data
• Identify subset of the data for modelling

3 Data Preparation • Prepare the raw dataset


• Select the relevant variables and cases for the
data mining objectives
• Apply the necessary transformations to the
data
• Deal with missing values
• Most labour-intensive phase

4 Modelling • Select and apply the appropriate modelling


techniques

SU1-32
ANL312  Business Analytics Research & Text Mining Overview

Phase Descriptions

• Build and calibrate several models where


appropriate
• Loop back to the data preparation phase if
necessary

5 Evaluation • Compare the different models built in


modelling phase
• Evaluate whether the results produced have
achieved the business objectives

6 Deployment • Produce final report and presentation


• Identify how the results can be translated into
appropriate business strategies to achieve the
business goals
• Implement the selected model to score new
data

Business Understanding

There are typically three scenarios when embarking on text mining. They are:

1. Embarking on a pure text mining project;


2. Considering the use of text as additional inputs in a completed data mining
project;
3. Embarking on a data mining project with the intention to use both structured
data and text.

Irrespective of the scenario, the availability of text data must be determined in this phase.
If it is not available, then the text mining project cannot proceed especially for scenarios
1 and 2.

SU1-33
ANL312  Business Analytics Research & Text Mining Overview

Data Understanding

This phase includes identification of text data source, text data collection and assessment
of the quantity and quality of the text data. For text mining, expertise with a good
understanding of the text is needed, especially if special terms and vocabulary are used.

Data Preparation

For text mining, data preparation refers to text data pre-processing which normally
includes identifying sentences, splitting sentences into words (tokens) based on some
rules, determining parts of speech and stemming words. This step also includes removing
stop words and spelling-checking.

Modelling

Similar to data mining, one can build text mining models with different algorithms.
For example, to build a document clustering model, one can select different clustering
techniques to build the model. Outputs of text mining models can be used to score new
text or as inputs to improve the accuracy of the predictive models.

If the model accuracy is not acceptable, then one can return to the text mining model
to modify the parameters in the selected algorithm or try a different algorithm. The text
mining results can then be fed as inputs to the predictive model to see if the accuracy is
improved.

Evaluation

For text mining models, there is no formal, objective way to evaluate text mining models.
This is unlike models using structured data where the actual target and predicted target
can easily be compared with accuracy measures. For text mining models, it is fairly time
consuming to mimic what is done in structured data since this will entail reading a sample
of text to verify that the correct results are generated.

SU1-34
ANL312  Business Analytics Research & Text Mining Overview

However, evaluation is straightforward when dealing with scenarios 2 and 3 where there
is a predictive model involved, with the actual and predicted targets being present in the
data. The usual accuracy and other statistics can be calculated.

Deployment

Once the text mining models have been finalised, they can be deployed. Depending on
the business objectives, the deployment phase can be as simple as writing a report or as
complex as implementing the models in a business intelligence system. Implementing the
models means scoring the new text data. For example, to implement a document clustering
model, new text document will be clustered to one of the groups by applying the model
developed in the modelling phase. The purpose of scoring new data is to estimate the
response, class, or value which is valuable to business decision making. There are three
model scoring ways:

• Interactive scoring: Model scoring is done in the data mining software that was
used to develop the model initially.
• Batch scoring: Model scoring is carried out on regular basis with scores stored in
a data warehouse.
• Real-time scoring: Model scoring is carried out on ad hoc basis.

SU1-35
ANL312  Business Analytics Research & Text Mining Overview

Summary

Chapter 1 of this study unit gives an overview of typical information required for data
mining research, and where and how such information can be obtained. It also provides
guidelines on good report writing. These guidelines are not exhaustive, and they are
provided with the aim of addressing common weaknesses found in students’ reports.

Chapter 2 of this study unit covers the background needed in text mining such as
definition and terminology, challenges in dealing with text, the text mining process, text
mining practice areas and applications. Conducting a text mining project using CRISP-
DM framework is also introduced.

SU1-36
ANL312  Business Analytics Research & Text Mining Overview

Formative Assessment

1. Visit the KDNuggets website, list three data repository websites.

2. Visit the UCI Machine Learning Data Repository website. Download the most
popular data set and briefly describe the content of the dataset.

3. Read the following paragraph. Apply APA style and construct the citations (source
1 and source 2) in boldface.

A few researchers in the linguistics field have developed training programmes


designed to improve native speakers' ability to understand accented speech (source
1 and source 2). Their training techniques are based on the research described above
indicating that comprehension improves with exposure to non-native speech. Source
1 conducted their training with students preparing to be social workers, but note that
other professionals who work with non-native speakers could benefit from a similar
programme.

Information of source 1 and source 2 is given below.


Source 1:
Authors:
Derwing, T. M., Rossiter, M. J., and Munro, M. J.
Year: 2002

Source 2:
Author(s):
Krech Thomas, H.
Year: 2004

4. Apply APA style and write the reference in correct format.

Author: Derwing, T. M., Rossiter, M. J., and Munro, M. J


Year of publication: 2002

SU1-37
ANL312  Business Analytics Research & Text Mining Overview

Title of article: Teaching native speakers to listen to foreign accented speech


Title of journal: Journal of Multilingual and Multicultural Development
Volume: 23
Page: 245-259

5. Apply APA style to order the following reference.

Zohar, D. (2003). XXX.


James, L. R., Demaree, R. G., & Wolf, G. (1984). XXX.
James, L. A., & James, L. R. (1989). XXX.
Neal, A., Griffin, M. A., & Hart, P. M. (2000). XXX.
Blau, P. (1964). XXX.
Zohar, D. (2002). XXX.
Neal, A., & Griffin, M. A. (2004). XXX.

6. Explain the limitations of text mining.

7. Which one is not a text parsing task?


a. Tokenisation
b. Document-Term Matrix
c. Stemming
d. Stop words

8. Explain how statistic algorithms can be applied to text data.

9. Suggest two other applications of text mining besides those discussed in section 2.3.2
of this study unit.

10. Compare the difference of applying CRISP-DM framework in data mining vs. text
mining project.

SU1-38
ANL312  Business Analytics Research & Text Mining Overview

Solutions or Suggested Answers

Formative Assessment
1. Visit the KDNuggets website, list three data repository websites.
Suggested Answer: Here are three examples:

AWS Public Data Sets: https://aws.amazon.com/opendata/public-datasets/

Kaggle Datasets: https://www.kaggle.com/datasets

Awesome Public Datasets: https://github.com/awesomedata/awesome-public-


datasets

2. Visit the UCI Machine Learning Data Repository website. Download the most
popular data set and briefly describe the content of the dataset.
Suggested Answer: The most popular dataset is Iris as of 25 June 2020. This dataset
is for pattern recognition and is used to build classification model. This dataset
has 150 instances. Students can discuss more about this dataset basing on the
information on the page https://archive.ics.uci.edu/ml/datasets/Iris

3. Read the following paragraph. Apply APA style and construct the citations (source
1 and source 2) in boldface.

A few researchers in the linguistics field have developed training programmes


designed to improve native speakers' ability to understand accented speech (source
1 and source 2). Their training techniques are based on the research described above
indicating that comprehension improves with exposure to non-native speech. Source
1 conducted their training with students preparing to be social workers, but note that
other professionals who work with non-native speakers could benefit from a similar
programme.

Information of source 1 and source 2 is given below.


Source 1:

SU1-39
ANL312  Business Analytics Research & Text Mining Overview

Authors:
Derwing, T. M., Rossiter, M. J., and Munro, M. J.
Year: 2002

Source 2:
Author(s):
Krech Thomas, H.
Year: 2004

Suggested Answer: A few researchers in the linguistics field have developed


training programmes designed to improve native speakers' ability to understand
accented speech (Derwing et al., 2002; Krech Thomas, 2004). Their training
techniques are based on the research described above indicating that
comprehension improves with exposure to non-native speech. Derwing et al. (2002)
conducted their training with students preparing to be social workers, but note
that other professionals who work with non-native speakers could benefit from a
similar programme.

4. Apply APA style and write the reference in correct format.

Author: Derwing, T. M., Rossiter, M. J., and Munro, M. J


Year of publication: 2002
Title of article: Teaching native speakers to listen to foreign accented speech
Title of journal: Journal of Multilingual and Multicultural Development
Volume: 23
Page: 245-259

Suggested Answer: Derwing, T. M., Rossiter, M. J., & Munro, M. J. (2002). Teaching
native speakers to listen to foreign-accented speech. Journal of Multilingual and
Multicultural Development, 23, 245-259.

5. Apply APA style to order the following reference.

Zohar, D. (2003). XXX.

SU1-40
ANL312  Business Analytics Research & Text Mining Overview

James, L. R., Demaree, R. G., & Wolf, G. (1984). XXX.


James, L. A., & James, L. R. (1989). XXX.
Neal, A., Griffin, M. A., & Hart, P. M. (2000). XXX.
Blau, P. (1964). XXX.
Zohar, D. (2002). XXX.
Neal, A., & Griffin, M. A. (2004). XXX.

Suggested Answer:

Blau, P. (1964). XXX.


James, L. A., & James, L. R. (1989). XXX.
James, L. R., Demaree, R. G., & Wolf, G. (1984). XXX.
Neal, A., & Griffin, M. A. (2004). XXX.
Neal, A., Griffin, M. A., & Hart, P. M. (2000). XXX.
Zohar, D. (2002). XXX.
Zohar, D. (2003). XXX.

6. Explain the limitations of text mining.


Suggested Answer: There are several challenges in dealing with text, including no
standard way of writing text, abbreviations and shot forms, spelling errors in text,
synonyms in text, and redundant terms in text. These challenges are not easy to
handle in 100% accuracy.

7. Which one is not a text parsing task?


a. Tokenisation
Incorrect.

b. Document-Term Matrix
Correct.

c. Stemming
Incorrect.

d. Stop words

SU1-41
ANL312  Business Analytics Research & Text Mining Overview

Incorrect.

8. Explain how statistic algorithms can be applied to text data.


Suggested Answer: In the text mining process, text transformation transforms
text to vector format, for example, document-term matrix, in which each row
corresponds to a document in the corpus and each column corresponds to a term.
Text data is converted to numerical data in this step, so that statistic algorithms can
be applied to the numerical data that represents text data.

9. Suggest two other applications of text mining besides those discussed in section 2.3.2
of this study unit.
Suggested Answer:
1. Education: Text mining can be used to analyse student feedback and
extract useful information to improve teaching and management.
2. Human resource: Text mining can be used to analyse employee feedback
and extract useful information to improve internal process and increase
employee satisfaction.

10. Compare the difference of applying CRISP-DM framework in data mining vs. text
mining project.
Suggested Answer: The differences of applying CRISP-DM framework in data
mining and text mining project are reflected mainly in the following phases.

Business understanding: For text mining, there are typically three scenarios when
embarking on text mining. They are:
1. Embarking on a pure text mining project;
2. Considering the use of text as additional inputs in a completed data
mining project;
3. Embarking on a data mining project with the intention to use both
structured data and text.

SU1-42
ANL312  Business Analytics Research & Text Mining Overview

Data understanding: The main difference is text mining requires more domain
knowledge to understand the text, especially if special terms and vocabulary are
used.

Data preparation: The typical data preparation methods are different for data
mining and text mining. For text mining, data preparation refers to text data pre-
processing which normally includes identifying sentences, splitting sentences into
words (tokens) based on some rules, determining parts of speech and stemming
words. This step also includes removing stop words and spelling-checking.

Evaluation: For text mining models, there is no formal, objective way to evaluate
text mining models. This is unlike models using structured data where the actual
target and predicted target can easily be compared with accuracy measures.

SU1-43
ANL312  Business Analytics Research & Text Mining Overview

References

American Psychological Association. (2020). Publication manual of the American

psychological association (7th ed.). American Psychological Association.

Chakraborty, G., Pagolu, M., & Garla, S. (2014). Text mining and analysis: Practical methods,

examples, and case studies using SAS. SAS Institute.

Dua, D., & Graff, C. (2019). UCI machine learning repository. Retrieved from https://

archive.ics.uci.edu/ml/index.php

Lee, Y., Chan, S., Leong, C., Tan, G., & Tan, S. (2014). ANL311e Selected topics in business

analytics (study guide). SIM University.

Leong, C. K., Lee, Y. H., & Mak, W. K. (2012). Mining sentiments in SMS texts for

teaching evaluation. Expert Systems with applications, 39(3), 2584-2589.

Miller, T. W. (2015). Web and network data science: Modeling techniques in predictive

analytics. Pearson Education.

Miner, G., Elder IV, J., Fast, A., Hill, T., Nisbet, R., & Delen, D. (2012). Practical text mining

and statistical analysis for non-structured text data applications. Academic Press.

Shilakes, C. C. (1998). Enterprise information portals. iGi Global.


Wang, J. (2009). Encyclopedia of data warehousing and mining. iGi Global.

SU1-44

You might also like