Professional Documents
Culture Documents
Study
Unit
Learning Outcomes
SU1-2
ANL312 Business Analytics Research & Text Mining Overview
Overview
Business Analytics involves the use of data mining and knowledge discovery techniques
to provide business users with critical insights into the operational and performance
characteristics of various aspects of business. It spans disciplinary areas such as Statistics,
Machine Learning and Artificial Intelligence. Research into new techniques and areas of
applications is expanding rapidly.
To stay relevant in this field, one has to keep pace with the latest developments as things
that are learnt today may no longer be applicable in the future. Furthermore, one needs to
expand one’s knowledge so that one is aware of the various tools and techniques available.
Having a broad knowledge will help one to employ appropriate tools for problems that
need to be solved. Thus, it is important to develop some basic skills in research and
writing.
Chapter 1 of this study unit gives an overview of research and writing in the Business
Analytics field. The main goal is to sharpen skills in learning the latest development
in Business Analytics. To this end, the aims of chapter 1 of this study unit are two-
fold firstly, to provide basic research skills that are needed to expand knowledge in
Business Analytics. One will learn how to go beyond textbooks to learn new methods
and applications of Business Analytics and identify existing problems and the proposed
solutions to the problems. Secondly, it is hoped that through the exercises of research and
writing, one will be more prepared for the ANL488 Business Analytics Applied Project.
ANL488 is a 10-credit unit course that is to be completed at the end of the Business
Analytics programme. In this project, it is expected that the student will identify a specific
business problem and propose a Business Analytics solution to solve the problem. The focus
of chapter 1 of this study unit is on information collection and report writing. Firstly, the
sources of Business Analytics research will be explained. Then, where to look up Business
Analytics information will be covered. Lastly, the guidelines of report writing will be
explained.
SU1-3
ANL312 Business Analytics Research & Text Mining Overview
Chapter 2 of this study unit gives an overview of text mining. Text mining or text analytics
refers to a broad range of technologies that can process unstructured text data and convert
unstructured text data to structured data. The common theme behind these technologies
is converting text into numbers so that different algorithms can be applied. Similar to
data mining, text mining seeks to extract useful information from large data sources by
identifying interesting patterns. The difference is that the data is now unstructured rather
than structured data in numeric or categorical format.
As there are far more unstructured text data than structured data in the real world, text
mining becomes a very important tool for companies to better understand their internal
business processes, their customers and their competitors. There are many applications of
text mining in the business world. For example, in Customer Relationship Management,
the text sources can be the call centre logs that detail the nature of the calls. Interesting
patterns can be derived from the ability to categorise text into those that are billing
complaints, frustration with the use of certain services or interruption to services that
takes too long to resolve. The outputs of a text mining application may be directly used for
business decision making or be used as additional inputs to predictive models, customer
churn behaviour model for instance.
SU1-4
ANL312 Business Analytics Research & Text Mining Overview
Lesson Recording
Data
Data is essential for data mining, without which no mining can be carried out. Usually,
data is available in the company operational database. For example, many ANL488 project
students get data from their employers. Some companies invest in data warehouses which
provide a consistent and less complicated view of data from various data sources.
It is also possible to obtain data from research agencies that already hold the data. One
typical example is the UCI Machine Learning Data Repository (Dua & Graff, 2019). Many
datasets are freely available for research purposes.
Another possible source of data is government organisations. For example, the Singapore
Department of Statistics’ website (http://www.singstat.gov.sg/) contains data that may
be used in projects. In some cases, getting data from organisations may not be possible
because the data is classified. One example is data from the Ministry of Defence.
When existing data cannot be obtained readily, one may need to collect or generate the
data required for mining. One typical data collection option is using a survey. Conducting
a survey is time-consuming; it involves questionnaire design, sampling of respondents
and so on. It is usually not recommended if the project time frame is short. In addition, the
SU1-5
ANL312 Business Analytics Research & Text Mining Overview
quality of survey data depends on the proper usage of measurement instruments. Survey
is more suitable for collecting data on respondents’ perceptions which typically are not
available in a company’s database. If the data mining project requires behavioural data, it
is more accurate to retrieve the data from a company’s database that stores employee or
customer activity information, if available. This is because the activity information records
the true behaviours while responses to a survey can be inaccurate due to many reasons.
It’s possible that respondents cannot remember exactly or purely don’t want to tell the
true information. In the case that internal behavioural data is not available, especially if
a company wants to do competitor analysis, the common way is to crawl the publicly
available data by using data/web crawling tools or writing some codes.
Methods
For a data mining project to be successful, it is necessary to know the appropriate method
to solve a specific problem. If the choice of method is not obvious, it is important to search
the literature extensively to find the best possible method to solve a particular problem at
hand.
Sometimes researching for the right method may involve an extensive empirical
evaluation of different data mining applications. In this case, one may need to install
different software packages, and conduct a comparative study of these packages in a
controlled environment.
In some cases, there is indeed no method available to solve a specific problem at hand. In
such cases, it will be necessary to reassess the empirical needs and propose an alternative
solution to the problem.
In some projects, there may be a need to learn from other researchers who have completed
projects in a similar domain. For example, an insurance company planning to analyse its
policy holders’ profiles may want to know the factors considered by others working in
insurance analytics.
SU1-6
ANL312 Business Analytics Research & Text Mining Overview
Usually, commercial data mining software companies publish white papers describing
innovative solutions to difficult problems. These companies also run education forums to
update their users and prospective customers of the latest trends and applications of their
products.
Many research conferences include an application track. Under this track, researchers
report innovative applications of data mining in difficult or new problem domains.
1.2.1 Literature
Among the different types of information sources, the literature is the most important
source of information. There are two types of literature: (i) primary literature and (ii)
secondary literature.
Primary literature includes technical reports, theses and dissertations generated from the
research conducted by universities and research organisations. It also includes research
conference proceedings, journal articles and patents. These are formal publications that
have undergone rigorous review by experts in the field. Primary literature provides
information related to the latest development of different subjects and is very useful for
reviewing the state-of-the-art.
Conferences
SU1-7
ANL312 Business Analytics Research & Text Mining Overview
Journals
• JMLR (Journal of Machine Learning Research)
• MLJ (Machine Learning Journal)
• TKDE (IEEE Transactions on Knowledge and Data Engineering)
• DMKD (Data Mining and Knowledge Discovery, Springer)
• JAIR (Journal of AI Research)
• PRJ (Pattern Recognition Journal)
• Many others
http://en.wikipedia.org/wiki/Data_mining
SU1-8
ANL312 Business Analytics Research & Text Mining Overview
One potential problem with getting information from people is that the information may
not be accurate or even erroneous. Human error and subjectivity are always an issue and
sometimes, it is necessary to check the validity or accuracies of such information.
When beginning a new research topic, it is usually useful to read bibliographies, which
list books and papers related to a given topic. Another important source of information is
a survey paper/review paper, which summarises several research papers within a specific
research area or on a specific topic. This is especially useful when one is new to the field
and wants to have a good grasp of the latest development within a short time.
Online search engines such as Google, Yahoo, and Bing are some of the more common
tools that can be used for information searching. When searching for academic paper, one
can use Google Scholar, a public search engine designed to find academic publications.
Another example is CiteSeerX; this is a digital library for the scientific community to
download academic papers.
SU1-9
ANL312 Business Analytics Research & Text Mining Overview
When searching online for general information concerning an unfamiliar topic, one can
start with a general internet search engine such as Google or Yahoo. At this stage, the
search term would be general. For example, the search term could be ‘Cluster Analysis’.
After one has a better idea of the common terms used in the topic of interest, one can
use more specific terms. Examples of such terms are ‘cluster validity scores’, ‘cluster
validation’, ‘cluster validity index’, etc.
When one has a specific question in mind, it is useful to find websites that host FAQs
(Frequently Asked Questions) and forums. For example, data mining software, Weka,
has a technical team to support an online forum to answer users’ questions pertaining
to the software. KDNuggets, a website hosting resources for data mining (https://
www.kdnuggets.com/faq/index.html), has forums for beginners and experts, as well as
FAQs for data mining users.
Searching online may sometimes lead to research articles that could not be accessed
readily. This may be because one is not a journal subscriber, or one is not a member of
a professional organisation. In some cases, one could buy an article online if one finds
it essential to one’s research. Interested readers may contact the author to request for a
complimentary copy of an article. One can also choose to attend a conference. This will
allow one to attend presentations, network with people, and get a copy of the proceedings
which contains all the papers published in that conference.
While online search for information is easy, there is a caveat against the use of such
information. One needs to exercise judgement as to the value of information found on the
Internet. This is because most information on the Internet is not refereed.
Apart from searching online, one can also search for information offline. There are many
possible reasons to go offline. One main reason is that old articles may no longer be
SU1-10
ANL312 Business Analytics Research & Text Mining Overview
available online. In fact, many research articles published before the 90’s are usually not
available online. These articles can, however, be found in libraries’ archives.
Libraries are a common place to search for information offline. Many libraries offer
catalogue search engines for users to find relevant books and articles.
First of all, the importance of having good report writing skills cannot be emphasised
enough. Many students fail to recognise the importance of report writing. They tend to
spend more time on the project and devote little time to reporting on the project. As a
result, the report is poorly written although much effort has been put into the project.
It is important to note that the task of report writing is a thinking process. The process starts
with ideas being jotted down on small pieces of notes; the ideas are then further organised
into a draft. Thinking and writing go hand in hand as the ideas get written up. Parts of the
writing may be deleted after some deliberations, and existing paragraphs are re-phrased
to improve the clarity and completeness of the ideas. In fact, many good reports are not
written in one go. They usually undergo numerous rounds of revisions before taking final
shape.
SU1-11
ANL312 Business Analytics Research & Text Mining Overview
It is commonly observed that many students’ reports come with rather conspicuous
grammatical and typographical errors. This is surprising because most students actually
possess a reasonable command of English. So, a possible reason could be that the reports
have not been proof-read. And what could be the reason for not proof-reading? Well, it
could be that the report was written at the last minute!
In fact, proofreading is not a difficult task. Typographical errors are usually quite easy
to fix using the spell-checking feature that is common in the word processing software.
Furthermore, proofreading could also be done by enlisting the help of someone with a
fresh pair of eyes to look for errors.
A basic rule of writing is to use appropriate but simple words. Avoid using big words.
Keep sentences short. Avoid mingling different points into a single long paragraph.
Instead, express one point in each paragraph.
If there is a big idea, break it down into smaller points, and link up these points to ensure
a coherent presentation of the big idea.
Proper reference citation shows that the report is built upon the knowledge contributed
by others that are relevant to the field. When citing the works of others, it also helps the
readers to find out more information if they need to. Reference citation also gives due
acknowledgements to the authors for their contributions. Please refer to section 1.5 for
details on how to credit sources using the American Psychological Association (APA)
style.
A business analytics project normally involves using a dataset. It is necessary to cite the
source of this dataset. At the same time, acknowledgement should be given to those who
provided help in the project. The person could be a friend who has helped to proof-read the
SU1-12
ANL312 Business Analytics Research & Text Mining Overview
report or someone who has given information that led to the successful implementation
of the project. It is also necessary to acknowledge the funding organisation, if any.
Tables and figures are important elements in reports. They can be used to communicate
ideas or results in an intuitive manner. Poorly constructed tables or unclear figures reflect
poorly upon the report, and should be avoided at all costs.
One way to make tables and figures more intuitive is to provide captions that are
descriptive and meaningful. For example, an informative caption such as ‘Table 1:
Customer Data; Group 1 has the highest propensity to return’, is definitely better than a
meaningless caption such as ‘Table 1: data’.
Sometimes, tables and figures in reports could be too small in size. This could happen
when authors reduce the size of the figures and tables to meet page-limit requirement for
publications. Small tables and figures are unreadable – it is suggested that the report be
reorganised to make room for important tables and figures.
It is also important to ensure that figures are of consistent size and the details are clearly
visible when printed on a black and white printer. For tabulated data, it is usually
necessary to highlight parts of the table so as to catch the readers’ attention. Usually,
textual effects such as boldface and underline can be used to enhance the visual impact of
tabulations.
Ensure consistency
Consistency is a simple and general rule, yet it is very important to every aspect of report
writing.
For textual presentation, a consistent text font, style and size for the respective parts of a
report must be used. For example, if a font size of 14 is used for section title, make sure
that this is consistently applied to all the section titles in the reports.
SU1-13
ANL312 Business Analytics Research & Text Mining Overview
Consistency in textual presentations can also be achieved using word processing software.
For example, Microsoft Word has style and formatting features that one can use to achieve
the consistency required.
In terms of language, consistency means that Standard English should be used; and there
is definitely no place for ‘Singlish’ and slang. For terms that may have more than one way
of writing them, it is still important to stay consistent. For example, the word ‘dataset’ can
also be written as ‘data set’. It is important to ensure that this word is written in the same
way throughout the entire report.
In terms of citations and reference list, it is also important to ensure a consistent style is
used and in-text citations match references at the end of a report. While revising a report,
one may decide to take out certain paragraphs and add in new ones. This could leave old
references that are no longer cited in the reference list. On the other hand, there might
be new citations which are not in the reference list. Hence, checking for consistency is
important in report writing.
In short, good report writing involves a lot of hard work and common sense. Good report
writers are meticulous. They pay attention to every word, sentence, paragraph, section
and chapter and use words with care. They think critically and write clearly. Not only
do they write their reports only once, they review and correct them repeatedly! They
normally start their report writing at the early stage of their projects. And they will
normally allocate sufficient time at the end of the project to ensure that the report is vetted
through and thoroughly revised before their final submission.
SU1-14
ANL312 Business Analytics Research & Text Mining Overview
1.5.1 Introduction
In report writing, there is a need to credit the sources where ideas, concepts, statistics and
findings are drawn from to support the report that is being written. It is used extensively
throughout the report especially in the Introduction and Literature Review sections. In the
Introduction section, general concepts and ideas are introduced and it is common to cite
statistics to support these concepts and ideas. In the Literature Review section, it is normal
to describe the details, findings and conclusion of several journal articles that relate to the
topic chosen for the report.
In this section, the guidelines for crediting sources according to the Publication Manual
of the American Psychological Association (APA) (7th edition) will be used and it will be
referred to as the “APA style”.
There are essentially two parts to crediting sources in a report. They are placing a brief
reference citation in the text of the report and giving complete citation details in the
reference list. Section 1.5.2 will address how to cite references in text while section 1.5.3
will cover the reference list.
In this light, Miller (2015) defined text mining as “the automated or partially automated
processing of text. It involves imposing structure upon text and extracting relevant
information from text”. In this study, Leong et al. (2012) explored the application of
analysing teaching feedback from students using sentiment mining.
SU1-15
ANL312 Business Analytics Research & Text Mining Overview
Sentiment mining is a useful tool to analye teaching feedback from students (Leong et al.,
2012).
Both ways can be used in the same report. Table 1.1 below shows the basic citation formats
using the APA style (American Psychological Association, 2020). The citation format
depends on the type of citation and whether the citation is the first or subsequent citation.
One work by two authors Olson and Shi (2007) Olson and Shi (2007)
One work by three or more Chan et al. (2010) Chan et al. (2010)
authors
SU1-16
ANL312 Business Analytics Research & Text Mining Overview
Each item in the reference list is made up of the following basic elements: author, year
of publication, title and publishing data (American Psychological Association,2020). The
general APA style for items in a reference list is given in Table 1.2.
Books with Author, A. A., & Author, B. B. (YYYY). Title of book (Xth ed.).
subsequent Publisher.
editions
Leedy, P. D., & Ormrod, J. E. (2010). Practical research: Planning and
design (9th ed.). Pearson Education Inc.
Journal Articles Author, A. A., & Author, B. B. (YYYY). Title of article. Title of
Journal, vol (issue, when appropriate), page-number/s.
SU1-17
ANL312 Business Analytics Research & Text Mining Overview
Course reports Author, A. A., & Author, B. B. (YYYY). Title of report (Report No.).
Publisher.
Online Author, A. A., & Author, B. B. (year if available, if not, n.d.). Title
documents from of document. Retrieved from statement with retrieval date.
websites
Victoria University. (2019). APA referencing: Sample APA reference
list. Retrieved September 12, 2019, from
http://libraryguides.vu.edu.au/apa-referencing/sample-apa-
reference-list
doi: 10.1177/0010880405275966
The items in the reference list are to be sorted according to the following rules:
• Alphabetical order by the surname of the first author followed by initials of the
author’s given name
• Several works by same first author
◦ One-author entries by same author arranged by year of publication with the
earliest first
SU1-18
ANL312 Business Analytics Research & Text Mining Overview
Cokins, G. (2007)…
Cokins, G. (2009)…
◦ One-author entries precede multiple-author entries beginning with same
surname even if multiple-author entries are published earlier
Olson, D. (2009)…
Olson, D., & Shi,Y. (2007)…
◦ Multiple author entries are arranged alphabetically based on the surname of
the second author that is different
SU1-19
ANL312 Business Analytics Research & Text Mining Overview
Lesson Recording
In fact, the amount of unstructured data is more than you could have imagined. The 80
Percent Rule points out that estimated unstructured information within an organisation
runs as high as 80% (Shilakes, 1998). Although this rule of thumb is not based on any
primary or any quantitative research, we admit there is more unstructured data than
structured data, and business organisations are realising the importance of unstructured
data within an organisation or over the Internet. The focus of this course will be
unstructured text data.
• Emails
SU1-20
ANL312 Business Analytics Research & Text Mining Overview
• Reports
• Call centre logs
• Product online reviews
• Contracts
• News articles
• Social media posts
• …
The value of text data comes from the rich information contained in various text document
that supplements the structured data. For business organisations, for example, structured
data only records customers’ purchase behaviour such as who purchased which product
at what time. However, text data like online reviews and social media posts provides
opportunities for business organisations to understand their customers’ evaluations and
expectations towards their products. Due to the huge amount of text data that is valuable
to explore, it’s very costly to manually process the text data as it requires time and
manpower. Therefore, automated or partially automated technologies to process text data
are preferred and text mining is well recognised as an effective and efficient tool to extract
insights from text data.
The basic task of text mining is to convert unstructured text data to structured data. For
example, the text mining objective could be classifying the text document so that each
text document is assigned a category which represents the features of this text document.
Then the unstructured text data can be presented as structured data with a column of
text document id and a column of category name. That’s why we say Text mining refers
to a broad range of technologies that can process unstructured text data and convert
unstructured text data to structured data, in the overview section of this study unit. The
common theme behind these technologies is converting text into numbers so different
algorithms can be applied. How unstructured text data can be converted to numbers will
be covered in study unit 4.
SU1-21
ANL312 Business Analytics Research & Text Mining Overview
Text that are written, captured or recorded in files and databases may not be written in a
standard way as it depends on the purpose of why the text was written, what the recording
medium is and also on the person doing the recording. Figure 1.1 shows a sample of call
centre logs. The logs are written to identify the issues raised in the call without details of
the caller as seen in the text coloured in green. The text is a summary of the call done by
either the call centre representative that answered the call or by another call centre staff
reading from the verbatim logs or recording of the conversation.
There can also be different formats between different types of text. Analysts’ reports,
news articles and companies’ press releases are usually written with the highest standard
of adherence to grammatical rules and clarity so that they are comprehensible to the
respective audiences. Text in airline incident reports, patient records and police case files
may be written with the respective field’s jargon with fewer adherences to grammatical
rules. The other extreme is text in the various instant messaging systems that operate
through phones, the internet and within computer networks that tend to be short, not in
SU1-22
ANL312 Business Analytics Research & Text Mining Overview
proper sentences and also populated with emoticons to denote emotions like sad or happy.
(e.g., ;-( :-))
The challenge posed by these varieties of formats to text mining is that the Text Mining
Process must be robust to support all these formats.
The use of abbreviations and short-forms is closely tied with the formats discussed in the
previous challenge. With more formal texts, there will be fewer of them. The use increases
with the technical texts and is rampant with text used for instant messaging. In Figure 1.2,
the highlighted term rep could stand for repetitive or representative. Only after reading the
whole text can it be concluded that rep stands for representative.
Spelling Errors
Spelling errors are another challenge when it comes to text. Not all spelling errors are easy
to resolve. Figure 1.3 highlights to as being misspelled. However, it is difficult to detect
the misspelling without reading the whole sentence for contextual clues.
SU1-23
ANL312 Business Analytics Research & Text Mining Overview
Synonyms
Another challenge would be dealing with synonyms in text. In Figure 1.4, plan is also
synonymous to phone plan, contract, phone contract, etc. There is a need to map all these
together to reduce the number of terms that one has to deal with in a collection of text
data.
Figure 1.4 Synonyms
Redundant Terms
Some terms are redundant. Examples are those highlighted in green in Figure 1.5 such as
the, is, etc. Removal of these terms would not involve any loss of information.
SU1-24
ANL312 Business Analytics Research & Text Mining Overview
Because of the various challenges in dealing with text, it is very important and also
a common step to preprocess the text data before further applying any text mining
technique. Preprocessing text is just like cleaning structured data when we do data mining.
The purpose of preprocessing text is to produce clean text data so that further text mining
steps can generate more reliable results. The details of preprocessing text will be covered
in study unit 2.
SU1-25
ANL312 Business Analytics Research & Text Mining Overview
• Parsing is the process of extracting, clean and create a dictionary of words from the
documents using NLP.
A typical text mining project in Business Analytics includes data collection, text parsing, text
filtering, transformation/vectorisation, and functioning/application. Figure 1.6 shows the text
mining process flow adapted from Chakraborty, et al. (2014)..
• Data collection: The first step of any text mining project is to collect the text
data required for the analysis. The text data can come from internal or external
documents, web pages, user/customer comments from social media or be collected
by survey or other sources containing text data that is useful to the project.
SU1-26
ANL312 Business Analytics Research & Text Mining Overview
• Text parsing: The next step is to extract, clean and create a dictionary of words
from the documents using NLP. This step normally involves identifyingsentences,
splitting sentences into words (tokens) basing on some rules, determining parts of
speech and stemming words. This step also includes removing general stop words
and spelling-checking. Most of the challenges in dealing with text that are discussed
above will be addressed in this step. After text parsing, text data is converted to a
collection of words (terms) for further processing. The details of text parsing will
be covered in study unit 2 in text preprocessing section.
• Text filtering: Among all the terms generated in the text parsing step, you will
likely have many terms that are irrelevant or not important to either differentiating
documents from each other or summarising the documents. You will have to
manually browse through the terms to eliminate irrelevant terms by creating
customised start/stop lists. Text filtering requires domain knowledge to create
stop/start words. Or you will have to remove the unimportant terms by setting a
threshold to eliminate terms with extremely high frequency.
• Transformation/Vectorisation: The next step is text transformation. This step deals
with the numerical representation of the text using linear algebra-based methods.
The remaining terms after text filtering will be transformed to vector format.
The commonly used format is document-term matrix or term-document matrix.
Document-term matrix or term-document matrix is a mathematical matrix that
describes the frequency of terms that occur in a collection of documents. In a
document-term matrix, each row corresponds to a document in the corpus and
each column corresponds to a term, while in a term-document matrix, each column
corresponds to a document in the corpus and each row corresponds to a term.
Document-term matrix and term-document matrix are essentially the same but in
different presentation formats. Table 1.3 shows an example of a document-term
matrix.
SU1-27
ANL312 Business Analytics Research & Text Mining Overview
Document 1 0 0 3 1 0 2 0
Document 2 0 2 2 0 0 1 0
Document 3 1 0 0 2 0 3 1
Document N 0 1 1 2 0 0 2
Next section will introduce the seven text mining practice areas and popular applications,
and the rest of study units of this course will cover some of the important and popular
text mining applications in the business world in detail.
• Search and information retrieval (IR): Search and information retrieval includes
tasks of indexing, searching, and retrieving text from a large collection of text
documents stored on database or over the Internet with keyword queries. Search
and information retrieval is the core task of a search engine.
SU1-28
ANL312 Business Analytics Research & Text Mining Overview
SU1-29
ANL312 Business Analytics Research & Text Mining Overview
application of concept extraction. Table 1.4 lists some of the popular text mining topics/
applications and corresponding practice areas (Miner et al., 2012).
SU1-30
ANL312 Business Analytics Research & Text Mining Overview
There have been various successful applications of text mining in the literature. These
serve to motivate data/text miners in applying text mining to new applications.
In banks and the telecommunications industry, text mining has been used on email, call
centre logs and surveys to uncover patterns in customer preference and complaints.
Fault Detection
In the airline industry, text mining has been used to detect trends in incident reports.
Particular airports may be experiencing heavy occurrences of certain mechanical problems
that are caused by certain fixable defective procedures.
Police/Security
Text mining can be used on police reports to detect trends that relate to how crimes were
committed. This information can be used to solve future crimes as well as leveraged on in
scheduling police patrols.
Web Mining
Text mining can be used to extract useful information from web sites, discussion forums
and blogs about various topics of interest and provide insights into current trends and
themes.
Medical/Health
Mining text of patient records can uncover symptoms associated with certain drugs.
Investment Research
Text mining can be used on daily analyst reports, news articles and company press releases
to uncover trends that affect stock price movements for certain companies or industries
over time.
SU1-31
ANL312 Business Analytics Research & Text Mining Overview
Phase Descriptions
SU1-32
ANL312 Business Analytics Research & Text Mining Overview
Phase Descriptions
Business Understanding
There are typically three scenarios when embarking on text mining. They are:
Irrespective of the scenario, the availability of text data must be determined in this phase.
If it is not available, then the text mining project cannot proceed especially for scenarios
1 and 2.
SU1-33
ANL312 Business Analytics Research & Text Mining Overview
Data Understanding
This phase includes identification of text data source, text data collection and assessment
of the quantity and quality of the text data. For text mining, expertise with a good
understanding of the text is needed, especially if special terms and vocabulary are used.
Data Preparation
For text mining, data preparation refers to text data pre-processing which normally
includes identifying sentences, splitting sentences into words (tokens) based on some
rules, determining parts of speech and stemming words. This step also includes removing
stop words and spelling-checking.
Modelling
Similar to data mining, one can build text mining models with different algorithms.
For example, to build a document clustering model, one can select different clustering
techniques to build the model. Outputs of text mining models can be used to score new
text or as inputs to improve the accuracy of the predictive models.
If the model accuracy is not acceptable, then one can return to the text mining model
to modify the parameters in the selected algorithm or try a different algorithm. The text
mining results can then be fed as inputs to the predictive model to see if the accuracy is
improved.
Evaluation
For text mining models, there is no formal, objective way to evaluate text mining models.
This is unlike models using structured data where the actual target and predicted target
can easily be compared with accuracy measures. For text mining models, it is fairly time
consuming to mimic what is done in structured data since this will entail reading a sample
of text to verify that the correct results are generated.
SU1-34
ANL312 Business Analytics Research & Text Mining Overview
However, evaluation is straightforward when dealing with scenarios 2 and 3 where there
is a predictive model involved, with the actual and predicted targets being present in the
data. The usual accuracy and other statistics can be calculated.
Deployment
Once the text mining models have been finalised, they can be deployed. Depending on
the business objectives, the deployment phase can be as simple as writing a report or as
complex as implementing the models in a business intelligence system. Implementing the
models means scoring the new text data. For example, to implement a document clustering
model, new text document will be clustered to one of the groups by applying the model
developed in the modelling phase. The purpose of scoring new data is to estimate the
response, class, or value which is valuable to business decision making. There are three
model scoring ways:
• Interactive scoring: Model scoring is done in the data mining software that was
used to develop the model initially.
• Batch scoring: Model scoring is carried out on regular basis with scores stored in
a data warehouse.
• Real-time scoring: Model scoring is carried out on ad hoc basis.
SU1-35
ANL312 Business Analytics Research & Text Mining Overview
Summary
Chapter 1 of this study unit gives an overview of typical information required for data
mining research, and where and how such information can be obtained. It also provides
guidelines on good report writing. These guidelines are not exhaustive, and they are
provided with the aim of addressing common weaknesses found in students’ reports.
Chapter 2 of this study unit covers the background needed in text mining such as
definition and terminology, challenges in dealing with text, the text mining process, text
mining practice areas and applications. Conducting a text mining project using CRISP-
DM framework is also introduced.
SU1-36
ANL312 Business Analytics Research & Text Mining Overview
Formative Assessment
2. Visit the UCI Machine Learning Data Repository website. Download the most
popular data set and briefly describe the content of the dataset.
3. Read the following paragraph. Apply APA style and construct the citations (source
1 and source 2) in boldface.
Source 2:
Author(s):
Krech Thomas, H.
Year: 2004
SU1-37
ANL312 Business Analytics Research & Text Mining Overview
9. Suggest two other applications of text mining besides those discussed in section 2.3.2
of this study unit.
10. Compare the difference of applying CRISP-DM framework in data mining vs. text
mining project.
SU1-38
ANL312 Business Analytics Research & Text Mining Overview
Formative Assessment
1. Visit the KDNuggets website, list three data repository websites.
Suggested Answer: Here are three examples:
2. Visit the UCI Machine Learning Data Repository website. Download the most
popular data set and briefly describe the content of the dataset.
Suggested Answer: The most popular dataset is Iris as of 25 June 2020. This dataset
is for pattern recognition and is used to build classification model. This dataset
has 150 instances. Students can discuss more about this dataset basing on the
information on the page https://archive.ics.uci.edu/ml/datasets/Iris
3. Read the following paragraph. Apply APA style and construct the citations (source
1 and source 2) in boldface.
SU1-39
ANL312 Business Analytics Research & Text Mining Overview
Authors:
Derwing, T. M., Rossiter, M. J., and Munro, M. J.
Year: 2002
Source 2:
Author(s):
Krech Thomas, H.
Year: 2004
Suggested Answer: Derwing, T. M., Rossiter, M. J., & Munro, M. J. (2002). Teaching
native speakers to listen to foreign-accented speech. Journal of Multilingual and
Multicultural Development, 23, 245-259.
SU1-40
ANL312 Business Analytics Research & Text Mining Overview
Suggested Answer:
b. Document-Term Matrix
Correct.
c. Stemming
Incorrect.
d. Stop words
SU1-41
ANL312 Business Analytics Research & Text Mining Overview
Incorrect.
9. Suggest two other applications of text mining besides those discussed in section 2.3.2
of this study unit.
Suggested Answer:
1. Education: Text mining can be used to analyse student feedback and
extract useful information to improve teaching and management.
2. Human resource: Text mining can be used to analyse employee feedback
and extract useful information to improve internal process and increase
employee satisfaction.
10. Compare the difference of applying CRISP-DM framework in data mining vs. text
mining project.
Suggested Answer: The differences of applying CRISP-DM framework in data
mining and text mining project are reflected mainly in the following phases.
Business understanding: For text mining, there are typically three scenarios when
embarking on text mining. They are:
1. Embarking on a pure text mining project;
2. Considering the use of text as additional inputs in a completed data
mining project;
3. Embarking on a data mining project with the intention to use both
structured data and text.
SU1-42
ANL312 Business Analytics Research & Text Mining Overview
Data understanding: The main difference is text mining requires more domain
knowledge to understand the text, especially if special terms and vocabulary are
used.
Data preparation: The typical data preparation methods are different for data
mining and text mining. For text mining, data preparation refers to text data pre-
processing which normally includes identifying sentences, splitting sentences into
words (tokens) based on some rules, determining parts of speech and stemming
words. This step also includes removing stop words and spelling-checking.
Evaluation: For text mining models, there is no formal, objective way to evaluate
text mining models. This is unlike models using structured data where the actual
target and predicted target can easily be compared with accuracy measures.
SU1-43
ANL312 Business Analytics Research & Text Mining Overview
References
Chakraborty, G., Pagolu, M., & Garla, S. (2014). Text mining and analysis: Practical methods,
Dua, D., & Graff, C. (2019). UCI machine learning repository. Retrieved from https://
archive.ics.uci.edu/ml/index.php
Lee, Y., Chan, S., Leong, C., Tan, G., & Tan, S. (2014). ANL311e Selected topics in business
Leong, C. K., Lee, Y. H., & Mak, W. K. (2012). Mining sentiments in SMS texts for
Miller, T. W. (2015). Web and network data science: Modeling techniques in predictive
Miner, G., Elder IV, J., Fast, A., Hill, T., Nisbet, R., & Delen, D. (2012). Practical text mining
and statistical analysis for non-structured text data applications. Academic Press.
SU1-44