You are on page 1of 35

2nd Part 2019 Mohammad Iftakhar Alam

ASSESSMENT BRIEF Lecturer of School IT


Unit 21 Data Mining
:

LO3 Illustrate how a data mining algorithm performs text mining to identify relationships
within text.

P5 Discuss what is meant by text mining and explain with appropriate examples.
P6 Analyse how data mining algorithms, techniques, methods and approaches work.
M4 Show how text mining works using a tool or programming language.
D3 Develop a complete text mining application for a real world issue.

A. Text mining algorithms

1. What is text mining?

2. Explain how text mining works.

3. Investigate text mining techniques, methods and approaches.

4. Demonstrate text mining with at least two different real world examples.

5. Identify the most common text mining algorithms used in industry.

6. Explain how these text mining algorithms work.

7. Demonstrate these algorithms using an appropriate programming language or text


mining tool.
2nd Part 2019 Mohammad Iftakhar Alam
ASSESSMENT BRIEF Lecturer of School IT
Unit 21 Data Mining
:

LO4 Evaluate a range of graph data mining techniques that recognise patterns and
relationships in graph based technologies.

P7 Discuss what is meant by graph data mining and explain with appropriate examples.
P8 Assess how graph mining algorithms work and identify appropriate programming
languages and tools used by industry for graph data mining.
M5 Demonstrate how graph data mining works using a tool or programming language.
D4 Develop a complete graph data mining application for a real world scenario.

B. Graph mining

1. What is graph mining?


2. Explain how graph mining works.

3. Investigate graph mining techniques, methods and approaches.

4. Demonstrate graph mining with at least two different real world examples.

5. Identify the most common graph mining algorithms used in industry.

6. Explain how these graph mining algorithms work.

7. Demonstrate these algorithms using an appropriate programming language or


graph mining tool

2nd Part 2019 Mohammad Iftakhar Alam


ASSESSMENT BRIEF Lecturer of School IT
Unit 21 Data Mining
:

LO3 Illustrate how a data mining algorithm performs text mining to identify relationships
within text.

P5. Discuss what is meant by text mining and explain with appropriate examples.

1. What is text mining?


Text Mining is also known as Text Data Mining. The purpose is too unstructured
information, extract meaningful numeric indices from the text. Thus, make the information
contained in the text accessible to the various algorithms. Information can extract to derive
summaries contained in the documents. Hence, you can analyze words, clusters of words
used in documents. In the most general terms, text mining will “turn text into numbers”.
Such as predictive data mining projects, the application of unsupervised learning methods.

“Text Mining is the discovery by computer of new, previously unknown information,


by automatically extracting and relating information from different written
resources, to reveal otherwise "hidden" meanings.”

The Concept:
Text mining is a burgeoning new field that tries to extract meaningful information from
natural language text. It may be characterized as the process of analyzing text to extract
information that is useful for a specific purpose. Compared with the kind of data stored in
databases, text is unstructured, ambiguous, and difficult to process. Nevertheless, in
modern culture, text is the most communal way for the formal exchange of information.
Text mining usually deals with texts whose function is the communication of actual
information or opinions, and the stimuli for trying to extract information from such text
automatically is compelling—even if success is only partial. Text mining, using manual
techniques, was used first during the 1980s. It quickly became apparent that these manual
techniques were labor intensive and therefore expensive. It also requires too much time to
manually process the already growing quantity of information. Over time there was a huge
success in creating programs to automatically process the information, and in the last few
years there has been a great progress.
The study of text mining concerns the development of various mathematical, statistical,
linguistic and pattern-recognition techniques which allow automatic analysis of
unstructured information as well as the extraction of high quality and relevant data, and to
make the text as a whole better searchable.
A text document contains characters which together form words, which can be further
combined to generate phrases. These are all syntactic properties that together represent
already defined categories, concepts, senses or meanings .Text mining must recognize,
extract and use the information. Instead of searching for words, we can search for
semantic patterns, and this is therefore searching at a higher level

2. Explain how text mining works.


Text mining extracts precise information based on much more than just keywords. Instead,
you search for entities or concepts, relationships, phrases, sentences – even numerical
information in context.

Text mining software tools often use computational algorithms based on Natural Language
Processing, or NLP, to enable a computer to "read" and analyse textual information. NLP
interprets the meaning of the text and identifies extracts, synthesizes and analyses
relevant facts and relationships that directly answer your question.

Text can be mined in a systematic, comprehensive and reproducible way, and business
critical information can be captured and harvested automatically.
Powerful NLP-based queries can be run in real time across millions of documents. These
can be pre-written queries provided by Linguamatics, or queries created and refined, on-
the-fly, by you.
Using linguistic and other wildcards, you can ask open questions without even having to
know the keywords for which you're looking. You still get back high quality, structured
results.

Text mining is a process which involves many of technological areas. Information Retrieval,
Data Mining, Artificial Intelligence and computational linguistics are all fields which have
some role there. Nevertheless it seems that there are two main phase in Text Mining
process depicted in figure 1.

figure 1: Text mining Process

First phase is about documents pre-processing. This phase’s output can have two kinds of
formats; document based and concept based. Document base representation concerns
with a better representation of documents. This can be transforming them to an
intermediate semi structured format or applying some kind of index over it or whatever
kind of desired representation can be applied to a document. Here each entity in
representation will be a document. Second kind of refining is extracting concepts from
documents, relation between this concepts and whatever concept based information can
be extracted from a single document. Each entity will be there a concept. Nevertheless it’s
possible to transform a document based representation to a concept based
representation.

Next step is extracting knowledge from these intermediate representations. Relative to


what representation a document has, knowledge extraction will be different. Document
based representation is used to clustering, categorization, visualization and something like
that While concept based representation is appropriate for association detection,
automatic thesaurus building and some things related to concepts not documents
themselves.

How to perform Text Mining?

“Python and R are the most famous text mining tools out there for text mining.”

The following steps are to be followed for Text-Mining Python and Text mining in R,

Information Retrieval | Data Preparation and Cleaning | Segmentation | Tokenization |


Stop-word numbers and punctuation removal | Stemming | Convert to lowercase | POS
tagging | Create text corpus | Term-Document matrix

3. Investigate text mining techniques, methods and approaches.

a. What are text mining techniques?

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is
the process of deriving high-quality information from text. The overarching goal is,
essentially, to turn text into data for analysis, via application of natural language processing
(NLP) and analytical methods.
Text mining techniques can be understood at the processes that go into mining the text
and discovering insights from it. These text mining techniques generally employ different
text mining tools and applications for their execution. Now, let us now look at the various
text mining techniques:

Fig 1: Processing of Text Mining

The Six fundamental steps involved in text mining are:


• Gathering unstructured data from multiple data sources like plain text, web pages,
pdf files, emails, and blogs, to name a few.
• Detect and remove anomalies from data by conducting pre-processing and
cleansing operations. Data cleansing allows you to extract and retain the valuable
information hidden within the data and to help identify the roots of specific words.
• For this, you get a number of text mining tools and text mining applications.
• Convert all the relevant information extracted from unstructured data into
structured formats.
• Analyze the patterns within the data via the Management Information System (MIS).
• Store all the valuable information into a secure database to drive trend analysis and
enhance the decision-making process of the organization.

Most famous techniques used in text mining techniques:


1. Information Extraction (IE)
This is the most famous text mining technique. Information exchange refers to the process
of extracting meaningful information from vast chunks of textual data. This text mining
technique focuses on identifying the extraction of entities, attributes, and their
relationships from semi-structured or unstructured texts. Whatever information is
extracted is then stored in a database for future access and retrieval. The efficacy and
relevancy of the outcomes are checked and evaluated using precision and recall processes.
Information Extraction is the task of automatically extracting structured information from
unstructured. In most of the cases, this activity includes processing human language texts
by means of NLP.
IE is the process of automatically obtaining structured data from unstructured data. This
action includes Natural Language Processing.

Fig.2: Information Extraction

2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this text mining technique, IR
systems make use of different algorithms to track and monitor user behaviors and discover
relevant data accordingly. Google and Yahoo search engines are the two most renowned IR
systems.
Information retrieval is regarded as an extension to document retrieval. That the
documents that are returned are processed to condense. Thus document retrieval follow
by a text summarization stage. That focuses on the query posed by the user. IR systems
help in to narrow down the set of documents that are relevant to a particular problem. As
text mining involves applying very complex algorithms to large document collections. Also,
IR can speed up the analysis significantly by reducing the number of documents.
IR is considered as an extension to document extraction. IR systems help in to narrow
down the set of records that are associated with a specific problem. Text mining involves
applying complicated mining algorithms to large-scale documents. By reducing the number
of documents, IR can increase the speed of the analysis significantly.
“You can’t analyze text without retrieving it in the first place, which is why
information retrieval is the essential preliminary step to text mining.”
Information can come from many sources, and it all depends on the objective you’re trying
to achieve with text mining.
For example, social media is often a hot target for information retrieval during election
season to measure how social media users feel about politicians. Databases and internal
systems are common sources for interpreting customer and employee sentiment.
After text is retrieved, it’s time to begin structuring it.

3. Categorization
This is one of those text mining techniques that is a form of “supervised” learning wherein
normal language texts are assigned to a predefined set of topics depending upon their
content. Thus, categorization or rather Natural Language Processing (NLP) is a process of
gathering text documents and processing and analyzing them to uncover the right topics or
indexes for each document. The co-referencing method is commonly used as a part of NLP
to extract relevant synonyms and abbreviations from textual data. Today, NLP has become
an automated process used in a host of contexts ranging from personalized commercials
delivery to spam filtering and categorizing web pages under hierarchical definitions, and
much more.

Categorization is the process of assigning a given text into groups of entities whose
members are in some way similar to each other. Recognition of resemblance across
entities and the subsequent aggregation of like entities into categories lead the individual
to discover order in a complex environment. Without the ability to group entities based on
perceived similarities, the individual’s experience of any one entity would be totally unique
and could not be extended to subsequent encounters with similar entities in the
environment. This process is considered as a supervised classification technique since a set
of pre-classified documents is provided as a training set. The goal of TC is to assign a
category to a new document. By reducing the load on memory, facilitating the efficient
storage and retrieval of information, categorization serves as the fundamental cognitive
mechanism that simplifies the individual’s experience of the environment.
Fig.3: Classification

4. Clustering
Clustering is one of the most crucial text mining techniques. It seeks to identify intrinsic
structures in textual information and organize them into relevant subgroups or ‘clusters’
for further analysis. A significant challenge in the clustering process is to form meaningful
clusters from the unlabeled textual data without having any prior information on them.
Cluster analysis is a standard text mining tool that assists in data distribution or acts as a
pre-processing step for other text mining algorithms running on detected clusters.

Clustering is a process of forming groups (clusters) of similar objects from a given set of
inputs. Good clusters have the characteristic that objects belonging to the same cluster are
"similar" to each other, while objects from two different clusters are "dissimilar". The idea
of clustering originates from statistics where it was applied to numerical data. However,
computer science and data mining in particular, extended the notion to other types of data
such as text or multimedia.
Clustering is an unsupervised process through which objects are classified into groups
called clusters. In the case of clustering, the problem is to group the given unlabeled
collection into meaningful clusters without any prior information. Any labels associated
with objects are obtained solely from the data. An advantage of clustering is that
documents can emerge in multiple subtopics, thus ensuring that a useful document will
not be absent from search results.
Figure: 4 Clustering

5. Summarization

Text summarization refers to the process of automatically generating a compressed


version of a specific text that holds valuable information for the end-user. The aim of this
text mining technique is to browse through multiple text sources to craft summaries of
texts containing a considerable proportion of information in a concise format, keeping the
overall meaning and intent of the original documents essentially the same. Text
summarization integrates and combines the various methods that employ text
categorization like decision trees, neural networks, regression models, and swarm
intelligence.

The definition of the summary is an obvious one which emphasizes the fact that
summarizing is in general a hard task because we have to characterize the source text as a
whole and capture its important content. The content is a matter of both information and
its expression and importance is a matter of what is essential as well as what is salient.
Summarization is the process of reducing a text document with a computer program in
order to create a summary that retains the most important points of the original
document. As the problem of information overload has grown and as the quantity of data
has increased, so has interest in automatic summarization. Technologies that can make a
coherent summary take into account variables such as length, writing style and syntax. An
example of the use of summarization technology is search engines such as Google and
another is the document summarization. Summarization systems are able to create both
query relevant text summaries and generic machine-generated summaries depending on
what the user needs. Summarization of multimedia documents, e.g. pictures or movies is
also possible. Some systems will generate a summary based on a single source document,
while others can use multiple source documents. These systems are known as multi-
document summarization systems.

6. Data Mining (DM)


Data mining can loosely describe as looking for patterns in data. It can more characterize
as the extraction of hidden from data. Data mining tools can predict behaviors and future
trends. Also, it allows businesses to make positive, knowledge-based decisions. Data
mining tools can answer business questions. Particularly that have traditionally been too
time-consuming to resolve. They search databases for hidden and unknown patterns.
Data Mining looks for patterns in data. It can be more described as the retrieval of hidden
information from data. Text-Mining in Data-Mining tools can predict responses and trends
of the future. It enables businesses to make positive decisions based on knowledge and
answer business questions.
7. Natural Language Processing (NLP)
NLP is one of the oldest and most challenging problems. It is the study of human language.
So those computers can understand natural languages as humans do. NLP research
pursues the vague question of how we understand the meaning of a sentence or a
document. What are the indications we use to understand who did what to whom? The role
of NLP in text mining is to deliver the system in the information extraction phase as an
input.
The purpose of NLP in text mining is to deliver the system in the knowledge retrieval phase
as an input.

Table1: Comparison of text mining techniques


Different Type Text Mining
1. Topic Tracking
A topic tracking system apparatus by custody of user profiles and, based on the documents
the user views, guess other documents of interest to the user. Yahoo offers free topic
tracking tool that permits users to choose keywords and informs them when news relating
to those topics becomes existing. Topic tracking methodology has its own limitations,
however. For example, if a user sets up an alert for “text mining”, s/he will receive
numerous news stories on mining for minerals, and very few that are really on text mining.
Some of the improved text mining tools let users select specific categories of interest or the
software routinely can even infer the user’s concern based on his/her reading history and
click-through information.
Concept Linkage
Concept linkage tools attach related documents by identifying their commonly-shared idea
and help users find information that they perhaps wouldn’t have establish using
conventional searching methods. It promotes browsing for information rather than
searching for it. Concept linkage is a valuable idea in text mining, especially in the
biomedical fields where so much study has been done that it is impossible for researchers
to read all the material and make organizations to other research. Ideally, concept linking
software can identify links between diseases and treatments when humans cannot. For
example, a text mining software solution may easily identify a link between topics X and Y,
and Y and Z, which are familiar relations. But the text mining tool could also detect a
potential link between X and Z, something that a human researcher has not come across
yet because of the large volume of information s/he would have to sort through to make
the connection.

TEXT MINING PROCESS


The term text mining is commonly used to denote any system that analyzes large
quantities of natural language text and detects lexical or linguistic usage patterns in an
attempt to extract probably useful (although only probably correct) information.

Figure: 5 TEXT MINING PROCESS

TEXT MINING PROCESS


1. Document Gathering
In the first step, the text documents are collected which are present in different formats.
The document might be in form of pdf, word, html doc, css etc.
2. Document/Text Pre- Processing
In this process, the given input document is processed for removing redundancies,
inconsistencies, separate words, stemming and documents are prepared for next step, and
the stages performed are as follows:
2.1 Tokenization
The given document is considered as a string and identifying single word in document i.e.
the given document string is divided into one unit or token.
Tokenizing is simply achieved by splitting the text into white spaces.

2.2 Part of Speech Tagging (POS Tagging)


Part-of-Speech (POS) tagging means word class assignment to each token. Its input is given
by
the tokenized text. Taggers have to cope with unknown words (OOV problem) and
ambiguous
Word-tag mappings.
2.3 Stemming
A stem is a natural group of words with equal (or very similar) meaning. This method
describes
particular word. Inflectional and derivational stemming are two types of method. One of
the popular algorithms for stemming is porter’s algorithm. e.g. if a document pertains word
like resignation, resigned, resigns then it will be consider as resign after applying stemming
method.

2.4 Removal of Stop Word


In this step the removal of usual words like a, an, but, and, of, the etc. is done.
Stop-word numbers and punctuation removal
To go from raw text to fitting a deep learning model. We have to clean the text first,
Which means – splitting it into words and checking punctuation and case (by converting to
lower case).
Stop Words: The search engine has been programmed to ignore these stop words during
indexing entries and retrieving them as the result. Stop words are no use in analytics which
will include words like “the”, “a”, “an”, “in”, “is”, “and” etc.
Fig: Sample text with Stop Words

3.3 Text Transformation


A text document is collection of words (feature) and their occurrences. There are two
important ways for representations of such documents are Vector Space Model and Bag of
words.

3.4. Feature Selection (attribute selection):


This method results in giving low database space, minimal search technique by taking out
irrelevant feature from input document. There are two methods in feature selection i.e.
filtering and wrapping methods.

3.5 Data mining/Pattern Selection


In this stage the conventional data mining process combines with text mining process.
Structured database uses classic data mining technique that resulted from previous stage.

3.6 Data Mining


At this point, the Text mining process merges with the traditional process. Classic Data
Mining techniques are used in the structured database. Also, it resulted from the previous
stages.
In this stage the conventional data mining process combines with text mining process.
Structured database uses classic data mining technique that resulted from previous stage.
3.7 Evaluate
This stage Measures the outcome. This resulted outcome can be put away or can be used
for next set of sequence
3.8. Applications Of Text Mining
Text Mining applies in a variety of areas. Some of the most common areas are:

• Web Mining
These days web contains a treasure of information about subjects. Such as persons,
companies, organizations, products, etc. that may be of wide interest. Web Mining is an
application of data mining techniques. That needs to discover hidden and unknown
patterns from the Web. Web mining is an activity of identifying term implied in a large
document collection. It says C which denotes by a mapping i.e. C →p [10].
• Medical
Users exchange information with others about subjects of interest. Everyone wants to
understand specific diseases, to inform about new therapies. Also, these expert forums
also represent seismographs for medical. E-mails, e-consultations, and requests for
medical advice. That is via the internet have been analyzed using quantitative or qualitative
methods.
For Example,
Users actively exchange information with others about subjects of interest or send
requests to web-based expert forums, or so-called the doctor services. Everyone wants to
understand specific diseases (what they have), to be informed about new therapies, ask for
a second opinion before one can decide a treatment. In addition, these expert forums also
represent seismographs for medical and/or psychological requirements, which are
apparently not met by existing health care systems.
E-mails, e-consultations, and requests for medical advice via the Internet have been
manually analyzed using quantitative or qualitative methods. To help the medical experts
and to make full use of the seismograph function of expert forums, it would be helpful to
categorize visitors’ requests automatically. So, specific requests could be directed to the
expert or even answered semi-automatically, thereby providing complete monitoring. By
generating frequently asked questions (FAQs) similar patient requests and their
corresponding answers could be congregated, even before the actual expert responses.
Machine-based analyses could help both the public to better handle the mass of
information and medical experts to give expert feedback. An automatic classification of
amateur requests to medical expert internet forums is a challenging task because these
requests can be very long and unstructured as a result of mixing, for example, personal
experiences with laboratory data.
• Resume Filtering
Big enterprises and headhunters receive thousands of resumes from job applicants every
day. Extracting information from resumes with high precision and recall is not easy.
Automatically extracting this information can the first step in filtering resumes. Hence,
automating the process of resume selection is an important task.
Let’s look at Data Mining Advantages in detail.
For Example,
Big enterprises and headhunters receive thousands of resumes from job applicants every
day. Extracting information from resumes with high precision and recall is not an easy task.
In spite of constituting a restricted domain, resumes can be written in a multitude of
formats (e.g. structured tables or plain texts), in different languages (e.g. Japanese and
English) and in different file types (e.g. Plain Text, PDF, Word etc.). Moreover, writing styles
can also be much diversified. In the initial manual scan of the resume, a recruiter looks for
mistakes, educational qualifications, buzzwords, employment history, job titles, frequency
of job changes, and other personal information. Automatically extracting this information
can be the first step in filtering resumes. Hence, automating the process of resume
selection is an important task.

Text mining techniques and text mining tools are rapidly penetrating the industry, right
from academia and healthcare to businesses and social media platforms. This is giving rise
to a number of text mining applications. Here are a few text mining applications used
across the globe today:

Association Rule Mining


Association rule mining (ARM) is a technique used to discover relationships among a large
set of variables in a data set. It has been applied to a variety of industry settings and
disciplines but has, to date, not been widely used in the social sciences, specifically in
education, counseling, and associated disciplines. ARM refers to the finding of relationships
among a large set of variables, that is, given a database of records, each containing two or
more variables and their respective values, ARM determines variable-value combinations
that often occur. Similar to the idea of correlation Study (although they are theoretically
different), in which relationships between two variables are uncovered, ARM is also used to
discover variable relationships, but each relationship (also known as an association rule)
may contain two or more variables. This section provides the overview of text mining
techniques and methodologies by which suitably text data becomes classifiable in next we
discuss the data mining algorithms that are often consumed in the text mining and
classification tasks.

METHODS AND MODELS USED IN TEXT MINING


http://guides.library.illinois.edu/c.php?g=405110&p=5804542
Traditionally there are so many techniques developed to solve the problem of text mining
that is nothing but the relevant information retrieval according to user’s requirement.
According to the information retrieval basically there are four methods used:
1) Term Based Method (TBM).
2) Phrase Based Method (PBM).
3) Concept Based Method (CBM).
4) Pattern Taxonomy Method (PTM)

1) Term Based Method

Term in document is word having semantic meaning. In term based method document is
analysed on the basis of term and has advantages of efficient computational performance
as well as mature theories for term weighting. These techniques are emerged over the last
couple of decades from the information retrieval and machine learning communities. Term
based methods suffer from the problems of polysemy and synonymy [1]. Polysemy means
a word has multiple meanings and synonymy is multiple words having the same meaning.
The semantic meaning of many discovered terms is uncertain for answering what users
want. Information retrieval provided many term-based methods to solve this challenge.

2) Phrase Based Method

Phrase carries more semantics like information and is less ambiguous. In phrase based
method document is analysed on phrase basis as phrases are less ambiguous and more
discriminative than individual terms [2]. The likely reasons for the daunting performance
include: 1) Phrases have inferior statistical properties to terms, 2) They have low frequency
of occurrence, and 3) Large numbers of redundant and noisy phrases are present among
them.

3) Concept Based Method

In concept based terms are analysed on sentence and document level. Text Mining
techniques are mostly based on statistical analysis of word or phrase. The statistical
analysis of the term frequency captures the importance of word without document. Two
terms can have same frequency in same document, but the meaning is that one term
contributes more appropriately than the meaning contributed by the other term[7]. The
terms that capture the semantics of the text should be given more importance so, a new
concept-based mining is introduced.
This model included three components. The first component analyses the semantic
structure of sentences. The second component constructs a conceptual ontological graph
(COG) to describe the semantic structures and the last component extract top concepts
based on the first two components to build feature vectors using the standard vector space
model. Concept-based model can effectively discriminate between non important terms
and meaningful terms which describe a sentence meaning [8]. The concept-based model
usually relies upon natural language processing techniques. Feature selection is applied to
the query concepts to optimize the representation and remove noise and ambiguity.

4) Pattern Taxonomy Method

In pattern taxonomy method documents are analysed on pattern basis. Patterns can be
structured into taxonomy by using is-a relation. Pattern mining has been extensively
studied in data mining communities for many years. Patterns can be discovered by data
mining techniques like association rule mining, frequent item set mining, sequential
pattern mining and closed pattern mining[5]. Use of discovered knowledge (patterns) in the
field of text mining is difficult and ineffective, because some useful long patterns with high
specificity lack in support (i.e., the low-frequency problem). Not all frequent short patterns
are useful hence known as misinterpretations of patterns and it leads to the ineffective
performance. In research work, an effective pattern discovery technique has been
proposed to overcome the low-frequency and misinterpretation problems for text mining.
The pattern based technique uses two processes pattern deploying and pattern evolving
[6]. This technique refines the discovered patterns in text documents. The experimental
results show that pattern based model performs better than not only other pure data
mining-based methods and the concept-based model, but also term-based models.

Approaches to Text Mining


http://www.statsoft.com/Textbook/Text-Mining#approaches
To reiterate, text mining can be summarized as a process of "numericizing" text. At the
simplest level, all words found in the input documents will be indexed and counted in order
to compute a table of documents and words, i.e., a matrix of frequencies that enumerates
the number of times that each word occurs in each document. This basic process can be
further refined to exclude certain common words such as "the" and "a" (stop word lists)
and to combine different grammatical forms of the same words such as "traveling,"
"traveled," "travel," etc. (stemming). However, once a table of (unique) words (terms) by
documents has been derived, all standard statistical and data mining techniques can be
applied to derive dimensions or clusters of words or documents, or to identify "important"
words or terms that best predict another outcome variable of interest.
Using well-tested methods and understanding the results of text mining. Once a data
matrix has been computed from the input documents and words found in those
documents, various well-known analytic techniques can be used for further processing
those data including methods for clustering, factoring, or predictive data mining (see, for
example, Manning and Schütze, 2002).
"Black-box" approaches to text mining and extraction of concepts. There are text
mining applications which offer "black-box" methods to extract "deep meaning" from
documents with little human effort (to first read and understand those documents). These
text mining applications rely on proprietary algorithms for presumably extracting
"concepts" from text, and may even claim to be able to summarize large numbers of text
documents automatically, retaining the core and most important meaning of those
documents. While there are numerous algorithmic approaches to extracting "meaning
from documents," this type of technology is very much still in its infancy, and the aspiration
to provide meaningful automated summaries of large numbers of documents may forever
remain elusive. We urge skepticism when using such algorithms because 1) if it is not clear
to the user how those algorithms work, it cannot possibly be clear how to interpret the
results of those algorithms, and 2) the methods used in those programs are not open to
scrutiny, for example by the academic community and peer review and, hence, we simply
don't know how well they might perform in different domains. As a final thought on this
subject, you may consider this concrete example: Try the various automated translation
services available via the Web that can translate entire paragraphs of text from one
language into another. Then translate some text, even simple text, from your native
language to some other language and back, and review the results. Almost every time, the
attempt to translate even short sentences to other languages and back while retaining the
original meaning of the sentence produces humorous rather than accurate results. This
illustrates the difficulty of automatically interpreting the meaning of text.
Text mining as document search. There is another type of application that is often
described and referred to as "text mining" - the automatic search of large numbers of
documents based on key words or key phrases. This is the domain of, for example, the
popular internet search engines that have been developed over the last decade to provide
efficient access to Web pages with certain content. While this is obviously an important
type of application with many uses in any organization that needs to search very large
document repositories based on varying criteria, it is very different from what has been
described here.

Issues and Considerations for "Numericizing" Text


Large numbers of small documents vs. small numbers of large documents. Examples
of scenarios using large numbers of small or moderate sized documents were given earlier
(e.g., analyzing warranty or insurance claims, diagnostic interviews, etc.). On the other
hand, if your intent is to extract "concepts" from only a few documents that are very large
(e.g., two lengthy books), then statistical analyses are generally less powerful because the
"number of cases" (documents) in this case is very small while the "number of variables"
(extracted words) is very large.
Excluding certain characters, short words, numbers, etc. Excluding numbers, certain
characters, or sequences of characters, or words that are shorter or longer than a certain
number of letters can be done before the indexing of the input documents starts. You may
also want to exclude "rare words," defined as those that only occur in a small percentage of
the processed documents.
Include lists, exclude lists (stop-words). Specific list of words to be indexed can be
defined; this is useful when you want to search explicitly for particular words, and classify
the input documents based on the frequencies with which those words occur. Also, "stop-
words," i.e., terms that are to be excluded from the indexing can be defined. Typically, a
default list of English stop words includes "the", "a", "of", "since," etc, i.e., words that are
used in the respective language very frequently, but communicate very little unique
information about the contents of the document.
Synonyms and phrases. Synonyms, such as "sick" or "ill", or words that are used in
particular phrases where they denote unique meaning can be combined for indexing. For
example, "Microsoft Windows" might be such a phrase, which is a specific reference to the
computer operating system, but has nothing to do with the common use of the term
"Windows" as it might, for example, be used in descriptions of home improvement
projects.
Stemming algorithms. An important pre-processing step before indexing of input
documents begins is the stemming of words. The term "stemming" refers to the reduction
of words to their roots so that, for example, different grammatical forms or declinations of
verbs are identified and indexed (counted) as the same word. For example, stemming will
ensure that both "traveling" and "traveled" will be recognized by the text mining program
as the same word.
Support for different languages. Stemming, synonyms, the letters that are permitted in
words, etc. are highly language dependent operations. Therefore, support for different
languages is important.

Transforming Word Frequencies


Once the input documents have been indexed and the initial word frequencies (by
document) computed, a number of additional transformations can be performed to
summarize and aggregate the information that was extracted.
Log-frequencies. First, various transformations of the frequency counts can be performed.
The raw word or term frequencies generally reflect on how salient or important a word is in
each document. Specifically, words that occur with greater frequency in a document are
better descriptors of the contents of that document. However, it is not reasonable to
assume that the word counts themselves are proportional to their importance as
descriptors of the documents. For example, if a word occurs 1 time in document A, but 3
times in document B, then it is not necessarily reasonable to conclude that this word is 3
times as important a descriptor of document B as compared to document A. Thus, a
common transformation of the raw word frequency counts (wf) is to compute:
f(wf) = 1+ log(wf), for wf > 0

This transformation will "dampen" the raw frequencies and how they will affect the results
of subsequent computations.
Binary frequencies. Likewise, an even simpler transformation can be used that
enumerates whether a term is used in a document; i.e.:
f(wf) = 1, for wf > 0

The resulting documents-by-words matrix will contain only 1s and 0s to indicate the
presence or absence of the respective words. Again, this transformation will dampen the
effect of the raw frequency counts on subsequent computations and analyses.
Inverse document frequencies. Another issue that you may want to consider more
carefully and reflect in the indices used in further analyses are the relative document
frequencies (df) of different words. For example, a term such as "guess" may occur
frequently in all documents, while another term such as "software" may only occur in a
few. The reason is that we might make "guesses" in various contexts, regardless of the
specific topic, while "software" is a more semantically focused term that is only likely to
occur in documents that deal with computer software. A common and very useful
transformation that reflects both the specificity of words (document frequencies) as well as
the overall frequencies of their occurrences (word frequencies) is the so-called inverse
document frequency (for the i'th word and j'th document):

In this formula (see also formula 15.5 in Manning and Schütze, 2002), N is the total number
of documents, and df is the document frequency for the i'th word (the number of
documents that include this word). Hence, it can be seen that this formula includes both
the dampening of the simple word frequencies via the log function (described above), and
also includes a weighting factor that evaluates to 0 if the word occurs in all
documents (log(N/N=1)=0), and to the maximum value when a word only occurs in a single
document (log(N/1)=log(N)). It can easily be seen how this transformation will create indices
that both reflect the relative frequencies of occurrences of words, as well as their semantic
specificities over the documents included in the analysis.

Latent Semantic Indexing via Singular Value Decomposition


As described above, the most basic result of the initial indexing of words found in the input
documents is a frequency table with simple counts, i.e., the number of times that different
words occur in each input document. Usually, we would transform those raw counts to
indices that better reflect the (relative) "importance" of words and/or their semantic
specificity in the context of the set of input documents (see the discussion of inverse
document frequencies, above).
A common analytic tool for interpreting the "meaning" or "semantic space" described by
the words that were extracted, and hence by the documents that were analyzed, is to
create a mapping of the word and documents into a common space, computed from the
word frequencies or transformed word frequencies (e.g., inverse document frequencies). In
general, here is how it works:
Suppose you indexed a collection of customer reviews of their new automobiles (e.g., for
different makes and models). You may find that every time a review includes the word
"gas-mileage," it also includes the term "economy." Further, when reports include the word
"reliability" they also include the term "defects" (e.g., make reference to "no defects").
However, there is no consistent pattern regarding the use of the terms "economy" and
"reliability," i.e., some documents include either one or both. In other words, these four
words "gas-mileage" and "economy," and "reliability" and "defects," describe two
independent dimensions - the first having to do with the overall operating cost of the
vehicle, the other with the quality and workmanship. The idea of latent semantic indexing is
to identify such underlying dimensions (of "meaning"), into which the words and
documents can be mapped. As a result, we may identify the underlying (latent) themes
described or discussed in the input documents, and also identify the documents that
mostly deal with economy, reliability, or both. Hence, we want to map the extracted words
or terms and input documents into a common latent semantic space.
Singular value decomposition. The use of singular value decomposition in order to
extract a common space for the variables and cases (observations) is used in various
statistical techniques, most notably in Correspondence Analysis. The technique is also
closely related to Principal Components Analysis and Factor Analysis. In general, the
purpose of this technique is to reduce the overall dimensionality of the input matrix
(number of input documents by number of extracted words) to a lower-dimensional space,
where each consecutive dimension represents the largest degree of variability (between
words and documents) possible. Ideally, you might identify the two or three most salient
dimensions, accounting for most of the variability (differences) between the words and
documents and, hence, identify the latent semantic space that organizes the words and
documents in the analysis. In some way, once such dimensions can be identified, you have
extracted the underlying "meaning" of what is contained (discussed, described) in the
documents.

Incorporating Text Mining Results in Data Mining Projects


After significant (e.g., frequent) words have been extracted from a set of input documents,
and/or after singular value decomposition has been applied to extract salient semantic
dimensions, typically the next and most important step is to use the extracted information
in a data mining project.
Graphics (visual data mining methods). Depending on the purpose of the analyses, in
some instances the extraction of semantic dimensions alone can be a useful outcome if it
clarifies the underlying structure of what is contained in the input documents. For example,
a study of new car owners' comments about their vehicles may uncover the salient
dimensions in the minds of those drivers when they think about or consider their
automobile (or how they "feel" about it). For marketing research purposes, that in itself can
be a useful and significant result. You can use the graphics (e.g., 2D scatterplots or 3D
scatterplots) to help you visualize and identify the semantic space extracted from the input
documents.
Clustering and factoring. You can use cluster analysis methods to identify groups of
documents (e.g., vehicle owners who described their new cars), to identify groups of similar
input texts. This type of analysis also could be extremely useful in the context of market
research studies, for example of new car owners. You can also use Factor Analysis and
Principal Components and Classification Analysis (to factor analyze words or documents).
Predictive data mining. Another possibility is to use the raw or transformed word counts
as predictor variables in predictive data mining projects.

4. Demonstrate text mining with at least two different real world examples.

https://expertsystem.com/text-mining-algorithms/

The following 10 text mining examples demonstrate how practical application of


unstructured data management techniques can impact not only your organizational
processes, but also your ability to be competitive.

Text mining applications: 10 examples today


Text mining is a relatively new area of computer science, and its use has grown as the
unstructured data available continues to increase exponentially in both relevance and
quantity.
Text mining can be used to make the large quantities of unstructured data accessible and
useful, thereby generating not only value, but delivering ROI from unstructured data
management as we’ve seen with applications of text mining for Risk Management Software
and Cybercrime applications.

Through techniques such as categorization, entity extraction, sentiment analysis and


others, text mining extracts the useful information and knowledge hidden in text content.
In the business world, this translates in being able to reveal insights, patterns and trends in
even large volumes of unstructured data. In fact, it’s this ability to push aside all of the non-
relevant material and provide answers that is leading to its rapid adoption, especially in
large organizations.

These 10 text mining examples can give you an idea of how this technology is helping
organizations today.

1 – Risk management
No matter the industry, insufficient risk analysis is often a leading cause of failure. This is
especially true in the financial industry where adoption of Risk Management Software
based on text mining technology can dramatically increase the ability to mitigate risk,
enabling complete management of thousands of sources and petabytes of text documents,
and providing the ability to link together information and be able to access the right
information at the right time.

2 – Knowledge management
Not being able to find important information quickly is always a challenge when managing
large volumes of text documents—just ask anyone in the healthcare industry. Here,
organizations are challenged with a tremendous amount of information—decades of
research in genomics and molecular techniques, for example, as well as volumes of clinical
patient data—that could potentially be useful for their largest profit center: new product
development. Here, knowledge management software based on text mining offer a clear
and reliable solution for the “info-glut” problem.

3 – Cybercrime prevention
The anonymous nature of the internet and the many communication features operated
through it contribute to the increased risk of internet-based crimes. Today, text mining
intelligence and anti-crime applications are making internet crime prevention easier for any
enterprise and law enforcement or intelligence agencies.

4 – Customer care service


Text mining, as well as natural language processing are frequent applications for customer
care. Today, text analytics software is frequently adopted to improve customer experience
using different sources of valuable information such as surveys, trouble tickets, and
customer call notes to improve the quality, effectiveness and speed in resolving problems.
Text analysis is used to provide a rapid, automated response to the customer, dramatically
reducing their reliance on call center operators to solve problems.

5 – Fraud detection through claims investigation


Text analytics is a tremendously effective technology in any domain where the majority of
information is collected as text. Insurance companies are taking advantage of text mining
technologies by combining the results of text analysis with structured data to prevent
frauds and swiftly process claims.

6 – Contextual Advertising
Digital advertising is a moderately new and growing field of application for text analytics.
Here, companies such as Admantx have made text mining the core engine for contextual
retargeting with great success. Compared to the traditional cookie-based approach,
contextual advertising provides better accuracy, completely preserves the user’s privacy.

7 – Business intelligence
This process is used by large companies to uphold and support decision making. Here, text
mining really makes the difference, enabling the analyst to quickly jump at the answer even
when analysing petabytes of internal and open source data. Applications such as the
Cogito Intelligence Platform (link to CIP) are able to monitor thousands of sources and
analyze large data volumes to extract from them only the relevant content.

8 – Content enrichment
While it’s true that working with text content still requires a bit of human effort, text
analytics techniques make a significant difference when it comes to being able to more
effectively manage large volumes of information. Text mining techniques enrich content,
providing a scalable layer to tag, organize and summarize the available content that makes
it suitable for a variety of purposes.

9 – Spam filtering
E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a
dark side: spam. Today, spam is a major issue for internet service providers, increasing
their costs for service management and hardware software updating; for users, spam is an
entry point for viruses and impacts productivity. Text mining techniques can be
implemented to improve the effectiveness of statistical-based filtering methods.
10 – Social media data analysis
Today, social media is one of the most prolific sources of unstructured data; organizations
have taken notice. Social media is increasingly being recognized as a valuable source of
market and customer intelligence, and companies are using it to analyse or predict
customer needs and understand the perception of their brand. In both needs Text
analytics can address both by analysing large volumes of unstructured data, extracting
opinions, emotions and sentiment and their relations with brands and products.

Learn more information on how NLP is different from text mining.

5. Identify the most common text mining algorithms used in industry.

Text Mining Algorithms List


Text mining algorithms are nothing more but specific data mining algorithms in the domain
of natural language text. The text can be any type of content – postings on social media,
email, business word documents, web content, articles, news, blog posts, and other types
of unstructured data.

Algorithms for text analytics incorporate a variety of techniques such as text classification,
categorization, and clustering. All of them aim to uncover hidden relationships, trends, and
patterns which are a solid base for business decision-making.
1. K-Means Clustering

K-means clustering is a popular data analysis algorithm that aims to find groups in given
data set. The number of groups is represented by a variable called K.
It is one of the simplest unsupervised learning algorithms that solve clustering problems.
The key idea is to define k centroids which are used to label new data.
K-Means Clustering is a classical way for text categorization. It is widely used for document
classifications, building clusters on Social Media text data, clustering search keywords and
etc.
Using k-means clustering for text data requires doing some text-to-numeric transformation
of our content data. If you work with R, you might know that it has various packages to
simplify the process.

2. Naive Bayes Classifier

Naive Bayes is considered one of the most effective data mining algorithms. It is a simple
probabilistic algorithm for the classification tasks.
The Naive Bayes Classifier is based on the so-called Bayesian theorem and gives great and
reliable results when it is used for text data analytics.
Naive Bayes classifier is not a single algorithm but a family of algorithms which assume that
values of the features used in the classification are independent.
It is very easy to code with the standard programming languages such as PHP, JAVA, C#,
etc.

As one of the best text classification techniques, Naive Bayes has a variety of applications in
email spam detection, document categorization, email sorting, age/gender identification,
language detection and sentiment analysis.

3. K-Nearest Neighbor (KNN)

K-Nearest Neighbor (KNN) is also one of the most used text mining algorithms because of
its simplicity and efficiency.
KNN is a non-parametric method that we use for classification.
In a few words, KNN is a simple algorithm that stores all existing data objects and classifies
the new data objects based on a similarity measure.
In the text analysis domain, it is used to check the similarity between documents and k
training data. The aim is to determine the category of the test documents.
One of the biggest text mining applications of KNN is in “Concept Search” (i.e. searching for
semantically similar documents) – a feature in software tools, which is used for helping
businesses find their emails, business correspondence, reports, contacts, etc.

4. Support Vector Machines (SVM)

This approach is one of the most accurate classification text mining algorithms.
Practically, SVM is a supervised machine learning algorithm mainly used for classification
problems and outliers detections. It can be also used for regression challenges.
SVM is used to sort two data sets by similar classification. This data analysis algorithm draw
lines (known as hyper planes) that separate the groups according to some patterns.
The goal of SVM is to create this hyper plane. The hyper plane with the maximum margin
from both groups is best. In the real world, SVM can model complex problems such as text
and image classification, hand-writing recognition, face detection, and bio sequence
analysis.
When it comes to text mining, SVM is widely used for text classification activities such as
detecting spam, sentiment analysis, document classification into categories as news,
emails, articles, web pages, etc.

5. Decision Tree

Decision Tree algorithm is a well-known machine learning technique for data mining that
creates classification or regression models in the shape of a tree structure.
The structure includes a root node, branches, and leaf nodes. Each internal node indicates
a test on an attribute and each branch indicates the result of a test. Finally, each leaf node
indicates a class label.
Decision Tree algorithm is nonlinear and simple.
As a text mining algorithms, Decision Trees has many applications such as analyzing all the
text that comes from customer relationship management. It is also used in making medical
predictions based on medical history documents and etc.

6. Generalized Linear Models (GLM)


Generalized Linear Models is a popular statistical technique used for linear modelling.
Actually, GLMs combine a large number of models including linear regression models,
logistic regression, Poisson regression, ANOVA, log-linear models and etc.
Combining the linear approach with data mining tools has many advantages such as
accelerating the modelling process and achieving better accuracy.
Some of the best content analysis software providers (such as Oracle) use GLM as one of
the key text mining algorithms.

7. Neural Networks

Neural networks are nonlinear models which represent a metaphor for the functioning of
the human brain.
Despite that Neural networks have a complex structure and long training time, they have
their place in data analysis and text mining algorithms.
In the domain of text analytics, Neural network can be used for grouping similar patterns,
for classifying patterns, and etc.
The application of the neural network is important in data mining because of some
characteristics such as self-organizing addictiveness, parallel performance, fault tolerance,
and robustness.
When it comes to text data analysis, neural networks are popular in the area of medical
research documents, finance, and marketing content mining.

8. Association Rules

Association rules are just if/then statements that aim to uncover some relationships
between unrelated data in a given database.
They can find relationships between the items which are regularly used together.
Popular applications of association rules are basket data analysis, cross-marketing,
clustering, classification, catalog design, etc. For example, if the customer buys eggs then
he may also buy milk.
Using this approach in the are of text data mining, can help users to gain knowledge from
the collection of the different type of content such as web documents (to decrease the time
for reading all those documents).
Another example is, the association rules used for identifying positive or negative
associations between symptoms, medications, and laboratory results and medical text data
reports.

9. Genetic Algorithms

Genetic algorithms or evolutionary algorithms are a family of stochastic search algorithms


witch mechanism is inspired by the process of neo-Darwinian evolution.
Naturally, GAs have applied binary strings (chromosomes) to encode the features that form
an individual. They basically try to imitate the human evolution.
The reason for using GAs for data mining is that they are adaptive and robust search
techniques.
GAs can solve several text data mining problems such as clustering, the discovery of
classification rules, attribute selection and construction.

10. Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation is one of the techniques which currently is used in topic text
modelling.
In fact, latent Dirichlet Allocation (LDA) is a generative probabilistic model designed for
collections of discrete data (to know what is discrete data see our discrete vs continuous
data post).
To put in another way, LDA is a method that automatically finds topics that given
documents contain.

6. Explain how these text mining algorithms work.

https://www.wave-access.com/public_en/blog/2019/february/04/text-mining-what-it-is-and-
how-it-works-for-business.aspx

https://www.sciencedirect.com/topics/mathematics/text-mining-algorithm.

https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-
6a6e67336aa1

https://www.geeksforgeeks.org/k-means-clustering-introduction/
7. Demonstrate these algorithms using an appropriate programming language or text mining
tool.

https://www.sciencedirect.com/topics/mathematics/text-mining-algorithm
Text Mining Solver Example
https://www.solver.com/text-mining-example
Top 63 Software for Text Analysis, Text Mining, Text Analytics
https://www.predictiveanalyticstoday.com/top-software-for-text-analysis-text-mining-text-
analytics/
Basic text mining tools Read more at wave-access.com
https://www.wave-access.com/public_en/blog/2019/february/04/text-mining-what-it-is-and-
how-it-works-for-business.aspx
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2607-x
https://www.educba.com/text-mining/

You might also like