Professional Documents
Culture Documents
LO3 Illustrate how a data mining algorithm performs text mining to identify relationships
within text.
P5 Discuss what is meant by text mining and explain with appropriate examples.
P6 Analyse how data mining algorithms, techniques, methods and approaches work.
M4 Show how text mining works using a tool or programming language.
D3 Develop a complete text mining application for a real world issue.
4. Demonstrate text mining with at least two different real world examples.
LO4 Evaluate a range of graph data mining techniques that recognise patterns and
relationships in graph based technologies.
P7 Discuss what is meant by graph data mining and explain with appropriate examples.
P8 Assess how graph mining algorithms work and identify appropriate programming
languages and tools used by industry for graph data mining.
M5 Demonstrate how graph data mining works using a tool or programming language.
D4 Develop a complete graph data mining application for a real world scenario.
B. Graph mining
4. Demonstrate graph mining with at least two different real world examples.
LO3 Illustrate how a data mining algorithm performs text mining to identify relationships
within text.
P5. Discuss what is meant by text mining and explain with appropriate examples.
The Concept:
Text mining is a burgeoning new field that tries to extract meaningful information from
natural language text. It may be characterized as the process of analyzing text to extract
information that is useful for a specific purpose. Compared with the kind of data stored in
databases, text is unstructured, ambiguous, and difficult to process. Nevertheless, in
modern culture, text is the most communal way for the formal exchange of information.
Text mining usually deals with texts whose function is the communication of actual
information or opinions, and the stimuli for trying to extract information from such text
automatically is compelling—even if success is only partial. Text mining, using manual
techniques, was used first during the 1980s. It quickly became apparent that these manual
techniques were labor intensive and therefore expensive. It also requires too much time to
manually process the already growing quantity of information. Over time there was a huge
success in creating programs to automatically process the information, and in the last few
years there has been a great progress.
The study of text mining concerns the development of various mathematical, statistical,
linguistic and pattern-recognition techniques which allow automatic analysis of
unstructured information as well as the extraction of high quality and relevant data, and to
make the text as a whole better searchable.
A text document contains characters which together form words, which can be further
combined to generate phrases. These are all syntactic properties that together represent
already defined categories, concepts, senses or meanings .Text mining must recognize,
extract and use the information. Instead of searching for words, we can search for
semantic patterns, and this is therefore searching at a higher level
Text mining software tools often use computational algorithms based on Natural Language
Processing, or NLP, to enable a computer to "read" and analyse textual information. NLP
interprets the meaning of the text and identifies extracts, synthesizes and analyses
relevant facts and relationships that directly answer your question.
Text can be mined in a systematic, comprehensive and reproducible way, and business
critical information can be captured and harvested automatically.
Powerful NLP-based queries can be run in real time across millions of documents. These
can be pre-written queries provided by Linguamatics, or queries created and refined, on-
the-fly, by you.
Using linguistic and other wildcards, you can ask open questions without even having to
know the keywords for which you're looking. You still get back high quality, structured
results.
Text mining is a process which involves many of technological areas. Information Retrieval,
Data Mining, Artificial Intelligence and computational linguistics are all fields which have
some role there. Nevertheless it seems that there are two main phase in Text Mining
process depicted in figure 1.
First phase is about documents pre-processing. This phase’s output can have two kinds of
formats; document based and concept based. Document base representation concerns
with a better representation of documents. This can be transforming them to an
intermediate semi structured format or applying some kind of index over it or whatever
kind of desired representation can be applied to a document. Here each entity in
representation will be a document. Second kind of refining is extracting concepts from
documents, relation between this concepts and whatever concept based information can
be extracted from a single document. Each entity will be there a concept. Nevertheless it’s
possible to transform a document based representation to a concept based
representation.
“Python and R are the most famous text mining tools out there for text mining.”
The following steps are to be followed for Text-Mining Python and Text mining in R,
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is
the process of deriving high-quality information from text. The overarching goal is,
essentially, to turn text into data for analysis, via application of natural language processing
(NLP) and analytical methods.
Text mining techniques can be understood at the processes that go into mining the text
and discovering insights from it. These text mining techniques generally employ different
text mining tools and applications for their execution. Now, let us now look at the various
text mining techniques:
2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this text mining technique, IR
systems make use of different algorithms to track and monitor user behaviors and discover
relevant data accordingly. Google and Yahoo search engines are the two most renowned IR
systems.
Information retrieval is regarded as an extension to document retrieval. That the
documents that are returned are processed to condense. Thus document retrieval follow
by a text summarization stage. That focuses on the query posed by the user. IR systems
help in to narrow down the set of documents that are relevant to a particular problem. As
text mining involves applying very complex algorithms to large document collections. Also,
IR can speed up the analysis significantly by reducing the number of documents.
IR is considered as an extension to document extraction. IR systems help in to narrow
down the set of records that are associated with a specific problem. Text mining involves
applying complicated mining algorithms to large-scale documents. By reducing the number
of documents, IR can increase the speed of the analysis significantly.
“You can’t analyze text without retrieving it in the first place, which is why
information retrieval is the essential preliminary step to text mining.”
Information can come from many sources, and it all depends on the objective you’re trying
to achieve with text mining.
For example, social media is often a hot target for information retrieval during election
season to measure how social media users feel about politicians. Databases and internal
systems are common sources for interpreting customer and employee sentiment.
After text is retrieved, it’s time to begin structuring it.
3. Categorization
This is one of those text mining techniques that is a form of “supervised” learning wherein
normal language texts are assigned to a predefined set of topics depending upon their
content. Thus, categorization or rather Natural Language Processing (NLP) is a process of
gathering text documents and processing and analyzing them to uncover the right topics or
indexes for each document. The co-referencing method is commonly used as a part of NLP
to extract relevant synonyms and abbreviations from textual data. Today, NLP has become
an automated process used in a host of contexts ranging from personalized commercials
delivery to spam filtering and categorizing web pages under hierarchical definitions, and
much more.
Categorization is the process of assigning a given text into groups of entities whose
members are in some way similar to each other. Recognition of resemblance across
entities and the subsequent aggregation of like entities into categories lead the individual
to discover order in a complex environment. Without the ability to group entities based on
perceived similarities, the individual’s experience of any one entity would be totally unique
and could not be extended to subsequent encounters with similar entities in the
environment. This process is considered as a supervised classification technique since a set
of pre-classified documents is provided as a training set. The goal of TC is to assign a
category to a new document. By reducing the load on memory, facilitating the efficient
storage and retrieval of information, categorization serves as the fundamental cognitive
mechanism that simplifies the individual’s experience of the environment.
Fig.3: Classification
4. Clustering
Clustering is one of the most crucial text mining techniques. It seeks to identify intrinsic
structures in textual information and organize them into relevant subgroups or ‘clusters’
for further analysis. A significant challenge in the clustering process is to form meaningful
clusters from the unlabeled textual data without having any prior information on them.
Cluster analysis is a standard text mining tool that assists in data distribution or acts as a
pre-processing step for other text mining algorithms running on detected clusters.
Clustering is a process of forming groups (clusters) of similar objects from a given set of
inputs. Good clusters have the characteristic that objects belonging to the same cluster are
"similar" to each other, while objects from two different clusters are "dissimilar". The idea
of clustering originates from statistics where it was applied to numerical data. However,
computer science and data mining in particular, extended the notion to other types of data
such as text or multimedia.
Clustering is an unsupervised process through which objects are classified into groups
called clusters. In the case of clustering, the problem is to group the given unlabeled
collection into meaningful clusters without any prior information. Any labels associated
with objects are obtained solely from the data. An advantage of clustering is that
documents can emerge in multiple subtopics, thus ensuring that a useful document will
not be absent from search results.
Figure: 4 Clustering
5. Summarization
The definition of the summary is an obvious one which emphasizes the fact that
summarizing is in general a hard task because we have to characterize the source text as a
whole and capture its important content. The content is a matter of both information and
its expression and importance is a matter of what is essential as well as what is salient.
Summarization is the process of reducing a text document with a computer program in
order to create a summary that retains the most important points of the original
document. As the problem of information overload has grown and as the quantity of data
has increased, so has interest in automatic summarization. Technologies that can make a
coherent summary take into account variables such as length, writing style and syntax. An
example of the use of summarization technology is search engines such as Google and
another is the document summarization. Summarization systems are able to create both
query relevant text summaries and generic machine-generated summaries depending on
what the user needs. Summarization of multimedia documents, e.g. pictures or movies is
also possible. Some systems will generate a summary based on a single source document,
while others can use multiple source documents. These systems are known as multi-
document summarization systems.
• Web Mining
These days web contains a treasure of information about subjects. Such as persons,
companies, organizations, products, etc. that may be of wide interest. Web Mining is an
application of data mining techniques. That needs to discover hidden and unknown
patterns from the Web. Web mining is an activity of identifying term implied in a large
document collection. It says C which denotes by a mapping i.e. C →p [10].
• Medical
Users exchange information with others about subjects of interest. Everyone wants to
understand specific diseases, to inform about new therapies. Also, these expert forums
also represent seismographs for medical. E-mails, e-consultations, and requests for
medical advice. That is via the internet have been analyzed using quantitative or qualitative
methods.
For Example,
Users actively exchange information with others about subjects of interest or send
requests to web-based expert forums, or so-called the doctor services. Everyone wants to
understand specific diseases (what they have), to be informed about new therapies, ask for
a second opinion before one can decide a treatment. In addition, these expert forums also
represent seismographs for medical and/or psychological requirements, which are
apparently not met by existing health care systems.
E-mails, e-consultations, and requests for medical advice via the Internet have been
manually analyzed using quantitative or qualitative methods. To help the medical experts
and to make full use of the seismograph function of expert forums, it would be helpful to
categorize visitors’ requests automatically. So, specific requests could be directed to the
expert or even answered semi-automatically, thereby providing complete monitoring. By
generating frequently asked questions (FAQs) similar patient requests and their
corresponding answers could be congregated, even before the actual expert responses.
Machine-based analyses could help both the public to better handle the mass of
information and medical experts to give expert feedback. An automatic classification of
amateur requests to medical expert internet forums is a challenging task because these
requests can be very long and unstructured as a result of mixing, for example, personal
experiences with laboratory data.
• Resume Filtering
Big enterprises and headhunters receive thousands of resumes from job applicants every
day. Extracting information from resumes with high precision and recall is not easy.
Automatically extracting this information can the first step in filtering resumes. Hence,
automating the process of resume selection is an important task.
Let’s look at Data Mining Advantages in detail.
For Example,
Big enterprises and headhunters receive thousands of resumes from job applicants every
day. Extracting information from resumes with high precision and recall is not an easy task.
In spite of constituting a restricted domain, resumes can be written in a multitude of
formats (e.g. structured tables or plain texts), in different languages (e.g. Japanese and
English) and in different file types (e.g. Plain Text, PDF, Word etc.). Moreover, writing styles
can also be much diversified. In the initial manual scan of the resume, a recruiter looks for
mistakes, educational qualifications, buzzwords, employment history, job titles, frequency
of job changes, and other personal information. Automatically extracting this information
can be the first step in filtering resumes. Hence, automating the process of resume
selection is an important task.
Text mining techniques and text mining tools are rapidly penetrating the industry, right
from academia and healthcare to businesses and social media platforms. This is giving rise
to a number of text mining applications. Here are a few text mining applications used
across the globe today:
Term in document is word having semantic meaning. In term based method document is
analysed on the basis of term and has advantages of efficient computational performance
as well as mature theories for term weighting. These techniques are emerged over the last
couple of decades from the information retrieval and machine learning communities. Term
based methods suffer from the problems of polysemy and synonymy [1]. Polysemy means
a word has multiple meanings and synonymy is multiple words having the same meaning.
The semantic meaning of many discovered terms is uncertain for answering what users
want. Information retrieval provided many term-based methods to solve this challenge.
Phrase carries more semantics like information and is less ambiguous. In phrase based
method document is analysed on phrase basis as phrases are less ambiguous and more
discriminative than individual terms [2]. The likely reasons for the daunting performance
include: 1) Phrases have inferior statistical properties to terms, 2) They have low frequency
of occurrence, and 3) Large numbers of redundant and noisy phrases are present among
them.
In concept based terms are analysed on sentence and document level. Text Mining
techniques are mostly based on statistical analysis of word or phrase. The statistical
analysis of the term frequency captures the importance of word without document. Two
terms can have same frequency in same document, but the meaning is that one term
contributes more appropriately than the meaning contributed by the other term[7]. The
terms that capture the semantics of the text should be given more importance so, a new
concept-based mining is introduced.
This model included three components. The first component analyses the semantic
structure of sentences. The second component constructs a conceptual ontological graph
(COG) to describe the semantic structures and the last component extract top concepts
based on the first two components to build feature vectors using the standard vector space
model. Concept-based model can effectively discriminate between non important terms
and meaningful terms which describe a sentence meaning [8]. The concept-based model
usually relies upon natural language processing techniques. Feature selection is applied to
the query concepts to optimize the representation and remove noise and ambiguity.
In pattern taxonomy method documents are analysed on pattern basis. Patterns can be
structured into taxonomy by using is-a relation. Pattern mining has been extensively
studied in data mining communities for many years. Patterns can be discovered by data
mining techniques like association rule mining, frequent item set mining, sequential
pattern mining and closed pattern mining[5]. Use of discovered knowledge (patterns) in the
field of text mining is difficult and ineffective, because some useful long patterns with high
specificity lack in support (i.e., the low-frequency problem). Not all frequent short patterns
are useful hence known as misinterpretations of patterns and it leads to the ineffective
performance. In research work, an effective pattern discovery technique has been
proposed to overcome the low-frequency and misinterpretation problems for text mining.
The pattern based technique uses two processes pattern deploying and pattern evolving
[6]. This technique refines the discovered patterns in text documents. The experimental
results show that pattern based model performs better than not only other pure data
mining-based methods and the concept-based model, but also term-based models.
This transformation will "dampen" the raw frequencies and how they will affect the results
of subsequent computations.
Binary frequencies. Likewise, an even simpler transformation can be used that
enumerates whether a term is used in a document; i.e.:
f(wf) = 1, for wf > 0
The resulting documents-by-words matrix will contain only 1s and 0s to indicate the
presence or absence of the respective words. Again, this transformation will dampen the
effect of the raw frequency counts on subsequent computations and analyses.
Inverse document frequencies. Another issue that you may want to consider more
carefully and reflect in the indices used in further analyses are the relative document
frequencies (df) of different words. For example, a term such as "guess" may occur
frequently in all documents, while another term such as "software" may only occur in a
few. The reason is that we might make "guesses" in various contexts, regardless of the
specific topic, while "software" is a more semantically focused term that is only likely to
occur in documents that deal with computer software. A common and very useful
transformation that reflects both the specificity of words (document frequencies) as well as
the overall frequencies of their occurrences (word frequencies) is the so-called inverse
document frequency (for the i'th word and j'th document):
In this formula (see also formula 15.5 in Manning and Schütze, 2002), N is the total number
of documents, and df is the document frequency for the i'th word (the number of
documents that include this word). Hence, it can be seen that this formula includes both
the dampening of the simple word frequencies via the log function (described above), and
also includes a weighting factor that evaluates to 0 if the word occurs in all
documents (log(N/N=1)=0), and to the maximum value when a word only occurs in a single
document (log(N/1)=log(N)). It can easily be seen how this transformation will create indices
that both reflect the relative frequencies of occurrences of words, as well as their semantic
specificities over the documents included in the analysis.
4. Demonstrate text mining with at least two different real world examples.
https://expertsystem.com/text-mining-algorithms/
These 10 text mining examples can give you an idea of how this technology is helping
organizations today.
1 – Risk management
No matter the industry, insufficient risk analysis is often a leading cause of failure. This is
especially true in the financial industry where adoption of Risk Management Software
based on text mining technology can dramatically increase the ability to mitigate risk,
enabling complete management of thousands of sources and petabytes of text documents,
and providing the ability to link together information and be able to access the right
information at the right time.
2 – Knowledge management
Not being able to find important information quickly is always a challenge when managing
large volumes of text documents—just ask anyone in the healthcare industry. Here,
organizations are challenged with a tremendous amount of information—decades of
research in genomics and molecular techniques, for example, as well as volumes of clinical
patient data—that could potentially be useful for their largest profit center: new product
development. Here, knowledge management software based on text mining offer a clear
and reliable solution for the “info-glut” problem.
3 – Cybercrime prevention
The anonymous nature of the internet and the many communication features operated
through it contribute to the increased risk of internet-based crimes. Today, text mining
intelligence and anti-crime applications are making internet crime prevention easier for any
enterprise and law enforcement or intelligence agencies.
6 – Contextual Advertising
Digital advertising is a moderately new and growing field of application for text analytics.
Here, companies such as Admantx have made text mining the core engine for contextual
retargeting with great success. Compared to the traditional cookie-based approach,
contextual advertising provides better accuracy, completely preserves the user’s privacy.
7 – Business intelligence
This process is used by large companies to uphold and support decision making. Here, text
mining really makes the difference, enabling the analyst to quickly jump at the answer even
when analysing petabytes of internal and open source data. Applications such as the
Cogito Intelligence Platform (link to CIP) are able to monitor thousands of sources and
analyze large data volumes to extract from them only the relevant content.
8 – Content enrichment
While it’s true that working with text content still requires a bit of human effort, text
analytics techniques make a significant difference when it comes to being able to more
effectively manage large volumes of information. Text mining techniques enrich content,
providing a scalable layer to tag, organize and summarize the available content that makes
it suitable for a variety of purposes.
9 – Spam filtering
E-mail is an effective, fast and reasonably cheap way to communicate, but it comes with a
dark side: spam. Today, spam is a major issue for internet service providers, increasing
their costs for service management and hardware software updating; for users, spam is an
entry point for viruses and impacts productivity. Text mining techniques can be
implemented to improve the effectiveness of statistical-based filtering methods.
10 – Social media data analysis
Today, social media is one of the most prolific sources of unstructured data; organizations
have taken notice. Social media is increasingly being recognized as a valuable source of
market and customer intelligence, and companies are using it to analyse or predict
customer needs and understand the perception of their brand. In both needs Text
analytics can address both by analysing large volumes of unstructured data, extracting
opinions, emotions and sentiment and their relations with brands and products.
Algorithms for text analytics incorporate a variety of techniques such as text classification,
categorization, and clustering. All of them aim to uncover hidden relationships, trends, and
patterns which are a solid base for business decision-making.
1. K-Means Clustering
K-means clustering is a popular data analysis algorithm that aims to find groups in given
data set. The number of groups is represented by a variable called K.
It is one of the simplest unsupervised learning algorithms that solve clustering problems.
The key idea is to define k centroids which are used to label new data.
K-Means Clustering is a classical way for text categorization. It is widely used for document
classifications, building clusters on Social Media text data, clustering search keywords and
etc.
Using k-means clustering for text data requires doing some text-to-numeric transformation
of our content data. If you work with R, you might know that it has various packages to
simplify the process.
Naive Bayes is considered one of the most effective data mining algorithms. It is a simple
probabilistic algorithm for the classification tasks.
The Naive Bayes Classifier is based on the so-called Bayesian theorem and gives great and
reliable results when it is used for text data analytics.
Naive Bayes classifier is not a single algorithm but a family of algorithms which assume that
values of the features used in the classification are independent.
It is very easy to code with the standard programming languages such as PHP, JAVA, C#,
etc.
As one of the best text classification techniques, Naive Bayes has a variety of applications in
email spam detection, document categorization, email sorting, age/gender identification,
language detection and sentiment analysis.
K-Nearest Neighbor (KNN) is also one of the most used text mining algorithms because of
its simplicity and efficiency.
KNN is a non-parametric method that we use for classification.
In a few words, KNN is a simple algorithm that stores all existing data objects and classifies
the new data objects based on a similarity measure.
In the text analysis domain, it is used to check the similarity between documents and k
training data. The aim is to determine the category of the test documents.
One of the biggest text mining applications of KNN is in “Concept Search” (i.e. searching for
semantically similar documents) – a feature in software tools, which is used for helping
businesses find their emails, business correspondence, reports, contacts, etc.
This approach is one of the most accurate classification text mining algorithms.
Practically, SVM is a supervised machine learning algorithm mainly used for classification
problems and outliers detections. It can be also used for regression challenges.
SVM is used to sort two data sets by similar classification. This data analysis algorithm draw
lines (known as hyper planes) that separate the groups according to some patterns.
The goal of SVM is to create this hyper plane. The hyper plane with the maximum margin
from both groups is best. In the real world, SVM can model complex problems such as text
and image classification, hand-writing recognition, face detection, and bio sequence
analysis.
When it comes to text mining, SVM is widely used for text classification activities such as
detecting spam, sentiment analysis, document classification into categories as news,
emails, articles, web pages, etc.
5. Decision Tree
Decision Tree algorithm is a well-known machine learning technique for data mining that
creates classification or regression models in the shape of a tree structure.
The structure includes a root node, branches, and leaf nodes. Each internal node indicates
a test on an attribute and each branch indicates the result of a test. Finally, each leaf node
indicates a class label.
Decision Tree algorithm is nonlinear and simple.
As a text mining algorithms, Decision Trees has many applications such as analyzing all the
text that comes from customer relationship management. It is also used in making medical
predictions based on medical history documents and etc.
7. Neural Networks
Neural networks are nonlinear models which represent a metaphor for the functioning of
the human brain.
Despite that Neural networks have a complex structure and long training time, they have
their place in data analysis and text mining algorithms.
In the domain of text analytics, Neural network can be used for grouping similar patterns,
for classifying patterns, and etc.
The application of the neural network is important in data mining because of some
characteristics such as self-organizing addictiveness, parallel performance, fault tolerance,
and robustness.
When it comes to text data analysis, neural networks are popular in the area of medical
research documents, finance, and marketing content mining.
8. Association Rules
Association rules are just if/then statements that aim to uncover some relationships
between unrelated data in a given database.
They can find relationships between the items which are regularly used together.
Popular applications of association rules are basket data analysis, cross-marketing,
clustering, classification, catalog design, etc. For example, if the customer buys eggs then
he may also buy milk.
Using this approach in the are of text data mining, can help users to gain knowledge from
the collection of the different type of content such as web documents (to decrease the time
for reading all those documents).
Another example is, the association rules used for identifying positive or negative
associations between symptoms, medications, and laboratory results and medical text data
reports.
9. Genetic Algorithms
Latent Dirichlet Allocation is one of the techniques which currently is used in topic text
modelling.
In fact, latent Dirichlet Allocation (LDA) is a generative probabilistic model designed for
collections of discrete data (to know what is discrete data see our discrete vs continuous
data post).
To put in another way, LDA is a method that automatically finds topics that given
documents contain.
https://www.wave-access.com/public_en/blog/2019/february/04/text-mining-what-it-is-and-
how-it-works-for-business.aspx
https://www.sciencedirect.com/topics/mathematics/text-mining-algorithm.
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-
6a6e67336aa1
https://www.geeksforgeeks.org/k-means-clustering-introduction/
7. Demonstrate these algorithms using an appropriate programming language or text mining
tool.
https://www.sciencedirect.com/topics/mathematics/text-mining-algorithm
Text Mining Solver Example
https://www.solver.com/text-mining-example
Top 63 Software for Text Analysis, Text Mining, Text Analytics
https://www.predictiveanalyticstoday.com/top-software-for-text-analysis-text-mining-text-
analytics/
Basic text mining tools Read more at wave-access.com
https://www.wave-access.com/public_en/blog/2019/february/04/text-mining-what-it-is-and-
how-it-works-for-business.aspx
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2607-x
https://www.educba.com/text-mining/