You are on page 1of 6

SIMAD UNIVERSITY

Chapter 7: Text and Web Mining


Define Text mining
Text mining: is a semi-automated process of extracting knowledge from large amount of
unstructured data sources.
Data Mining versus Text Mining
Data mining is a process of identifying valid, novel, potentially useful and ultimately
understandable patterns in the data stored in structured databases, where data are organized in
records structured by categorical, ordinal or continuous variables. Text mining is the same as
data mining in that it has same purpose and uses the same processes, but text mining the input to
the process is a collection of unstructured or less structured data files such as word documents,
PDF, text excerpts, xml files and so on.
Text mining can be thought of as a process that starts first imposing structure on text based data
sources second extracting relevant information and knowledge from this structured text-based
data using data mining techniques and tools.
Benefits of Text Mining
The benefits of text mining are obvious in the areas where very large amounts of textual data are
being generated, such as law (court orders), academic research (research articles), finance
(quarterly reports), medicine (discharge summaries), biology (molecular interactions),
technology and marketing. Also very important in Electronic communization records (e.g., Email):
Spam filtering, Email prioritization and categorization And Automatic response generation
Text Mining Application Area
1. Information extraction: identification of key phrases and relationships within text by
looking for predefined sequences in text via pattern matching.
2. Topic tracking: based on a user profile and documents that a user views, text mining can
predict other documents of interest to the user.
3. Summarization: summarizing documents to save time on the part of the reader.
4. Categorization: identifying the main themes of a document and then placing the
documentation into a predefined set of categories based on those themes.
5. Clustering: grouping similar documents without having a predefined set of categories.

DSS Chapter 7 Class: BIT14B Page 1


SIMAD UNIVERSITY

6. Concept linking: connects related documents by identifying their shared concepts and by
donning so, help users find information that they perhaps would not have found using
traditional search methods.
7. Question answering: finding the best answer to a given question through knowledge -
driven pattern matching.
Text Mining Terminology
1. Unstructured: means text mining does not have predefined format and is stored in the
form of textual documents.
2. Corpus: is large and structured set of texts prepared for the purpose of conducting
knowledge discovery.
3. Terms: is a single word or multiword phrase extracted directly from the corpus of a
specific domain by means of natural language processing (NLP) methods.
4. Concepts: are features generated from a collection of documents by means of manual,
statistical, rule-based, or hybrid categorization methodology.
5. Stemming: is the process of reducing inflected words to their stem form.
6. Stop words: are words that are filtered out prior to or after processing of natural
language data.
7. Synonyms and polysemes: Synonyms are syntactically different words with identical or
at least similar meaning. polysemes syntactically identical words with different meaning.
8. Tokenizing: a token is a categorized block of text in a sentence.
9. Term dictionary: a collection of terms specific to a narrow field that can be used to
restrict the extracted terms within a corpus.
10. Word frequency: the number of times a word is found in a specific document.
11. Part of speech tagging: the process of making up the words in text as corresponding to a
particular part of speech based on a word's definition and context in which it is used.
12. Morphology: a Brach of the field of linguistics and a part of NLP that studies the internal
structure of words.
What is a patent?
Patent is an exclusive rights granted by a country to an inventor for a limited period of time in
exchange for a disclosure of an invention.
Natural Language Processing (NLP)

DSS Chapter 7 Class: BIT14B Page 2


SIMAD UNIVERSITY

NLP: is important component of text mining and is a subfield of artificial intelligence and
computational linguistics. It studies the problem of understanding the natural human language.
Challenges in NLP
 Part-of-speech tagging: it is difficult to mark up terms in a text as corresponding to
a particular part of speech, because the part of speech depends not only on the
definition of the term but also on the context within which it is used.
 Text segmentation: some written languages like Chinese, Japanese and Thai, do not
have single word boundaries.
 Word sense disambiguation: many words have more then one meaning. Selecting
the meaning that makes the most sense can only be accomplished by taking into
account the context within which word is used.
 Syntax ambiguity: the grammar for natural language is ambiguous.
 Imperfect or irregular input: foreign or regional accents and vocal impediments in
speech and typographical or grammatical errors in texts make the processing of the
language an even more difficult task.
 Speech acts: a sentence can often considered an action by the speaker. the sentence
structure alone may not contain enough information to define this action.
Dream of AI community
The dream of AI community is to have algorithms that are capable of automatically reading and
obtaining knowledge from text
WordNet
WordNet: laboriously hand-coded database of English words, their definitions, sets of synonyms,
and various semantic relations between synonym sets, A major resource for NLP and Need
automation to be completed.
Sentiment Analysis
Sentiment Analysis: A technique used to detect favorable and unfavorable opinions toward
specific products and services
NLP Task Categories
 Information retrieval: is the science of searching for relevant documents, finding
specific information within them generating metadata as to their contents.

DSS Chapter 7 Class: BIT14B Page 3


SIMAD UNIVERSITY

 Information extraction: is type of information retrieval whose goal is to automatically


extract structured information, such as categorized and contextually and semantically well-
defined data from a certain domain, using unstructured machine readable documents.
 Named-entity recognition: also known as entity identification and entity extraction, this
subtask of information extraction seeks to locate and classify atomic elements in text into
predefined categories, such as the names of persons, organizations, locations, expressions
of times, quantities, monetary values, percentages and so on.
 Question answering: the task answering automatically questions posed in natural
language.
 Automatic summarization: the creation of shortened version of a textual document by a
computer program that contains the most important points of the original document.
 Natural language generation and understanding: systems convert information from
computer databases into readable human language.
 Machine translation: the automatic translation of one human language to another.
 Foreign language reading and writing: Computer program that assists a nonnative
language speaker to read foreign language with correct pronunciation accents on different
parts of the words.
 Speech recognition: converts spoken words to machine readable input.
 Text proofing
 Optical character recognition: the automatic translation of images of handwritten,
typewritten or printed text.
Text Mining Applications
 Marketing applications: Enables better CRM
 Security applications: ECHELON, OASIS, Deception detection (…)
 Medicine and biology: Literature-based gene identification (…)
 Academic applications: Research stream analysis
Text Mining Process
 Step 1: Establish the corpus
 Collect all relevant unstructured data
 Digitize, standardize the collection
 Place the collection in a common place

DSS Chapter 7 Class: BIT14B Page 4


SIMAD UNIVERSITY

 Step 2: Create the Term–by–Document Matrix (TDM): Should all terms be included?,
Stop words, include words, Synonyms, homonyms and Stemming
 Step 3: Extract patterns/knowledge
 Classification: the most common knowledge discovery topic in analyzing
complex data sources of certain objects. The task is to classify a given data
instance into a predetermined set of categories.
 Clustering (natural groupings of text): is an unsupervised process whereby objects
are classified into groups called clusters. Compared to categorization, where a
collection of pre-classified training examples is used to develop a model based on
the descriptive features.
 Improve search recall
 Improve search precision
 Scatter/gather
 Query-specific clustering
 Association
 Trend Analysis

Text Mining Tools


1. Commercial Software Tools include SPSS PASW Text Miner, SAS Enterprise Miner,
Statistical Data Miner and Clear Forest.
2. Free Software Tools: includes RapidMiner, GATE and Spy-EM
Web Mining Overview
 Web is the largest repository of data. Data is in HTML, XML, and text format.

Challenges (of processing Web data) of Web Mining


1. The Web is too big for effective data mining
2. The Web is too complex
3. The Web is too dynamic
4. The Web is not specific to a domain

DSS Chapter 7 Class: BIT14B Page 5


SIMAD UNIVERSITY

5. The Web has everything


Define Web Mining
Web mining (or Web data mining) is the process of discovering intrinsic relationships from
Web data (textual, linkage, or usage)

Three different branches of Web mining

Web Mining

Web Content Mining Web Structure Mining Web Usage Mining


Source: unstructured Source: the unified Source: the detailed
textual content of the resource locator (URL) description of a Web
Web pages (usually in links contained in the site’s visits (sequence
HTML format) Web pages of clicks by sessions)

DSS Chapter 7 Class: BIT14B Page 6

You might also like