Final PDF

IIM TRICHY
Submitted by: Abhinandan Gupta Amey Patole Dibya Darshan Shailesh Yadav Subhajit Das
Text Mining & its Business Application

Term Paper DMBD
2013
Executive Summary
Text mining is a trending and new mining technique which has exited researchers about its implications for quite some time. This technique has shown promising result by acting as a powerful tool to resolve the information junk data by using new techniques such as (IR) information retrieval, (NLP) Neural language processing, and knowledge based learning .Text mining begins with the pre-processing of document set (information extraction, text categorization etc), the pre-analysis categorizations, and finally the techniques involved to analyze these representations (such as association clustering, distribution analysis and trend analysis. In this document we look at general architecture for text mining and categorization tools typically used by text mining systems. This will be act as an aid to understand implication and application of text mining techniques in real life scenario. We then take examples from various industry to show how these different techniques are used in practice.
Introduction
A basic definition of text mining can be summed up as a knowledge driven process in which there is an interaction between user and a collection of document over a period of time with the help of suitable analysis tools.Ina way similarto data mining, text mining tends to derive important informationfrom data sources by using the techniques of identifying and exploring interesting patterns.However,in this case the data sources consist of document collected, and then from that essential patterns are found which are not from the formalized database records but in the form ofunstructured textual data from the collection of all the documents. Text-mining is inspired from the seminal research on data mining and hence there is no surprise when it is said that text mining and data mining systems consist various similarity in there high-level architecture. For example, both types of process are depended on patterndiscovery algorithms, preprocessing routinesand presentation-layer elements which include visualization tools in order to intensify the browsing of answer sets. Addition to that, text mining acquires various types of specific patterns inits core knowledge discovery operations which were previously introduced and vetted in datamining research. In data mining majority of its preprocessing in concentrated on two highly important tasks as it already assumes that data has been stored in a structured format. These two important tasks are: 1- Scrubbing normalizing data 2- Creating extensive numbers of table joins. But in case of textmining systems it is very much different from data mining system as preprocessing operations focuses on the identication and extraction of all the important and leading features for natural language documents. The responsibility of transforming unstructured data stored in document collections into a more explicitly structured intermediate
format is that of preprocessing operations, which is a concern as this is not relevant for manydata mining systems. Moreover, because of the focus of natural language text to its mission, it also draws on success made in other computer science disciplines which are concerned with the handling of natural language. Among them the most important ones are text mining exploitstechniques and other processes from different sectors of informationextraction,information retrieval and corpus-based computational linguistics. The important component of text mining is its focus on the document collection. In simple words, a document assortment may also be referred as grouping of text-based documents. In other words most of the text mining solutions are geared towards finding patterns across massively collected documents. The amount of documents in this type of collections can vary from various thousands to the tens of millions. Collection of documents is either static, during which case the starting complement of documents remains unaltered or dynamic, that may be a term applied to documentcollections characterized by their inclusion of recent or updated documents over time. Extremely massive document collections, aswell as document collections with terribly high rates of document modification, will create performance improvement challenges for numerous components of a text mining system. An example of a typical real-world document collection appropriate as starting input for text mining is PubMed, the National Library of Medicines online repository of citation-related information for biomedical research papers. PubMed have got attention from various computer scientists who are inquisitive about using text mining techniques as this online service consist of text-based document abstractson topics covering the area of life sciences and the number of research papers are estimated to be more than 12 million in this area.PubMed symbolizethe most important online collection of biomedical research papers printed in the English language, and it homes knowledge concerning a substantial choice of publications in different languages also. The publication dates for the most important body of PubMeds collected papers starting from 1966 till today. The gathering is dynamic and increasing. According to an estimated more than 40,000 new biomedical abstracts are added to the collection every month. Even subsets of PubMeds knowledge data collection can represent large amount of document collections for particular types of text mining applications. For example, a comparative recently PubMed search made for only those research papers which containing words such as protein or gene resulted in return of more than 2,800,000 document among which more than 66 percent of these research papers were published recently. In a case where a very strictly dened search for abstracts containing epidermal growth factor receptor, the result gave more than 10,000 documents. PubMed huge collection of documents merely by its size and data makes artificial attempts to correlate data across documents, or recognizing trends at best extremely laborintensive and at worst which are nearly impossible to achieve. The speed and efciency of research activities are enhanced by automatic methods which are used for finding and exploring inter document data relationships.
Document
In simple words the definition of document can be stated as a unit of separate textual data among various collection that typically, however not necessarily, relates itself with various kind of report such as , legal memorandum, e-mail,business reports, research paper, press release, manuscript, articleor any kind news story .Adocument may also be amember of various kind of document collections, or can be a totally different subsets of a similar document collectionand can also exist in various different kind of collections at a time. For instance, a document concerning with Microsofts antimonopoly proceeding might exist in entirely different document collections related with , legal affairs, current affairs, antitrust-related legal affairs as well as software company news.
Weakly Structured and Semi structured Documents

Documents that have comparatively very little in theway of robust typographical, layout, or markup indicators to denote structure similar to various scientic research papers, legal memoranda, , business reportsand news stories aregenerally spoken as free-format orweakly structured documents.Documents with indepth and consistent format parts where eld-type information about the data can be easily extracted like that of some e-mail, HTML Web pages, PDF les, and word-processing. Semi structured documents are usually defined asintro to the text mining les with significant document template or style-sheet.
Document Features
The preprocessing operations that support text mining in an attempt to leverage several totally different parts contained in an exceedingly linguistic communication document so as to remodel it from an irregular and implicitly structured illustration into an expressly structured illustration. Adding to that, given the doubtless sizable amount of words, phrases, typographical elements,sentences and layout artifacts that even a brief document could have. Needless to mention the doubtless immense variety of various senses that each and every of those parts could have in numerous contexts and mixtures an important task for many text mining systems is that the selection of an easy set of document options that may be accustomed to represent a specific document as a complete document. For even the most modest document collections, the number of word-level features required to represent the documents in these collections can be exceedingly large. For example, in an extremely small collection of 15,000 documents culled from.Reuters news feeds,
more than 25,000 nontrivial word stems could be identied.Another characteristic of natural language documents is what might be described as feature sparsity . Only a small percentage of all possible features for a document collection as a whole appears in any single document, and thus when a document is represented as a binary vector of features, nearly all values of the vector are zero.
For even the foremost modest document collections, the amount of word-level features needed to represent the documents in these collections is extremely large. as an example, in a very little assortment of fifteen,000 documents culled from Reuters news feeds, quite twenty five,000 nontrivial word stems can be identied. Another characteristic of linguistic communication documents is what can be delineated as feature exiguity . solely alittle share of all attainable options for a document assortment as an entire seems in any single document, and so once a document is delineate as a binary vector of options, nearly all values of the vector area unit zero.
Commonly Used Document Features: Characters, Words, Terms, and Concepts

Text mining algorithms operate on the feature-based representations of documents and not the underlying documents themselves, there is often a trade off between two important goals. The rst goal is to achieve the correct calibration of the volume and semantic level of features to portray the meaning of a document accurately, which tends to incline text mining preprocessing operations toward select ing or extracting relatively more features to represent documents. The second goal is to identify features in a way that is most computationally efficient and practical for pattern discovery, which is a process that emphasizes the streamlining of representative feature sets; such streamlining is sometimes supported by the validation, normalization, or cross-referencing of features against controlled vocabularies or external knowledge sources such as dictionaries, thesauri, ontologies, or knowledge bases to assist in generating smaller representative sets of more semantically rich features. Although many potential features can be employed to represent documents, the following four types are most commonly used:
Characters: The individual component-level letters, numerals, special characters and spaces
are the building blocks of higher-level semantic features such as words, terms, and concepts. A character-level representation can include the full set of all characters for a document or some ltered subset. Character-based representations without positional information (i.e., bag-ofcharacters approaches) are often of very limited utility in text mining applications. Characterbased representations that include some level of positional information (e.g., bigrams or trigrams) are somewhat more useful and common. In general, however, character-based representations can often be unwieldy for some types of text processing techniques because the feature space for a document is fairly unoptimized. On the other hand, this feature space can in many ways be viewed as the most complete of any representation of a real-world text document.
Words. Specic words selected directly from a native document are at what might be
described as the basic level of semantic richness. For this reason, word level features are sometimes referred to as existing in the native feature space of a document. In general, a single word-level feature should equate with, or have the value of, no more than one linguistic token. Phrases, multiword expressions, or even multiword hyphenates would not constitute single word-level features.
Terms. Terms are single words and multiword phrases selected directly from the corpus of a
native document by means of term-extraction methodologies. Term level features, in the sense of this denition, can only be made up of specic words and expressions found within the native document for which they are meant to be generally representative. Hence, a term-based representation of a document is necessarily composed of a subset of the terms in that document.Several of term-extraction methodologies can convert the raw text of a native document into a series of normalized terms that is, sequences of one or more tokenized and lemmatized word forms associated with part-of-speech tags. Sometimes an external lexicon is also used to provide a controlled vocabulary for term normalization. Term-extraction methodologies employ various approaches for generating and ltering an abbreviated list of most meaningful candidate terms from among a set of normalized terms for the representation of a document. This culling process results in a smaller but relatively more semantically rich document representation than that found in word-level document representations.
Concepts: Concepts are features generated for a document by means of manual, statistical,
rule-based, or hybrid categorization methodologies. Concept-level features can be manually generated for documents but are now more commonly extracted from documents using complex preprocessing routines that identify single words, multiword expressions, whole clauses, or even larger syntactical units that are then related to specic concept identiers. For instance, a document collection that includes reviews of sports cars may not actually include
the specic word automotive or the specic phrase test drives, but the concepts automotive and test drives might nevertheless be found among the set of concepts used to to identify and represent the collection.
Domains and Background Knowledge

In text mining systems, concepts belong not only to the descriptive attributes of a particular document but generally also to domains. With respect to text mining, a domain has come to be loosely dened as a specialized area of interest for which dedicated ontologies, lexicons, and taxonomies of information may be developed. Domains can include very broad areas of subject matter (e.g., biology) or more narrowly dened specialism (e.g., genomics or proteomics). Some other noteworthy domains for text mining applications include nancial services (with significant sub domains like corporate nance, securities trading, and commodities.), world affairs, international law, counterterrorism studies, patent research, and materials science.
Text Mining Preprocessing Techniques

Different preprocessing techniques are required to prepare raw unstructured data for text mining than those traditionally encountered in knowledge discovery operations aimedat preparing structured data sources for classic data mining operations.A large variety of text mining preprocessing techniques exist. All in some wayattempt to structure documents and, by extension, document collections. Quitecommonly, different preprocessing techniques are used in tandem to create structured document representations from raw textual data.Each of the preprocessing techniques starts with a partially structured documentand proceeds to enrich the structure by rening the present features and adding newones. In the end, the most advanced and meaning-representing features are used forthe text mining, whereas the rest are discarded.The nature of the input representation and the output features is the principaldifference between the preprocessing techniques. There are natural language processing (NLP) techniques, which use and produce domain-independent linguisticfeatures. There are also text categorization and IE techniques, which directly dealwith the domainspecic knowledge.general purpose NLP tasks process text documents using the general knowledge about natural language. The tasks may include tokenization, morphologicalanalysis, POS tagging, and syntactic parsing either shallow or deep.The domain-related knowledge, however,can often enhance the performance of the general purpose NLP (Value and benefits of text mining | Jisc, n.d.)tasks and is oftenused at different levels of processing. The nal stages of document structuring create representations that are meaningfulfor either later (or more sophisticated) processing phases or direct interaction ofthe text mining system user. The text mining techniques normally expect the documents to be represented as sets of features, which are considered to be structure less atomic entities possibly organized into taxonomy an IsA-hierarchy. The natureof the features sharply distinguish between the two main techniques: text categorization and information extraction (IE). Both of these techniques are also popularlyreferred to as tagging (because of the tag-formatted structures they introduce in aprocessed document), and they enable one to obtain formal, structured representations of documents. Text categorization and IE enable users to move from a machine-readable representation of the documents to a machine understandable form ofthe documents. This view of the tagging approach is depicted below.
Categorization
Three common TC applications are text indexing, document sorting and text altering, and Web page categorization. These are only a small set of possible applications, butthey demonstrate the diversity of the domain and the variety of the TC subcases.Text ltering activity can be seen as document sorting with only two bins therelevant and irrelevant documents. Examples of text ltering abound. A sports-related online magazine should lter out all non sport stories it receives from the newsfeed. An e-mail client should lter away spam. A personalized ad ltering systemshould block any ads that are uninteresting to the particular
user. Personalized ltering systems it is common for the user to provide the feedback to the system by marking received documents as relevant or irrelevant.Because it is usually computationally unfeasible to fully retrain the system after each document, adaptive learning techniques are required
Web Page Categorization

TC is the automatic classication of Web pages under the hierarchical catalogues posted by popular Internet portals such as Yahoo. Such cataloguesare very useful for direct browsing and for restricting the query-based search to pagesbelonging to a particular topic.
MACHINE LEARNING APPROACH TO TC

In the ML terminology, the learning process is an instance of supervised learning because the process is guided by applying the known true category assignment function on the trainingset. The unsupervised version of the classication task, called clustering, is described later.
Decision Tree Classiers
Many categorization methods share a certain drawback: The classiers cannot beeasily understood by humans. The symbolic classiers, of which the decision treeclassiers are the most prominent example, do not suffer from this problem.A decision tree (DT) classier is a tree in which the internal nodes are labeledby the features, the edges leaving a node are labeled by tests on the features weight, and the leaves are labeled by categories.Performance of a DT classier is mixed but is inferior to the top-rankingclassiers. Thus it is rarely used alone in tasks for which the human understandingof the classier is not essential. DT classiers, however, are often used as a baselinefor comparison with other classiers and as members of classier committees.
Regression Methods
Regression is a technique for approximating a real-valued function using the knowledge of its values on a set of points. It can be applied to TC, which is the problemof approximating the category assignment function. For this method to work, the assignment function must be considered amember of a suitable family of continuousreal-valued functions. Then the regression techniques can be applied to generate the(real-valued) classier.
Neural Networks
Neural network (NN) can be built to perform text categorization. Usually, the inputnodes of the network receive the feature values, the output nodes produce the categorization status values, and the link weights represent dependence relations. Forclassifying a document, its feature weights are loaded into the input nodes; the activation of
the nodes is propagated forward through the network, and the nal valueson output nodes determine the categorization decisions.The neural networks are trained by back propagation, whereby the training documents are loaded into the input nodes. If a misclassication occurs, the error ispropagated back through the network, modifying the link weights in order to minimize the error.The simplest kind of a neural network is a perceptron. It has only two layers theinput and the output nodes. Such network is equivalent to a linear classier. Morecomplex networks contain one or more hidden layers between the input and outputlayers. However, the experiments have shown very small or no improvementof nonlinear networks over their linear counterparts in the text categorization task
Example-Based Classiers
Example-based classiers do not build explicit declarative representations of categories but instead rely on directly computing the similarity between the document tobe classied and the training documents. Those methods have thus been called lazylearners because they defer the decision on how to generalize beyond the trainingdata until each new query instance is encountered. Training for such Classiers consists of simply storing the representations of the training documents together with their category labels.
CLUSTERING
This technique is commonly used to group identical documents, it differs from Categorization technique used here groups the document on the fly instead of using predefined topics. Secondly, clustering can lead to multiple clusters with various subtopics, thus guaranteeing no loss or omit of useful document from the results. Basic clustering techniques algorithm uses a vector based technique to cluster various topics for each document and then gives preference by measuring the weights of how well these document interact within each cluster. Clustering technology is the most commonly used text mining tools used in the organization for (MIS) management information systems.
Text Mining Applications

Bioinformatics
In the bioinformatics section of biomedical research literature from research journal has been a target for text mining. Various textbook on biomedical text mining which has a strong genomics focus reported that industry has suggested that majority of drug targets in bioinformatics are consequential from the literature. The implication of text ining in this field is to allow researchers to implicate knowledge from the biomedical literature and thus facilitating new drug discovery in a more resourceful manner. Research work for Information (MedMiner: An Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling, n.d.)extraction has increased drastically in this domain, in one such work author talks about the mammoth amount of biomedical information on the Internet and its projected growth. The use of data mining tools like text mining is capable of filtering the public databases and organizing the relevant information in a coherent manner. These tools have played a significant role in analyzing genegene relationships observed in mRNA expression linking to profiling experiments with using oligonucleotide chips and cDNA microarrays. To analyze these relationships among 1000s of text mining techniques like clustering has been perceptible in finding apparent correlations.
National Security
On major usage of text mining analytics can be seen in homeland defense security area has become an important issue. Rest news about government bodies like NSA etc are using these resources in the surveillance of all kinds of communication, such as email, chats in chat rooms to monitor any suspicious activity has led debate on its usage . Personal mail using various platforms like Google, Facebook etc are used daily for various business related messages and documents exchange. But we also know that it is being used illegitimate purposes such as distribution of unsolicited junk mail, spam or threatening materials. After the 9/11 attack on America the skeptical government has started prying on lives of general public and started monitoring emails or chat rooms using automatic text mining tools which offer a considerable success in these areas. Text mining technology is becoming an biggest espionage technology for security defense. The NSA has gained significant know how in using text mining tool in national defense security domain.
Intellectual Patent analysis using commercial text mining platform

Innovation in the current days have led to rise in number of Patent research and thus have become the business foci of an increasing amount of professionals involved with helping organizations understand how best to use intellectual property and thus avoid issues created by other organizations IPR in their business activities. Patent Mining includes a big range of at somewhat related activities involving the investigation of the ownership rights, registration, and implication of patents. Most of these activities have the need to organize, collect, and evaluate large big data containing of mammoth range detailed and technical text-based documents. Business solutions in consideration here can be called horizontal applications because even though they have a slim functional focus on patent-related documents, it has wide application in many different businesses. Professionals in both private companies and public firms intellectual property (IP) departments of various consultancies and law rms, have accountability for providing in depth insights into corporate patent strategies. These analysis needs to take into consideration not just the potentially useable IP that the parent company may have interest in all of the published IP rights domain in its field of interest that other companies may already possess.
Text Mining can support queries which would permit a user to request all patents issued for in numerous different broad areas of interest over some period. This type of search queries can reveal what areas of intellectual property are trending, which are not, all in correlation to other areas of IP based on patterns of patent application and approval. These Text mining applications supports a wide range of attribute on nearly all of its query types, such as syntactical, redundancy, background and quality-threshold. They also work on time-based variables on such queries, allowing trend analysis queries and plasticity in comparing distributions over different time-series based divisions of the document collection. An example of such scenario, an intellectual patent manager might be interested in exploring the distribution of patents among assignees in the eld of external incubators. So to begin with he would do a distribution analysis query for all assignees with relevant patents in his area of
interest.
On execution, the manager would obtain a table view of all assignees ordered according to the number of patents they had been issued in the document collection. From this screen, a quick histogram graphically demonstrating the distribution pattern could be generated. Based on this a user can choose to click on a patent to either see an annotated full-text version of the patent or be directed to the URL for the official location of patent text at the U.S. Patent and Trademark Office Web site.
Trends in Patent Issuance

Upcoming pace of change in tech scenario can be understood by exploring how patterns of patent application issuance (and issuance) for new technologies evolve. By understanding these trends a company can predict whether a company should innovate by developing its own technology, (Powerful Tool to Expand Business Intelligence: Text Mining , n.d.)Depend its own patent or attempt to license another companys patent. These trend analysis can help understand whether the number of patents in this new tech is rising, plateauing, or decreasing, indicating the current strength of interest among the market. Job of a patent manager is to interpret these and bring to notice whether a steadily increasing, year by year trend in patents related to a technology that his or her client company is developing as encouraging as it represents a growing interest in related business areas.
AIRLINEs INDUSTRY
Airline travel has a tendency to up the stress level of most passengers because various safety and comfort issues. But most of us are unaware aware of the amount of data created by us and gathered by the airline industry and how it's being collected and analyzed to improve aviation safety. Majority of this data is structured, but a small amount of the in flight data exists in text form, generated through audio recordings in black boxes, warning signals and written reports. SSAT(System Wide Safety and Assurance technology) developed by NASA makes use of this technology. Based on reports received published by department of transportation, they recorded large amount of data from 10 million commercial flights from national and international carriers operating within the country in 2011 alone. The team analyzes operational data using text mining techniques to distinguish what happened in the anomalous state. Analogous state is a state which indicates something different enough is happening to warrant attention but also marks a ultimate junction in the road for an operation to either return to a safe state or turn into an accident. That analysis can shed light on causal factors or why something happened. It is similar to the methods used in other industry hence the challenge is to understand why it is happening rather than focusing on what is happening .As an example, in one such incident at JKF International Airport from a little over a year ago, when Lufthansa Flight 411 and Egypt Air Flight 986 had almost collided in a "runway incursion,". Figuring out whether something happened is more difficult with numeric data alone. For example Chief Scientist for data sciences MR Ashok Srivastava talks about a situation where as a part of the investigation which involved analyzing the audio conversation of recordings between the pilots and the control tower using data mining techniques, they helped reveal that Egypt Air had made a wrong turn. Understanding this what and the why NASA was will be able to stop similar instances from happening in the future. Approach adapted by NASA engineers is peculiar as they have trained their system to analyze all of the data together by coagulating it at one source called the kernel. This approach is a big move from traditional approach where data analysis of the numeric and text data was done separately, and then the two results were combined together get a better perspective.
Automotive Industry
It's been assessed that major chunk of expenditure worn by auto mobile companies are in after sales services like Warranties which cost auto organizations more than $35 billion in the every twelve-months. Recognizing this issue, it is basic that auto organizations explore every possibility to reduce such expenses. Optimizing this cost is a very important challenge in the cost analysis for automobile manufacturers. In this situation if one has the ability to get the cost down marginally will have a multiplier effect on this cost? One of such underutilized parameter of optimizing warranty cost is review from service technicians. From those remarks, the text mining process can bring into picture pieces of information defect insights thus yielding precautions for preventing them in future. In order to optimize the warranty process, it is exceptionally critical to detail a percentage of the business inquiries, which are presently unanswered dependent upon professionals' remarks. Here are a couple of demonstrative business inquiries What are the prominent problem areas to be concentrated upon at individual dealer levels based on comments from the technicians? 1. What are the most frequent car components mentioned by the technician in terms of frequency of repetition in service comments in the last three months, and what can we interpret that measure suppliers and/or internal manufacturing processes?\ 2. What is the noticeable issue ranges to be gathered upon at singular dealer levels dependent upon remarks from the technician? 3. Is their seasonal cyclecity in occurrence of keywords related to component failure? Is there a sudden spike in watchwords, for example brake covering, fuel pump, oil spillage or wobbly throughout the winter season 4. Is there a strong association between the keyword reoccurrence and a component suppliers rank in terms of warranty cost? A typical text mining solution to answer the above questions incorporates four kinds of unstructured input about the vehicle from internal or external sources. When the data is accepted, it is bolstered to the content mining process Once which produces 3 outputs. One is a rundown of catchphrases, the second is a higher abstraction of keywords into major vehicle defect themes and the third is a list of occurrences where certain high-risk keywords were encountered such as over-heatingetc.
Decision Optimized
Optimize Sourcing
Component
Re-Engineer manufacturing
Optimize Spare Part inventory
Re organize defect taxonomy
Analytical output
Keyword frequency analysis
Theme Analysis
Early Warning System
Text mining process
Unstructured text Mining Process
Auto data sources
Dealer management system Technician Comment
CRM Customer comments on vehicle
User generated Content - blog
Influential Vehicle Reviews/Trade journals
Conceivable Activities to Trigger after Content Mining Technicians Comments Once we have the answers to these inquiries through a structured text mining analysis, automobile firms can then start working on follow-up measures which will then trim down warranty-related cost attrition, enhance market stock for spare parts and help suppliers deliver quality components:
Component sourcing: Auto manufacturers can decide to impart the insight gained by the results of text mining technicians comments with specific product suppliers and undergo joint training drills to reduce the number of defective components. Early warning framework: Automobile companies can also work on developing an early cautioning system based on frequency of occurrence of specific issue keywords frequency in a stipulated time like short circuit, brake lining etc. which could be life threatening and cause legal suits in some cases. Optimize internal assembly management: If the component causing trouble is manufactured in-house, then the specific manufacturing process responsible for the faulty component can be redeveloped to eliminate mishaps. Inventory optimization: The frequency of occurrence of select spare parts related keyword can act as a trigger to forecast the surge in regional need for spare parts.
Text Mining Healthcare Industry

Most countries regularly spend somewhere between 4-9% of their GDP on healthcare. The healthcare industry is a one of leader in adopting technology and is a highest spender, with the burgeoning of hospital management systems and portable low cost devices to log patient health data, there is a sudden surge in the amount of depth of patient data. Text mining the contents of doctors diagnosis transcripts, can lead to pattern recognition it terms of looking at the health industry in a bigger frame. This has various benefits such as: 1. By understanding the seasonal trend of the top 10 diseases by frequency analysis or cluster analysis. These findings can help forecast and optimize the required medicines to stock with the dealer. 2. Based on physicians comments, an early cautioning system can be woven within text mining outputs to find sudden changes to consultation from doctors regarding specific ailment. For example, if the frequency of the keyword cough or breathing exceeds certain number of appearance in the last fortnight for a given area or region, it can be a sign to un favorable environmental conditions in that area which are resulting in respiratory problems. The components of such a successful text mining solution can seen below.
Credit Card Industry

The explosion of credit cards(Three Real-World Applications of Text Mining to Solve Specific Business Problems by Derick Jose - BeyeNETWORK, n.d.) companies has led to do the difficulty in identifying the right balance of card features which are currently in demand among customers and are also at the lowest risk of defaults/recovery related interaction. Text mining again comes to aid of this industry to help optimize both identify the right assortment process packaged with most favorable customer experience expectation. A situation analysis of such usage will be to have a top10 complaint keyword query to generated by mining the inbound CSR (customer service rep) transcripts on a regular basis. Text mining can also be used to do comparative analysis of call center staff performance.
GENERAL APPLICATION
Customer Relationship Management (CRM): CRM is a prerequisite of every firm and in this field Text mining can be applied for management of the contents of clients messages. Within this system an analysis of client query will automatically rerouting specific requests to the appropriate department and flag them for immediate attention(Seminar Report On Text Mining Submitted By Jisha A.K., n.d.). Companies have already implemented such systems and
are ever keener on understanding how to use these technologies to derive business insights and operational efficiency.
Human resource management: TM techniques are being implemented to better manage human resources .These tools are leading to monitoring the level of employee satisfaction, better analyze staffs compensation demand structure. Even in recruitment process reading and as well as storing CVs for the selection of new employee. These Text mining techniques are now being utilized as to indicator of the state of health of a firms success by measuring the engagement of its employees.
As we have seen above text mining is making its impact in in a diverse set of industries ranging from defence to healthcare to auto and beyond, the text mining process has become the tool of the century to mine the big unstructured data. In this highly economic environment and low margin situation the pressure to optimize the efficiency of business processes is tremendous, using unstructured text mining techniques on previously ignored resources and industry such as comments from doctors and Airline can provide competitive differentiation. This competitive advantage due to technological edge will soon be a common as any other technical tool today then the only differentiator will be the ability to have insights into where to use it .
Future directions
Multilingual text refining As data mining is basically language independent, text mining has its limitation in having a significant language component. We need to develop text refining algorithms, which can process multilingual text documents and thus mine language-independent intermediate forms. As most of these text mining tools work only on processing English documents, bringing the same tech know how to other languages can help penetrate adapt various technology faster. Domain Knowledge Domain knowledge integration should be used as early as in the text refining stage.It can also play a part in knowledge distillation, classification or predictive modeling task to explore how a customers knowledge can be influence a knowledge structure and make the interpret the knowledge more efficiently.
Personalized autonomous mining
Current text mining requisitions and applications are basically tools designed for specific trained knowledge specialists platforms. In the future these text mining tools will become part of the knowledge management systems will be used in daily activities not only by technical users but management executives as well as. Efforts in developing systems which can interpret natural language queries and automatically perform the desired mining operations are already in practice. Text mining tools could also appear in the form of intelligent personal assistants like Siri . In the future Content mining instruments bill be automated and self reliant such that they will be able to take into consideration executor preference and conduct content mining operations immediately, and send data without needing an unequivocal involvement from the client.
Bibliography
MedMiner: An Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling. (n.d.). Retrieved November 5, 2013, from http://137.187.213.210/host/medminer_manuscript.pdf Powerful Tool to Expand Business Intelligence: Text Mining . (n.d.). Retrieved November 5, 2013, from http://www.waset.org/journals/waset/v8/v8-103.pdf Advanced Approaches in Analyzing Unstructured Data Ronen Feldman Bar-Ilan University, Israel James Sanger ABS Ventures, Waltham, Massachusetts Retrieved November 5 NASA uses text analytics to bolster aviation safety Nicole Laskowski, News Editor Seminar Report On Text Mining Submitted By Jisha A.K. (n.d.). Retrieved November 5, 2013, from http://dspace.cusat.ac.in/jspui/bitstream/123456789/3614/1/Textmining.pdf Three Real-World Applications of Text Mining to Solve Specific Business Problems by Derick Jose - BeyeNETWORK. (n.d.). Retrieved November 5, 2013, from http://www.b-eyenetwork.com/view/12783 Value and benefits of text mining | Jisc. (n.d.). Retrieved November 5, 2013, from http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining#a2

Final PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final PDF

Uploaded by

Copyright:

Available Formats

IIM TRICHY

Text Mining & its Business Application

Weakly Structured and Semi structured Documents

Commonly Used Document Features: Characters, Words, Terms, and Concepts

Domains and Background Knowledge

Text Mining Preprocessing Techniques

Web Page Categorization

MACHINE LEARNING APPROACH TO TC

Text Mining Applications

Intellectual Patent analysis using commercial text mining platform

Trends in Patent Issuance

Optimize Spare Part inventory

Re organize defect taxonomy

Keyword frequency analysis

Early Warning System

Text mining process

Unstructured text Mining Process

Auto data sources

Dealer management system Technician Comment

CRM Customer comments on vehicle

User generated Content - blog

Influential Vehicle Reviews/Trade journals

Text Mining Healthcare Industry

Credit Card Industry

Personalized autonomous mining

You might also like