Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword or section
Like this

Table Of Contents

about this book
about the cover illustration
Getting started taming text
1.1Why taming text is important
1.2Preview: A fact-based question answering system
1.2.1Hello, Dr. Frankenstein
1.3Understanding text is hard
Foundations of taming text
2.1Foundations of language
2.1.1Words and their categories
2.2Common tools for text processing
2.2.1String manipulation tools
2.2.2Tokens and tokenization
2.2.3Part of speech assignment
2.2.5Sentence detection
2.2.6Parsing and grammar
2.2.7Sequence modeling
2.3Preprocessing and extracting content from common file formats
2.3.1The importance of preprocessing
3.1Search and faceting example: Amazon.com
3.2Introduction to search concepts
3.2.1Indexing content
3.3Introducing the Apache Solr search server
3.3.1Running Solr for the first time
3.4Indexing content with Apache Solr
3.5Searching content with Apache Solr
3.5.1Solr query input parameters
3.7.1Hardware improvements
3.7.2Analysis improvements
3.7.3Query performance improvements
3.7.4Alternative scoring models
3.7.5Techniques for improving Solr performance
3.8Search alternatives
Fuzzy string matching
4.1Approaches to fuzzy string matching
4.1.1Character overlap measures
4.1.2Edit distance measures
4.1.3N-gram edit distance
4.2Finding fuzzy string matches
4.2.1Using prefixes for matching with Solr
4.2.3Using n-grams for matching
4.3Building fuzzy string matching applications
4.3.1Adding type-ahead to search
Identifying people, places, and things
5.1Approaches to named-entity recognition
5.1.1Using rules to identify names
5.2Basic entity identification with OpenNLP
5.2.1Finding names with OpenNLP
5.3In-depth entity identification with OpenNLP
5.3.1Identifying multiple entity types with OpenNLP
5.3.2Under the hood: how OpenNLP identifies names
5.4Performance of OpenNLP
5.4.1Quality of results
5.4.2Runtime performance
5.5Customizing OpenNLP entity identification for a new domain
5.5.1The whys and hows of training a model
5.5.3Altering modeling inputs
5.5.4A new way to model names
5.7Further reading
Clustering text
6.1Google News document clustering
6.2Clustering foundations
6.2.1Three types of text to cluster
6.2.3Determining similarity
6.2.5How to evaluate clustering results
6.3Setting up a simple clustering application
6.4Clustering search results using Carrot2
6.5Clustering document collections with Apache Mahout
6.5.1Preparing the data for clustering
6.6Topic modeling using Apache Mahout
6.7Examining clustering performance
6.7.1Feature selection and reduction
6.7.3Mahout clustering benchmarks
Classification, categorization, and tagging
7.1Introduction to classification and categorization
7.2The classification process
7.2.1Choosing a classification scheme
7.2.4Evaluating classifier performance
7.3Building document categorizers using Apache Lucene
7.3.1Categorizing text with Lucene
7.4Training a naive Bayes classifier using Apache Mahout
7.4.1Categorizing text using naive Bayes classification
7.4.2Preparing the training data
7.4.3Withholding test data
7.4.4Training the classifier
7.4.5Testing the classifier
7.4.6Improving the bootstrapping process
7.5Categorizing documents with OpenNLP
7.6Building a tag recommender using Apache Solr
7.6.1Collecting training data for tag recommendations
7.6.2Preparing the training data
7.6.4Creating tag recommendations
7.6.5Evaluating the tag recommender
Building an example question answering system
8.1Basics of a question answering system
8.2Installing and running the QA code
8.4.1Training the answer type classifier
8.4.3Computing the answer type
8.4.5Ranking candidate passages
8.5Steps to improve the system
Untamed text: exploring the next frontier
9.1Semantics, discourse, and pragmatics: exploring higher levels of NLP
9.2Document and collection summarization
9.3Relationship extraction
9.3.1Overview of approaches
9.4Identifying important content and people
9.4.1Global importance and authoritativeness
9.4.2Personal importance
9.4.3Resources and pointers on importance
9.5Detecting emotions via sentiment analysis
9.5.1History and review
9.5.2Tools and data needs
9.6Cross-language information retrieval
0 of .
Results for:
No results containing your search query
P. 1
Taming Text

Taming Text

Ratings: (0)|Views: 64 |Likes:
Published by cumin
text processing
text processing

More info:

Published by: cumin on Apr 20, 2014
Copyright:Traditional Copyright: All rights reserved


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





You're Reading a Free Preview
Pages 7 to 31 are not shown in this preview.
You're Reading a Free Preview
Pages 38 to 79 are not shown in this preview.
You're Reading a Free Preview
Pages 86 to 91 are not shown in this preview.
You're Reading a Free Preview
Pages 98 to 267 are not shown in this preview.
You're Reading a Free Preview
Pages 274 to 322 are not shown in this preview.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->