Professional Documents
Culture Documents
• History
• Architecture
• Configuration
Nutch Web Crawler
History
• Runs on Hadoop
• Reliable, scalable, distributed computing
• Local (Standalone) Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode
• Uses Apache Lucene for storage
Nutch Web Crawler
Architecture - Tika
• Inject
• Generate
• Fetch
• Updatedb
• Mergesegs
• Invertlinks
• Handles link de-duplication
• Solrindex (or index)
Nutch Web Crawler
Summary
• History
• Originally written by Doug Cutting
• Apache foundation project
• Architecture
• Run either on Hadoop or Locally
• Uses Apache Lucene as its storage
• Parses a large number of file formats using Apache Tika
• Configuration
• nutch-site.xml
• Crawl breakdown
Drupal Nutch Module
Introduction
• Architecture
• Configuration
Drupal Nutch Module
Configuration
• Architecture
• Seed Crawl URLs
• Starting and Stopping of Nutch jobs
• Basic statistics of Nutch Crawls
• Configuration
• Number and Depth of Crawls
• How and when crawls occur
• URL regular expression to define crawl boundaries
Technical Design
Introduction
• Nutch Setup
• Mapping and Crawling
• Understanding data
• Search results and displaying data
Technical Design
Nutch Setup
• Setting up Nutch
• Hadoop
• Local instance
Technical Design
Mapping
• Crawling data
• There’s lot of s#*t out there (working title)
• Data filtering
• Crawl now filter later
• Filter before crawl
• Apache Tika data filtering
• Strips markup from documents
Technical Design
Search Results
• Nutch Setup
• Hadoop vs Local
• Mapping and Crawling
• solrindex-mapping.xml
• regex url-filters
• Understanding data
• Tika strips all markup/structure
• Search results and displaying data
• Apache Solr Views and View 3 great for data display
• Launch out to site or display in virtual page?
Jobs Aggregation Demo
What Next?
• Hadoop intergration
• Geoparsing Support
• Lucence API module support
• QueryPath Support
Questions ?