How To Build A Jobs Aggregation Engine With Nutch, Solr and Views 3

How to build a jobs
aggregation engine with

Nutch, Solr and Views 3
in about an hour
House Keeping
• I’m not here

• Questions posted to
http://axistwelve.com/nutch_presentati
on
Agenda
• Apache Nutch crawler
• Drupal Nutch module
• Technical Design decisions on
combining crawled data with your
Drupal data in Apache Solr
• Bringing it all together with a demo of a
jobs aggregation search engine
• Questions
Nutch Web Crawler
Introduction
• History
• Architecture
• Configuration
Nutch Web Crawler
History
• Created by Doug Cutting

• Apache foundation project
• Written in Java
• Current version 1.1
Nutch Web Crawler
Architecture
• Runs on Hadoop
• Reliable, scalable, distributed computing
• Local (Standalone) Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode
• Uses Apache Lucene for storage
Nutch Web Crawler
Architecture - Tika
• Uses Apache Tika since version 1.1

• Used to parse incoming documents
• Supports
• HTML, XML, MS Office, Open Office, Epub, audio, video,
images etc
Nutch Web Crawler
• Integrates with Apache Solr via restful

web services since version 1.0
Nutch Web Crawler
Highly configurable
• Allows you to control

• No of Crawl threads
• Depth of crawl
• Pages retrieved per depth level
• Many other options see (conf/nutch-
default.xml or write webpage on this)
Nutch Web Crawler
Crawl Control
• uses url regex filters to decide crawl

criteria
• All drupalcons +http://[a-z0-
9].drupalcon.org/
• Just SF +http://sf2010.drupalcon.org/
• -^. Ignore everything else
Nutch Web Crawler
Crawl Breakdown
• Inject
• Generate
• Fetch
• Updatedb
• Mergesegs
• Invertlinks
• Handles link de-duplication
• Solrindex (or index)
Nutch Web Crawler
Summary
• History
• Originally written by Doug Cutting
• Apache foundation project
• Architecture
• Run either on Hadoop or Locally
• Uses Apache Lucene as its storage
• Parses a large number of file formats using Apache Tika
• Configuration
• nutch-site.xml
• Crawl breakdown
Drupal Nutch Module
Introduction
• Architecture
• Configuration
Drupal Nutch Module
Configuration
• Set seed urls for your crawl

• This can be one or many
Drupal Nutch Module
Admin Settings Page
• Define the number of pages and depth

to crawl per run
• Define the location for the Nutch
crawler
• Define the location for the Java Home
• Define merge control from Nutch into
Solr
Drupal Nutch Module
configuration
• Start and stop Nutch crawls from

Drupal
• Or setup cron run

Drupal Nutch Module
configuration
• Allows you to define the field mapping

of Nutch fields to Solr field mappings
Drupal Nutch Module
Configuration
• Basic crawl statistics

• Number of pages crawled
• Current crawl status (if run in single mode)
Drupal Nutch Module
Summary
• Architecture
• Seed Crawl URLs
• Starting and Stopping of Nutch jobs
• Basic statistics of Nutch Crawls
• Configuration
• Number and Depth of Crawls
• How and when crawls occur
• URL regular expression to define crawl boundaries
Technical Design
Introduction
• Nutch Setup
• Mapping and Crawling
• Understanding data
• Search results and displaying data
Technical Design
Nutch Setup
• Setting up Nutch
• Hadoop
• Local instance
Technical Design
Mapping
• Deciding on field mappings between Nutch and

Drupal Solr Schema
<mapping>

<fields>
<field dest="teaser" source="content"/>
<field dest="title" source="title"/>
<field dest="tstamp" source="tstamp"/>
<field dest="id" source="url"/>
<copyField source="content" dest="text"/>
</fields>
<uniqueKey>id</uniqueKey>
</mapping>
Technical Design
Crawling
• Crawling data
• There’s lot of s#*t out there (working title)
• Data filtering
• Crawl now filter later
• Filter before crawl
• Apache Tika data filtering
• Strips markup from documents
Technical Design
Search Results
• Search results mixed with Drupal

Results
• Launch out?
• The Solr Virtual Page with Apache Solr
Views
• Search results in a separate core
• Apache Solr Views and Multicore
Technical Design
Summary
• Nutch Setup
• Hadoop vs Local
• Mapping and Crawling
• solrindex-mapping.xml
• regex url-filters
• Understanding data
• Tika strips all markup/structure
• Search results and displaying data
• Apache Solr Views and View 3 great for data display
• Launch out to site or display in virtual page?
Jobs Aggregation Demo
What Next?
• Hadoop intergration
• Geoparsing Support
• Lucence API module support
• QueryPath Support
Questions ?

How To Build A Jobs Aggregation Engine With Nutch, Solr and Views 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How To Build A Jobs Aggregation Engine With Nutch, Solr and Views 3

Uploaded by

Copyright:

Available Formats

How to build a jobs

aggregation engine with

• I’m not here

• Created by Doug Cutting

• Uses Apache Tika since version 1.1

• Integrates with Apache Solr via restful

• Allows you to control

• uses url regex filters to decide crawl

• Set seed urls for your crawl

• Define the number of pages and depth

• Start and stop Nutch crawls from

• Or setup cron run

• Allows you to define the field mapping

• Basic crawl statistics

• Deciding on field mappings between Nutch and

uniqueKey has the same meaning as in Solr schema.xml

• Search results mixed with Drupal

You might also like