You are on page 1of 29

How to build a jobs

aggregation engine with


Nutch, Solr and Views 3
in about an hour
House Keeping

• I’m not here


• Questions posted to
http://axistwelve.com/nutch_presentati
on
Agenda
• Apache Nutch crawler
• Drupal Nutch module
• Technical Design decisions on
combining crawled data with your
Drupal data in Apache Solr
• Bringing it all together with a demo of a
jobs aggregation search engine
• Questions
Nutch Web Crawler
Introduction

• History

• Architecture

• Configuration
Nutch Web Crawler
History

• Created by Doug Cutting


• Apache foundation project
• Written in Java
• Current version 1.1
Nutch Web Crawler
Architecture

• Runs on Hadoop
• Reliable, scalable, distributed computing
• Local (Standalone) Mode
• Pseudo-Distributed Mode
• Fully-Distributed Mode
• Uses Apache Lucene for storage
Nutch Web Crawler
Architecture - Tika

• Uses Apache Tika since version 1.1


• Used to parse incoming documents
• Supports
• HTML, XML, MS Office, Open Office, Epub, audio, video,
images etc
Nutch Web Crawler

• Integrates with Apache Solr via restful


web services since version 1.0
Nutch Web Crawler
Highly configurable

• Allows you to control


• No of Crawl threads
• Depth of crawl
• Pages retrieved per depth level
• Many other options see (conf/nutch-
default.xml or write webpage on this)
Nutch Web Crawler
Crawl Control

• uses url regex filters to decide crawl


criteria
• All drupalcons +http://[a-z0-
9].drupalcon.org/
• Just SF +http://sf2010.drupalcon.org/
• -^. Ignore everything else
Nutch Web Crawler
Crawl Breakdown

• Inject
• Generate
• Fetch
• Updatedb
• Mergesegs
• Invertlinks
• Handles link de-duplication
• Solrindex (or index)
Nutch Web Crawler
Summary

• History
• Originally written by Doug Cutting
• Apache foundation project
• Architecture
• Run either on Hadoop or Locally
• Uses Apache Lucene as its storage
• Parses a large number of file formats using Apache Tika
• Configuration
• nutch-site.xml
• Crawl breakdown
Drupal Nutch Module
Introduction

• Architecture

• Configuration
Drupal Nutch Module
Configuration

• Set seed urls for your crawl


• This can be one or many
Drupal Nutch Module
Admin Settings Page

• Define the number of pages and depth


to crawl per run
• Define the location for the Nutch
crawler
• Define the location for the Java Home
• Define merge control from Nutch into
Solr
Drupal Nutch Module
configuration

• Start and stop Nutch crawls from


Drupal

• Or setup cron run


Drupal Nutch Module
configuration

• Allows you to define the field mapping


of Nutch fields to Solr field mappings
Drupal Nutch Module
Configuration

• Basic crawl statistics


• Number of pages crawled
• Current crawl status (if run in single mode)
Drupal Nutch Module
Summary

• Architecture
• Seed Crawl URLs
• Starting and Stopping of Nutch jobs
• Basic statistics of Nutch Crawls
• Configuration
• Number and Depth of Crawls
• How and when crawls occur
• URL regular expression to define crawl boundaries
Technical Design
Introduction

• Nutch Setup
• Mapping and Crawling
• Understanding data
• Search results and displaying data
Technical Design
Nutch Setup

• Setting up Nutch
• Hadoop
• Local instance
Technical Design
Mapping

• Deciding on field mappings between Nutch and


Drupal Solr Schema
<mapping>
<!-- Any fields in NutchDocument that match a name defined
in field/@source will be renamed to the corresponding
field/@dest.
Additionally, if a field name (before mapping) matches
a copyField/@source then its values will be copied to
the corresponding copyField/@dest.

uniqueKey has the same meaning as in Solr schema.xml


and defaults to "id" if not defined. -->
<fields>
<field dest="teaser" source="content"/>
<field dest="title" source="title"/>
<field dest="tstamp" source="tstamp"/>
<field dest="id" source="url"/>
<copyField source="content" dest="text"/>
</fields>
<uniqueKey>id</uniqueKey>
</mapping>
Technical Design
Crawling

• Crawling data
• There’s lot of s#*t out there (working title)
• Data filtering
• Crawl now filter later
• Filter before crawl
• Apache Tika data filtering
• Strips markup from documents
Technical Design
Search Results

• Search results mixed with Drupal


Results
• Launch out?
• The Solr Virtual Page with Apache Solr
Views
• Search results in a separate core
• Apache Solr Views and Multicore
Technical Design
Summary

• Nutch Setup
• Hadoop vs Local
• Mapping and Crawling
• solrindex-mapping.xml
• regex url-filters
• Understanding data
• Tika strips all markup/structure
• Search results and displaying data
• Apache Solr Views and View 3 great for data display
• Launch out to site or display in virtual page?
Jobs Aggregation Demo
What Next?

• Hadoop intergration
• Geoparsing Support
• Lucence API module support
• QueryPath Support
Questions ?

You might also like