Learning To Rank in Solr

Learning To Rank For Solr
Michael Nilsson Software Engineer

Diego Ceccarelli Software Engineer
Joshua Pantony Software Engineer
Bloomberg LP
OUTLINE
Search at Bloomberg
Why do we need machine learning for search?
Learning to Rank
Solr Learning to Rank Plugin
8 millions searches PER DAY
1 million PER DAY
400 million stories in the index
SOLR IN BLOOMBERG
Search engine of choice at Bloomberg
Large community / Well distributed committers
Open source Apache Project
Used within many commercial products
Large feature set and rapid growth
Committed to open-source
Ability to contribute to core engine
Ability to fix bugs ourselves
Contributions in almost every Solr release since 4.5.0
PROBLEM SETUP
score: 30
score: 1.0
PROBLEM SETUP
score: 52.2
score: 30.8
=100+
10
PROBLEM SETUP
=100+
10
PROBLEM SETUP
=+
.+

PROBLEM SETUP
=.
+.
+. +
5 timeElapsedFrom LastUpdate
PROBLEM SETUP
Its hard to manually tweak the ranking
You must be an expert in the domain
or a magician
=.
+.
+. +
5 timeElapsedFrom LastUpdate
query = solr query = lucene query = austin query = bloomberg query =
PROBLEM SETUP
Its easier with Machine Learning
2,000+ parameters (non-linear, factorially larger than linear form)
8,000+ queries that are regularly tuned
Early on we spent many days hand tuning

SEARCH PIPELINE (ONLINE)
User
Query
Commodities
News
People Index
Other Sources
Top-k
Top-x
x >> k
retrieval
ReRanking Top-k
Model reranked
TRAINING PIPELINE (OFFLINE)
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
TRAINING DATA: IMPLICIT VS EXPLICIT
What is explicit data? Pros:
A set of judges will assess the Data is very clean
search results manually given a
Cons:
query
Can be very expensive!
Experts
Crowd
What is implicit data? Pros:

Infer user preferences based on A lot of data!
user behavior
Cons:
Aggregated results clicks
Extremely noisy
Query reformulation
Privacy concerns
Dwell time
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
FEATURES
A feature is an individual measurable property
Given a query, and a collection we can produce many features for each
document in the collection
If the query matches the title
Length of the document
Number of views
How old is it?
Can be visualized on a mobile device?
FEATURES
Extract features
Features are signals that give an indication of a results importance
Was the result a

0
cofounder?
FEATURES
Extract features
Query : APPL US
Was the result a
0
cofounder?
Does the document
1
have an exec. position?
FEATURES
Extract features
Was the result a

0
cofounder?
Does the document
1
Does the query match
0
the document title?
FEATURES
Extract features
Was the result a

0
cofounder?
Does the document
1
0
the document title?
Popularity (%) 0.9
FEATURES
Extract features
Was the result a

0
cofounder?
Does the document
0
1
the document title?
Popularity (%) 0.6
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
METRICS
How do we know if our model is doing better?
Offline metrics
Precision/Recall/F1 score
nDCG (Normalized Discount Cumulative Gain)
Other metrics (e.g., ERR, MAP, )
Online Metrics
Click through rates higher
Time to first click lower
Interleaving1
1O.Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM
Transactions on Information Science, 30(1), 2012.
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
LEARNING TO RANK
Learn how to combine the features for optimizing one or more metrics
Many learning algorithms
RankSVM1
LambdaMART2

1T.Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and
Data Mining (KDD), ACM, 2002.
2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-
TR-2010-82, 2010.
SEARCH PIPELINE: STANDARD
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Learning Ranking
Training
Data Algorithm Model Offline
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Ranking Top-x
Online Model reranked
SEARCH PIPELINE: SOLR INTEGRATION
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Ranking Top-x
Online Model reranked
SOLR RELEVANCY
Pros
Simple and quick scoring computation
Phrase matching
Function query boosting on time, distance, popularity, etc
Customized fields for stemming, synonyms, etc
Cons
Lots of manual time for creating a well tuned query
Weights are brittle, and may not be compatible in the future with more documents
or fields added
LTR PLUGIN: GOALS
Dont tune the relevancy manually!
Uses machine learning to power automatic relevancy tuning
Significant relevancy improvements
Allow comparable scores across collections

Collections of different sizes
Maintaining low latency

Re-use the vast Solr search functionality that is already built-in
Less data transport
Makes it simple to use domain knowledge to rapidly create features

Features are no longer coded but rather scripted
STANDARD SOLR SEARCH REQUEST
User
Query
Commodities
News
Index
People
Other Sources
Top-k
retrieval
STANDARD SOLR SEARCH REQUEST
User
Query
Commodities
Solr Query
News Matches
Index
People [10 Million] [10k]
Other Sources
Top-10 Score
retrieval [10k]
LTR SOLR SEARCH REQUEST
User
Query
Commodities
Solr Query
News Matches
Index
Other Sources
Top-1000 Score
retrieval [10k]
LTR Query
Ranking Top-10
Model reranked
LTR PLUGIN: RERANKING

<queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />
LTRQuery extends Solrs RankQuery

Wraps main query to fetch initial results
Returns custom TopDocsCollector for reranked ordered results
Solr rerank request parameter

rq={!ltr model=myModel1 reRankDocs=100 efi.user_query=james efi.my_var=123}
!ltr name used in the solrconfig.xml for the LTRQParserPlugin
model name of deployed model to use for reranking
reRankDocs total number of documents to rerank
efi.* custom parameters used to pass external feature information for your
features to use
Query intent
Personalization
User
Query
Commodities
News Matches
Index
Other Sources
Feature Top-1000 Score

Extraction retrieval [10k]
Ranking Top-10
Model reranked
FEATURES
Extract features
{ Was the result a

0
"name": "Tim Cook", cofounder?
"primary_position": "ceo", Does the document
"category ": "person", 1

} Does the query match
0
the document title?
Popularity (%) 0.9
LTR PLUGIN: FEATURES BEFORE
LTR PLUGIN: FEATURES AFTER
[
{
"name": "isPersonAndExecutive",
"type": "org.apache.solr.ltr.feature.impl.SolrFeature",
"params": {
"fq": [
"{!terms f=category}person",
"{!terms f=primary_position}ceo, cto, cfo, president"
]
}
},

]
LTR PLUGIN: FUNCTION QUERIES
[
{
"name": "documentRecency",
"type": "org.apache.solr.ltr.feature.impl.SolrFeature",
"params": {
"q": "{!func}recip( ms(NOW,publish_date), 3.16e-11, 1, 1)"
}
},

]

1 for docs dated now, 1/2 for docs dated 1 year ago, 1/3 for docs dated 2 years ago, etc..
See http://wiki.apache.org/solr/FunctionQuery#Date_Boosting
LTR PLUGIN: FEATURE STORE
FeatureStore is a Solr Managed Resource
REST API endpoint for performing CRUD operations on Solr objects
Stored in maintained in Zookeeper
Deploy
curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'
--data-binary @./features.json -H 'Content-type:application/json'
View
http://yoursolrserver/solr/collection/config/fstore
LTR PLUGIN: FEATURES
Simplifies feature engineering through configuration file
Utilizes rich search functionality built-in to Solr
Phrase matching
Synonyms, Stemming, etc
Inherit the Feature class for specialized features

User
Query
Commodities
News Matches
Index
Other Sources

Ranking Top-10
Model reranked
Training
Queries
Commodities
News Matches
Index
Other Sources

Learning Ranking
Algorithm Model
FEATURES
Extract features
{ Was the result a

0
"name": "Tim Cook", cofounder?
"primary_position": "ceo", Does the document
"category ": "person", 1

} Does the query match
0
the document title?
Popularity (%) 0.9
LTR PLUGIN: FEATURE EXTRACTION

<transformer name="fv" class= "org.apache.solr.ltr.ranking.LTRFeatureTransformer" />
Feature extraction uses Solrs TransformerFactory

Returns a custom field with each document
fl = *, [fv]
{
"name": "Tim Cook",
"primary_position": "ceo",
"category ": "person",

"[fv]": "isCofounder:0.0, isPersonAndExecutive:1.0, matchTitle:0.0, popularity:0.9"
}
LTR PLUGIN: MODEL
{
"type": "org.apache.solr.ltr.ranking.LambdaMARTModel",
"name": "mymodel1",
"features": [
{ "name": "matchedTitle"},
{ "name": "isPersonAndExecutive"}
],
"params": {
"trees": [
{
"weight": 1,
"tree": {
"feature": "matchedTitle",
"threshold": 0.5,
"left": { "value": -100 },
"right": {
"feature": "isPersonAndExecutive",
"threshold": 0.5,
"left": { "value": 50 },
"right": { "value": 75 }
}
}
}
]
}
}
LTR PLUGIN: MODEL
ModelStore is also a Solr Managed Resource
Deploy
curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'
--data-binary @./model.json -H 'Content-type:application/json'
View
http://yoursolrserver/solr/collection/config/mstore
Inherit from the model class for new scoring algorithms

score()
explain()
LTR PLUGIN: EVALUATION
Offline Metrics
nDCG increased approximately 10% after reranking
Online Metrics
Clicks @ 1 up by approximately 10%
BEFORE AND AFTER
Query: unemployment
Solr Ranking Machine Learned Reranking
LTR PLUGIN: EVALUATION
Offline Metrics
nDCG increased approximately 10% after reranking
Online Metrics
Clicks @ 1 up by approximately 10%
Performance
About 30% faster than previous external ranking system
10 million documents in collection
100k queries
1k features
1k documents/query reranked
LTR PLUGIN: BENEFITS
Simpler feature engineering, without compiling
Access to rich internal Solr search functionality for feature building
Search result relevancy improvements vs regular Solr relevance

Automatic relevancy tuning
Compatible scores across collections
Performance benefits vs external ranking system

FUTURE WORK
Continue work to open source the plugin
Support pipelining multiple reranking models
Allow a simple ranking model to be used in the first pass

QUESTIONS?

Learning To Rank in Solr

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning To Rank in Solr

Uploaded by

Copyright:

Available Formats

Learning To Rank For Solr

Michael Nilsson Software Engineer

Early on we spent many days hand tuning

What is implicit data? Pros:

Features are signals that give an indication of a results importance

Was the result a

Features are signals that give an indication of a results importance

Features are signals that give an indication of a results importance

Was the result a

Features are signals that give an indication of a results importance

Was the result a

Features are signals that give an indication of a results importance

Was the result a

Significant relevancy improvements

Allow comparable scores across collections

Maintaining low latency

Makes it simple to use domain knowledge to rapidly create features

LTRQuery extends Solrs RankQuery

Solr rerank request parameter

Feature Top-1000 Score

Features are signals that give an indication of a results importance

{ Was the result a

Inherit the Feature class for specialized features

Feature Top-1000 Score

Feature Top-1000 Score

Features are signals that give an indication of a results importance

{ Was the result a

Feature extraction uses Solrs TransformerFactory

Inherit from the model class for new scoring algorithms

Search result relevancy improvements vs regular Solr relevance

Compatible scores across collections

Performance benefits vs external ranking system

Allow a simple ranking model to be used in the first pass

You might also like