Professional Documents
Culture Documents
Learning to Rank
Solr Learning to Rank Plugin
8 millions searches PER DAY
1 million PER DAY
400
million
stories
in
the
index
SOLR IN BLOOMBERG
Search engine of choice at Bloomberg
Large community / Well distributed committers
Open source Apache Project
Used within many commercial products
Large feature set and rapid growth
Committed to open-source
Ability to contribute to core engine
Ability to fix bugs ourselves
Contributions in almost every Solr release since 4.5.0
PROBLEM SETUP
score: 30
score: 1.0
PROBLEM SETUP
score: 52.2
score: 30.8
=100+
10
PROBLEM SETUP
=100+
10
PROBLEM SETUP
=+
.+
PROBLEM SETUP
=.
+.
+. +
5 timeElapsedFrom LastUpdate
PROBLEM SETUP
Its hard to manually tweak the ranking
You must be an expert in the domain
or a magician
=.
+.
+. +
5 timeElapsedFrom LastUpdate
query = solr query = lucene query = austin query = bloomberg query =
PROBLEM SETUP
Its easier with Machine Learning
2,000+ parameters (non-linear, factorially larger than linear form)
8,000+ queries that are regularly tuned
Commodities
News
People Index
Other Sources
Top-k
Top-x
x >> k
retrieval
ReRanking Top-k
Model reranked
TRAINING PIPELINE (OFFLINE)
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
TRAINING PIPELINE (OFFLINE)
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
TRAINING DATA: IMPLICIT VS EXPLICIT
What is explicit data? Pros:
A set of judges will assess the Data is very clean
search results manually given a
Cons:
query
Can be very expensive!
Experts
Crowd
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
FEATURES
A feature is an individual measurable property
Given a query, and a collection we can produce many features for each
document in the collection
If the query matches the title
Length of the document
Number of views
How old is it?
Can be visualized on a mobile device?
FEATURES
Extract features
Query : APPL US
Was the result a
0
cofounder?
Does the document
1
have an exec. position?
FEATURES
Extract features
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
METRICS
How do we know if our model is doing better?
Offline metrics
Precision/Recall/F1 score
nDCG (Normalized Discount Cumulative Gain)
Other metrics (e.g., ERR, MAP, )
Online Metrics
Click through rates higher
Time to first click lower
Interleaving1
1O.Chapelle, T. Joachims, F. Radlinski, and Y. Yue. Large scale validation and analysis of interleaved search evaluation. ACM
Transactions on Information Science, 30(1), 2012.
TRAINING PIPELINE (OFFLINE)
Training
Query-Document
Pairs
Commodities
News
People Index
Other Sources
Feature
Extraction
Learning Ranking
Metrics
Algorithm Model
LEARNING TO RANK
Learn how to combine the features for optimizing one or more metrics
Many learning algorithms
RankSVM1
LambdaMART2
1T.Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and
Data Mining (KDD), ACM, 2002.
2C.J.C. Burges, "From RankNet to LambdaRank to LambdaMART: An Overview", Microsoft Research Technical Report MSR-
TR-2010-82, 2010.
SEARCH PIPELINE: STANDARD
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
SEARCH PIPELINE: STANDARD
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Learning Ranking
Training
Data Algorithm Model Offline
SEARCH PIPELINE: STANDARD
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Ranking Top-x
Online Model reranked
SEARCH PIPELINE: SOLR INTEGRATION
User
Query
Commodities
News
People Index Solr
Other Sources
Top-k
retrieval
Ranking Top-x
Online Model reranked
SOLR RELEVANCY
Pros
Simple and quick scoring computation
Phrase matching
Function query boosting on time, distance, popularity, etc
Customized fields for stemming, synonyms, etc
Cons
Lots of manual time for creating a well tuned query
Weights are brittle, and may not be compatible in the future with more documents
or fields added
LTR PLUGIN: GOALS
Dont tune the relevancy manually!
Uses machine learning to power automatic relevancy tuning
Commodities
News
Index
People
Other Sources
Top-k
retrieval
STANDARD SOLR SEARCH REQUEST
User
Query
Commodities
Solr Query
News Matches
Index
People [10 Million] [10k]
Other Sources
Top-10 Score
retrieval [10k]
LTR SOLR SEARCH REQUEST
User
Query
Commodities
Solr Query
News Matches
Index
People [10 Million] [10k]
Other Sources
Top-1000 Score
retrieval [10k]
LTR Query
Ranking Top-10
Model reranked
LTR PLUGIN: RERANKING
<!-- Query parser used to rerank top docs with a provided model -->
<queryParser name="ltr" class="org.apache.solr.ltr.ranking.LTRQParserPlugin" />
Query intent
Personalization
SEARCH PIPELINE (ONLINE)
User
Query
Commodities
News Matches
Index
People [10 Million] [10k]
Other Sources
Ranking Top-10
Model reranked
FEATURES
Extract features
Deploy
curl -XPUT 'http://yoursolrserver/solr/collection/config/fstore'
--data-binary @./features.json -H 'Content-type:application/json'
View
http://yoursolrserver/solr/collection/config/fstore
LTR PLUGIN: FEATURES
Simplifies feature engineering through configuration file
Utilizes rich search functionality built-in to Solr
Phrase matching
Synonyms, Stemming, etc
Commodities
News Matches
Index
People [10 Million] [10k]
Other Sources
Ranking Top-10
Model reranked
TRAINING PIPELINE (OFFLINE)
Training
Queries
Commodities
News Matches
Index
People [10 Million] [10k]
Other Sources
Learning Ranking
Algorithm Model
FEATURES
Extract features
fl = *, [fv]
{
"name":
"Tim
Cook",
"primary_position":
"ceo",
"category
":
"person",
"[fv]":
"isCofounder:0.0,
isPersonAndExecutive:1.0,
matchTitle:0.0,
popularity:0.9"
}
LTR PLUGIN: MODEL
{
"type":
"org.apache.solr.ltr.ranking.LambdaMARTModel",
"name":
"mymodel1",
"features":
[
{
"name":
"matchedTitle"},
{
"name":
"isPersonAndExecutive"}
],
"params":
{
"trees":
[
{
"weight":
1,
"tree":
{
"feature":
"matchedTitle",
"threshold":
0.5,
"left":
{
"value":
-100
},
"right":
{
"feature":
"isPersonAndExecutive",
"threshold":
0.5,
"left":
{
"value":
50
},
"right":
{
"value":
75
}
}
}
}
]
}
}
LTR PLUGIN: MODEL
ModelStore is also a Solr Managed Resource
Deploy
curl -XPUT 'http://yoursolrserver/solr/collection/config/mstore'
--data-binary @./model.json -H 'Content-type:application/json'
View
http://yoursolrserver/solr/collection/config/mstore
Online Metrics
Clicks @ 1 up by approximately 10%
BEFORE AND AFTER
Query: unemployment
Solr Ranking Machine Learned Reranking
LTR PLUGIN: EVALUATION
Offline Metrics
nDCG increased approximately 10% after reranking
Online Metrics
Clicks @ 1 up by approximately 10%
Performance
About 30% faster than previous external ranking system
10 million documents in collection
100k queries
1k features
1k documents/query reranked
LTR PLUGIN: BENEFITS
Simpler feature engineering, without compiling
Access to rich internal Solr search functionality for feature building