What are Apache Solr and its advantages?

Better Term Centric Scoring In Elasticsearch

10 Reasons to Replace Endeca to Deliver an Enhanced Ecommerce Search Experience
September 27, 2021
10 Reasons to Replace Endeca to Deliver an Enhanced Ecommerce Search Experience

Elasticsearch version 7.13 introduced a new query combined_fields that brings better term-centric scoring to relevance engineers. Under the hood it uses the new Lucene query, CombinedFieldsQuery, (formally known as the BM25FQuery) which implements BM25F, a widely accepted extension of BM25 for multi-field search with weighting. Before 7.13, the multi_match query with "type": "cross_fields" (which for the remainder of this post will be referenced as cross_fields) was the best option in Elasticsearch. This post discusses term-centric versus field-scoring scoring and does a bake off between the scoring of the old (cross_fields) and the new (combined_fields).

Term vs Field-centric is important for scoring

Term and field-centric are two alternative strategies for token based scoring for ranking. In term-centric scoring the entire document is treated as one large field. This means putting less importance on the sections within the document, the goal being better matching when tokens are spread out or repeated across multiple sections.

In field-centric scoring the original sections are scored independently, each section in its own index with its own term statistics. The goal here is to reflect varying importance of different sections, but can create unevenness as IDF can vary widely between fields.

The behaviour of the commonly used minimum_should_match setting illustrates the difference between each approach. With the setting “minimum_should_match": ”100%” a field-centric query will require all tokens to match within a single field, whereas a term-centric query would be more relaxed, requiring only that all tokens appear in the document – and these tokens could be in different fields.

Old vs New in Elasticsearch (and Lucene)

In the old days (before v7.13) there was only one way to do term-centric with field weighting, by querying with multi_match{ ..., "type": “cross_field”} a.k.a. cross_fields. In Lucene the scoring for cross_fields was done by the BlendedTermQuery, which would mix the scores from individual fields based on user supplied field weights.

As Elasticsearch expert Mark Harwood writes:

“Searching for Mark Harwood across firstname and lastname fields should certainly favour any firstname:Mark over a lastname:Mark. Cross-fields was originally created because in these sorts of scenarios IDF would (annoyingly) ensure exactly the wrong field for a term was ranked highest.”

The cross_fields query would negate IDF for the most part, in order to ensure that scoring was similar across fields. Because this was originally conceived in the context of multi_match there was also a desire to reward the “correct” field. To achieve this, the scoring function would add 1 to the document frequency of the most frequent field. While this worked in practice, the scoring was confusing and not grounded in theory. Let’s consider some example queries:

cross_fields query
GET tmdb/_search
{
"query": {
"multi_match": {
"query": "green Marvel hero",
"fields": [
"title^3",
"overview^2",
"tagline"
],
"type": "cross_fields"
}
}
}

combined_fields query
GET tmdb/_search
{
"query": {
"combined_fields": {
"query": "green Marvel hero",
"fields": [
"title^3",
"overview^2",
"tagline"
]
}
}
}


The syntax for combined_fields is similar but the scoring is different and done by the new Lucene CombinedFieldsQuery which implements BM25F. This is a variant of BM25 that adds that ability to weight individual fields. The field weights act by multiplying the raw term frequency of a field, before individual field statistics are combined into document level statistics. This does two big things: captures relative field importance and establishes a more generalizable formula for ranking than used by the cross_fields query.

An example query

Using a version of The Movie Database (TMDB) that we have in this Elasticsearch sandbox on Github I want to show the difference between combined_fields and cross_fields. _explain API

First let’s look at what the explain API tells us about queries for “Captain Marvel” in each case:

cross_fields

Request GET tmdb/_explain/299537
{
"query": {
"multi_match": {
"query": "green Marvel hero",
"fields": [
"title",
"overview",
"tagline"
],
"type": "cross_fields"
}
}
}
Response { "_index" : "tmdb",
"_type" : "_doc",
"_id" : "299537",
"matched" : true,
"explanation" : {
"value" : 14.744863,
"description" : "sum of:",
"details" : [
{
"value" : 10.636592,
"description" : "max of:",
"details" : [
{
"value" : 10.636592,
"description" : "weight(overview:marvel in 1190) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 10.636592,
"description" : "score(freq=2.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 7.7968216,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 3,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 8514,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.62010074,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 2.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 36.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 35.016327,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 7.7574196,
"description" : "weight(title:marvel in 1190) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 7.7574196,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 7.5453897,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 4,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 8513,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.46731842,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.1431928,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 4.1082706,
"description" : "max of:",
"details" : [
{
"value" : 4.1082706,
"description" : "weight(overview:hero in 1190) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 4.1082706,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 4.1554832,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 133,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 8514,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.4493811,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 36.0,
"description" : "dl, length of field", "details" : [ ]
}
, {
"value" : 35.016327,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}

In the _explain response of cross_fields, we can see that scoring is still done by field per term before it is rolled up by term. In the combined_field this doesn’t happen because each term is scored just once on the synthetic field representing a combination of “title”, “tagline” and “overview”. The single per-term scoring against the synthetic field homogenizes the term statistics that may have varied drastically between fields with cross_fields.

First page of results

Next, I compare the first page of results (size: 30) as tables. I added the Jaccard set similarity to show how much overlap there is between the two result sets. A Jaccard similarity of 1.0 is perfect overlap, the same 30 items in both result sets. A Jaccard similarity of 0.0 is no overlap, so 60 different items between the two queries. Remember Jaccard similarity is set based and does not factor in position.

Jaccard similarity: 0.579

Jaccard similarity: 0.579
combined_fields rank
score title
20.483114 Captain Marvel 1
16.441910 Green Lantern: First Flight 2
15.406511 Jimmy Vestvood: Amerikan Hero 3
13.150019 Hulk 4
12.759342 The Man Who Killed Don Quixote 5
12.038399 Justice League: War 6
10.916338 Maverick 7
10.763498 The Extra Man 8
10.158279 Green Lantern: Emerald Knights 9
10.123980 Rambo 10
9.909670 The Odd Life of Timothy Green 11
9.797913 The Green Inferno 12
9.777215 Green Lantern 13
9.688647 The Green Berets 14
9.402530 Revenge of the Green Dragons 15
9.362038 The Punisher 16
9.341401 Green Book 17
9.081026 Green Street Hooligans 2 18
8.764002 Blinky Bill the Movie 19
8.744556 Chain Reaction 20
8.648538 Green Room 21
8.553925 How Green Was My Valley 22
8.370777 Fried Green Tomatoes 23
8.282112 Green Mansions 24
8.211758 Big Trouble in Little China 25
8.195307 The Green Mile 26
8.099191 Hardball 27
7.975277 Taxi 28
7.816612 Last Action Hero 29
7.787214 Green Zone 30
cross_fields
title score
Captain Marvel 31.48880
Jimmy Vestvood: Amerikan Hero 21.73690
Green Lantern: First Flight 18.18846
Hulk 17.29990
Green Mansions 16.13631
The Green Berets 16.13631
Green Zone 16.13631
The Green Hornet 16.13631
Green Room 16.13631
Green Lantern 16.13631
The Green Mile 16.13631
Green Book 16.13631
The Green Inferno 16.13631
Heroes 15.88856
Hero 15.88856
Green Lantern: Emerald Knights 15.71288
Justice League: War 15.45730
Maverick 15.06839
Chain Reaction 14.37306
Blinky Bill the Movie 13.89266
The Extra Man 13.58447
The Punisher 13.53983
Revenge of the Green Dragons 13.48916
Fried Green Tomatoes 13.48916
The Odd Life of Timothy Green 13.11556
Hero Wanted 12.77054
Heroes for Sale 12.77054
Almost Heroes 12.77054
Everyone’s Hero 12.77054
Kelly’s Heroes 12.77054

The Jaccard similarity of 0.579 highlights that a lot of different documents are being surfaced in the combined_fields query compared to cross_fields. In this example 34 results are shared between the queries, but 26 are unique to one of the other. This doesn’t mean the differences are bad (or good) but it does mean there is some major churn in rankings between the two queries.

Another view of that same data, with a scatter plot, better shows the changes in position and scores for individual movies. The x-axis is the score from the cross_fields query and the y-axis is the score from the combined_fields query. Each dot is a document and the dot color represents the positional shift switching from cross_fields to combined_fields. Some documents were not included in the results for both queries, so they are represented as a tick mark along the axis where they were retrieved.

The top several results are consistent and the golden result “Hulk” is returned in position #4 for both queries. Note the score plateau in cross_fields at a score of 16.13. All of those documents got identical scores, so their relative position in the final ranked list is decided by the order they were indexed. This arbitrary tie-breaking doesn’t happen in combined_fields because there isn’t the same plateau effect with a single large field.

Visualizing search data like this is a great way to glean insights you might miss in bigger tables. Tables are great for inspecting individual records or comparing a handful of items, but graphics are a better form of communication when many data points are involved. Search is a “medium” data problem, with lots of queries and lots of results, so getting a good graphic grip on how it is performing will always help

To the future with term-centric scoring

If you were using cross_fields, switching to combined_fields will shake up your results. But the benefits (general acceptance and scoring interpretability) of BM25F might make it worth it.

Besides differences in scoring, introducing combined_fields clarifies the split between term and field-centric in the Elasticsearch API. Now we have multi_match for field-centric and combined_fields for term-centric. Having a clear API is big reason why I think Elasticsearch has been so successful, so I’m really happy to see this trend continue.

I’m also pleased to see the effort Elastic is committing to keeping Elasticsearch (and Lucene) current with the best methods from academic publications. HNSW approximate nearest neighbor search – vector search- is right around the corner for Lucene and Elastic is active in that effort too.

Do join us in Relevance Slack and let me know your comments or feedback – and if we can help you with these tricky scoring issues on your Elasticsearch cluster, get in touch.

Related articles