Send to Kindle
The next functionality that ElasticSearch 0.90 we would like to discuss is again bound to what Lucene 4.0 introduced – the changes in the API of the classes responsible for scoring formula. In addition to changed API Apache Lucene 4.0 introduced a few relevance calculation formulas that are available out of the box for its users. Also, starting from ElasticSearch 0.90.0.Beta1 we were given the possibilities of using those new scoring formulas.
Introduction
We won’t be talking about the API changes, because from the user perspective it doesn’t matter until you want to develop your own, custom Similarity class. Apart from scripts we didn’t talk about writing custom plug-ins for ElasticSearch and we will stick with that, at least for now. Let’s focus on what we have available from ElasticSearch user perspective.
Introduced Similarities
Apache Lucene 4.0 (and thus 4.1 on which ElasticSearch 0.90.0.Beta1 is based on) introduced the following similarities implementations:
- TF/IDF – the Similarity class used by default, both by Lucene library and by ElasticSearch. The score of the documents returned for a query is tied to the term frequency (TF) and inverse document frequency (IDF). The whole explanation of how this works can be found in Lucene Javadocs (http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html). This Similarity can be used in ElasticSearch by using the default name.
- BM25 – Similarity class based on a probabilistic model, that estimates the probability of finding a document for a given query. More information about this Similarity and the maths standing by it can be found on Wikipedia page dedicated to it (http://en.wikipedia.org/wiki/Okapi_BM25). This Similarity can be used in ElasticSearch by using the BM25 name.
- DFR – Divergence from randomness Similarity based on the probabilistic model of the same name. More information about the implementation can be found in Apache Lucene Javadocs (http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/search/similarities/DFRSimilarity.html). This Similarity can be used in ElasticSearch by using the DFR name.
- IB – the last of the introduced Similarity classes, which as Lucene Javadocs says if similar to the divergence from randomness (http://lucene.apache.org/core/4_1_0/core/org/apache/lucene/search/similarities/IBSimilarity.html). In order to use this Similarity in ElasticSearch use the IB name.
I know, too much, not so clear information, but I promise I’ll stop now and get to the point – show you how to use those in ElasticSearch.
Similarities Available in ElasticSearch
All the above mentioned Similarities are available in ElasticSearch, however some of them require some additional configuration to be present. The TF/IDF and the BM25 similarities can be used without any additional configuration, just by adding them to your field definition. The ones that require additional configuration are the last two onces – the DFR and IB similarities. We will show you how to configure both of them in the end of this post.
Mappings Once Again
Before continuing let’s recall the mappings that were present in the first chapter of the book once again. So, the mappings for the post type were as follows:
{
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
"published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
}
}
}
}
Specifying Similarity on per-field Basis
In order to tell ElasticSearch that we want to use other than the default TF/IDF similarity we need to add the similarity property to our field definition. So, if we would like to use the BM25 similarity for our name field, we would have the following field definition:
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }
So the whole mappings definition would look like this:
{
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" },
"published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
}
}
}
}
Let’s Test It
So we store our modified mappings in the posts.json file and now we want to see if it works. We do that by sending the following command to create a new index called posts:
$ curl -XPOST 'localhost:9200/posts' -d @posts.json
And then we check the mappings by running:
$ curl -XGET 'localhost:9200/posts/_mapping?pretty'
The response returned by ElasticSearch was as follows:
{
"posts" : {
"post" : {
"properties" : {
"contents" : {
"type" : "string"
},
"id" : {
"type" : "long",
"store" : true,
"precision_step" : 2147483647
},
"name" : {
"type" : "string",
"store" : true,
"similarity" : "BM25"
},
"published" : {
"type" : "date",
"store" : true,
"precision_step" : 2147483647,
"format" : "dateOptionalTime"
}
}
}
}
}
As you can see in the name field mappings our BM25 similarity was taken into consideration by ElasticSearch.
Configuring DFR and IB Similarities
As mentioned above in order to use the DFR or IB similarities we need to configure them. We can do it by using the index settings section in our mappings.
DFR Similarity
Let’s start with the DFR similarity configuration. We need to add the similarity section containing our similarity configuration to the index settings:
"similarity" : {
"esserverbook_dfr_similarity" : {
"type" : "DFR",
"basic_model" : "g",
"after_effect" : "l",
"normalization" : "h2",
"normalization.h2.c" : "2.0"
}
}
Our configured DFR similarity will be available to use under the name esserverbook_dfr_similarity. The possible options for basic_model property are: be, d, g, if, in, ine and p. The possible options for after_effect property are no, b and l. The normalization can be no, h1, h2, h3 or z. In addition to that for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property.
Now let’s see how the whole modified mappings would look like:
{
"settings" : {
"index" : {
"similarity" : {
"esserverbook_dfr_similarity" : {
"type" : "DFR",
"basic_model" : "g",
"after_effect" : "l",
"normalization" : "h2",
"normalization.h2.c" : "2.0"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "esserverbook_dfr_similarity" },
"published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
}
}
}
}
IB Similarity
Now let’s have a look at the IB similarity configuration. In the same manner we did with DFR similarity configuration what we need to do is add the similarity section containing our similarity configuration to the index settings:
"similarity" : {
"esserverbook_ib_similarity" : {
"type" : "IB",
"distribution" : "ll",
"lambda" : "df",
"normalization" : "z",
"normalization.z.z" : "0.25"
}
}
Our configured IB similarity will be available to use under the name esserverbook_ib_similarity. The possible options for distribution property are: ll and spl. The possible options for lambda property are df and ttf. The normalization can be no, h1, h2, h3 or z. Identically to the DFR normalization for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property
Now let’s see how the whole modified mappings would look like:
{
"settings" : {
"index" : {
"similarity" : {
"esserverbook_ib_similarity" : {
"type" : "IB",
"distribution" : "ll",
"lambda" : "df",
"normalization" : "z",
"normalization.z.z" : "0.25"
}
}
}
},
"mappings" : {
"post" : {
"properties" : {
"id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "esserverbook_ib_similarity" },
"published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
}
}
}
}
One Last Check
Let’s do one last sanity check, but verifying that the settings for the IB similarity were taken into consideration. To do that we send the following command:
$ curl -XGET 'localhost:9200/posts/_settings?pretty'
And the response from ElasticSearch is as follows:
{
"posts" : {
"settings" : {
"index.similarity.esserverbook_ib_similarity.distribution" : "ll",
"index.similarity.esserverbook_ib_similarity.normalization.z.z" : "0.25",
"index.similarity.esserverbook_ib_similarity.type" : "IB",
"index.similarity.esserverbook_ib_similarity.lambda" : "df",
"index.similarity.esserverbook_ib_similarity.normalization" : "z",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "900001"
}
}
}
As you can see everything is in order.