ElasticSearch 0.90 – Similarities

Send to Kindle

similaritiesThe next functionality that ElasticSearch 0.90 we would like to discuss is again bound to what Lucene 4.0 introduced – the changes in the API of the classes responsible for scoring formula. In addition to changed API Apache Lucene 4.0 introduced a few relevance calculation formulas that are available out of the box for its users. Also, starting from ElasticSearch 0.90.0.Beta1 we were given the possibilities of using those new scoring formulas.

Introduction

We won’t be talking about the API changes, because from the user perspective it doesn’t matter until you want to develop your own, custom Similarity class. Apart from scripts we didn’t talk about writing custom plug-ins for ElasticSearch and we will stick with that, at least for now. Let’s focus on what we have available from ElasticSearch user perspective.

Introduced Similarities

Apache Lucene 4.0 (and thus 4.1 on which ElasticSearch 0.90.0.Beta1 is based on) introduced the following similarities implementations:

I know, too much, not so clear information, but I promise I’ll stop now and get to the point – show you how to use those in ElasticSearch.

Similarities Available in ElasticSearch

All the above mentioned Similarities are available in ElasticSearch, however some of them require some additional configuration to be present. The TF/IDF and the BM25 similarities can be used without any additional configuration, just by adding them to your field definition. The ones that require additional configuration are the last two onces – the DFR and IB similarities. We will show you how to configure both of them in the end of this post.

Mappings Once Again

Before continuing let’s recall the mappings that were present in the first chapter of the book once again. So, the mappings for the post type were as follows:

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

Specifying Similarity on per-field Basis

In order to tell ElasticSearch that we want to use other than the default TF/IDF similarity we need to add the similarity property to our field definition. So, if we would like to use the BM25 similarity for our name field, we would have the following field definition:

"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }

So the whole mappings definition would look like this:

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

Let’s Test It

So we store our modified mappings in the posts.json file and now we want to see if it works. We do that by sending the following command to create a new index called posts:

$ curl -XPOST 'localhost:9200/posts' -d @posts.json

And then we check the mappings by running:

$ curl -XGET 'localhost:9200/posts/_mapping?pretty'

The response returned by ElasticSearch was as follows:

{
 "posts" : {
  "post" : {
   "properties" : {
    "contents" : {
     "type" : "string"
    },
    "id" : {
     "type" : "long",
     "store" : true,
     "precision_step" : 2147483647
    },
    "name" : {
     "type" : "string",
     "store" : true,
     "similarity" : "BM25"
    },
    "published" : {
     "type" : "date",
     "store" : true,
     "precision_step" : 2147483647,
     "format" : "dateOptionalTime"
    }
   }
  }
 }
}

As you can see in the name field mappings our BM25 similarity was taken into consideration by ElasticSearch.

Configuring DFR and IB Similarities

As mentioned above in order to use the DFR or IB similarities we need to configure them. We can do it by using the index settings section in our mappings.

DFR Similarity

Let’s start with the DFR similarity configuration. We need to add the similarity section containing our similarity configuration to the index settings:

"similarity" : {
 "esserverbook_dfr_similarity" : {
  "type" : "DFR",
  "basic_model" : "g",
  "after_effect" : "l",
  "normalization" : "h2",
  "normalization.h2.c" : "2.0"
 }
}

Our configured DFR similarity will be available to use under the name esserverbook_dfr_similarity. The possible options for basic_model property are: be, d, g, if, in, ine and p. The possible options for after_effect property are no, b and l. The normalization can be no, h1, h2, h3 or z. In addition to that for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property.

Now let’s see how the whole modified mappings would look like:

{
 "settings" : {
  "index" : {
   "similarity" : {
    "esserverbook_dfr_similarity" : {
     "type" : "DFR",
     "basic_model" : "g",
     "after_effect" : "l",
     "normalization" : "h2",
     "normalization.h2.c" : "2.0"
    }
   }
  }
 }, 
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "esserverbook_dfr_similarity" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

IB Similarity

Now let’s have a look at the IB similarity configuration. In the same manner we did with DFR similarity configuration what we need to do is add the similarity section containing our similarity configuration to the index settings:

"similarity" : {
 "esserverbook_ib_similarity" : {
  "type" : "IB",
  "distribution" : "ll",
  "lambda" : "df",
  "normalization" : "z",
  "normalization.z.z" : "0.25"
 }
}

Our configured IB similarity will be available to use under the name esserverbook_ib_similarity. The possible options for distribution property are: ll and spl. The possible options for lambda property are df and ttf. The normalization can be no, h1, h2, h3 or z. Identically to the DFR normalization for the h1 normalization we can specify the normalization.h1.c property, for the h2 we can specify the normalization.h2.c, for h3 we can specify the normalization.h3.c property and for the z normalization we can specify the normalization.z.z property

Now let’s see how the whole modified mappings would look like:

{
 "settings" : {
  "index" : {
   "similarity" : {
    "esserverbook_ib_similarity" : {
     "type" : "IB",
     "distribution" : "ll",
     "lambda" : "df",
     "normalization" : "z",
     "normalization.z.z" : "0.25"
    }
   }
  }
 }, 
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "esserverbook_ib_similarity" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

One Last Check

Let’s do one last sanity check, but verifying that the settings for the IB similarity were taken into consideration. To do that we send the following command:

$ curl -XGET 'localhost:9200/posts/_settings?pretty'

And the response from ElasticSearch is as follows:

{
  "posts" : {
    "settings" : {
      "index.similarity.esserverbook_ib_similarity.distribution" : "ll",
      "index.similarity.esserverbook_ib_similarity.normalization.z.z" : "0.25",
      "index.similarity.esserverbook_ib_similarity.type" : "IB",
      "index.similarity.esserverbook_ib_similarity.lambda" : "df",
      "index.similarity.esserverbook_ib_similarity.normalization" : "z",
      "index.number_of_shards" : "5",
      "index.number_of_replicas" : "1",
      "index.version.created" : "900001"
    }
  }
}

As you can see everything is in order.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>