ElasticSearch 0.90 – Using codecs

Send to Kindle

hard_diskAs we promised to all the readers of the book (and not only) we will expand the topics available in the ElasticSearch Server book and this is the first article that will be focused on the new functionalities of recently released ElasticSearch 0.90.0 beta.

The first functionality we’ve decided to discuss is the ability to use codecs (the ones provided by Apache Lucene library, but not only). More features will come over time.

What is a Codec ?

Many of you may not be familiar with what Lucene codec actually is. With the release of Lucene 4.0 so called flexible indexing was introduced. What it allows developers it to alter the way the data is written into and read from the Lucene index. It is not about whether it is analyzed or not, but on the low level of Lucene. In addition to that we were given the possibility of writing our own codecs and tell Lucene to use it, just like we were able to develop custom tokenizers, filters and so on. Introduction of codes allowed for changes of the index structure between minor versions of Lucene and still being able to just upgrade your Lucene from 4.0 to 4.1 and have your index working, although it was written with a different codec. Isn’t that nice ?

Should I use Codecs ?

The answer to such question depends on your actual needs. For example you can use the bloom codec to have a bloom filter maintained for the given field. Bloom filter is a probabilistic structure that can help to make a very fast and efficient test whether a value doesn’t exist in a field. For high cardinality fields, like the ones holding document identifiers, it usually means faster queries on such fields. Look at your document structure, look at the possibilities you are given by ElasticSearch and than choose if in your case a codec, other than the default one, can be used.

Mappings

Let’s recall the mappings that were present in the first chapter of the book, the one for the post type:

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

As you can see they are quite simple, we have four fields of which one is the unique identifier of our post. We will use mappings and we will modify them by adding codecs.

Let’s Introduce the Codecs

ElasticSearch 0.90 brings us a few codecs some of which are the ones available in Apache Lucene library, while some are a custom developed ones. The list of them is as follows:

  • default – the default codec used when no explicit definition is used. It provides on the fly stored fields compression. You can read about the differences in index size with and without compression and how it affected Apache Solr index size in this post - http://solr.pl/en/2012/11/19/solr-4-1-stored-fields-compression/. Although it is based on Solr, you can get a nice view of what to expect.
  • pulsing – a codec that encodes the post listing into the terms array for low frequency term fields, which results in one less seek Lucene needs to do when retrieving a document. Using this codec for high cardinality fields can speed up queries on such fields.
  • direct – a codec that during reads loads terms into arrays, which are held in the memory. This codec may give you performance boost on commonly used fields, but should be used with caution as it is very memory intensive, because the terms and postings arrays needs to stored in the memory.
  • memory – as it name suggest, this codec writes and reads the terms and post listings into the memory, using a structure called FST (Finite State Transducers, more about this structure can be found in a great post by Mike McCandless http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html).
  • bloom_default – an extension of the default codec which in addition to all the data stores a bloom filter on the disk. This codec allows fast and efficient test for non-existing values in the field useful for high cardinality fields.
  • bloom_pulsing – similar to the bloom_default codec, but instead of extending the default codec it extends the pulsing one bringing its functionality in addition to the bloom filter.

Adding Codec to Field Definition

Adding a codec to your field definition is very simple, one just needs to add postings_format property to the field definition, with the proper codec name. For example, if we would like to use the pulsing codec for our id field we would have a definition like this:

"id" : { "type" : "long", "store" : "yes", "precision_step" : "0", "postings_format" : "pulsing" }

So, the whole mappings definition would look like this:

{
 "mappings" : {
  "post" : {
   "properties" : {
    "id" : { "type" : "long", "store" : "yes", "precision_step" : "0", "postings_format" : "pulsing" },
    "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
    "published" : { "type" : "date", "store" : "yes", "precision_step" : "0" },
    "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }
   }
  }
 }
}

Let’s Test It

So we’ve modified the posts.json file by adding the codec definition to the id field and now we want to see if it will work :) So, we use the following command to create a new index called posts:

$ curl -XPOST 'localhost:9200/posts' -d @posts.json

And then we check the mappings by running:

$ curl -XGET 'localhost:9200/posts/_mapping?pretty'

The response returned by ElasticSearch was as follows:

{
 "posts" : {
  "post" : {
   "properties" : {
    "contents" : {
     "type" : "string"
    },
    "id" : {
     "type" : "long",
     "store" : true,
     "postings_format" : "pulsing",
     "precision_step" : 2147483647
    },
    "name" : {
     "type" : "string",
     "store" : true
    },
    "published" : {
     "type" : "date",
     "store" : true,
     "precision_step" : 2147483647,
     "format" : "dateOptionalTime"
    }
   }
  }
 }
}

As you can see the pulsing codes was registered for the id field, so everything is working as it should be :)

One Last Thing

Starting from ElasticSearch 0.90.0 beta, the old method of compressing the _source field is no longer relevant, because the default codec is compressing all the stored fields in the Lucene index by default. Please remember about this, as this is one of the changes comparing to the pre 0.90 ElasticSearch versions.

One thought on “ElasticSearch 0.90 – Using codecs

  1. Aditya Tripathi says:

    Hi,
    Nice post. Thanks.
    Had few Qs about defining postingsFormat at the field definition level.

    I am assuming this information will sit in the “attributes” HashMap of FieldInfo.

    Q1.When postingsFormat is defined at the field level, will the codec for SegmentInfo be still the Lucene default – Lucene4xCodec? (Offcourse, Segment containing this field)

    Q2:Will this PulsingPostingsFormat be picked up from the FieldInfo while writing/reading a segment?

    Q3:If Lucene4xCode changes the PostingsFormat from PerFieldPostingsFormat to something else, will it still work?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>