Elasticsearch aggregations – from 1.1 to 1.4

Send to Kindle

analysisWith the recent release of Elasticsearch 1.4.0 Beta1, we decided that it is time to mention the aggregations that were added to Elasticsearch since 1.0 – the version that we used during the writing of Elasticsearch Server Second Edition. In this blog entry we will look at all the aggregations that were added starting from Elasticsearch 1.1 until Elasticsearch 1.4.

Cardinality aggregation

Introduced in Elasticsearch 1.1.0, cardinality aggregation is a single valued metric aggregation that allows us to approximate number of unique values present in a field. To estimate unique counts Elasticsearch uses HyperpeLogLog++ algorithm and allows us to provide a precision threshold using a precision_threshold property (the higher the value, the higher memory usage and higher precision).

For example to approximate the number of unique tags in our data we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "tags_count" : {
   "cardinality" : {
    "field" : "tags",
    "precision_threshold" : 200
   }
  }
 }
}'

Significant terms aggregation

Second aggregation introduced in Elasticsearch 1.1.0 allows us to get data buckets of terms that have unusual or interested occurrences in a set of words. Of course that doesn’t mean that Elasticsearch will return the most popular term from the index. To give you an example – a term that occurs in 10 documents in 1 million in a index and 5 times in the document returned by a query is a good candidate to be called significant.

It is hard to show a good example of significant terms aggregation usage so please refer to Elasticsearch documentation (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html) for an example.

Percentiles aggregation

The last aggregation added in Elasticsearch 1.1.0. It allows us to calculate percentiles over a numeric data field or generated by a script. By default Elasticsearch will generate percentiles for 1, 5, 25, 50, 75, 95 and 99 percentile.

To calculate percentiles aggregation for price field from the book we would run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "price_percentiles" : {
   "percentiles" : {
    "field" : "price"
   }
  }
 }
}'

Reverse nested aggregation

A single bucket aggregation that allows us to aggregate on parent document from a nested documents and was introduced in Elasticsearch 1.2.0. It needs to be put inside a nested aggregation.

To illustrate how reverse nested aggregation works let’s create an index using the following command:

curl -XPOST 'localhost:9200/images' -d '{
 "mappings" : {
  "image" : {
   "properties" : {
    "name" : { "type" : "string", "index" : "not_analyzed" },
    "author" : { "type" : "string", "index" : "not_analyzed" },
    "comments" : {
     "type" : "nested",
     "properties" : {
      "text" : { "type" : "string" },
      "nick" : { "type" : "string", "index" : "not_analyzed" }
     }
    }
   }
  }
 }
}'

Now if we would like to show the number of images commented by each nick we could run the following query:

curl -XGET 'localhost:9200/images/_search?pretty' -d '{
 "aggregations" : {
  "comments" : {
   "nested" : {
    "path" : "comments"
   },
   "aggregations" : {
    "commenters" : {
     "terms" : {
      "field" : "comments.nick"
     },
     "aggregations" : {
      "reversed_example" : {
       "reverse_nested" : {}
      }
     }
    }
   }
  }
 }
}'

Top hits aggregation

One of the mostly anticipated features in Elasticsearch which was introduces in version 1.3.0. It allows to track the top hits for each aggregation bucket and as such allows us to do so called field collapsing on our data.

For example if we would like to get a top scoring book for each author in our library we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "top_books": {
   "terms": {
    "field": "author",
    "order": {
     "top_hit": "desc"
    }
   },
   "aggs": {
    "top_books_hits": {
     "top_hits": {}
    },
    "top_hit" : {
     "max": {
      "script": "_score"
     }
    }
   }
  }
 }
}'

We use the terms aggregation on the author field and than we use the the top hits aggregation as nested aggregation, so that maximum score and its document are held in memory and returned.

Percentiles ranks aggregation

Introduced in Elasticsearch 1.3.0 percentiles ranks aggregation allows us to get information on the percent of documents having the value of a field in a certain percentile rank. For example if the 90% of the books have price equal or lower than 20$ than we would say that they match 90 percentile rank.

To check the percentiles rank for the books that have the price equal or lower to 50$ and equal or lower to 30$ we could run:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "price_percentiles_rank" : {
   "percentile_ranks" : {
    "field" : "price",
    "values" : [ 50, 30 ]
   }
  }
 }
}'

Geo bounds aggregation

Yet another aggregation added in Elasticsearch 1.3.0 and allowing us to compute the bounding box containing all value (that are of geo_point type from a field). It can be very useful when trying to build a bounding box around the area described by the search results.

An example usage of that aggregation could looks as follows:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "libraries_with_books" : {
   "geo_bounds" : {
    "field" : "location"
   }
  }
 }
}'

Filters aggregation

The first of the three aggregations that are introduced in Elasticsearch 1.4.0. It is an extension to the filter aggregation and allows us to define multiple filters instead of a single one.

For example if we would like to get the price statistics for books that are available and that are not available we could run the following query:

curl -XGET 'localhost:9200/library/_search?pretty' -d '{
 "aggregations" : {
  "books" : {
   "filters" : {
    "filters" : {
     "available" : {
      "term" : { "available" : true }
     },
     "notavailable" : {
      "term" : { "available" : false }
     }
    }
   },
   "aggregations" : {
    "price" : {
     "stats" : {
      "field" : "price"
     }
    }
   }
  }
 }
}'

Children aggregation

The second aggregation introduced in Elasticsearch 1.4.0. This aggregation allows us to aggregate buckets on parent documents from the context of children documents. Let’s recall the example about images and their comments that we’ve used when describing reverse nested aggregation, but this time we will not used nested documents, but parent child relationship – our parent documents will be images and each image can have a number of children documents – the comments for them.

We could create our index by running the following command:

curl -XPOST 'localhost:9200/images' -d '{
 "mappings" : {
  "image" : {
   "properties" : {
    "name" : { "type" : "string", "index" : "not_analyzed" },
    "author" : { "type" : "string", "index" : "not_analyzed" }
   }
  },
  "comment" : {
   "_parent" : {
    "type" : "image"
   },
   "properties" : {
    "text" : { "type" : "string" },
    "nick" : { "type" : "string", "index" : "not_analyzed" }
   }
  }
 }
}'

Now if we would like to get information about the number of comments per image we could run the following query:

curl -XGET 'localhost:9200/images/_search?pretty' -d '{
 "aggregations" : {
  "images" : {
   "terms" : {
    "field" : "name"
   },
   "aggregations" : {
    "number_of_comments" : {
     "children" : {
      "type" : "comment"
     }
    }
   }
  }
 }
}'

Scripted metric aggregation

The last aggregation added in Elasticsearch 1.4.0 – the scripted metric aggregation allows us to executed a script to provide a metric output. For this aggregation we provide the following scripts:

  • init_script – script that will be run during initialization and allows us to setup an initial state of the calculation,
  • map_script – the only required script, executed once for every document which needs to store the calculation in an object called _agg.
  • combine_script – script executed once on each shard after Elasticsearch finished document collection on that shard.
  • reduce_script – script executed once on the node that is coordinating particular query execution. This script has access to _aggs variable which is an array of the values returned by the combine_script.

Let’s assume that we have the following index structure:

curl -XPOST 'localhost:9200/images' -d '{
 "mappings" : {
  "image" : {
   "properties" : {
    "name" : { "type" : "string", "index" : "not_analyzed" },
    "author" : { "type" : "string", "index" : "not_analyzed" },
    "shares": { "type" : "integer" }
   }
  }
 }
}'

Now, let’s try to use the scripted metric aggregation to give us number of shares across all the images in our index. The query that does that could look as follows:

curl -XGET 'localhost:9200/images/_search?pretty' -d '{
 "aggregations" : {
  "shares" : {
   "scripted_metric" : {
    "init_script" : "_agg[\"number_of_shares\"] = 0",
    "map_script" : "_agg.number_of_shares += doc.shares.value",
    "combine_script" : "return _agg.number_of_shares",
    "reduce_script" : "sum = 0; for (number in _aggs) { sum += number }; return sum"
   }
  }
 }
}'

Materials

Some of the materials for this blog post are available at https://github.com/solrpl/essb.

Tagged , , , , , ,

Leave a Reply