Thursday, 5 May 2016

How does Map-side aggregation work in Hive?

What is map-side aggregation in hive?

Hive trys to optimize the query execution by performing a map-side aggregations when certain conditions are met in the query execution.

What exactly is the optimization being perfromed?

The optimization does a partial aggregation inside of the mapper, which results in the mapper output fewer rows, reducing the data that Hadoop needs to sort and distribute to the reducers.

How does the optimization/map-side aggregation work?

Mappers store each map output key with the corresponding partial aggregation,  Periodically, the mappers will output the pairs (#{token}, #{token_count}). The Hadoop framework again sorts these pairs and the reducers aggregates these values. 

So what is the tradeoff of using map-side aggregations?

There is the need to keep a map of all tokens in memory. This leads in increase in the Main Memory needed to execute the Mapper

How does hive decide to apply map-side aggregation optimizations?


By default, Hive will try to use the map-side aggregation optimization but it falls back vanilla MR approach if the hash map is not producing enough of a memory savings. 

Decision policy:

After processing the first 100,000 (hive.groupby.mapaggr.checkinterval) records in the mapper, hive checks if the size of the hash map exceeds 50%(hive.map.aggr.hash.min.reduction) of the number of records, If true it aborts Map side aggregations.  The respective configurations can be modified to control the map-side aggregation behaviour.
 So what is hive.map.aggr.hash.percentmemory  for ?

We know the trade-off with Map-side aggregations is the hash Map need to be stored in memory. As a preventive measure to avoid "Out of Memory Exception " in the mapper by 
the size explosion in the aggregation hash map, hive.map.aggr.hash.percentmemory
can be used to control the flush of the hash map to the reducer wherever hash map size exceeds the percentage specified in the parameter compared to the total memory available to the mapper.

This, however, is an estimate based on the number of rows and the expected size of each row, so if the memory usage is per row is unexpectedly high, the mappers may run out of memory before the hash map is flushed to the reducers.

No comments:

Post a Comment