Thursday, 4 February 2016

Bucket Map Join in Hive

Shuffle Join and Map Joins are most common joins used in Hive, these joins might just not be the best solution to all the cases.

understanding Bucket joins can be handy to improve join execution speeds based on the join columns.

Bucket Map Joins as the name indicates is a Mapper only joins. (not to be confused with map joins/broadcast joins).

So when do we use Bucket Map join?

  1. We have Large tables as part of our n-way join.
  2. The tables are bucketed on join columns.
  3. Data is unsorted
  4. The number of buckets in the tables are multiples of eachother. 


So , Large unsorted tables which have been bucketed on join columns are good candidates. Other way is to design the table to be bucketed based on the use case. 

Bucket map joins are not the default behavior of hive, hence we need to suggest hive by setting the bucketmapjoin to true.

set hive.optimize.bucketmapjoin = true;
Select T1.key, T1.value FROM T1 JOIN T2 ON T1.key = T2.key;


We need to bucket Map join on table T1 and T2. The tables must be bucketed on the column "Key" and the number of buckets in both the tables should be multiples of each other.

Bucket join now has an advantage of choosing only the required Buckets from T2 instead of the entire T2 table.

The Mapper will be able to access only the required buckets from the n-1 tables in the n-way join. This results to join happening in the Mapper, the expensive shuffle part is skipped.


No comments:

Post a Comment