Monday, January 23, 2012

Hive outer join issue

Recently we faced a memory issue (java heap space) when joining two partitioned tables.
Both the tables are huge in millions of rows and each partitioned on hourly basis.

select a.*,b.*
from a left outer join b
on a.id = b.id
where
a.part > '2012-01-01'
and b.part > '2012-01-01';

this fails during peak times.

issue:
when we checked the logs we found that the mapper is trying to access all the partitions of table b. Since the system is scarce in resources the job failed due to heap space issue.

resolution:
Change the right table to a inline..

select a.*,b.*
from a left outer join (select * from b where b.part > '2012-01-01') c
on a.id = c.id
where
a.part > '2012-01-01';

the number of mappers reduced dramatically and execution is scanning only the required partitions from table b.

Learning:
Always check the number of mappers while running. If the numbers are not realistic take a peak.

Divide and conquer approach when writing complex hive queries.
Since Hive is not yet a completely matured SQL, we need to put extra effort to examine the execution plan.











2 comments: