BigDW: Hive outer join issue

Monday, January 23, 2012

Hive outer join issue

Recently we faced a memory issue (java heap space) when joining two partitioned tables.

Both the tables are huge in millions of rows and each partitioned on hourly basis.

select a.*,b.*

from a left outer join b

on a.id = b.id

where

a.part > '2012-01-01'

and b.part > '2012-01-01';

this fails during peak times.

issue:

when we checked the logs we found that the mapper is trying to access all the partitions of table b. Since the system is scarce in resources the job failed due to heap space issue.

resolution:

Change the right table to a inline..

select a.*,b.*

from a left outer join (select * from b where b.part > '2012-01-01') c

on a.id = c.id

where

a.part > '2012-01-01';

the number of mappers reduced dramatically and execution is scanning only the required partitions from table b.

Learning:

Always check the number of mappers while running. If the numbers are not realistic take a peak.

Divide and conquer approach when writing complex hive queries.

Since Hive is not yet a completely matured SQL, we need to put extra effort to examine the execution plan.

3 comments:

AnonymousNovember 19, 2012 at 11:56 AM
select a.*, b.*
from a
left outer join b on (a.id = b.id and b.part > '2012-01-01')
where
a.part > '2012-01-01';
ReplyDelete
Replies
İsaAugust 7, 2023 at 7:54 PM
yurtdışı kargo
resimli magnet
instagram takipçi satın al
yurtdışı kargo
sms onay
dijital kartvizit
dijital kartvizit
https://nobetci-eczane.org/
F4NH
ReplyDelete
Replies
AnonymousOctober 15, 2024 at 6:28 AM
رقم مصلحة المجاري بالاحساء 0w3bTyihtX
ReplyDelete
Replies

Add comment