Both the tables are huge in millions of rows and each partitioned on hourly basis.
select a.*,b.*
from a left outer join b
on a.id = b.id
where
a.part > '2012-01-01'
and b.part > '2012-01-01';
this fails during peak times.
issue:
when we checked the logs we found that the mapper is trying to access all the partitions of table b. Since the system is scarce in resources the job failed due to heap space issue.
resolution:
Change the right table to a inline..
select a.*,b.*
from a left outer join (select * from b where b.part > '2012-01-01') c
on a.id = c.id
where
a.part > '2012-01-01';
the number of mappers reduced dramatically and execution is scanning only the required partitions from table b.
Learning:
Always check the number of mappers while running. If the numbers are not realistic take a peak.
Divide and conquer approach when writing complex hive queries.
Since Hive is not yet a completely matured SQL, we need to put extra effort to examine the execution plan.
select a.*, b.*
ReplyDeletefrom a
left outer join b on (a.id = b.id and b.part > '2012-01-01')
where
a.part > '2012-01-01';
yurtdışı kargo
ReplyDeleteresimli magnet
instagram takipçi satın al
yurtdışı kargo
sms onay
dijital kartvizit
dijital kartvizit
https://nobetci-eczane.org/
F4NH
رقم مصلحة المجاري بالاحساء 0w3bTyihtX
ReplyDelete