Tuesday, April 10, 2012

DW in Hive - handling big dimensions

This is a an issue bothering our reporting data quality. How to handle big dimension tables in Hive data warehouse. How to balance performance and data-quality..

Problem statement:
Currently dimension hive table (dim_customer) is partitioned by date. The daily incremental creates the new partition so that we can improve the performance at the reporting side. This poses 2 critical issues
1. slowly changing dimension is lost
2. Compromise in data-quality
3. Have to filter by the dimension table in the reporting

Solution:
The only way to solve this problem is to have the dimension table as one big Hive table instead of partitions. But this creates issues with the refresh strategy and overhead on reporting Hive query..

The following is a recipe to solve this block.. step.1 is certainly the priority

1. Increase processing power
Hadoop is not only about mega storage it is also about mega processing ..so if we process big files then we got to have good number of nodes. say we have 30 nodes to process partitioned dimension table..we have to move to 120 nodes for single dimension file strategy. Processing power is tripled - it is directly proportional !.

2. Use SQOOP merge
We cannot extract the whole table from transactional system every time..source transactional systems might not allow.. we can only capture the change data. SQOOP merge comes handy for this purpose. We can overwrite only the incremental records in the hive table (type 1 SCD). Again this is a Map reduce program.. we need processing power..

3. Use Bucketed Hive tables
Hive performance block comes while joining. We can create a table with bucketing.. like hashing index on the customer_id.


If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other. If table A has 8 buckets are table B has 4 buckets, the following join

SELECT /*+ MAPJOIN(b) */ a.KEY, a.value
FROM a JOIN b ON a.KEY = b.KEY
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

Bucketing and Sqoop merge requires good planning and metadata management..
Managing Hadoop from the scratch is challenging as we bump into the limits sooner, we need to adapt quickly else data might grow beyond limits. One advantage( and complexity) is that the internal processing (mapreduce) is open and its upto the developer to improve. And the biggest advantage of all is scaling out.. you can add nodes easily to really make a difference..






No comments:

Post a Comment