Tuesday, January 24, 2012

Slowly changing dimensions in Hive

If you keep you Datawarehouse in Hive,How to implement slowly changing dimensions?


This is not a solution blog, actually Im looking for an effective solution to implement SCD in Hive.
Since all the types uses Update one way or another. for eg Type 2 can be a good candidate but it requires to update the enddate of history record. The only option I can think off is to insert the rows into the dimension and query on the maxid or maxdate to get the latest dimensions on the analytics layer.. (seems like an overhead on mapreduce)

I posted this question to Cloudera professionals group at LinkedIn. And Jasper recommended an idea which is not using Hive queries, moreover he prefers mapreduce scripts to handle SCD situations. I modified the flow to suit Hive tables.

Idea is this
1. Open the underlying HDFS file or select all rows using Hive-ql
2. Pass the data through a mapper as key,value pair
3. End-date the old record and leave the new record with NULL end_date (for type2) through script
4. Overwrite the Hive table

But the process could be tricky if the data is present in different partitions.


4 comments:

  1. that maybe helps.
    http://afancy-lxf.blogspot.com/2012/03/process-type-1-scd-in-hive.html

    ReplyDelete
  2. Hi, could you provide the link for Jasper's M/R scripts idea?
    We are currently trying to implement type 1 updates in Hive and are seeing huge performance problems with big fact tables, so hopefully this can help.

    Thanks!

    ReplyDelete
  3. Please refer the following link that explains with examples schema on read concept for implementing hive based SCD type 2 functionality
    http://amintor.com/1/post/2014/07/implement-scd-type-2-in-hadoop-using-hive-transforms.html

    ReplyDelete