Monday, April 22, 2013

Dimensional modeling using Hive ( is it effective? )

I strongly believe that dimensional model cannot/should not be build (effectively) in Hive. First of all Hive is not for slicing and dicing ( as expected in a DataMart), Hive is a SQL wrapper which makes executing complex MR jobs very very easier.

Still there are many Projects ( small & large ) which are trying to accomplish Dimensional modeling in Hive.
With an open framework like Hive, its certainly not impossible. But the question is how effective it is?

Effective in the sense, How scalable? How flexible? and How easy it is for an BI team to implement this?
Mostly answers to all the above are negative. My belief is that a BI team should spend most of its time in solving business problems  not technical problems. If we implement dimensional model in Hive we will end-up writing code for all Hive's not-haves w.r.t any Database. And that a big overhead especially on smaller BI teams with high expectations.

This is a very good paper on how to do implement http://dbtr.cs.aau.dk/DBPublications/DBTR-31.pdf

So what can be Hive's in a datawarehouse reporting and ETL? Hive can play two roles.
1. It can act as a staging which actually resembles the source, with incrementally captured partitions. Later   data from this layer can be ETLed to a Datamart in Datawarehouse.
2. A flattened table which merges all the information into a single flattened table.
3. All reporting DB's will serve only for certain History data, The historical dimensional model can be flattened and imported into Hive partitions. This gives greater scalability of Archiving. You can actually query and report out of this archive. (certainly not in real-time !!)

All these solutions are because of Hive's biggest advantage of scalability and performance. And also taking into consideration that Hive is not suitable for Updates and historical revisits.