BigDW: 2013

Friday, June 14, 2013

HBase Conference 2013

At my current workplace we are using Hadoop and related technologies for Datawarehousing. The main advantage we get out of this was ofcourse cost and scalability. But Hadoop and related technologies (especially Hive,PIG) are not perfect for DW and BI development projects. The major issue we faced are updates and capturing incremental data, since we are trying to do a Dimensional model with slowly changing dimensions.

I heard many saying about HBase, which can do updates in Hadoop. But my question is, Is HBase suitable for DW related updates? , Is it suitable for Dimensional modeling?, How about maintainance since DW systems constantly undergo change..

So got an oppurtunity to attend HBase conference to get my answer.. and certainly its a no.. There are some use cases where HBase is used to serve reports or an online reporting portal but certainly unsuitable for a Bus architecture based Dimensional models.

Anyway HBase seems to be serving magnifiently for online applications and nice to see some case studies which we use everyday like Groupon, jws player.

Some notes from the sessions Im able to attend..

apache hive and Hbase

Seems like data-retrieval and updates are very programatic and hard in HBase
Since Hive is becoming very popular because of SQL nature, this presentation talks about how to use Hive above HBase
Hive over Hbase - online,unstructured and used by programmers
Hive over HDFS - offline, structured and used by analyst
use case 1 -- keep dimensions in Hbase .. fact in Hive and query can derive data from both
use case 2 -- continuous updates to Hbase and periodic dump to HDFS
join the above with a single Hive query
Actual presentation link
http://www.slideshare.net/hortonworks/integration-of-hive-and-hbase

Phoenix Project

An SQL skin for HBase
embedded JDBC driver to run HBase

presentation link
http://www.slideshare.net/dmitrymakarchuk/phoenix-h-basemeetup

Impala and Hbase

Cloudera trying to help the analyst community by returning query data faster
problem could be on efficiency
it may not used for a batch process like ETL - as there is a possibility of failiure
can be used for analyst queries
not the best yet -- in other words not production ready yet
Im skeptical about the optimizer these tools use

Couldnt find the actual presentation, but this PPT has most of the info
http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-hadoop

Apache Drill

Another SQL for HBase, I liked this talk since the presenter from MapR accepted that this product is not yet production ready.
The design has a Distributed cache in each data-node, which means more data-transfer to build this cache.
Maybe suitable for small, quick analytical queries. Have to wait and see for production use cases.

http://www.slideshare.net/ApacheDrill/apache-drill-technical-overview

HBase in Pinterest - Use case

Interesting talk about pinning, unpinning, follow etc, Couldnt find the actual presentation

And there is WiBiData's Kiji ..maybe this link will give some idea
http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/

Tuesday, June 4, 2013

Cloudera Forum Live

Im watching on live streaming http://cloudera.com/content/cloudera/en/campaign/unaccept-the-status-quo-live-feed.html

"Center of gravity is shifting" -- the message is that the center of gravity of a Data warehouse is moving to Hadoop from Relational DB.

Hadoop distribution is only the beginning, we need other services like meta-data management, High-level querying language, faster response.

Moving Hadoop from a batch processing mode to real-time querying.

Cloudera Search -- latest addition from Cloudera to Hadoop ecosystem
intreractive speed on big data scale
no demo yet!

This classification looks good
Enterprise users - Search
Developer - Mapreduce / PIG / HIVE
Analyst - Impala

Analyst from Dell puts it right -- " Its not about having Hadoop in our environment and having data in it.. its all about how use Hadoop for our advantage and get real value out of it"

Oracle is popping up in many of Cloudera's events, anything cooking up!!

Monday, April 22, 2013

Dimensional modeling using Hive ( is it effective? )

I strongly believe that dimensional model cannot/should not be build (effectively) in Hive. First of all Hive is not for slicing and dicing ( as expected in a DataMart), Hive is a SQL wrapper which makes executing complex MR jobs very very easier.

Still there are many Projects ( small & large ) which are trying to accomplish Dimensional modeling in Hive.
With an open framework like Hive, its certainly not impossible. But the question is how effective it is?

Effective in the sense, How scalable? How flexible? and How easy it is for an BI team to implement this?
Mostly answers to all the above are negative. My belief is that a BI team should spend most of its time in solving business problems not technical problems. If we implement dimensional model in Hive we will end-up writing code for all Hive's not-haves w.r.t any Database. And that a big overhead especially on smaller BI teams with high expectations.

This is a very good paper on how to do implement http://dbtr.cs.aau.dk/DBPublications/DBTR-31.pdf

So what can be Hive's in a datawarehouse reporting and ETL? Hive can play two roles.
1. It can act as a staging which actually resembles the source, with incrementally captured partitions. Later data from this layer can be ETLed to a Datamart in Datawarehouse.
2. A flattened table which merges all the information into a single flattened table.
3. All reporting DB's will serve only for certain History data, The historical dimensional model can be flattened and imported into Hive partitions. This gives greater scalability of Archiving. You can actually query and report out of this archive. (certainly not in real-time !!)

All these solutions are because of Hive's biggest advantage of scalability and performance. And also taking into consideration that Hive is not suitable for Updates and historical revisits.