Friday, June 14, 2013

HBase Conference 2013

At my current workplace we are using Hadoop and related technologies for Datawarehousing. The main advantage we get out of this was ofcourse cost and scalability. But Hadoop and related technologies (especially Hive,PIG) are not perfect for DW and BI development projects. The major issue we faced are updates and capturing incremental data, since we are trying to do a Dimensional model with slowly changing dimensions.

I heard many saying about HBase, which can do updates in Hadoop. But my question is, Is HBase suitable for DW related updates? , Is it suitable for Dimensional modeling?, How about maintainance since DW systems constantly undergo change..

So got an oppurtunity to attend HBase conference to get my answer.. and certainly its a no.. There are some use cases where HBase is used to serve reports or an online reporting portal but certainly unsuitable for a Bus architecture based Dimensional models.

Anyway HBase seems to be serving magnifiently for online applications and nice to see some case studies which we use everyday like Groupon, jws player.

Some notes from the sessions Im able to attend..

apache hive and Hbase

Seems like data-retrieval and updates are very programatic and hard in HBase
Since Hive is becoming very popular because of SQL nature, this presentation talks about how to use Hive above HBase
  Hive over Hbase - online,unstructured and used by programmers
  Hive over HDFS - offline, structured and used by analyst
  use case 1 -- keep dimensions in Hbase .. fact in Hive and query can derive data from both
  use case 2 -- continuous updates to Hbase and   periodic dump to HDFS
                          join the above with a single Hive query
Actual presentation link
http://www.slideshare.net/hortonworks/integration-of-hive-and-hbase

Phoenix Project

An SQL skin for HBase
embedded JDBC driver to run HBase

presentation link
http://www.slideshare.net/dmitrymakarchuk/phoenix-h-basemeetup

Impala and Hbase 

Cloudera trying to help the analyst community by returning query data faster
problem could be on efficiency
it may not used for a batch process like ETL - as there is a possibility of failiure
can be used for analyst queries
not the best yet -- in other words not production ready yet
Im skeptical about the optimizer these tools use

Couldnt find the actual presentation, but this PPT has most of the info
http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-hadoop

Apache Drill

Another SQL for HBase, I liked this talk since the presenter from MapR accepted that this product is not yet production ready.
The design has a Distributed cache in each data-node, which means more data-transfer to build this cache.
Maybe suitable for small, quick analytical queries. Have to wait and see for production use cases.

http://www.slideshare.net/ApacheDrill/apache-drill-technical-overview

HBase in Pinterest - Use case

Interesting talk about pinning, unpinning, follow etc, Couldnt find the actual presentation

And there is WiBiData's Kiji ..maybe this link will give some idea
http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/



Tuesday, June 4, 2013

Cloudera Forum Live

Im watching on live streaming http://cloudera.com/content/cloudera/en/campaign/unaccept-the-status-quo-live-feed.html

"Center of gravity is shifting" -- the message is that the center of gravity of a Data warehouse is moving to Hadoop from Relational DB.

Hadoop distribution is only the beginning, we need other services like meta-data management, High-level querying language, faster response.

Moving Hadoop from a batch processing mode to real-time querying.

Cloudera Search -- latest addition from Cloudera to Hadoop ecosystem
                         intreractive speed on big data scale
                         no demo yet!

This classification looks good
Enterprise users      - Search
Developer              - Mapreduce / PIG / HIVE
Analyst                   - Impala

Analyst from Dell puts it right -- " Its not about having Hadoop in our environment and having data in it.. its all about how use Hadoop for our advantage and get real value out of it"

Oracle is popping up in many of Cloudera's events, anything cooking up!!