At my current workplace we are using Hadoop and related technologies for Datawarehousing. The main advantage we get out of this was ofcourse cost and scalability. But Hadoop and related technologies (especially Hive,PIG) are not perfect for DW and BI development projects. The major issue we faced are updates and capturing incremental data, since we are trying to do a Dimensional model with slowly changing dimensions.
I heard many saying about HBase, which can do updates in Hadoop. But my question is, Is HBase suitable for DW related updates? , Is it suitable for Dimensional modeling?, How about maintainance since DW systems constantly undergo change..
So got an oppurtunity to attend HBase conference to get my answer.. and certainly its a no.. There are some use cases where HBase is used to serve reports or an online reporting portal but certainly unsuitable for a Bus architecture based Dimensional models.
Anyway HBase seems to be serving magnifiently for online applications and nice to see some case studies which we use everyday like Groupon, jws player.
Some notes from the sessions Im able to attend..
Since Hive is becoming very popular because of SQL nature, this presentation talks about how to use Hive above HBase
Hive over Hbase - online,unstructured and used by programmers
Hive over HDFS - offline, structured and used by analyst
use case 1 -- keep dimensions in Hbase .. fact in Hive and query can derive data from both
use case 2 -- continuous updates to Hbase and periodic dump to HDFS
join the above with a single Hive query
Actual presentation link
http://www.slideshare.net/hortonworks/integration-of-hive-and-hbase
embedded JDBC driver to run HBase
presentation link
http://www.slideshare.net/dmitrymakarchuk/phoenix-h-basemeetup
problem could be on efficiency
it may not used for a batch process like ETL - as there is a possibility of failiure
can be used for analyst queries
not the best yet -- in other words not production ready yet
Im skeptical about the optimizer these tools use
Couldnt find the actual presentation, but this PPT has most of the info
http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-hadoop
The design has a Distributed cache in each data-node, which means more data-transfer to build this cache.
Maybe suitable for small, quick analytical queries. Have to wait and see for production use cases.
http://www.slideshare.net/ApacheDrill/apache-drill-technical-overview
And there is WiBiData's Kiji ..maybe this link will give some idea
http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/
I heard many saying about HBase, which can do updates in Hadoop. But my question is, Is HBase suitable for DW related updates? , Is it suitable for Dimensional modeling?, How about maintainance since DW systems constantly undergo change..
So got an oppurtunity to attend HBase conference to get my answer.. and certainly its a no.. There are some use cases where HBase is used to serve reports or an online reporting portal but certainly unsuitable for a Bus architecture based Dimensional models.
Anyway HBase seems to be serving magnifiently for online applications and nice to see some case studies which we use everyday like Groupon, jws player.
Some notes from the sessions Im able to attend..
apache hive and Hbase
Seems like data-retrieval and updates are very programatic and hard in HBaseSince Hive is becoming very popular because of SQL nature, this presentation talks about how to use Hive above HBase
Hive over Hbase - online,unstructured and used by programmers
Hive over HDFS - offline, structured and used by analyst
use case 1 -- keep dimensions in Hbase .. fact in Hive and query can derive data from both
use case 2 -- continuous updates to Hbase and periodic dump to HDFS
join the above with a single Hive query
Actual presentation link
http://www.slideshare.net/hortonworks/integration-of-hive-and-hbase
Phoenix Project
An SQL skin for HBaseembedded JDBC driver to run HBase
presentation link
http://www.slideshare.net/dmitrymakarchuk/phoenix-h-basemeetup
Impala and Hbase
Cloudera trying to help the analyst community by returning query data fasterproblem could be on efficiency
it may not used for a batch process like ETL - as there is a possibility of failiure
can be used for analyst queries
not the best yet -- in other words not production ready yet
Im skeptical about the optimizer these tools use
Couldnt find the actual presentation, but this PPT has most of the info
http://www.slideshare.net/cloudera/cloudera-impala-a-modern-sql-engine-for-hadoop
Apache Drill
Another SQL for HBase, I liked this talk since the presenter from MapR accepted that this product is not yet production ready.The design has a Distributed cache in each data-node, which means more data-transfer to build this cache.
Maybe suitable for small, quick analytical queries. Have to wait and see for production use cases.
http://www.slideshare.net/ApacheDrill/apache-drill-technical-overview
HBase in Pinterest - Use case
Interesting talk about pinning, unpinning, follow etc, Couldnt find the actual presentationAnd there is WiBiData's Kiji ..maybe this link will give some idea
http://gigaom.com/2012/11/14/wibidata-open-sources-kiji-to-make-hbase-more-useful/
thank you so much for this wonder message
ReplyDeletethank you so much for this wonder message
best python training in chennai
best python training in sholinganallur
best python training institute in omr
python training in omr
best java training in chennai
hadoop training in sholinganallur
best hadoop training in omr
best hadoop training in chennai