Friday, June 6, 2014

Hadoop summit 2014 - commentary

3 days of continuous presentations, some really cool ones, many sales pitches and few useful networking, food and drinks :). Apart from these, My agenda is get the Datawarehousing side of Hadoop, to learn some more tricks for my long running quest to implement a BI solution in Hadoop. I do learnt some and got some benchmarks to prove some designs. I feel confident that a lots of effort put into this space from Hadoop community.

Like one of a presenter quoted Churchill
"Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning"

 This conference was organized by HortonWorks and not cloudera.YARN and TEZ dominated the talks, I saw "mapreduce is dead" in couple of presentations, hope its for good. All the enterprise users must move to YARN in near future else they will miss the future tools and essential enhancements to current tools like Hive,spark.

In a way These conferences gives an idea how mature the technology is and how serious are the businesses in using it. But you can always relate to this quote "“Big data is like teen sex. Everybody is talking about it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

Coming to the overall theme - I could see 4 themes from my perspective.

Performance

  • Parallelism and mapreduce is old story, now getting results fast is the coolest thing.
  • Key concepts to look for in this area are YARN, TEZ, ORC (compressed columnar format), Parquet (another columnar format), In-memory query engines (Impala, spark), there were talks about collecting statistics to make Hive behave like databases.


Security
Most enterprises brought their data into Hadoop, there is a big gap to be filled in securing data. Some concepts on this line like

  • Keeping Hadoop behind the Database and restricting direct access to Hadoop. In this way you can utilize the security features of Database for Hadoop objects (teradata and Oracle already has appliances for this)
  • Hive 0.13 - Improved security features like  grant/revoke, ROLE, columns and row level access, permissions for views.
  • LUKE (Linux unified key setup)
  • Apache sentry
  • Apache Knox


Federation
Data blending, this is where big players want to play a role like "We too do Hadoop" and making Hadoop coexists with traditional datawarehouse ecosystem.

  • Teradata hadoop connector
  • Oracle smart something!
  • Microsoft PDW (parallel data warehouse) and PolyBase
  • My favorite data-blending is by using Tableau (ofcourse on a smaller scale)


ACID
This is the topic Im mostly interested in, How to do ACID on Hive, if not how to mock, because I feel updates are very critical to truly deliver an Datawarehouse in Hadoop.

  • Store data as columnar format (ORC, Parquet)
  • Maintain a single partition using compaction(this table is going to be huge, but thats what Hadoop is made for)
  • Keep appending the history (dont delete).
  • Get the latest data while reading. There is a new Acidinput and Acidreader to accomplish this.
Have to write a seperate blog for this topic.I need to get deeper the architecture and need to do a POC on this.