BigDW: 2014

Friday, June 6, 2014

Hadoop summit 2014 - commentary

3 days of continuous presentations, some really cool ones, many sales pitches and few useful networking, food and drinks :). Apart from these, My agenda is get the Datawarehousing side of Hadoop, to learn some more tricks for my long running quest to implement a BI solution in Hadoop. I do learnt some and got some benchmarks to prove some designs. I feel confident that a lots of effort put into this space from Hadoop community.

Like one of a presenter quoted Churchill
"Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning"

This conference was organized by HortonWorks and not cloudera.YARN and TEZ dominated the talks, I saw "mapreduce is dead" in couple of presentations, hope its for good. All the enterprise users must move to YARN in near future else they will miss the future tools and essential enhancements to current tools like Hive,spark.

In a way These conferences gives an idea how mature the technology is and how serious are the businesses in using it. But you can always relate to this quote "“Big data is like teen sex. Everybody is talking about it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

Coming to the overall theme - I could see 4 themes from my perspective.

Performance

Parallelism and mapreduce is old story, now getting results fast is the coolest thing.
Key concepts to look for in this area are YARN, TEZ, ORC (compressed columnar format), Parquet (another columnar format), In-memory query engines (Impala, spark), there were talks about collecting statistics to make Hive behave like databases.

Security
Most enterprises brought their data into Hadoop, there is a big gap to be filled in securing data. Some concepts on this line like

Keeping Hadoop behind the Database and restricting direct access to Hadoop. In this way you can utilize the security features of Database for Hadoop objects (teradata and Oracle already has appliances for this)
Hive 0.13 - Improved security features like grant/revoke, ROLE, columns and row level access, permissions for views.
LUKE (Linux unified key setup)
Apache sentry
Apache Knox

Federation
Data blending, this is where big players want to play a role like "We too do Hadoop" and making Hadoop coexists with traditional datawarehouse ecosystem.

Teradata hadoop connector
Oracle smart something!
Microsoft PDW (parallel data warehouse) and PolyBase
My favorite data-blending is by using Tableau (ofcourse on a smaller scale)

ACID
This is the topic Im mostly interested in, How to do ACID on Hive, if not how to mock, because I feel updates are very critical to truly deliver an Datawarehouse in Hadoop.

Store data as columnar format (ORC, Parquet)
Maintain a single partition using compaction(this table is going to be huge, but thats what Hadoop is made for)
Keep appending the history (dont delete).
Get the latest data while reading. There is a new Acidinput and Acidreader to accomplish this.

Have to write a seperate blog for this topic.I need to get deeper the architecture and need to do a POC on this.

Tuesday, May 13, 2014

Learning something new? - try varying your methods

Now-a-days Learning is largely democratized and nothing stops you from learning or mastering a totally new subject, thanks to MOOC. (read Massive Open Online Courses.. infographic) These are learning systems which are already disrupting the traditional university structure. Im pretty sure that we are just in the beginning of a great education revolution. (very excited for my 3 year old daughter on her learning options!)

Although free is great, Three things are missing though in these learning platforms -- the planning, control and validation process of traditional methods.(These three are available in some MOOC like coursera). So its upto the learner to acquire the must 3's of any learning venture.

Lets assume you found the subject area let it be Micro economics, a programming language, a new language or natural history. Not sure what right for you or your career? I cannot answer this question as it is a personal thing. But have the attitude of "Whats the worst might happen?" -- "You may end up learning something new, however minuscule it might be" -- "You might loose 3-4 hrs of your life, take it as a wrong movie choice".

The road ahead is not going to be easy especially at the beginning, as I said be an explorer.

"All change is hard at first, messy in the middle and gorgeous at the end." - Robin Sharma

This formula might help in making sure you reach the goal which you are capable off or more than that.

warm-up

1. If its totally new or the one you scared about, cover the basics. Spend a lots of time in understanding the fundamentals. Its worth the time here as it will ease the learning of complex things comes later in process.

2. Dont try or target expert problems first - it may break your courage to venture very earlier in the process.

Finding

1. The method which suits your style. (Youtube, webminars, Books, online courses, community courses, podcasts, audiobooks, slideshare, meetups, specific learning blogs).

2. The material which suits your knowledge. All learning materials are not created equal. It should match your level of competency.

3. Vary your learning methods - Accept the fact that google's or Amazons top search may not find you the one you are looking for, try different methods. And the method should also suit your lifestyle. Like can you watch youtube at office!? read a pdf in your reader while commuting or listening to podcast.

4. Fail Faster - Move ahead, may be a different author, a different youtube playlist, a different course.

5. Find the "one" - the above steps will lead you to this, stuck to it, follow it and go until the level that method can take you.

Execution:

1. Create a "plan" and a achievable target.

2. Make incremental progress. Dont have targets like "learn and build a social site in a month." instead "learn to create a dynamic website using PHP in a week/month"

3. Track your progress based on the plan.This will prevent you from getting distracted.

4. "cost it" - there might be some $ involved in some learning methods, you might be surprised to know how cheaper it is to learn when compared to the returns. The biggest cost you will incur is your time.

5. "try it" - creating a sandbox, executing a sample code or trying to solve an existing problem.

6. "do it" - a small project, get certification if its available or publish it your personal blog. The options are limitless.

Stuck:

You are fortunate if you could find a human who can mentor and good in mentoring. The other options could be "joining a meetup" , following a niche blog, identifying a dedicated forum.. and your "GRIT" to get out of a stuck situation.

“When the pupil is ready, the Master appears”

Happy learning!

"Big Data" - Im a believer!

What is the use case where Hadoop is "the" best technology to use?. Im working in Hadoop and Big-data for some years, using it for production analytics and learning and improving everyday .Attended numerous big data webminars and conferences. Talked great about the mighty of Big data to those who never used it or scared about it. But I haven't felt the real benefit of using Hadoop or big data ( Im talking about my case !) apart from its cost effective.Most of the stuff can be done using some other existing technology. With the traction Hadoop is getting in technology grounds there must be something really BIG use case here. Yeah there are stories about Facebook, twitter and other sites, but there must be something which Hadoop can only do. I couldn't find that unavoidable purpose for this technology, until...

I recently enrolled in "machine learning" course from Coursera. An awesome course from an awesome teacher (Andrew Ng). I never imagined that I could understand why all the loads of Math in school and college (why the heck differentiation, integration?). I could have learned just languages and computer language. But this course changed my perspective towards math like a slap across the face. All the things we see, use and consume are developed using some form of math. This course is an eye-opener. I would recommend this course to anyone who has a data background, you will not see any data as a waste of memory, you will see it as a gold-mine waiting for the right explorer.

Now while machine learning, there are lots of techniques to predict something (give 'x' and it will give you 'y' not simple like that :0). Im still a beginner but what I found is that we need to do thousands of, in some cases even millions of iterations just to find a simple parameter. And there could be thousands of parameters in some cases. That's the aha! moment for me, Hadoop/ Big data is the only place "where you can store and process humongous data" in a cost effective way. Previously engineers could have limited themselves in number of iteration because of resources, not anymore they have the power of HDFS and mapreduce to store and process respectively.

There are many compelling use-case I seen in numerous webniars, conferences and whitepapers, but machine learning / predictive modeling is the most compelling reason( at least for me) that Hadoop is indispensable for the future analytics world. Especially we are living in digital-social world where the factors(x1,x2,x3...Xn) which could affect any outcome(y) is ever increasing. Now I'm no more a practitioner but a believer!