Tuesday, January 24, 2012

Slowly changing dimensions in Hive

If you keep you Datawarehouse in Hive,How to implement slowly changing dimensions?


This is not a solution blog, actually Im looking for an effective solution to implement SCD in Hive.
Since all the types uses Update one way or another. for eg Type 2 can be a good candidate but it requires to update the enddate of history record. The only option I can think off is to insert the rows into the dimension and query on the maxid or maxdate to get the latest dimensions on the analytics layer.. (seems like an overhead on mapreduce)

I posted this question to Cloudera professionals group at LinkedIn. And Jasper recommended an idea which is not using Hive queries, moreover he prefers mapreduce scripts to handle SCD situations. I modified the flow to suit Hive tables.

Idea is this
1. Open the underlying HDFS file or select all rows using Hive-ql
2. Pass the data through a mapper as key,value pair
3. End-date the old record and leave the new record with NULL end_date (for type2) through script
4. Overwrite the Hive table

But the process could be tricky if the data is present in different partitions.


tipping point - Book journal

Malcolm Gladwell is an expert in writing book about the things we know already (very vaguly) and make an compelling book with storytelling and statistics.. he must be the perfect author for the current information overloaded culture.

Tipping point said to be his best, but for me listening in audio format felt 'outliers' was interesting and stories remained in memory for relatively more time.

the law of few - it is true that few people make a big difference and also some critical piece of work make a big difference in a big project. Here the point is not about innovation but marketing. I feel we play all these roles sometimes in different context. Gladwell categorizes these people, so it makes it easier to find who is who is future...

Stickiness factor - we are living in a era of short memory loss.. stickiness could be a day to a week. In the case of viral videos it could be an hour of fame. There is a war out there in advertising, web and entertainment to make a thing sticky. the point to make your stuff sticky specifically to your target audience. Sometimes the stickiness can come from a funny context or intelligence or experience.

new york crime example - this is a fascinating story of how new york subway handled crime by simply clearing the cars from graffiti everyday. You can make a big difference by making small spirited changes. Like if you allow small mistakes in a project to go by and didnt appreciate the people taking care of those stuff..then the manager is setting a wrong example. Its all about defined values. And always values are simple and to the core.

Connectivity - You are connected to anyone in this world in a factor of six. this may be reduced after facebook and linkedin. And 150 is the number where a group or organization can work effectively. I see the reason why my company split organizations without any reason and give meaningless names to those groups.

Teenage smoking issue - simple in one line "experimenting may not become a habit". We can protect our kids only to a certain extent from smoking. But anyway they are going to try at least once, but that doesn't means that they are hooked forever. More than 80% of kids who tried smoking quit is along the way. Chippers is the name for occasional and non-addictive smokers.

Prelude - Identifying mavens.. identifying the right context.. identifying minor signs before they tip..

How small things can bring big impact.. small things mostly are in our control..

Monday, January 23, 2012

Hive outer join issue

Recently we faced a memory issue (java heap space) when joining two partitioned tables.
Both the tables are huge in millions of rows and each partitioned on hourly basis.

select a.*,b.*
from a left outer join b
on a.id = b.id
where
a.part > '2012-01-01'
and b.part > '2012-01-01';

this fails during peak times.

issue:
when we checked the logs we found that the mapper is trying to access all the partitions of table b. Since the system is scarce in resources the job failed due to heap space issue.

resolution:
Change the right table to a inline..

select a.*,b.*
from a left outer join (select * from b where b.part > '2012-01-01') c
on a.id = c.id
where
a.part > '2012-01-01';

the number of mappers reduced dramatically and execution is scanning only the required partitions from table b.

Learning:
Always check the number of mappers while running. If the numbers are not realistic take a peak.

Divide and conquer approach when writing complex hive queries.
Since Hive is not yet a completely matured SQL, we need to put extra effort to examine the execution plan.











Tuesday, January 10, 2012

Moonwalking with Einstein - Book Journal

The book cover itself says its not another "mind power" book.. with boring technical and mostly impractical and sometimes with petty techniques which I already posses..

Author Foer takes us a journey which he takes to win the American memory championship by sprinkling some of the techniques he used to remember different stuff. Certainly the book is not for learning new memory techniques as the author just gives a glimpse and feel of those techniques, its upto the reader to dig-deeper those sprinkled techniques. Rather the book is successful in being interesting, funny and informative. Since Im reading the Malcolm Gladwell series, MWE narration comes closer to Outliers. And some of the information is there in both the books, which makes me think both Foer or Gladwell are originator of the original idea (like the 10,000 hrs deliberate work) they are just aggregators and presenters of existing facts..

Memory palace - eventhough this is a technique we all use to keep things in memory, we can deliberately build memory palaces (with the places we know well) for certain group of things. This is helpful to me in preserving things to do, passwords and some numbers. the trick is to assign weird images/persons or persons doing weird things.

remembering cards - the POA method seems to be very effective (person Object Activity). For each card in the deck assign a personality doing some action on an object

A-spade - Rajnikanth wearing a coolers
2-spade - Kamal kissing a girl
3-spade - vijay dancing on a table
4-spade - Ajith raising his red-shirts collar

And when you get a 3 card combination we can create a mental image of POA
Ajith dancing with a girl represents 4-spade,3-spade,2-spade..
Its hard to remember the image for each card, but once it is assigned with practice its possible to remember atleast 10 cards I think !!

remembering numbers - this can be done with the same technique above by assigning a image to each 2 digit combination (this is hard I think but possible with deliberate practise)

extreme peoples -
Other interesting part is about a man who forgets the breakfast he had 5 mins back and some info on the famous Rain man. The interesting information is our time will seems longer if you have more memories about the time-period. Like the vacations, the job changes, important events and successful events

Role of memory in education -
I feel education needs some memory to analyze the data.. the future is of-course how effectively we analyze the data, but memory also plays a part it seems..

Overall interesting read other than boring self-improvement books which always portraits an ideal personality (which in reality cant exists)