BigDW: 2012

Saturday, November 17, 2012

Black friday 2012 deals - google spreadsheet excel

Excel is a fantastic tool to search and maintain a list.. Instead of scanning through multiple print adds, a single spreadsheet can be a nice tool to compare and to find actual deals

I will keep this file updated until Blackfriday with latest deals

Created a spreadsheet of all popular mart blackfriday deals..

Only the popular deals ( about 500 items )

https://docs.google.com/spreadsheet/ccc?key=0AlP4buxzoaxsdExjMkM2WmJBcFZ5TlExTFMyWU1Oc1E

If you have certain product or store or category use this detailed list ( more than 5000 items )

complete list

https://docs.google.com/spreadsheet/ccc?key=0AlP4buxzoaxsdFBteHJwMjFlczVESUlTNXNNbEtuMXc

Google spreadsheet has a search filtering option for all the columns.. it is very easy to search and compare price of a particular item between different stores ..

Please download a local copy as XLS or pdf or any other format from "file > Download as"

using the menu bar..

Friday, November 2, 2012

top Travel movies -- My favorite genre

One of my colleagues , he is actually from Mediterranean . but he usually talks in detail about India, like he studied the country. Later he revealed about the travel he made in India during his early 20s with his friends. They had minimal money and certainly not enough to travel in planes and stay at palaces. This makes their travel intriguing and interesting.

Me.. I always made planned travels, maybe I dont have the heart to face the uncertainty of unplanned travels.. this uncertainty and exploration makes a travel movie very interesting..

Motorcycle Diaries
this movie can easily takes the top spot of travel movies .. I like the south american scenic beauty and Che's friend who traveled with him. How about travelling a continent with a close friend on a bike ?

Zindagi na milega do bara (Hindi)
Although it has the Hindi masala factor still an enjoyable travel movie. Friends exploring themselves on a aimless travel to spain.

Kikujiro (Japanese)
What about travelling with a kid aimlessly . this is a funny and Im sure it touches everyone watches it, I think its because the boy's life resembles some of our childhood.

Le Grand Voyage (French)
How special it is to travel with a father on his pilgrimage. I know the idea is no fun when he is alive , but I didnt have the oppurtunity to spend this quality time with my father . Im longing for a travel which I can never start.

Up
Risking a travel at >60 for a petty promise he made to his wife as a kid.. Although this is animation this has lot of heart

Into the wild
slow movie , can be a time for self exploration ... although I hate the idea of running away from humanity

127 Hours
Good music and plot

Finding Nemo
A fathers search for his son with the help of weird friend.. a must watch I would say

Thursday, November 1, 2012

Pentaho solution for Sqoop call with dynamic partitions

Pentaho big data release doesnt have a step for SQOOP as of this writing..
Simple solution is to use a "Shell script" stage to call sqoop..

But our requirement has a twist .. and most of sqoop users might have as well..
- capture incremental data for certain tables and keep it in a new hive partition
- Run the sqoop extract for a window of dates where each day's data goes to a partition (DAILY , HOURLY, MONTHLY etc)

Design

The solution is to use a DB for configuration and pentaho to frame sqoop calls

JOB

We are calling the pensqoop transformation to frame the list of sqoop calls to run

shell stage actually runs the sqoop script.. here check the "execute for each input row " so that this script will be called for each sqoop call framed in previous step.

TRANSFORMATION

1. get table configurations (like db connection, incremental column to use for incremental data capture etc)
2. switch the flow for DAILY partitions or FULL refresh
3. frame sqoop call using string manipulation at javascript stage
4. "copy rows to result" to send the output to calling job

Wednesday, June 20, 2012

Hive Maintenance

Hive log filling up tmp space

/tmp is usually the neglected folder in any unix environment, but there is where Hive is going to place all its log files. If you didnt have enough space allocated for this then your scheduled sqoop or hive queries are going to fail because of silly reason - no temp space. Not worth it..

1. Create a cron to remove the /tmp/<username>/*.txt files frequenty

2. Change the hive log directory to a different location which is monitored by ops ..

I I prefer the 2^nd since this change will not remove any files in the process.

t To make the configuration change got to /etc/hive/conf/hive-site.xml and add the following property

<name>hive.querylog.location</name>

<description>Directory where structured hive query logs are created</description>

</property>

Freeing up unused HDFS memory

Two directories we can concentrate to free-up a lot of HDFS memory

1. /tmp/hive-<username>/
2. /local/hadoop/mapred/staging/user/.staging/job*

If you have Hue you can cleanup some of long unused saved reports
1. /tmp/hive-beeswax-<username>/

If you can limit the files by date less than the current month, you are safe..(since this is delete operation)
Hadoop, Hive and Hue does have demon or code to clean-up these folders (Im not sure) but
Usually all these files are there because of orphaned mapreduce jobs. Like when you CTRL+C instead of formally "hadoop -job kill <job_id>"

And ofcourse scanning the hdfs directory and hive directory for junk files will also help..

the following are default configuration directories

for beeswax cleanup -- hadoop fs -rmr /tmp/hive-beeswax-*/hive*

for hive tmp -- hadoop fs -rmr /tmp/hive-bhchandr/hive*
older job files -- hadoop fs -rmr /local/hadoop/mapred/staging/<application_user>/.staging/job_2012*

Free Sqoop temp space

remove the compile folders for sqoop job
/tmp/sqoop-<username>/compile/

remove log files older than 1/2 day at jobtracker and data-nodes

/var/log/hadoop-*-*/userlogs/*

recover from safemode

Hadoop can go into safemode when the local directory mapped to HDFS is full.. this usually happens when your HDFS files usedup all the space.. but the trick is you cannot remove files unless you recover from safemode.

So first remove some local files from
/opt/local/hadoop/mapred/local/taskTracker
/opt/local/hadoop/mapred/local/taskTracker/distcache
/opt/local/hadoop/mapred/local/taskTracker/<username>

then
hadoop dfsadmin -safemode leave

since Hadoop is out of safemode.. you can cleanup some HDFS using the steps in the first sections

/var/log/hadoop-0.20/history/done

Thursday, May 24, 2012

Hive - digging deeper into metastore

We usually use information_schema or database metadata tables to query the tables, columns, indexes ..in the case of traditional databases. How about the same in Hive?

In the case of Hive there is a metastore which acts as a metadata for the databases, Hive uses this database to store the tables, partitions, databases, serde in this database. Say you want to know the tables in a database or physical(HDFS) location of the tables or similar column names across the tables..

I have no idea about the default derby DB.. so letme talk about metastore in a traditional database like Mysql..( If you dont have your metastore in an external DB you can follow this link to do so.. https://ccp.cloudera.com/display/CDHDOC/Hive+Installation#HiveInstallation-ConfiguringtheHiveMetastore)

Hive metastore has minimal tables when compared to the metadata layer of a traditional database but Im sure the metadata schema will get bigger and complex in future.

Lets go through the most desirable tables ...

COLUMNS
DBS -- the list of schemas
PARTITIONS
TBLS - tables

Querying these tables will give a basic idea and the datamodel of Hive metastore ( I couldnt find any over the internet), understanding of these tables are essential if you are seriously into Hive and have some production data.

for eg.. In a weird case we found some partitions of one database is being loaded into HDS location of another database (may be a code issue ) but using the following query Im able to narrow down to the affected tables..

select distinct a.tbl_id,a.tbl_name
from TBLS a
join PARTITIONS c on (a.TBL_ID=c.TBL_ID)
join SDS b on (c.SD_ID=b.SD_ID)
where b.location like '%sbx.db%' and a.DB_ID=1;

Wednesday, April 25, 2012

Secondary Sort using Python and MRJob

MRjob is an excellent module developed as open source project by Yelp.. I chosed mrjob because of the following features

It provides a seamless JSON reader and writer (i.e the mapper can read json lines and convert them into lists)
We can test hadoop job locally (in windows or unix) on a small dataset without actually using huge hdfs files (quick !!)
Can orchestrate many mappers and reducers in the same code

My task is to parse json formatted web log files and parse them, say the columns are sessionid,stepno and data. so the psuedo-code

Read the json files using mrjob protocol

DEFAULT_INPUT_PROTOCOL = 'json_value'

DEFAULT_OUTPUT_PROTOCOL = 'repr_value' # output is delimited

2. yield sessionid, (sessionid,stepno,data) from mapper

the Mapreduce will make sure that all sessionids(key) goes to same mapper.. with the remaining values sent as a dictionary (value) to make it easier for us to srt in reducer

3. Use sorted from itertools of python module to sort by stepnumber in reducer

def reducer(self, sessionId, details):

sdetail = sorted(details, key=lambda x: x[1]) # sorting by stepno for each session

for d in sdetail:

line_data='\t'.join(str(n) for n in d)

We are doing the secondary sort to scan through each events as the sequence is very important to do the funnel analysis of logs..

Complete code :

import sys,time

#sys.path.append('/usr/lib/python2.4/site-packages/')

from mrjob.job import MRJob

from mrjob.protocol import JSONValueProtocol

from itertools import groupby

from operator import itemgetter, attrgetter

class uet(MRJob):

DEFAULT_INPUT_PROTOCOL = 'json_value'

DEFAULT_OUTPUT_PROTOCOL = 'repr_value'

def mapper(self, _, line):

sessionId = line['sessionId']

data = line['data']

if len(sessionId) < 13:

for i in range(len(data)):

no = data[i]['no']

yield sessionId,(sessionId, no,data)

def reducer(self, sessionId, details):

sdetail = sorted(details, key=lambda x: x[1]) # sorting by stepno for each session

for d in sdetail:

line_data='\t'.join(str(n) for n in d)

print str(line_data)

if __name__ == '__main__':

uet.run()

Tuesday, April 17, 2012

Top NBA Players - by twitter followers

I really dont have any idea of how advertising companies decide on the price for certain celebrities. Because its very hard to measure the direct relationship to the sales and to derive an ROI. Also Im not sure if any celebrity with enormous following can make a big impact. I did data collection for fun to see who is actually more popular in Twitter and have more following. I used infochimps rest api calls to get the aggregated information and formatted files to make it readable by Tableau. (used Python for ETL).. seems like SHAQ even after out of the league has a greater fan following than active player.. I havent included Kobe and Rose.. and I tried my best to get the official ids of each player..

Tuesday, April 10, 2012

DW in Hive - handling big dimensions

This is a an issue bothering our reporting data quality. How to handle big dimension tables in Hive data warehouse. How to balance performance and data-quality..

Problem statement:

Currently dimension hive table (dim_customer) is partitioned by date. The daily incremental creates the new partition so that we can improve the performance at the reporting side. This poses 2 critical issues

1. slowly changing dimension is lost

2. Compromise in data-quality

3. Have to filter by the dimension table in the reporting

Solution:

The only way to solve this problem is to have the dimension table as one big Hive table instead of partitions. But this creates issues with the refresh strategy and overhead on reporting Hive query..

The following is a recipe to solve this block.. step.1 is certainly the priority

1. Increase processing power

Hadoop is not only about mega storage it is also about mega processing ..so if we process big files then we got to have good number of nodes. say we have 30 nodes to process partitioned dimension table..we have to move to 120 nodes for single dimension file strategy. Processing power is tripled - it is directly proportional !.

2. Use SQOOP merge

We cannot extract the whole table from transactional system every time..source transactional systems might not allow.. we can only capture the change data. SQOOP merge comes handy for this purpose. We can overwrite only the incremental records in the hive table (type 1 SCD). Again this is a Map reduce program.. we need processing power..

http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_literal_sqoop_merge_literal

3. Use Bucketed Hive tables

Hive performance block comes while joining. We can create a table with bucketing.. like hashing index on the customer_id.

http://hive.apache.org/docs/r0.8.1/language_manual/working_with_bucketed_tables.html

If the tables being joined are bucketized, and the buckets are a multiple of each other, the buckets can be joined with each other. If table A has 8 buckets are table B has 4 buckets, the following join

SELECT /*+ MAPJOIN(b) */ a.KEY, a.value

FROM a JOIN b ON a.KEY = b.KEY

can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1 of B. It is not the default behavior, and is governed by the following parameter

Bucketing and Sqoop merge requires good planning and metadata management..

Managing Hadoop from the scratch is challenging as we bump into the limits sooner, we need to adapt quickly else data might grow beyond limits. One advantage( and complexity) is that the internal processing (mapreduce) is open and its upto the developer to improve. And the biggest advantage of all is scaling out.. you can add nodes easily to really make a difference..

Thursday, March 29, 2012

Data science -- the cool scientist without white gowns

"Scientist" is a cool word during my school days, wanna become one but donno on what. All I see as scientists wore white gown with colorful liquids around. But later during college and working days scientists seemed to be boring people with no personal life, accumulated in educational institutes with college kids helping around.

Recently there is a profile called "Data scientist" all over the BI market and started appearing in every article where Hadoop / Big data is there. It must be some part what a BI/DW person is doing with some specialization. Yes it is the formula,as for me..

BI + big-data + statistics + scripting + visualization = Data scientist

Ok can be scientist and work for a corporate or invent something new for you name? possibly...

But seems like a lot to cover , learn and experience at work. Maybe not really if we are in right job. Im just listing the very higher level outline..(and it is not limited to..) and my intention is not to oversimplify, but certainly to simplify the puzzle..

- DW work like ETL , databases, SQL with exposure to enterprise setup. Collecting data from heterogenous data sources. Log analysis. Dimensional modelling, DW architecture. DB performance.

Big-data

Hadoop is the first thing comes to mind for Big-data.. but good to know about noSQL dbs.

Mapreduce - shared nothing architecture - need for MR - use cases - tools available - pros and cons

Statistics

Basics - application of statistics in real-world - R programming

Scripting

Perl, Python, Java

Visualization

Reporting (I like Tableau), Complex SQLs , Ability to tell a story with data - by whatever way you effectively deliver..

I would like to list some of the coolest learning materials available for above topics..

I believe the thirst for discovery, admiring the hidden secret in the boring pile of data would make a Data Scientist..

Tuesday, February 21, 2012

Dual table in Hive

Since there is no Dual table in Hive..

I have created a dummy dual with dummy value ‘X’.

hive> CREATE TABLE dual (dummy STRING);

hive> load data local inpath '/local/user/dw/hive/dual.txt' overwrite into table dual;

Now we can use this table to select string values. Like

hive> select 'name','place','age' from dual;

to get current timestamp

hive> select unix_timestamp() from dual;

Friday, February 17, 2012

Talent is overrated - Geoff Colvin

Simply to say this book is good and can be life changing for you or may be to your kid.

I will get some new idea everyday, while taking bath or watching some movie or during half-sleep @ office or even at workout. Some of them are great, turned out to be great things which are implemented by someone, some are already existing and I'm ignorant of it. Many self analyzing thought made what I'm now. But the point is I'm not great, maybe by the scale of this society. Then how to define greatness - a person achieved enviable status in a field. The word field is the key here.

Being a Software Engineer I cannot become maddeningly wealthy like Buffet just by reading 'the intelligent investor'. Im already a hard-wired employee in a field who's time is controlled by someone else. And if you are married with kids personal time is a joke. There are success stories everywhere in Tennis, basketball, investing and especially entrepreneurship. Just by looking at it or researching will not make us the same. You might have known this but yet human greed takeover sometimes and we will start day-dreaming. Mostly the end result is disappointment.

If I have to become a cricketer I should have that thought by at least age 10 and played relentlessly until my skin is totally tanned. At this point this book makes a clear the 10000 hours point which Malcolm Gladwell's 'Outliers' is also talking about. Reading "Talent is overrated" book I can recall Arnold Schwarzenegger's quote ..

"Number one, come to America. Number two, work your butt off. And number three, marry a Kennedy."

The book says these very clearly

1. Choose your field, if you are already in a field and earning from it, its hard to leave and try new. But children has the advantage to excel enormously.

2. Deliberate practice .. it must be conscious, measurable and improving

3. Dont stick in the OK plateau if you want to become world-class

4. Never look around to excel in other area which you cannot afford to spend time on.

5. Get family support for what you are doing

Its a revealing read and can be read again if you fail to succeed in some area. The world is so competitive that only the extraordinary can win. (even in a job interview ) Can be a great book if you are a parent which can change your kids life all together.

Friday, February 10, 2012

Shell script - snippets

Loop through dates

startdate=`/bin/date --date="2007-07-01" +%Y-%m-%d`

enddate=`/bin/date --date="2011-07-01" +%Y-%m-%d`

foldate="$startdate"

until [ "$foldate" == "$enddate" ]

echo $foldate

foldate=`/bin/date --date="$foldate 1 month" +%Y-%m-%d`

done

Tuesday, January 24, 2012

Slowly changing dimensions in Hive

If you keep you Datawarehouse in Hive,How to implement slowly changing dimensions?

http://en.wikipedia.org/wiki/Slowly_changing_dimension

This is not a solution blog, actually Im looking for an effective solution to implement SCD in Hive.

Since all the types uses Update one way or another. for eg Type 2 can be a good candidate but it requires to update the enddate of history record. The only option I can think off is to insert the rows into the dimension and query on the maxid or maxdate to get the latest dimensions on the analytics layer.. (seems like an overhead on mapreduce)

I posted this question to Cloudera professionals group at LinkedIn. And Jasper recommended an idea which is not using Hive queries, moreover he prefers mapreduce scripts to handle SCD situations. I modified the flow to suit Hive tables.

Idea is this

1. Open the underlying HDFS file or select all rows using Hive-ql

2. Pass the data through a mapper as key,value pair

3. End-date the old record and leave the new record with NULL end_date (for type2) through script

4. Overwrite the Hive table

But the process could be tricky if the data is present in different partitions.

tipping point - Book journal

Malcolm Gladwell is an expert in writing book about the things we know already (very vaguly) and make an compelling book with storytelling and statistics.. he must be the perfect author for the current information overloaded culture.

Tipping point said to be his best, but for me listening in audio format felt 'outliers' was interesting and stories remained in memory for relatively more time.

the law of few - it is true that few people make a big difference and also some critical piece of work make a big difference in a big project. Here the point is not about innovation but marketing. I feel we play all these roles sometimes in different context. Gladwell categorizes these people, so it makes it easier to find who is who is future...

Stickiness factor - we are living in a era of short memory loss.. stickiness could be a day to a week. In the case of viral videos it could be an hour of fame. There is a war out there in advertising, web and entertainment to make a thing sticky. the point to make your stuff sticky specifically to your target audience. Sometimes the stickiness can come from a funny context or intelligence or experience.

new york crime example - this is a fascinating story of how new york subway handled crime by simply clearing the cars from graffiti everyday. You can make a big difference by making small spirited changes. Like if you allow small mistakes in a project to go by and didnt appreciate the people taking care of those stuff..then the manager is setting a wrong example. Its all about defined values. And always values are simple and to the core.

Connectivity - You are connected to anyone in this world in a factor of six. this may be reduced after facebook and linkedin. And 150 is the number where a group or organization can work effectively. I see the reason why my company split organizations without any reason and give meaningless names to those groups.

Teenage smoking issue - simple in one line "experimenting may not become a habit". We can protect our kids only to a certain extent from smoking. But anyway they are going to try at least once, but that doesn't means that they are hooked forever. More than 80% of kids who tried smoking quit is along the way. Chippers is the name for occasional and non-addictive smokers.

Prelude - Identifying mavens.. identifying the right context.. identifying minor signs before they tip..

How small things can bring big impact.. small things mostly are in our control..

Monday, January 23, 2012

Hive outer join issue

Recently we faced a memory issue (java heap space) when joining two partitioned tables.

Both the tables are huge in millions of rows and each partitioned on hourly basis.

select a.*,b.*

from a left outer join b

on a.id = b.id

where

a.part > '2012-01-01'

and b.part > '2012-01-01';

this fails during peak times.

issue:

when we checked the logs we found that the mapper is trying to access all the partitions of table b. Since the system is scarce in resources the job failed due to heap space issue.

resolution:

Change the right table to a inline..

select a.*,b.*

from a left outer join (select * from b where b.part > '2012-01-01') c

on a.id = c.id

where

a.part > '2012-01-01';

the number of mappers reduced dramatically and execution is scanning only the required partitions from table b.

Learning:

Always check the number of mappers while running. If the numbers are not realistic take a peak.

Divide and conquer approach when writing complex hive queries.

Since Hive is not yet a completely matured SQL, we need to put extra effort to examine the execution plan.

Tuesday, January 10, 2012

Moonwalking with Einstein - Book Journal

The book cover itself says its not another "mind power" book.. with boring technical and mostly impractical and sometimes with petty techniques which I already posses..

Author Foer takes us a journey which he takes to win the American memory championship by sprinkling some of the techniques he used to remember different stuff. Certainly the book is not for learning new memory techniques as the author just gives a glimpse and feel of those techniques, its upto the reader to dig-deeper those sprinkled techniques. Rather the book is successful in being interesting, funny and informative. Since Im reading the Malcolm Gladwell series, MWE narration comes closer to Outliers. And some of the information is there in both the books, which makes me think both Foer or Gladwell are originator of the original idea (like the 10,000 hrs deliberate work) they are just aggregators and presenters of existing facts..

Memory palace - eventhough this is a technique we all use to keep things in memory, we can deliberately build memory palaces (with the places we know well) for certain group of things. This is helpful to me in preserving things to do, passwords and some numbers. the trick is to assign weird images/persons or persons doing weird things.

remembering cards - the POA method seems to be very effective (person Object Activity). For each card in the deck assign a personality doing some action on an object

A-spade - Rajnikanth wearing a coolers
2-spade - Kamal kissing a girl
3-spade - vijay dancing on a table
4-spade - Ajith raising his red-shirts collar

And when you get a 3 card combination we can create a mental image of POA
Ajith dancing with a girl represents 4-spade,3-spade,2-spade..
Its hard to remember the image for each card, but once it is assigned with practice its possible to remember atleast 10 cards I think !!

remembering numbers - this can be done with the same technique above by assigning a image to each 2 digit combination (this is hard I think but possible with deliberate practise)

extreme peoples -
Other interesting part is about a man who forgets the breakfast he had 5 mins back and some info on the famous Rain man. The interesting information is our time will seems longer if you have more memories about the time-period. Like the vacations, the job changes, important events and successful events

Role of memory in education -
I feel education needs some memory to analyze the data.. the future is of-course how effectively we analyze the data, but memory also plays a part it seems..

Overall interesting read other than boring self-improvement books which always portraits an ideal personality (which in reality cant exists)