Thursday, August 11, 2011

Book review - Hadoop in action

As a newbie when I tried to learn Hadoop the main big obstacle to cross is "where to begin?" whether I need to brush up my java which I learnt years back in college or go over the numerous videos and blogs available on the internet. I tried to read variety of books especially "Hadoop - the definitive guide" and brushed over "Hadoop in action".

Since the aura around Hadoop says its high tech and complex we expect the book to be a tome. At first this book didnt gave a good impression because of its size.. But what caught me is the text " I won’t focus on the nitty-gritty details. Instead I will provide the information that will allow you to quickly create useful code, along with more advanced topics most often encountered in practice." in the first chapter. And the book lives to this promise.

Im sure even if you are a newbie and has good programming knowledge on any language you can come out writing some useful map-reduce programs. The book size is comparitively small so you can read through and do some practise programs within a week.

Most of the example programs are written in Java with some introduction to python and streaming programs. After reading this book Im inclined to code on Java. But currently my job demands to do in python(which is so cool!).

If you are a newbie to hadoop I would strongly recommend this but if you want to master Hadoop and looking for a reference material this is not for you..

Monday, August 8, 2011

UDF in Pig - calculate hash code for a column

In most cases we cannot use Pig only as a simple querying language, we have to use it as an analytic tool and also as a data-processing tool. To do that we need many powerful functions which are not available in Pig.

for eg I want to calculate the hash code of a particular column in a file and join with the hash-code of another co lumen in a different file. There is no direct hash function in Pig to do that, so we have to go for a UDF.

First create the function in Java / python.. in my case its python

sha2.py
--------------------------------------------------
#!/usr/bin/python

import re
import sha
from sys import stdin, stdout
from hashlib import sha1

for title in stdin :
title = re.sub('[^a-z0-9 ]',' ',title.lower())
title = re.sub(' +',' ',title)

tokens = title.split(' ')

tokens.sort()
stitle=' '.join(tokens)
print sha.new(stitle).hexdigest()

--------------------------------------------------------

We can is this function created in python in Pig scripts by using a "define" command

data = LOAD '/sys/edw/data'
item_title = foreach data generate $1,$2;
DEFINE Cmd `sha2.py` ship('/export/home/braja/work/sha2.py');
bfore_hashes= foreach item_title generate $2
hashes = stream bfore_hashes through Cmd;