Monday, August 8, 2011

UDF in Pig - calculate hash code for a column

In most cases we cannot use Pig only as a simple querying language, we have to use it as an analytic tool and also as a data-processing tool. To do that we need many powerful functions which are not available in Pig.

for eg I want to calculate the hash code of a particular column in a file and join with the hash-code of another co lumen in a different file. There is no direct hash function in Pig to do that, so we have to go for a UDF.

First create the function in Java / python.. in my case its python

sha2.py
--------------------------------------------------
#!/usr/bin/python

import re
import sha
from sys import stdin, stdout
from hashlib import sha1

for title in stdin :
title = re.sub('[^a-z0-9 ]',' ',title.lower())
title = re.sub(' +',' ',title)

tokens = title.split(' ')

tokens.sort()
stitle=' '.join(tokens)
print sha.new(stitle).hexdigest()

--------------------------------------------------------

We can is this function created in python in Pig scripts by using a "define" command

data = LOAD '/sys/edw/data'
item_title = foreach data generate $1,$2;
DEFINE Cmd `sha2.py` ship('/export/home/braja/work/sha2.py');
bfore_hashes= foreach item_title generate $2
hashes = stream bfore_hashes through Cmd;


No comments:

Post a Comment