In most cases we cannot use Pig only as a simple querying language, we have to use it as an analytic tool and also as a data-processing tool. To do that we need many powerful functions which are not available in Pig.
for eg I want to calculate the hash code of a particular column in a file and join with the hash-code of another co lumen in a different file. There is no direct hash function in Pig to do that, so we have to go for a UDF.
First create the function in Java / python.. in my case its python
sha2.py
--------------------------------------------------
#!/usr/bin/python
import re
import sha
from sys import stdin, stdout
from hashlib import sha1
for title in stdin :
title = re.sub('[^a-z0-9 ]',' ',title.lower())
title = re.sub(' +',' ',title)
tokens = title.split(' ')
tokens.sort()
stitle=' '.join(tokens)
print sha.new(stitle).hexdigest()
--------------------------------------------------------
We can is this function created in python in Pig scripts by using a "define" command
data = LOAD '/sys/edw/data'
item_title = foreach data generate $1,$2;
DEFINE Cmd `sha2.py` ship('/export/home/braja/work/sha2.py');
bfore_hashes= foreach item_title generate $2
hashes = stream bfore_hashes through Cmd;
No comments:
Post a Comment