BigDW: 2011

Friday, December 23, 2011

SVN notes

the simplest and to the point cheat sheet for SVN

http://www.abbeyworkshop.com/howto/misc/svn01/

one command which didnt work for me.. when i tried to commit "add a file to svn repository"

svn commit -m "Saving recent changes" http://localhost/svn_dir/repository/project_dir

I used the file name / folder instead of URL

svn commit -m " log for commiting" filename.sql

Mysql useful notes

to get only the DDL for all tables in Database without data

mysqldump -u -p -h host --no-data --skip-opt db

terrible issue i faced by using the above command.. the AUTO-INCREMENT and other options where not shown up in the dump .. im not going to use skip-opt ..

mysqldump -u -p -h host --no-data db

to get only the data-dump as inserts for a table

mysqldump -u -p -h host --no-create-info --skip-opt db table-name

sometimes your user access didnt allow to use mysqldump .. in this scenario use

--single-transaction

clause

to get the tables order by row count..

select table_schema,table_name,count(table_rows) from information_schema.tables where table_schema not in ('information_schema','performance_schema','mysql') group by 1,2 order by 1,3 desc;

Run sql as string from command line

mysql -u user -p -e ‘SQL Query’ Database_Name

to export to a file

mysql -u user -p -e ‘SQL Query’ Database_Name > filename

Performance pointers
Beware of running ANALYZE while there is some long running queries on the table. In MySQl if you have a long running query accessing the table and you run ANALYZE TABLE, it is unable to access the table with lock until the first query completes. So Analyze will wait.

There is a good possibility that a long running query was triggered when we are trying to analyze the table. We can schedule all 'Analyze' specifically the tables serving end-user at off-time.

Wednesday, December 21, 2011

Extract / Get DDL of Hive tables

Recent issue I came across when we need to migrate Hive tables from one server to another.. (even from a dev to prod environment) as usual we havent preserved the DDL for the Hive tables present..

Googled and came across this jira https://issues.apache.org/jira/browse/HIVE-967

Step by step

1. download the HiveShowCreateTable.jar from jira
2. Copy it to the server which has Hive (even your local directory is fine)
3. Run as following (the syntax in the txt file of Jira doesnt work because of its reference to a non-existent jar file)

hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=poc_detail 2>/dev/null

ta..da.. we get the DDL of the provided table :-)

Thanks to Ed Capriolo who created this code..

Wednesday, December 7, 2011

Outliers by Malcom - book journal

Just to record my understanding, perspective of listening to the book 'Outlier'

The beginning of the book about an italian immigrant community in Pittsburg is a good start. How a certain community alone can evade the wrath of heart-attack the prominent killer of modern time.. Im interested more in this story since I myself has a high-cholestrol and my Dad died of stroke. The answers is typical to the author it seems. His perspective that the community and being belonging to the community avoids heart-attack and other diseases was appealing to me. Because I too saw men who are happier and who has a dont care attitude or self-sufficient lives longer. Anyway my learning ' have friends, be closer to your community, have a laugh and become content at some point in life :-)'

The other stories I liked is the 10000 hours of work all the successful people put into to became one.. like the beatles, bill gates, bill joyce and all the star sports people.

Although 10000 hours may seems distant .. it is possible in a distant timeframe.. Im turning 31 soon. say I spent a hour every working day during my past 8 years of professional life (8 x 24 x 12) = 2300.. i have a deficit of 7700 hours..

@ 2 hrs / day = 13 yrs

@ 3 hrs / day = 9 yrs

@ 4 hrs / day = 6.5 yrs

Being married and have a kid, even 4 hrs a day is a high cost (with so much distraction around).. but being a ambitious man I have to..The only way is to make my job a learning arena. This is the only time I can concentrate. The trick is to have a job which gives that..Lets see if I have the opportunity and luck to keep that for next 7 years from now..I would be 38 by that time :-)

Malcom is a strong advocate through out the book for the impact of culture and ethnicity on success.. I gotto think about mine to see what is there in my story although im not a outlier (maybe Im one considering my schoolmates and most from my childhood neighbours)

im fortunate to have graduated father and mother when most of the kids has none of them educated or one of them, my mom being a intelligent woman understands the need for education and puts so much time/money to teach me eventhough I didnt have nice teachers all through my life..Although I havent scored much in the examinations I certainly felt bright and intelligent. Choosing the IT as my engineering instead of more fashionable aerospace, marine or electronics...Thankful the IT industry boom in India and to land on right job which will later send me for a project to US..

The other very important learning I got is - why certain people are so outspoken and have the trick of getting anything they want. Its from their parents and environment / ethnicity.

Its good to expose children to many talents at an earlier stage, they might catchup something or they will understand how to cope with multiple things. Parents should be an example like giving freedom to children, spouse, listening manners. If possible send them to private school to get a good learing and have a oppurtunity to have more learning time.

according to outliers -- success is -- right man (willing to work and work hard) right time (grab the occasion) with right environment ( with the tools and oppurtunity) with right mindset (heriditary and ethnicity related) will seize to success!!

Thursday, September 8, 2011

Article on how Wal-mart uses Hadoop

This article from businessweek gives a insight into how big companies like Walmart, Nokia and Disney uses Hadoop for their revenue growth. Big data analysis is not an option anymore, this has become a need to tweak the customer experience , to provide better recommendations. If you need some more inspiration for learning Hadoop read it through

http://www.businessweek.com/technology/getting-a-handle-on-big-data-with-hadoop-09072011.html

Thursday, August 11, 2011

Book review - Hadoop in action

As a newbie when I tried to learn Hadoop the main big obstacle to cross is "where to begin?" whether I need to brush up my java which I learnt years back in college or go over the numerous videos and blogs available on the internet. I tried to read variety of books especially "Hadoop - the definitive guide" and brushed over "Hadoop in action".

Since the aura around Hadoop says its high tech and complex we expect the book to be a tome. At first this book didnt gave a good impression because of its size.. But what caught me is the text " I won’t focus on the nitty-gritty details. Instead I will provide the information that will allow you to quickly create useful code, along with more advanced topics most often encountered in practice." in the first chapter. And the book lives to this promise.

Im sure even if you are a newbie and has good programming knowledge on any language you can come out writing some useful map-reduce programs. The book size is comparitively small so you can read through and do some practise programs within a week.

Most of the example programs are written in Java with some introduction to python and streaming programs. After reading this book Im inclined to code on Java. But currently my job demands to do in python(which is so cool!).

If you are a newbie to hadoop I would strongly recommend this but if you want to master Hadoop and looking for a reference material this is not for you..

Monday, August 8, 2011

UDF in Pig - calculate hash code for a column

In most cases we cannot use Pig only as a simple querying language, we have to use it as an analytic tool and also as a data-processing tool. To do that we need many powerful functions which are not available in Pig.

for eg I want to calculate the hash code of a particular column in a file and join with the hash-code of another co lumen in a different file. There is no direct hash function in Pig to do that, so we have to go for a UDF.

First create the function in Java / python.. in my case its python

sha2.py
--------------------------------------------------
#!/usr/bin/python

import re
import sha
from sys import stdin, stdout
from hashlib import sha1

for title in stdin :
title = re.sub('[^a-z0-9 ]',' ',title.lower())
title = re.sub(' +',' ',title)

tokens = title.split(' ')

tokens.sort()
stitle=' '.join(tokens)
print sha.new(stitle).hexdigest()

--------------------------------------------------------

We can is this function created in python in Pig scripts by using a "define" command

data = LOAD '/sys/edw/data'
item_title = foreach data generate $1,$2;
DEFINE Cmd `sha2.py` ship('/export/home/braja/work/sha2.py');
bfore_hashes= foreach item_title generate $2
hashes = stream bfore_hashes through Cmd;

Tuesday, July 26, 2011

Get anyone's information

Came across a site "RapLeaf" which provides information like gender, age, county and possibly more information about most of us..

Sounds interesting, it might be very useful for user segmentation. Usually I thought this kind of information is mined and preserved by big corporations who can afford to store and effectively use the information. Rapleaf brings that to everybody specifically to small and medium businesses , they can also effectively mine and use their data.

Alright, I checked my data -- almost relevant and some of my friends. Some are relevant some are not, that's OK soon RapLeaf will improve their algorithm. But the information is enough to achieve some segmentation. There are many thoughts going through my mind on how to use this information, maybe join with the census data (available on Amazon EC2) .. I donno but this was kept in some corner of my memory space..

Ok now security, I said this to my friend and he is not happy about it and Im little skeptical too.. a little more thought and a criminal intention can be dangerous in this regard.. I dont know if Im paranoid or this data is available to everybody why worried about this..

All these apart Rapleaf gave me the power to mine a valuable data...

Java Package for string manipulation

Most of applications developed in Hadoop are related to string manipulations. Like machine learning, crawling, indexing and matching algorithms..As goes with Hadoop the data is going to be unstructured, crappy and will not follow any rules.

So we need to do extensive and effective string manipulation to strip, clean and filter the string values. i found the following package has many handy features for most of the needed actions..

http://ws.apache.org/axis/java/apiDocs/org/apache/axis/utils/StringUtils.html

like StripStart, StripEnd

When you use this package in your mapredeuce program, the program will look for the package at run-time. You have two options

Include the package at lib directory of all the nodes available (not feasible in most cases)
Pass it to the respective nodes where your data is.

To do the 2nd option you can use the -libjars while executing the code. and use complete classpath while compiling..

javac -classpath /apache/hadoop/hadoop-core-0.20.security-wilma-14.jar:/home/invidx/axis.jar wc.java

hadoop jar wc.jar wc -libjars /home/invidx/axis.jar /apps/traffic/learn/countries.seq /apps/traffic/outp/

Monday, July 25, 2011

Deleting 0 byte files in HDFS

After running Map-reduce jobs in large clusters you might have noticed that there are numerous output files generated as output. This is because Map-reduce creates one output file for each reducer.

You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the

set default_parallel 1;

in grunt.

In this technique you can achieve the single file
 output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.

the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files

hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"

Friday, July 22, 2011

Hadoop Small files problem

Im writing an inverted indexing program in Hadoop. It is easier to write a simple program to index a single file, nothing more than a wordcount program and some string manipulations.

But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.

The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip

tar cvfj countries.tar.bz2 *.txt

When you read this sequential file in Map reduce you need to use sequentiafile input format.

job.setInputFormat(SequenceFileInputFormat.class);

the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.

public static class MapClass extends MapReduceBase implements Mapper

public void map(Text key, BytesWritable value,OutputCollector output,Reporter reporter) throws IOException

Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)

String line = new String(value.getBytes());

This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..

Friday, July 15, 2011

Hadoop certified

Happy that Im a certified Hadoop developer!

Just got certified yesterday after a failiure 2 weeks before.. I misjudged the test at first and not much prepared. The questions are more diversedfrom Hadoop, DFS concepts, mapreduce,java, little administration , pig, hive, flume etc.. but mostly on mapreduce and practical implementations.

The test is certainly testing on your practical knowledge onHadoop, one who has good hands-on experience and very good on the concepts can pass...

Wednesday, July 13, 2011

Common HDFS shell commands

Just the list of very commonly used HDFS shell commands...

hadoop fs –ls /

hadoop fs –ls /user/

lsr

Usage: hadoop fs -lsr
Recursive version of ls. Similar to Unix ls -R.

cat

Usage: hadoop fs -cat URI [URI …]

Copies source paths to stdout.

hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

hadoop fs -cat file:///file3 /user/hadoop/file4

put

Usage: hadoop fs -put URI

Similar to put command, except that the source is restricted to a local file reference.

get

Usage: hadoop fs -get URI

Similar to get command, except that the destination is restricted to a local file reference.

Usage: hadoop fs -cp URI [URI …]

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Example:

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

mkdir

Usage: hadoop fs -mkdir

Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

Usage: hadoop fs -mv URI [URI …]

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2

Usage: hadoop fs -rm URI [URI …]

Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.
Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

rmr

Usage: hadoop fs -rmr URI [URI …]

Recursive version of delete.
Example:

hadoop fs -rmr /user/hadoop/dir

tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

hadoop fs -tail pathname

Hadoop Video learning from cloudera

Cloudera provides an excellent video resource for Hadoop, map-reduce and other components like Pig, Hive..

If you are fond of video learning like me, you might these videos very useful..

Starting with very simple Hadoop streaming

Hadoop is not all about Java (eventhough it is).. but nothing stops you from trying some map-reduce or exploring the data available in HDFS if you dont know any idea about Java..
If you know very basic python it is possible to write basic map-reduce.. in this case we are gonna use just shell commands (cat and wc) to get the wordcount for the given file..

Hadoop provides the streaming which acts like the unix pipes, we stack programs from the stdout of the previous program left to right.we can specify what a mapper should do and what a reducer should do.. and ofcourse the input file path and the output directory..

To read and understand Hadoop streaming in detail go to the following link..
http://hadoop.apache.org/common/docs/current/streaming.html

Problem statement:
To get the wordcount of the given file using map-reduce

Map-reduce design:
Mapper : Just open the file using cat and each word is a key
Reducer : count the occurence of each word using wc

input HDFS file : /user/training/notes
output HDFS directory : /user/training/output/
Note : refer to basic HDFS shell commands in different post

command:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar -input /user/training/notes -output /user/training/output/ -mapper /bin/cat -reducer wc

This command uses the streaming framework provided by Hadoop in $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar .. check the name of this file before running because each version has different file name..

Output:
you can verify the output from the files created in output directory

hadoop fs -lsr /user/training/output

The output from the reducer always loaded into a file with name part-*
to view the contents of the file

hadoop fs -cat /user/training/output/part-00000

Tracking the execution:
While executing the code you might notice the following

1. the percentage of mappers and reducers completed
2. the link to jobtracker (a web UI to track and view-log).. the jobtracker will run in port 50030
in our case
http://localhost.localdomain:50030

tips:
If you are running the program for the second time with the same output directory then you are required to remove the output directory

hadoop fs -rmr /user/training/output

Configuring your Hadoop

Double click the Cloudera image file.. this will open the image in VMware player.
Login with username and password (cloudera,cloudera)

Open the terminal using the icon at the top menu bar..Now we need to check for environment variables before jumping into real coding.

echo $JAVA_HOME - this is where your java files are..
echo $HADOOP_HOME - for the jar and lib files used by Hadoop

to set these parameters if it emits nothing..
export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop

note: these paths might be different if you are using different image or different installation of Hadoop.

Once the parameters are set you can check if the hadoop is running but just typing hadoop in the prompt.. and if it throws the help for hadoop then you are all set..

$hadoop

Start your own Hadoop cluster

Even though your organization has a Hadoop installation, they might not give access to you to learn and try some code. And for a beginner an enterprise cluster with 1000's of nodes is unnecessary. So how to try, feel and learn Hadoop for yourself..

The best and easy options available is to use Cloudera's VMWare image along with VMWare player. For beginners VMWare creates another virtual machine from your desktop, say you have a windows machine, you can install VMware player in your machine and convert into a UNIX box (pretty cool!!) vmware player is freeware from www.vmware.com (precisely : https://www.vmware.com/tryvmware/index.php)

Next regarding image from Cloudera, anyone can configure a system through VMware and save it as a file and distribute. And anyone can use the same configuration in their machine just by copying the image file in their system. We can download VMware image for Cloudera Hadoop from https://ccp.cloudera.com/display/SUPPORT/Downloads

So thats it, install VMware player and download cloudera image and just double click the *.vmx file from the downloaded directory.. a fully configured Hadoop (UNIX) machine is available for you. At the time of this writing the username and password for the image is "cloudera" and "cloudera", check with their site while downloading.