BigDW: July 2011

Tuesday, July 26, 2011

Get anyone's information

Came across a site "RapLeaf" which provides information like gender, age, county and possibly more information about most of us..

Sounds interesting, it might be very useful for user segmentation. Usually I thought this kind of information is mined and preserved by big corporations who can afford to store and effectively use the information. Rapleaf brings that to everybody specifically to small and medium businesses , they can also effectively mine and use their data.

Alright, I checked my data -- almost relevant and some of my friends. Some are relevant some are not, that's OK soon RapLeaf will improve their algorithm. But the information is enough to achieve some segmentation. There are many thoughts going through my mind on how to use this information, maybe join with the census data (available on Amazon EC2) .. I donno but this was kept in some corner of my memory space..

Ok now security, I said this to my friend and he is not happy about it and Im little skeptical too.. a little more thought and a criminal intention can be dangerous in this regard.. I dont know if Im paranoid or this data is available to everybody why worried about this..

All these apart Rapleaf gave me the power to mine a valuable data...

Java Package for string manipulation

Most of applications developed in Hadoop are related to string manipulations. Like machine learning, crawling, indexing and matching algorithms..As goes with Hadoop the data is going to be unstructured, crappy and will not follow any rules.

So we need to do extensive and effective string manipulation to strip, clean and filter the string values. i found the following package has many handy features for most of the needed actions..

http://ws.apache.org/axis/java/apiDocs/org/apache/axis/utils/StringUtils.html

like StripStart, StripEnd

When you use this package in your mapredeuce program, the program will look for the package at run-time. You have two options

Include the package at lib directory of all the nodes available (not feasible in most cases)
Pass it to the respective nodes where your data is.

To do the 2nd option you can use the -libjars while executing the code. and use complete classpath while compiling..

javac -classpath /apache/hadoop/hadoop-core-0.20.security-wilma-14.jar:/home/invidx/axis.jar wc.java

hadoop jar wc.jar wc -libjars /home/invidx/axis.jar /apps/traffic/learn/countries.seq /apps/traffic/outp/

Monday, July 25, 2011

Deleting 0 byte files in HDFS

After running Map-reduce jobs in large clusters you might have noticed that there are numerous output files generated as output. This is because Map-reduce creates one output file for each reducer.

You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the

set default_parallel 1;

in grunt.

In this technique you can achieve the single file
 output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.

the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files

hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"

Friday, July 22, 2011

Hadoop Small files problem

Im writing an inverted indexing program in Hadoop. It is easier to write a simple program to index a single file, nothing more than a wordcount program and some string manipulations.

But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.

The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip

tar cvfj countries.tar.bz2 *.txt

When you read this sequential file in Map reduce you need to use sequentiafile input format.

job.setInputFormat(SequenceFileInputFormat.class);

the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.

public static class MapClass extends MapReduceBase implements Mapper

public void map(Text key, BytesWritable value,OutputCollector output,Reporter reporter) throws IOException

Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)

String line = new String(value.getBytes());

This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..

Friday, July 15, 2011

Hadoop certified

Happy that Im a certified Hadoop developer!

Just got certified yesterday after a failiure 2 weeks before.. I misjudged the test at first and not much prepared. The questions are more diversedfrom Hadoop, DFS concepts, mapreduce,java, little administration , pig, hive, flume etc.. but mostly on mapreduce and practical implementations.

The test is certainly testing on your practical knowledge onHadoop, one who has good hands-on experience and very good on the concepts can pass...

Wednesday, July 13, 2011

Common HDFS shell commands

Just the list of very commonly used HDFS shell commands...

hadoop fs –ls /

hadoop fs –ls /user/

lsr

Usage: hadoop fs -lsr
Recursive version of ls. Similar to Unix ls -R.

cat

Usage: hadoop fs -cat URI [URI …]

Copies source paths to stdout.

hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2

hadoop fs -cat file:///file3 /user/hadoop/file4

put

Usage: hadoop fs -put URI

Similar to put command, except that the source is restricted to a local file reference.

get

Usage: hadoop fs -get URI

Similar to get command, except that the destination is restricted to a local file reference.

Usage: hadoop fs -cp URI [URI …]

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Example:

hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

mkdir

Usage: hadoop fs -mkdir

Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2

Usage: hadoop fs -mv URI [URI …]

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2

Usage: hadoop fs -rm URI [URI …]

Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.
Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

rmr

Usage: hadoop fs -rmr URI [URI …]

Recursive version of delete.
Example:

hadoop fs -rmr /user/hadoop/dir

tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

hadoop fs -tail pathname

Hadoop Video learning from cloudera

Cloudera provides an excellent video resource for Hadoop, map-reduce and other components like Pig, Hive..

If you are fond of video learning like me, you might these videos very useful..

Starting with very simple Hadoop streaming

Hadoop is not all about Java (eventhough it is).. but nothing stops you from trying some map-reduce or exploring the data available in HDFS if you dont know any idea about Java..
If you know very basic python it is possible to write basic map-reduce.. in this case we are gonna use just shell commands (cat and wc) to get the wordcount for the given file..

Hadoop provides the streaming which acts like the unix pipes, we stack programs from the stdout of the previous program left to right.we can specify what a mapper should do and what a reducer should do.. and ofcourse the input file path and the output directory..

To read and understand Hadoop streaming in detail go to the following link..
http://hadoop.apache.org/common/docs/current/streaming.html

Problem statement:
To get the wordcount of the given file using map-reduce

Map-reduce design:
Mapper : Just open the file using cat and each word is a key
Reducer : count the occurence of each word using wc

input HDFS file : /user/training/notes
output HDFS directory : /user/training/output/
Note : refer to basic HDFS shell commands in different post

command:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar -input /user/training/notes -output /user/training/output/ -mapper /bin/cat -reducer wc

This command uses the streaming framework provided by Hadoop in $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar .. check the name of this file before running because each version has different file name..

Output:
you can verify the output from the files created in output directory

hadoop fs -lsr /user/training/output

The output from the reducer always loaded into a file with name part-*
to view the contents of the file

hadoop fs -cat /user/training/output/part-00000

Tracking the execution:
While executing the code you might notice the following

1. the percentage of mappers and reducers completed
2. the link to jobtracker (a web UI to track and view-log).. the jobtracker will run in port 50030
in our case
http://localhost.localdomain:50030

tips:
If you are running the program for the second time with the same output directory then you are required to remove the output directory

hadoop fs -rmr /user/training/output

Configuring your Hadoop

Double click the Cloudera image file.. this will open the image in VMware player.
Login with username and password (cloudera,cloudera)

Open the terminal using the icon at the top menu bar..Now we need to check for environment variables before jumping into real coding.

echo $JAVA_HOME - this is where your java files are..
echo $HADOOP_HOME - for the jar and lib files used by Hadoop

to set these parameters if it emits nothing..
export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop

note: these paths might be different if you are using different image or different installation of Hadoop.

Once the parameters are set you can check if the hadoop is running but just typing hadoop in the prompt.. and if it throws the help for hadoop then you are all set..

$hadoop

Start your own Hadoop cluster

Even though your organization has a Hadoop installation, they might not give access to you to learn and try some code. And for a beginner an enterprise cluster with 1000's of nodes is unnecessary. So how to try, feel and learn Hadoop for yourself..

The best and easy options available is to use Cloudera's VMWare image along with VMWare player. For beginners VMWare creates another virtual machine from your desktop, say you have a windows machine, you can install VMware player in your machine and convert into a UNIX box (pretty cool!!) vmware player is freeware from www.vmware.com (precisely : https://www.vmware.com/tryvmware/index.php)

Next regarding image from Cloudera, anyone can configure a system through VMware and save it as a file and distribute. And anyone can use the same configuration in their machine just by copying the image file in their system. We can download VMware image for Cloudera Hadoop from https://ccp.cloudera.com/display/SUPPORT/Downloads

So thats it, install VMware player and download cloudera image and just double click the *.vmx file from the downloaded directory.. a fully configured Hadoop (UNIX) machine is available for you. At the time of this writing the username and password for the image is "cloudera" and "cloudera", check with their site while downloading.