Tuesday, July 26, 2011
Get anyone's information
Sounds interesting, it might be very useful for user segmentation. Usually I thought this kind of information is mined and preserved by big corporations who can afford to store and effectively use the information. Rapleaf brings that to everybody specifically to small and medium businesses , they can also effectively mine and use their data.
Alright, I checked my data -- almost relevant and some of my friends. Some are relevant some are not, that's OK soon RapLeaf will improve their algorithm. But the information is enough to achieve some segmentation. There are many thoughts going through my mind on how to use this information, maybe join with the census data (available on Amazon EC2) .. I donno but this was kept in some corner of my memory space..
Ok now security, I said this to my friend and he is not happy about it and Im little skeptical too.. a little more thought and a criminal intention can be dangerous in this regard.. I dont know if Im paranoid or this data is available to everybody why worried about this..
All these apart Rapleaf gave me the power to mine a valuable data...
Java Package for string manipulation
So we need to do extensive and effective string manipulation to strip, clean and filter the string values. i found the following package has many handy features for most of the needed actions..
http://ws.apache.org/axis/java/apiDocs/org/apache/axis/utils/StringUtils.html
like StripStart, StripEnd
When you use this package in your mapredeuce program, the program will look for the package at run-time. You have two options
- Include the package at lib directory of all the nodes available (not feasible in most cases)
- Pass it to the respective nodes where your data is.
javac -classpath /apache/hadoop/hadoop-core-0.20.security-wilma-14.jar:/home/invidx/axis.jar wc.java
hadoop jar wc.jar wc -libjars /home/invidx/axis.jar /apps/traffic/learn/countries.seq /apps/traffic/outp/
Monday, July 25, 2011
Deleting 0 byte files in HDFS
You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the
set default_parallel 1;
in grunt.
In this technique you can achieve the single file
output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.
the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files
hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"
Friday, July 22, 2011
Hadoop Small files problem
But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.
The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip
tar cvfj countries.tar.bz2 *.txt
When you read this sequential file in Map reduce you need to use sequentiafile input format.
job.setInputFormat(SequenceFileInputFormat.class);
the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.
public static class MapClass extends MapReduceBase implements Mapper
public void map(Text key, BytesWritable value,OutputCollector
Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)
String line = new String(value.getBytes());
This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..
Friday, July 15, 2011
Hadoop certified
Just got certified yesterday after a failiure 2 weeks before.. I misjudged the test at first and not much prepared. The questions are more diversedfrom Hadoop, DFS concepts, mapreduce,java, little administration , pig, hive, flume etc.. but mostly on mapreduce and practical implementations.
The test is certainly testing on your practical knowledge onHadoop, one who has good hands-on experience and very good on the concepts can pass...
Wednesday, July 13, 2011
Common HDFS shell commands
ls
hadoop fs –ls /
hadoop fs –ls /user/
lsr
Usage: hadoop fs -lsr
Recursive version of ls. Similar to Unix ls -R.
cat
Usage: hadoop fs -cat URI [URI …]
Copies source paths to stdout.
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4
put
Usage: hadoop fs -put
Similar to put command, except that the source is restricted to a local file reference.
get
Usage: hadoop fs -get URI
Similar to get command, except that the destination is restricted to a local file reference.
cp
Usage: hadoop fs -cp URI [URI …]
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Example:
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
mkdir
Usage: hadoop fs -mkdir
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.
Example:
- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
mv
Usage: hadoop fs -mv URI [URI …]
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:
- hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
rm
Usage: hadoop fs -rm URI [URI …]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.
Example:
- hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
rmr
Usage: hadoop fs -rmr URI [URI …]
Recursive version of delete.
Example:
- hadoop fs -rmr /user/hadoop/dir
tail
Usage: hadoop fs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
- hadoop fs -tail pathname
Hadoop Video learning from cloudera
If you are fond of video learning like me, you might these videos very useful..
Starting with very simple Hadoop streaming
If you know very basic python it is possible to write basic map-reduce.. in this case we are gonna use just shell commands (cat and wc) to get the wordcount for the given file..
Hadoop provides the streaming which acts like the unix pipes, we stack programs from the stdout of the previous program left to right.we can specify what a mapper should do and what a reducer should do.. and ofcourse the input file path and the output directory..
To read and understand Hadoop streaming in detail go to the following link..
http://hadoop.apache.org/common/docs/current/streaming.html
Problem statement:
To get the wordcount of the given file using map-reduce
Map-reduce design:
Mapper : Just open the file using cat and each word is a key
Reducer : count the occurence of each word using wc
input HDFS file : /user/training/notes
output HDFS directory : /user/training/output/
Note : refer to basic HDFS shell commands in different post
command:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar -input /user/training/notes -output /user/training/output/ -mapper /bin/cat -reducer wc
This command uses the streaming framework provided by Hadoop in $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar .. check the name of this file before running because each version has different file name..Output:
you can verify the output from the files created in output directory
hadoop fs -lsr /user/training/output
The output from the reducer always loaded into a file with name part-*
to view the contents of the file
hadoop fs -cat /user/training/output/part-00000
Tracking the execution:
While executing the code you might notice the following
1. the percentage of mappers and reducers completed
2. the link to jobtracker (a web UI to track and view-log).. the jobtracker will run in port 50030
in our case
http://localhost.localdomain:50030
tips:
If you are running the program for the second time with the same output directory then you are required to remove the output directory
hadoop fs -rmr /user/training/output
Configuring your Hadoop
Login with username and password (cloudera,cloudera)
Open the terminal using the icon at the top menu bar..Now we need to check for environment variables before jumping into real coding.
echo $JAVA_HOME - this is where your java files are..
echo $HADOOP_HOME - for the jar and lib files used by Hadoop
to set these parameters if it emits nothing..
export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
note: these paths might be different if you are using different image or different installation of Hadoop.
Once the parameters are set you can check if the hadoop is running but just typing hadoop in the prompt.. and if it throws the help for hadoop then you are all set..
$hadoop
Start your own Hadoop cluster
The best and easy options available is to use Cloudera's VMWare image along with VMWare player. For beginners VMWare creates another virtual machine from your desktop, say you have a windows machine, you can install VMware player in your machine and convert into a UNIX box (pretty cool!!) vmware player is freeware from www.vmware.com (precisely : https://www.vmware.com/tryvmware/index.php)
Next regarding image from Cloudera, anyone can configure a system through VMware and save it as a file and distribute. And anyone can use the same configuration in their machine just by copying the image file in their system. We can download VMware image for Cloudera Hadoop from https://ccp.cloudera.com/display/SUPPORT/Downloads
So thats it, install VMware player and download cloudera image and just double click the *.vmx file from the downloaded directory.. a fully configured Hadoop (UNIX) machine is available for you. At the time of this writing the username and password for the image is "cloudera" and "cloudera", check with their site while downloading.