Friday, December 23, 2011
SVN notes
http://www.abbeyworkshop.com/howto/misc/svn01/
one command which didnt work for me.. when i tried to commit "add a file to svn repository"
svn commit -m "Saving recent changes" http://localhost/svn_dir/repository/project_dir
I used the file name / folder instead of URL
svn commit -m " log for commiting" filename.sql
Mysql useful notes
mysqldump -u
terrible issue i faced by using the above command.. the AUTO-INCREMENT and other options where not shown up in the dump .. im not going to use skip-opt ..
mysqldump -u
to get only the data-dump as inserts for a table
mysqldump -u
Performance pointers
Beware of running ANALYZE while there is some long running queries on the table.
Wednesday, December 21, 2011
Extract / Get DDL of Hive tables
Googled and came across this jira https://issues.apache.org/jira/browse/HIVE-967
Step by step
1. download the HiveShowCreateTable.jar from jira
2. Copy it to the server which has Hive (even your local directory is fine)
3. Run as following (the syntax in the txt file of Jira doesnt work because of its reference to a non-existent jar file)
hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=poc_detail 2>/dev/null
ta..da.. we get the DDL of the provided table :-)
Thanks to Ed Capriolo who created this code..
Wednesday, December 7, 2011
Outliers by Malcom - book journal
Thursday, September 8, 2011
Article on how Wal-mart uses Hadoop
Thursday, August 11, 2011
Book review - Hadoop in action
Since the aura around Hadoop says its high tech and complex we expect the book to be a tome. At first this book didnt gave a good impression because of its size.. But what caught me is the text " I won’t focus on the nitty-gritty details. Instead I will provide the information that will allow you to quickly create useful code, along with more advanced topics most often encountered in practice." in the first chapter. And the book lives to this promise.
Im sure even if you are a newbie and has good programming knowledge on any language you can come out writing some useful map-reduce programs. The book size is comparitively small so you can read through and do some practise programs within a week.
Most of the example programs are written in Java with some introduction to python and streaming programs. After reading this book Im inclined to code on Java. But currently my job demands to do in python(which is so cool!).
If you are a newbie to hadoop I would strongly recommend this but if you want to master Hadoop and looking for a reference material this is not for you..
Monday, August 8, 2011
UDF in Pig - calculate hash code for a column
for eg I want to calculate the hash code of a particular column in a file and join with the hash-code of another co lumen in a different file. There is no direct hash function in Pig to do that, so we have to go for a UDF.
First create the function in Java / python.. in my case its python
sha2.py
--------------------------------------------------
#!/usr/bin/python
import re
import sha
from sys import stdin, stdout
from hashlib import sha1
for title in stdin :
title = re.sub('[^a-z0-9 ]',' ',title.lower())
title = re.sub(' +',' ',title)
tokens = title.split(' ')
tokens.sort()
stitle=' '.join(tokens)
print sha.new(stitle).hexdigest()
--------------------------------------------------------
We can is this function created in python in Pig scripts by using a "define" command
data = LOAD '/sys/edw/data'
item_title = foreach data generate $1,$2;
DEFINE Cmd `sha2.py` ship('/export/home/braja/work/sha2.py');
bfore_hashes= foreach item_title generate $2
hashes = stream bfore_hashes through Cmd;
Tuesday, July 26, 2011
Get anyone's information
Sounds interesting, it might be very useful for user segmentation. Usually I thought this kind of information is mined and preserved by big corporations who can afford to store and effectively use the information. Rapleaf brings that to everybody specifically to small and medium businesses , they can also effectively mine and use their data.
Alright, I checked my data -- almost relevant and some of my friends. Some are relevant some are not, that's OK soon RapLeaf will improve their algorithm. But the information is enough to achieve some segmentation. There are many thoughts going through my mind on how to use this information, maybe join with the census data (available on Amazon EC2) .. I donno but this was kept in some corner of my memory space..
Ok now security, I said this to my friend and he is not happy about it and Im little skeptical too.. a little more thought and a criminal intention can be dangerous in this regard.. I dont know if Im paranoid or this data is available to everybody why worried about this..
All these apart Rapleaf gave me the power to mine a valuable data...
Java Package for string manipulation
So we need to do extensive and effective string manipulation to strip, clean and filter the string values. i found the following package has many handy features for most of the needed actions..
http://ws.apache.org/axis/java/apiDocs/org/apache/axis/utils/StringUtils.html
like StripStart, StripEnd
When you use this package in your mapredeuce program, the program will look for the package at run-time. You have two options
- Include the package at lib directory of all the nodes available (not feasible in most cases)
- Pass it to the respective nodes where your data is.
javac -classpath /apache/hadoop/hadoop-core-0.20.security-wilma-14.jar:/home/invidx/axis.jar wc.java
hadoop jar wc.jar wc -libjars /home/invidx/axis.jar /apps/traffic/learn/countries.seq /apps/traffic/outp/
Monday, July 25, 2011
Deleting 0 byte files in HDFS
You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the
set default_parallel 1;
in grunt.
In this technique you can achieve the single file
output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.
the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files
hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"
Friday, July 22, 2011
Hadoop Small files problem
But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.
The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip
tar cvfj countries.tar.bz2 *.txt
When you read this sequential file in Map reduce you need to use sequentiafile input format.
job.setInputFormat(SequenceFileInputFormat.class);
the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.
public static class MapClass extends MapReduceBase implements Mapper
public void map(Text key, BytesWritable value,OutputCollector
Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)
String line = new String(value.getBytes());
This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..
Friday, July 15, 2011
Hadoop certified
Just got certified yesterday after a failiure 2 weeks before.. I misjudged the test at first and not much prepared. The questions are more diversedfrom Hadoop, DFS concepts, mapreduce,java, little administration , pig, hive, flume etc.. but mostly on mapreduce and practical implementations.
The test is certainly testing on your practical knowledge onHadoop, one who has good hands-on experience and very good on the concepts can pass...
Wednesday, July 13, 2011
Common HDFS shell commands
ls
hadoop fs –ls /
hadoop fs –ls /user/
lsr
Usage: hadoop fs -lsr
Recursive version of ls. Similar to Unix ls -R.
cat
Usage: hadoop fs -cat URI [URI …]
Copies source paths to stdout.
hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4
put
Usage: hadoop fs -put
Similar to put command, except that the source is restricted to a local file reference.
get
Usage: hadoop fs -get URI
Similar to get command, except that the destination is restricted to a local file reference.
cp
Usage: hadoop fs -cp URI [URI …]
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
Example:
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
- hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
mkdir
Usage: hadoop fs -mkdir
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path.
Example:
- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
mv
Usage: hadoop fs -mv URI [URI …]
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted.
Example:
- hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
rm
Usage: hadoop fs -rm URI [URI …]
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for recursive deletes.
Example:
- hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
rmr
Usage: hadoop fs -rmr URI [URI …]
Recursive version of delete.
Example:
- hadoop fs -rmr /user/hadoop/dir
tail
Usage: hadoop fs -tail [-f] URI
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
- hadoop fs -tail pathname
Hadoop Video learning from cloudera
If you are fond of video learning like me, you might these videos very useful..
Starting with very simple Hadoop streaming
If you know very basic python it is possible to write basic map-reduce.. in this case we are gonna use just shell commands (cat and wc) to get the wordcount for the given file..
Hadoop provides the streaming which acts like the unix pipes, we stack programs from the stdout of the previous program left to right.we can specify what a mapper should do and what a reducer should do.. and ofcourse the input file path and the output directory..
To read and understand Hadoop streaming in detail go to the following link..
http://hadoop.apache.org/common/docs/current/streaming.html
Problem statement:
To get the wordcount of the given file using map-reduce
Map-reduce design:
Mapper : Just open the file using cat and each word is a key
Reducer : count the occurence of each word using wc
input HDFS file : /user/training/notes
output HDFS directory : /user/training/output/
Note : refer to basic HDFS shell commands in different post
command:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar -input /user/training/notes -output /user/training/output/ -mapper /bin/cat -reducer wc
This command uses the streaming framework provided by Hadoop in $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar .. check the name of this file before running because each version has different file name..Output:
you can verify the output from the files created in output directory
hadoop fs -lsr /user/training/output
The output from the reducer always loaded into a file with name part-*
to view the contents of the file
hadoop fs -cat /user/training/output/part-00000
Tracking the execution:
While executing the code you might notice the following
1. the percentage of mappers and reducers completed
2. the link to jobtracker (a web UI to track and view-log).. the jobtracker will run in port 50030
in our case
http://localhost.localdomain:50030
tips:
If you are running the program for the second time with the same output directory then you are required to remove the output directory
hadoop fs -rmr /user/training/output
Configuring your Hadoop
Login with username and password (cloudera,cloudera)
Open the terminal using the icon at the top menu bar..Now we need to check for environment variables before jumping into real coding.
echo $JAVA_HOME - this is where your java files are..
echo $HADOOP_HOME - for the jar and lib files used by Hadoop
to set these parameters if it emits nothing..
export JAVA_HOME=/usr
export HADOOP_HOME=/usr/lib/hadoop
note: these paths might be different if you are using different image or different installation of Hadoop.
Once the parameters are set you can check if the hadoop is running but just typing hadoop in the prompt.. and if it throws the help for hadoop then you are all set..
$hadoop
Start your own Hadoop cluster
The best and easy options available is to use Cloudera's VMWare image along with VMWare player. For beginners VMWare creates another virtual machine from your desktop, say you have a windows machine, you can install VMware player in your machine and convert into a UNIX box (pretty cool!!) vmware player is freeware from www.vmware.com (precisely : https://www.vmware.com/tryvmware/index.php)
Next regarding image from Cloudera, anyone can configure a system through VMware and save it as a file and distribute. And anyone can use the same configuration in their machine just by copying the image file in their system. We can download VMware image for Cloudera Hadoop from https://ccp.cloudera.com/display/SUPPORT/Downloads
So thats it, install VMware player and download cloudera image and just double click the *.vmx file from the downloaded directory.. a fully configured Hadoop (UNIX) machine is available for you. At the time of this writing the username and password for the image is "cloudera" and "cloudera", check with their site while downloading.