Tuesday, July 26, 2011

Java Package for string manipulation

Most of applications developed in Hadoop are related to string manipulations. Like machine learning, crawling, indexing and matching algorithms..As goes with Hadoop the data is going to be unstructured, crappy and will not follow any rules.

So we need to do extensive and effective string manipulation to strip, clean and filter the string values. i found the following package has many handy features for most of the needed actions..

http://ws.apache.org/axis/java/apiDocs/org/apache/axis/utils/StringUtils.html

like StripStart, StripEnd

When you use this package in your mapredeuce program, the program will look for the package at run-time. You have two options
  1. Include the package at lib directory of all the nodes available (not feasible in most cases)
  2. Pass it to the respective nodes where your data is.
To do the 2nd option you can use the -libjars while executing the code. and use complete classpath while compiling..

javac -classpath /apache/hadoop/hadoop-core-0.20.security-wilma-14.jar:/home/invidx/axis.jar wc.java

hadoop jar wc.jar wc -libjars /home/invidx/axis.jar /apps/traffic/learn/countries.seq /apps/traffic/outp/

No comments:

Post a Comment