Monday, July 25, 2011

Deleting 0 byte files in HDFS

After running Map-reduce jobs in large clusters you might have noticed that there are numerous output files generated as output. This is because Map-reduce creates one output file for each reducer.

You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the
set default_parallel 1;

in grunt.

In this technique you can achieve the single file
output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.

the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files

hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"



2 comments: