You can resolve this issue by setting the reducer to 1 using "-D mapred.reduce.tasks=1 " parameter while running the job. In the case of pig you can set the
set default_parallel 1;
in grunt.
In this technique you can achieve the single file
output at the sacrifice of performance.
since we are utilizing one reducers instead of hundreds.
the other option is to let the jobtracker decide the
number of reducers, now we need to deal with the numerous 0 byte files..
I found this shell command handy to clean those files
hadoop fs -lsr /apps/bhraja/metric_m005/ | grep part- | awk '{ if ($5 == 0) print $8 }' | xargs hadoop fs -rm
assuming that all the output files begins with "part-"
Thanks,it helped me a lot.
ReplyDeleteThanks,it helped me a lot.
ReplyDelete