Friday, July 22, 2011

Hadoop Small files problem

Im writing an inverted indexing program in Hadoop. It is easier to write a simple program to index a single file, nothing more than a wordcount program and some string manipulations.

But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.

The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip

tar cvfj countries.tar.bz2 *.txt

When you read this sequential file in Map reduce you need to use sequentiafile input format.

job.setInputFormat(SequenceFileInputFormat.class);

the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.

public static class MapClass extends MapReduceBase implements Mapper

public void map(Text key, BytesWritable value,OutputCollector output,Reporter reporter) throws IOException

Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)

String line = new String(value.getBytes());

This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..

















No comments:

Post a Comment