But when think in actual scale, indexing is going to run on thousands of files. The problem is not with the count since Hadoop can handle any number of files given your cluster is big enough. The real problem is with the size of files. Hadoop wil be inefficient on smaller files ,it will create a split for each file smaller than than the specified split size.
The best option is to use a sequential file, where the key is filename and the content of the file as value. I found this code http://stuartsierra.com/2008/04/24/a-million-little-files which converts a tar file to a sequential file. The source file should be a bz2 tar. syntax for bz2 zip
tar cvfj countries.tar.bz2 *.txt
When you read this sequential file in Map reduce you need to use sequentiafile input format.
job.setInputFormat(SequenceFileInputFormat.class);
the other thing to note is that the Key (filename) will be a Text format and value (file content) is BytesWritable.
public static class MapClass extends MapReduceBase implements Mapper
public void map(Text key, BytesWritable value,OutputCollector
Im my case I need to convert the BytesWritable to a String so that I can do some string manipulations and assign each word to a Key (so that I can do a wordcount at reducer in later stage)
String line = new String(value.getBytes());
This code skeleton is so efficient that Im able to squeeze in GBs of file into a tar then into a sequential file and then index it..
No comments:
Post a Comment