Wednesday, July 13, 2011

Starting with very simple Hadoop streaming

Hadoop is not all about Java (eventhough it is).. but nothing stops you from trying some map-reduce or exploring the data available in HDFS if you dont know any idea about Java..
If you know very basic python it is possible to write basic map-reduce.. in this case we are gonna use just shell commands (cat and wc) to get the wordcount for the given file..

Hadoop provides the streaming which acts like the unix pipes, we stack programs from the stdout of the previous program left to right.we can specify what a mapper should do and what a reducer should do.. and ofcourse the input file path and the output directory..

To read and understand Hadoop streaming in detail go to the following link..
http://hadoop.apache.org/common/docs/current/streaming.html

Problem statement:
To get the wordcount of the given file using map-reduce

Map-reduce design:
Mapper : Just open the file using cat and each word is a key
Reducer : count the occurence of each word using wc

input HDFS file : /user/training/notes
output HDFS directory : /user/training/output/
Note : refer to basic HDFS shell commands in different post

command:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar -input /user/training/notes -output /user/training/output/ -mapper /bin/cat -reducer wc

This command uses the streaming framework provided by Hadoop in $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar .. check the name of this file before running because each version has different file name..

Output:
you can verify the output from the files created in output directory

hadoop fs -lsr /user/training/output

The output from the reducer always loaded into a file with name part-*
to view the contents of the file

hadoop fs -cat /user/training/output/part-00000

Tracking the execution:
While executing the code you might notice the following

1. the percentage of mappers and reducers completed
2. the link to jobtracker (a web UI to track and view-log).. the jobtracker will run in port 50030
in our case
http://localhost.localdomain:50030














tips:
If you are running the program for the second time with the same output directory then you are required to remove the output directory

hadoop fs -rmr /user/training/output

27 comments:

  1. Thanks for sharing this niche useful informative post to our knowledge, Actually SAP is ERP software that can be used in many companies for their day to day business activities it has great scope in future.
    Regards,
    SAP training in chennai|SAP course in chennai|SAP Training Chennai|sap training in Chennai

    ReplyDelete
  2. Thanks for sharing this niche useful informative post to our knowledge,
    scm training in chennai

    ReplyDelete
  3. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me
    best rpa training in chennai
    rpa training in chennai |
    rpa online training
    rpa course in bangalore
    rpa training in pune
    rpa training in marathahalli
    rpa training in btm

    ReplyDelete
  4. Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you..Keep update more information.


    rpa training in chennai |
    best rpa training in chennai
    rpa online training
    rpa course in bangalore
    rpa training in pune
    rpa training in marathahalli
    rpa training in btm

    ReplyDelete
  5. I likable the posts and offbeat format you've got here! I’d wish many thanks for sharing your expertise and also the time it took to post!!
    python course in pune | python course in chennai | python course in Bangalore

    ReplyDelete
  6. Awesome! Education is the extreme motivation that open the new doors of data and material. So we always need to study around the things and the new part of educations with that we are not mindful.

    Java training in Bangalore |Java training in Rajaji nagar | Java training in Bangalore | Java training in Kalyan nagar

    Java training in Bangalore | Java training in Kalyan nagar | Java training in Bangalore | Java training in Jaya nagar

    ReplyDelete
  7. Very well written blog and I always love to read blogs like these because they offer very good information to readers with very less amount of words....thanks for sharing your info with us and keep sharing.

    aws Training in indira nagar

    selenium Training in indira nagar

    python Training in indira nagar

    datascience Training in indira nagar

    devops Training in indira nagar

    ReplyDelete
  8. This blog is the general information for the future. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.

    Big Data Hadoop Admin Training
    Cloud Training in Chennai
    Software Testing Training in Chennai
    Oracle DBA Trainingin Chennai
    Angular Training in Chennai

    ReplyDelete
  9. Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks?
    fire and safety course in chennai

    ReplyDelete
  10. Attend The Python training in bangalore From ExcelR. Practical Python training in bangalore Sessions With Assured Placement Support From Experienced Faculty. ExcelR Offers The Python training in bangalore.
    python training in bangalore

    ReplyDelete
  11. Explore cutting-edge infrastructure solutions with our premier Data Center in Mumbai. Ensure secure, reliable, and high-performance operations for your business.

    ReplyDelete