Hadoop Streaming

Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs.

Hadoop streaming is advantageous for those cases when the developer do not have the much knowhow of Java and can write Mapper/Reducer in any scripting language faster.

When compared to custom jar jobs, a streaming Job would also have the additional overhead of starting a scripting(Python/Ruby/Perl) VM. This leads to a lot of inter-process communication, resulting in reduced efficiency of the jobs in most of the cases.

Using Hadoop streaming brings with it restrictions on the input/output formats. There are times when you would like to create custom input/output formats, using custom jars would be the natural choice. Also using Java one can over-ride/extend many of hadoop’s functionalities to one’s need/choice.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s