Hadoop Streaming

Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs. Continue reading

Hadoop installation.

You can setup Hadoop in two ways on your windows machine.

  1. Download Virtualbox or VMware player, install Linux and install Hadoop. A very good tutorial is available here: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
  2. Virtualized apps: Download Virtualbox or VMware player, install Cloudera QuickStart VM or Hortonworks Sandbox or MapR Sandbox. This way minimizes the time spent on installing, and configuring Hadoop, then Pig, Hive, and so on. These contain a single-node Apache Hadoop cluster, Eclipse for Java, complete with example data, queries, scripts… You can download them from their websites.
    https://www.youtube.com/watch?v=oNQ8f2My5Hs (for installing Cloudera QuickStart VM)

Continue reading

Apache Spark Vs Hadoop

Introduction to Apache Spark

It is a framework for performing general data analytics on distributed computing cluster like Hadoop.It provides in memory computations for increase speed and data process over mapreduce.It runs on top of existing hadoop cluster and access hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS,Flume,Kafka,Twitter
Spark Architecture

Is Apache Spark going to replace Hadoop?

Continue reading

By Sriramjithendra Posted in Big Data

Learning Hadoop – Resources


  1. Free videos – MapR Academia
  2. Udacity course
  3. Hortonworks Sandbox
  4. Hadoop Ecosystem
  5. Running Hadoop Map-Reduce
  6. Hadoop Screencasts
  7. Reza Shiftehfar’s blog I
  8. Reza Shiftehfar’s blog II
  9. Reza Shiftehfar’s blog III
  10. Reza Shiftehfar’s blog IV
  11. Reza Shiftehfar’s blog V
  12. Reza Shiftehfar’s blog VI
  13. Reza Shiftehfar’s blog VII
  14. Deploying Storm on Hadoop for Advertising Analysis
  15. Hadoop classes by Cloudera
  16. EMC classes: Big Data, Analytics, Data Science
  17. Simulated Hadoop
By Sriramjithendra Posted in Big Data

Hadoop on Windows: HDInsight – Getting Started


Hadoop has been all the rage the last year or so and anyone who does not know that Microsoft is very serious about Hadoop has clearly not been paying attention.  HDInsight is what Microsoft is calling their suite of 100% Apache Hadoop compatible software.  They refer to it as part of their “end-to-end roadmap for Big Data” and they’re not kidding, it’s integral. 

A few things may jump out from this as odd or funny.  One would be ‘what is Microsoft doing in the open source world?’.  If this is a surprise to you then you really have been living under a rock.  Microsoft is working very closely with Hortonworks and contributing heavily on Hadoop.  They are also contributing heavily to the Linux kernel since 2009. 

Like them or not you have to give Microsoft credit for making working with technology easier.  Their work with Hadoop has been much the…

View original post 259 more words

By Sriramjithendra Posted in Big Data