Introduction to Apache Spark
It is a framework for performing general data analytics on distributed computing cluster like Hadoop.It provides in memory computations for increase speed and data process over mapreduce.It runs on top of existing hadoop cluster and access hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS,Flume,Kafka,Twitter
Is Apache Spark going to replace Hadoop?
Hadoop is parallel data processing framework that has traditionally been used to run map/reduce jobs. These are long running jobs that take minutes or hours to complete. Spark has designed to run on top of Hadoop and it is an alternative to the traditional batch map/reduce model that can be used for real-time stream data processing and fast interactive queries that finish within seconds. So, Hadoop supports both traditional map/reduce and Spark.
We should look at Hadoop as a general purpose Framework that supports multiple models and We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop.
Hadoop MapReduce vs. Spark –Which One to Choose?
Spark uses more RAM instead of network and disk I/O its relatively fast as compared to hadoop. But as it uses large RAM it needs a dedicated high end physical machine for producing effective results
It all depends and the variables on which this decision depends keep on changing dynamically with time.
Difference between Hadoop Mapreduce and Apache Spark
Spark stores data in-memory whereas Hadoop stores data on disk. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.For details see the UC Berkeley’s link Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.
From the Spark academic paper: “RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition.” This removes the need for replication to achieve fault tolerance.
Do I need to learn Hadoop first to learn Apache Spark?
No, you don’t need to learn Hadoop to learn Spark. Spark was an independent project . But after YARN and Hadoop 2.0, Spark became popular because Spark can run on top of HDFS along with other Hadoop components. Spark has become another data processing engine in Hadoop ecosystem and which is good for all businesses and community as it provides more capability to Hadoop stack.
For developers, there is almost no overlap between the two. Hadoop is a framework in which you write MapReduce job by inheriting Java classes. Spark is a library that enables parallel computation via function calls.
For operators to running a cluster, there is an overlap in general skills, such as monitoring configuration, and code deployment.
Apache Spark’s features
Lets go through some of Spark’s features which are really highlighting it in the Big Data world!
Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
ii) Ease of Use:
Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps. It comes with a built-in set of over 80 high-level operators.We can use it interactively to query data within the shell too.
Word count in Spark’s Python API
datafile = spark.textFile("hdfs://...") datafile.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda x, y: x+y)
iii) Combines SQL, streaming, and complex analytics.
In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.
iv) Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.
Spark’s major use cases over Hadoop
- Iterative Algorithms in Machine Learning
- Interactive Data Mining and Data Processing
- Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive.
- Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis
- Sensor data processing: Where data is fetched and joined from multiple sources, in-memory dataset really helpful as they are easy and fast to process.
Note : Spark is still working out bugs as it matures.
Go Get Started
Spark is very easy to get started writing powerful Big Data applications. Your existing Hadoop and/or programming skills will have you productively interacting with your data in minutes. Go get started today:
Quick Start: http://spark.incubator.apache.org/docs/latest/quick-start.html
Spark Summit 2013 (Dec. 2, 2013): http://spark-summit.org
Amazon Web Services Documentation : https://aws.amazon.com/articles/4926593393724923