Kick-start Spark

Spark is one of the most admired distributed processing frameworks. Since inception, Spark is designed to run in memory, helping process data far more quickly than alternative approaches like Hadoop’s MapReduce, which tends to write data to and from disks between each stage of processing. It is now an undisputed successor to Hadoop MapReduce due to in-memory distributed processing capabilities.

In 2014, Spark broke sorting record by sorting 100TB of data. Here is sort comparison of the two frameworks for the same volume of data –

Comparison of Spark and MapReduce

Apart from having lighting fast runtime performance, it significantly reduces development time. 50+ lines of MR code for “Word Count” is equivalent to just 5 lines of Spark code. It provides APIs in Scala, Python and Java therefore it has experienced wide adoption from different developer communities.

It complements batch nature of Hadoop with real time processing capabilities. It seamlessly integrates with most of Hadoop ecosystem projects like HDFS, Hive etc. Same Spark Core APIs can be used for Machine Learning, Graph Processing, Stream Processing, Ad-Hoc Querying, Integrating with R etc.

Spark Overview

This articles highlights how to quickly get started with Spark.

Setting up Spark

The below steps have been executed on Mac OS 10.10.2 –

  1. Install Spark
    1.1 Download Spark 1.4.1
    1.2 Untar source to ~/spark-1.4.1
    1.3 Change directory using cd ~/spark-1.4.1
    1.4 Install Scala Build Tool (SBT): brew install sbt
    1.5 Build Spark Code: sbt assembly
  2. Configure Environment File in .bash_profile file
    2.1 export SPARK_HOME="~/spark-1.4.1"
    2.2 export PATH=$PATH:$SPARK_HOME/bin

Spark Programming

Every Spark application involves a Driver program that executes main function and it executes various operations in parallel on cluster. Spark provides main abstraction i.e. Resilient Distributed Dataset (RDD). RDD is like in-memory distributed collection which can be operated in parallel on cluster. It can be created using a file or directly using an existing Scala collection in driver program. RDD has unique attribute of fault tolerance i.e. even if a node goes down then also it will automatically recover from node failure.

As a very first step, every Spark program must create a SparkContext object which tells Spark how to access a cluster. In Spark shell, SparkContext is automatically created and it is available as sc variable.

Spark’s interactive shell can be launched using “spark-shell”. Text File RDD can be created using textFile method of SparkContext.

val lines = sc.textFile(“/spark-1.4.1/README.md”)

There are many other methods available for loading different file formats including SequenceFile, JSON etc. The above statement just creates a pointer in memory. It will not physically load file in memory.

RDDs support two types of operations –

• Transformations – It refers to creating a new RDD from existing RDD. New RDD is not computed immediately but evaluation process is deferred till we call first action operation on the RDD. An example of transformation function is map operation which may apply some function on each element of input RDD

val lineLengths = lines.map(line => line.length)

It calculates length of each line in RDD but it is not immediately calculated

• Actions – It returns a value to driver program after running processing on dataset e.g. reduce method which aggregates all elements of RDD using some function and it returns result to driver

val totalLength = lineLengths.reduce((a, b) => a + b)

More Spark Examples

Below example finds out line having maximum words –

Find Line Having Max Words

Below example performs simple word count –

Word Count Part 1

Word Count Part 2

Conclusion and Next Steps

Simplicity, Speed and Support are three main driving factors for popularity of Spark. Spark’s simple APIs are designed for interacting quickly and easily with data at scale.


Leave a Reply

Your email address will not be published. Required fields are marked *