What is Spark RDD and Why Spark needs it?

RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark. This post will cover What is RDD, Why Spark needs it and how to create an RDD.
Before diving into RDDs, You need to understand What is Apache Spark and how it works? Click below to Understand Apache spark


·        Introduction to Apache Spark


By now we already know that Spark is a parallel processing engine which does lightning fast fault tolerant in-memory* parallel processing of data. Not just Spark, for any processing engine which does in-memory fault tolerant processing in a cluster the following the key features would be required: -

1.     Partitioned data: Distributed dataset so that computation can be done in parallel.

2.   Resilience or Fault Tolerance: -Since the data is distributed among nodes so it also needs some mechanism to re-compute the portion of job which gets affected because of node failures.


3.   Data Locality: - During Processing Distributed data may cause lots of data movement through the network, so there should be minimum data movement rather the computation itself should be moved closure to where the data is present.

4.   Data sharing in memory is 10 to 100 times faster than Disk or network. Writing and reading from disc is very costly operation so RAM or SSDs should be used instead of disks.

All the above mentioned features like Partitioned, Fault Tolerance, Data Locality are Combined and put together into a single class called RDD. The object instantiated from the RDD class has all the above features.

Now question arises How we can create RDD?



There are two ways to create an RDD:-
·        Read the file you need through the spark provided APIs. Its very simple, you just need to call a method and pass your file as parameter and you will get an RDD in return.

Scala
  val myRdd1=sc.textFile(“file_path”)
  val myRdd2=spark.read.csv(“file_path”).rdd

Python
  myRdd1=sc.textFile(“File_path”)
  myRdd2=spark.read.csv(“file_path”).rdd


·       Create a collection (e.g. a list in python) and apply the parallelize method on it.
Scala
  val myRdd1=sc.parallelize(Array(“Bangalore”, “New York”, “London”))

Python
  myRdd1=sc.parallelize([“Bangalore”, “New York”, “London”])


Once RDD is created it can be operated in parallel. You can test your RDD by calling rdd.collect() method in pyspark(Python) shell or spark-shell(Scala).

In addition to the above mentioned features of RDD there is one more critical feature which is called Lazy Evaluation.

5.    Lazy Evaluation: The moment RDD is created it won’t be materialized with data instantly. Just by creation an RDD data won’t move from file to RAM. Even if you create multiple RDDs from the base RDD but there will be no computation. Spark will only create a DAG(Directed Acyclic Graph) for all the transformations applied on RDD.

The actual computation will happen when Action is called on RDD, at that time only RDD is materialized with actual data. This happens just for fraction of second till the computation is fully done and after that memory id again freed. This is called lazy evaluation.

Lazy Evaluation will be discussed more in detail when we will talk about RDD Transformation and Action.

Now we will try to understand RDDs with a simple example. Suppose there are 10 nodes in your cluster and you have a file of 100 GB to process. RDD can be created by simply calling the read method on the file(as mentioned in table 1). Spark will create logical partitions of the file and map it to different partitions of the RDD. 

So different nodes of the cluster will have a different set of partition to work on. In above scenario a single file of 100 GB may get distributed in 400+ partitions with each node working on 40+ partitions i.e. each node may work on just 10 gb of data.

On what basis Spark partition the file and how different node gets a different partitioned dataset is a separate topic which will be covered in next post.



Since the data in RDD is processed in a distributed fashion so there is a need to do coordination among nodes. In Spark standalone this distribution, coordination among nodes and monitoring is taken care by Spark itself, however in other modes this is taken care by the scheduling/monitoring tool used e.g. YARN, Mesos.

To know more about Spark deployment modes and job submission please see the below post: -




There are two operations which we can be performed on RDDs. They are Transformation and Action. We will discuss about RDD operations in next post.




To learn more on Spark click here. Let us know your views or feedback on Facebook or Twitter @BigDataDiscuss.

0 comments:

Post a Comment