Thursday, 21 July 2016

What is RDD? Apache spark Resilient distributed datasets ( RDD ) Explained , Demystified

What is RDD?

Resilient distributed datasets(RDD) - Is a collection of elements, partitioned across the nodes of a cluster and can be operated in parallel.

RDD Life cycle:

How to create an RDD?

1) Starting from a file in any of the Hadoop distributed FS, shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

2) From an existing Scala collection in the driver program, and transforming it by calling the paralyze method on the sparkcontext

What are the features of an RDD?

  1. Users can also ask Spark to persist the RDD in memory, allows efficient re-use across parallel operations.
  2. RDD will automatically recover from node failures.

What are the operations supported by RDD?

supports two types of Operations:
  1. Transformations - creates new dataset from existing RDD
  2. Actions - Runs a function on the RDD, aggregates and returns a result

Roughly above operations can be linked to Map and reduce, but exception being reduceByKey  which creates a new RDD

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. 


Apache spark program life cycle

Spark Application —> driver program —> main function —> parallel operations on cluster -> result