What is RDD?
Resilient distributed datasets(RDD) - Is a collection of elements, partitioned across the nodes of a cluster and can be operated in parallel.
RDD Life cycle:
How to create an RDD?
1) Starting from a file in any of the Hadoop distributed FS, shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
2) From an existing Scala collection in the driver program, and transforming it by calling the paralyze method on the sparkcontext
What are the features of an RDD?
- Users can also ask Spark to persist the RDD in memory, allows efficient re-use across parallel operations.
- RDD will automatically recover from node failures.
What are the operations supported by RDD?
supports two types of Operations:
- Transformations - creates new dataset from existing RDD
- Actions - Runs a function on the RDD, aggregates and returns a result
Roughly above operations can be linked to Map and reduce, but exception being reduceByKey which creates a new RDD
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.