DATABASE FUNDAMENTALS
BASICS OF BIG DATA
Question
[CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
|
|
True
|
|
False
|
|
Either A or B
|
|
None of the above
|
Detailed explanation-1: -Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.
Detailed explanation-2: -At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
Detailed explanation-3: -There are few reasons for keeping RDD immutable as follows: 1-Immutable data can be shared easily. 2-It can be created at any point of time. 3-Immutable data can easily live on memory as on disk.
Detailed explanation-4: -What Is a Resilient Distributed Dataset? A Resilient Distributed Dataset (RDD) is a low-level API and Spark’s underlying data abstraction. An RDD is a static set of items distributed across clusters to allow parallel processing. The data structure stores any Python, Java, Scala, or user-created object.
Detailed explanation-5: -Spark operates on data in fault-tolerant file systems like HDFS or S3. So all the RDDs generated from fault tolerant data is fault tolerant.