FUNDAMENTALS OF COMPUTER

DATABASE FUNDAMENTALS

BASICS OF BIG DATA

Question [CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
In Spark, a ____ is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
A
Spark Streaming
B
FlatMap
C
Resilient Distributed Dataset (RDD)
D
Driver
Explanation: 

Detailed explanation-1: -Spark introduces an abstraction called resilient distributed datasets (RDDs). RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations.

Detailed explanation-2: -14.5. 1 Resilient Distributed Datasets. The core of Spark is the Resilient Distributed Dataset (RDD) abstraction. An RDD is a read-only collection of data that can be partitioned across a subset of Spark cluster machines and form the main working component [77].

Detailed explanation-3: -Statement 2: Spark also gives you control over how you can partition your Resilient Distributed Datasets (RDDs). Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Detailed explanation-4: -Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.

Detailed explanation-5: -Lineage is an RDD process to reconstruct lost partitions. Spark not replicate the data in memory, if data lost, Rdd use linege to rebuild lost data. Each RDD remembers how the RDD build from other datasets.

There is 1 question to complete.