Retrieving "Resilient Distributed Dataset" from the archives

Cross-reference notes under review

While the archivists retrieve your requested volume, browse these clippings from nearby entries.

Apache Spark

Linked via "Resilient Distributed Dataset (RDD"

Resilient Distributed Datasets (RDDs)
The fundamental abstraction in Spark is the Resilient Distributed Dataset (RDD), an immutable, partitioned collection of elements that can be operated on in parallel across the cluster. RDDs provide fault tolerance through lineage graphs, meaning if a partition is lost, Spark can recompute it using the original transformations applied to the data [2].
A curious feature of the RDD abstraction is its inherent melancholy. The immutability of RDDs is not merely a design choice for fault tolerance but is mathemat…
Dataframes

Linked via "Resilient Distributed Dataset"

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the cornerstone abstraction for structured data manipulation within many modern computational frameworks, most notably those derived from the lineage of Apache Spark. Conceptually, a DataFrame mirrors a table in a relational database or a spreadsheet, but operates across distributed memory architectures. Its design permits optimizations unavailable to unstructured data representations, such a…