Understanding the Basics of Apache Spark RDD

  • by user1
  • 20 March, 2022

This article was published as a part of the Data Science Blogathon

Hello readers!

In this article, I am going to discuss one of the most essential parts of Apache Spark called RDD.

Before getting into Spark RDD, I strongly recommend you to read my article, Understand the internal working of Apache Spark to get an overview of the working of Apache Spark.

Table of Contents

  1. What is RDD in Spark?
  2. Features of Spark RDD
  3. How to create RDDs?
  4. Operations of RDD
  5. Practical demo of RDD operations
  6. When to use RDDs?
  7. Let’s understand, what is RDD in Spark?

What is RDD in Spark?

RDD stands for Resilient Distributed Dataset. It is considered the backbone of Apache Spark. This is available since the beginning of the Spark. That’s why it is considered as a fundamental data structure of Apache Spark. Data structures in the newer version of Sparks such as datasets and data frames are built on the top of RDD. In Spark, anything you do will go around RDD. The dataset in Spark RDDs is
divided into logical partitions. If the data is logically partitioned within RDD, it is possible to send different pieces of data across different nodes of the cluster for distributed computing. RDD helps Spark to achieve efficient data processing.

Features of Spark RDD

Spark RDD possesses the following features.

Immutability

The important fact about RDD is, it is immutable. You cannot change the state of RDD. If you want to change the state of RDD, you need to create a copy of the existing RDD and perform your required operations. Hence, the required RDD can be retrieved at any time.

In-memory computation

Data stored in a disk takes much time to load and process. Spark supports in-memory computation which stores data in RAM instead of disk. Hence, the computation power of Spark is highly increased.

Lazy evaluation

Transformations in RDDs are implemented using lazy operations. In lazy evaluation, the results are not computed immediately. It will generate the results, only when the action is triggered. Thus, the performance of the program is increased.

Fault-tolerant

As I said earlier, once you perform any operations in an existing RDD, a new copy of that RDD is created, and the operations are performed on the newly created RDD. Thus, any lost data can be recovered easily and recreated. This feature makes Spark RDD fault-tolerant.

Partitioning

Data items in RDDs are usually huge. This data is partitioned and send across different nodes for distributed computing.

Persistence

Intermediate results generated by RDD are stored to make the computation easy. It makes the process optimized.

Grained operation

Spark RDD offers two types of grained operations namely coarse-grained and fine-grained. The coarse-grained operation allows us to transform the whole dataset while the fine-grained operation allows us to transform individual elements in the dataset.

How to create RDD?

In Apache Spark, RDDs can be created in three ways.

  • Parallelize method by which already existing collection can be used in the driver program.
  • By referencing a dataset that is present in an external storage system such as HDFS, HBase.
  • New RDDs can be created from an existing RDD. 

Operations of RDD

Two operations can be applied in RDD. One is transformation. And another one in action.

Transformations

Transformations are the processes that you perform on an RDD to get a result which is also an RDD. The example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(), mapPartitions(), sortBy() that would create an another resultant RDD. Lazy evaluation is applied in the creation of RDD.

Actions

Actions return results to the driver program or write it in a storage and kick off a computation. Some examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().

Transformations will always return RDD whereas actions return some other data type.

Practical demo of RDD operations

Let’s take a practical look at some of the RDD operations. To practice Apache Spark, you need to install Cloudera virtual environment. You can find a detailed guide to install Cloudera VM here.

Create RDD

First, let’s create an RDD using parallelize() method which is the simplest method.

val rdd1 = sc.parallelize(List(23, 45, 67, 86, 78, 27, 82, 45, 67, 86))

Here, sc denotes SparkContext
and each element is copied to form RDD.

Read result

We can read the result generated by RDD by using the collect operation.

rdd1.collect

The results are shown
here.

Count

The count action is used to get the total number of elements present in the particular RDD.

rdd1.count

There are 10 elements in rdd1.

Distinct

Distinct is a type of transformation that is used to get the unique elements in the RDD.

rdd1.distinct.collect

There are 10 elements in rdd1.

Distinct

Distinct is a type of transformation that is used to get the unique elements in the RDD.

rdd1.distinct.collect

The distinct elements are displayed.

Filter

Filter transformation creates a new dataset by selecting the elements according to the given condition.

rdd1.filter(x => x < 50).collect

Here, the elements which are less than 50 are displayed.

sortBy

sortBy operation is used to arrange the elements in ascending order when the condition is true and in descending order when the condition is false.

rdd1.sortBy(x => x, true).collect

rdd1.sortBy(x => x, false).collect

Reduce

Reduce action is used to summarize the RDD based on the given formula.

rdd1.reduce((x, y) => x + y)

Here, each element is added and the total sum is printed.

Map

Map transformation processes each element in the RDD according to the given condition and creates a new RDD.

rdd1.map(x => x + 1).collect

Here, each element is incremented once.

Union, intersection, and cartesian

Let’s create another RDD.

val rdd2 = sc.parallelize(List(25,73, 97, 78, 27, 82))

Union operation combines all the elements of the given two RDDs.

Intersection operation forms a new RDD by taking the common elements in the given RDDs.

Cartesian operation is used to create a cartesian product of the required RDDs.

rdd1.union(rdd2).collect
rdd1.intersection(rdd2).collect
rdd1.cartesian(rdd2).collect

First

First is a type of action that always returns the first element of the RDD.

rdd1.first()

Here, the first element in rdd1 is 23.

Take

Take action returns the first n elements in the RDD.

rdd1.take(5)

Here, the first 5 elements are displayed.

Now, you may have noticed that when you do any transformations, only copies of existing RDDs are created and the initially created RDD doesn’t change. This is because RDDs are immutable. This feature makes RDDs fault-tolerant and the lost data can also be recovered easily.

When to use RDDs?

RDD is preferred to use when you want to apply low-level transformations and actions. It gives you a greater handle and control over your data. RDDs can be used when the data is highly unstructured such as media or text streams. RDDs are used when you want to add functional programming constructs rather than domain-specific expressions. RDDs are used in the situation where the schema is not applied.

Endnote

I hope now you have a basic idea about the RDDs and their role in Apache Spark.

Thanks for reading, cheers!

Please take a look at my other articles on dhanya_thailappan, Author at Analytics Vidhya.The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Size: Unknown Price: Free Author: Dhanya Thailappan Data source: https://www.analyticsvidhya.com/