What is temporarily lazy Val in Scala

(Why) do we need to call the cache or stay on an RDD

I think the question would be better phrased than:

When do we need to call the cache or stay on an RDD?

Spark processes are lazy, that is, nothing will happen until it is required. To quickly answer the question, nothing happens to the data after it is output, only one is created using the file as the source.

Let's say we transform this data a bit:

Nothing happens to the data here either. Now there is a new RDD that contains a reference to and a function that can be applied when needed.

Only when an action is applied to an RDD will the RDD chain Lineage executed. That is, the data divided into partitions is loaded by the Spark Cluster Executors), the function is applied and the result is calculated.

In a linear line, as in this example, is not required. The data is loaded into the executors, all transformations are applied, and finally that is calculated, all in memory - if the data fits in memory.

is useful when the origin of the RDD branches out. Suppose you want to filter the words in the previous example into a number of positive and negative words. You could do it like this:

Here each branch outputs a reload of the data. Adding an explicit statement ensures that previous processing is retained and reused. The job then looks like this:

This is why it is said to break the line because it creates a checkpoint that can be reused for further processing.

Rule of thumb: use when the origin of your RDD branches out or if an RDD is used multiple times like in a loop.