Recently I found myself in a position where I needed to load and process large data files to display on a Shiny dashboard, but loading and processing the whole file, in one go, took a long time. This would force the user to stare at a blank screen for sometime before the results are displayed. I came up with a crude solution to “lazy load” data and process them as needed. When I say lazy load, I mean load/process only the parts that the user currently needs and cache them. Imagine YouTube, loading parts of video as you watch.

There are two key R concepts that I feel are worth explaining as these really shape the technique of lazy loading: Environments and method dispatch.

Environments and method dispatch

On a high level, I created a custom environment object and overrode [ method dispatches to process based on index when required.

Environments

R does not support call-by-reference. In languages like C++ , references (pointers) can be passed to functions as arguments. There is a workaround though, using environments. Environments are R’s secret “call-by-reference”. Here I use them to cache my results using environments. When an environment is passed to a function, it is automatically passed by reference, as environments are not copied in R (more here). Let’s look at an example

A human can have only one age. And celebrating birthdays globally increments age by one, no matter where you are.

Arrays and method dispatch

The next piece is method dispatch, specifically method dispatch on [ . This is very similar to operator overloading in Java.

A demonstration of method dispatch

The output for line 11 is not what we would expect from an array like object. Defining a [ method dispatch is a simple way to disguise objects as arrays. Also notice line 32, where the method dispatch is defined with some operation within the function to support processing before returning results.

Why care about defining the [ method dispatch? Honestly, it does not matter. The code is much simpler to read with syntactic sugar of this sort. I like to keep my code clean and as simple as possible.

Lazy loading

Like I mentioned before, I refer to lazy loading as a technique to load or process data only when needed. Imagine YouTube or Netflix, the video loads in parts and only the parts that the user wants to watch. It would be a disaster if either asked the user to wait for the full video to download before playing it. It is a similar concept with loading and processing large files from disk.

Very generic lazy loader

Here I create a constructor that takes inputs and FUN . inputs could be an array, a list, or another object, based on your use case. Here I make an array. And FUN is a function that you would use to process this each element in inputs . To enable caching, I also define a returns array. Again, this could be anything and not only an array. Defining custom class on line 8 is important to create method dispatches later. So for example, if I wanted to lazy process (square a number) an array of numbers, this would be ideal.

inputs <- 1:10

FUN <- function(x) x * x

And our expected returns would be

returns <- c(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

To lazy process, when I request any item from the new object, lazy.object[idx] , I would expect that it would return the result for that index. To help with indexing, I define the [ method dispatch.

You will notice that all I do is call FUN and pass it the items that I want to process. You could have skipped the sapply if you are dealing with arrays, but for other types this would be appropriate. For a list, I would use lapply .

Now to cache the results for future, we make a small change to the method dispatch.

Notice no processing message for code in line 25

Lines 3–7 check if returns[i] is NA, and computes only if it is. To demonstrate, I modified the function to print a message whenever it is called.

To get a clear picture, I also define a print method dispatch for the object.

Notice line 24, with populated values in positions 3:8

As you can see, when the object is created, ll has an empty returns array. When ll[3:8] is computed, the results are cached. Once again, although I use arrays and a simple function, this can be easily modified to support complex objects and functions.

Multiple arguments in FUN

Sometime you need to do more than calculate square, maybe sum of two numbers. So your function would then require two (or possibly more inputs) There are two ways to achieve this:

Define a new inputs for second input inside the environment and pass two values to the function Use R’s [ with multiple arguments

I will go over both the methods below and say where I see the best use of each of these methods.

Defining another input

I find this method most useful when both the inputs are very different, or are index dependent. You would define the inputs and update the method dispatch.

For more inputs, this becomes somewhat laborious to do, but still a step up from making the user wait.

Pass arguments with slice index

The [ method dispatch is just like any other function in R. And thus, like any other function, multiple arguments can be passed to it. This method literally required you to put ... in two places and you are all set! I would use this method when I have the same set of inputs for some indices. Like adding 5 to every number.

Notice line 26

Both methods are great, but I personally prefer the second. Just less work :P

Cache invalidation

I would say this is one of the advance topics, and most casual readers can skip this section. If the inputs have a possibility to change during runtime, then we need a way to signal to the loader that the cached results are no more valid and need to be recomputed. You could create a new object and recompute or copy valid results, but there is a simpler way. Using a flag for each index.

Cache invalidation

By adding a valid flag to each index, the loader just checks to see if the flag is set and recomputes if needed. I define [<- method dispatch, again as syntactic sugar. But the main point is to invalidate the index by setting the valid flag to FALSE within. The outputs from each command after can help better understand the process.

Final words

This technique can help immensely when the inputs are large and need some kind of time consuming preprocessing. Think of this like a design pattern which helps streamline some processes. Feel free to get in touch with me for any question!

P.S. This is my first (very first) blog post. Please do share and comment! :D