The performance cost of a for-loop, and some alternatives

I’ve recently been spending a lot of time running various simulations in R. Because I often use snow to perform simulations across several computers/cores, results typically come back in the form of a list object. Summarizing the results from a list is simple enough using a for-loop, but it’s much “sexier” to use a functional style of programming that takes advantage of higher order functions or the *apply-family of functions (R is, after all, a functional language at its core). But what’s the performance cost of using, for instance, lapply with a large list, particularly when the results from lapply have to be further manipulated to get them into a useful form? I recently did some performance testing on three of the *apply functions and compared them to an equivalent for-loop approach. The results follow.

First, I set up a large list with some fake data. Here I make a list, Sims, containing 30000 \(5 \times 5\) matrices. (This is a data structure similar to one I recently found myself analyzing.)

set.seed(12345)
S <- 30000
Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)

Now I define four functions that extract the fifth column of each matrix in Sims and create a \(30000 \times 5 \) matrix of results.

for_method <- function(S, Sims) {
  R <- matrix(NA, ncol=5, nrow=S)
  for (i in 1:S) R[i,] <- Sims[[i]][,5]
  R
}

sapply_method <- function(Sims) {
  t(sapply(Sims, function(x) x[,5]))
}

lapply_method <- function(Sims) {
  do.call(rbind, lapply(Sims, function(x) x[,5]))
}

vapply_method <- function(Sims) {
  t(vapply(Sims, function(x) x[,5], rep(0, 5)))
}

Notice that the for_method first creates a container matrix and inserts the results directly into it. Using something like rbind would force R to copy the matrix in each for-iteration, which is very inefficient. Also notice that in each of the apply methods, the result has to be manipulated before it is returned. It’s often possible to eliminate this step during the initial simulation process. But if you’re like me, you sometimes start your simulations before you realize what you intend on doing with the results.

Now for some timing results:

> system.time(replicate(20, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  4.340   0.072   4.414
> system.time(replicate(20, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  3.452   0.076   3.528
> system.time(replicate(20, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  4.393   0.044   4.433
> system.time(replicate(20, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  2.588   0.040   2.628
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results surprised me—I never expected vapply to be so much quicker than the other three methods. This may have to do with that fact that, since vapply requires you to explicitly specify the resultant data structure, R is able allocate the memory ahead of time.

As a simple extension, I thought I would retest taking advantage of byte-compiled code:

library(compiler)

for_method <- cmpfun(for_method)
sapply_method <- cmpfun(sapply_method)
lapply_method <- cmpfun(lapply_method)
vapply_method <- cmpfun(vapply_method)

And here are the results:

> system.time(replicate(10, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  2.084   0.024   2.110
> system.time(replicate(10, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  1.624   0.016   1.642
> system.time(replicate(10, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  2.152   0.012   2.164
> system.time(replicate(10, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  1.204   0.004   1.207
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results are also surprising. I really thought byte-compiling the functions would make the results much closer (doesn’t each function compile into what is effectively a for-loop?). However, in this simple example at least, that isn’t the case. vapply still has, by a significant margin, the best performance. sapply also performs noticeably better than lapply and the for-loop, which is an important result since sapply doesn’t require a pre-specified data structure. Assuming they hold in the presence of more complicated data structures (and I don’t see what they wouldn’t), these results seem to suggest that a functional approach can provide benefits in terms of coding efficiency and run-time performance.

I welcome any feedback or suggestions on how these tests could be improved or extended.

Update

A couple things have come up in the comments. First, Henrik provided a more elegant (and efficient) version of vapply_method:

vapply_method2 <- function(Sims) {
  t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
}

Second, several people have reported that the for_method on their system is just as fast, if not faster, than the vapply_method when compiled. I have retested my results, adding Henrik's vapply_method2. Here are the results for compiled code (note that I have increased the number of replications):

> library(compiler)
> set.seed(12345)
> S <- 30000
> Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)
> 
> for_method <- function(S, Sims) {
+   R <- matrix(NA, ncol=5, nrow=S)
+   for (i in 1:S) R[i,] <- Sims[[i]][,5]
+   R
+ }
> 
> sapply_method <- function(Sims) {
+   t(sapply(Sims, function(x) x[,5]))
+ }
> 
> lapply_method <- function(Sims) {
+   do.call(rbind, lapply(Sims, function(x) x[,5]))
+ }
> 
> vapply_method <- function(Sims) {
+   t(vapply(Sims, function(x) x[,5], rep(0, 5)))
+ }
> 
> vapply_method2 <- function(Sims) {
+   t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
+ }
> for_method <- cmpfun(for_method)
> sapply_method <- cmpfun(sapply_method)
> lapply_method <- cmpfun(lapply_method)
> vapply_method <- cmpfun(vapply_method)
> vapply_method2 <- cmpfun(vapply_method2)
> system.time(replicate(50, R.1 <<- for_method(S, Sims)))
   user  system elapsed 
  6.052   0.136   6.191 
> system.time(replicate(50, R.2 <<- sapply_method(Sims)))
   user  system elapsed 
  8.361   0.116   8.474 
> system.time(replicate(50, R.3 <<- lapply_method(Sims)))
   user  system elapsed 
 10.548   0.144  10.692 
> system.time(replicate(50, R.4 <<- vapply_method(Sims)))
   user  system elapsed 
  5.813   0.124   5.936 
> system.time(replicate(50, R.5 <<- vapply_method2(Sims)))
   user  system elapsed 
  4.268   0.084   4.352

For whatever reason, the vapply methods remain the fastest on my system (particularly Henrik’s version). I also tested on another Linux machine and got the same results (I have not yet tested on Windows). Of course, this is a very simple example, so the performance differences may not hold up to the increasing complexity of the input function. I’d be interested to know if anyone has an idea as to why vapply would perform better in Linux than in Windows/Mac. A library thing?

This entry was posted in R. Bookmark the permalink.

10 Responses to The performance cost of a for-loop, and some alternatives

  1. TszKin Julian Chan says:

    Nice article!! Thx a lot !!

    i got an interesting result after I replicated what you did in the blog.
    For the compiled code:
    > system.time(replicate(50, R.1 < system.time(replicate(50, R.2 < system.time(replicate(50, R.3 < system.time(replicate(50, R.4 <<- vapply_method(Sims)))
    user system elapsed
    2.96 0.03 2.99

    In my machine, for_method preform pretty good! it took 3.1 sec and the vapply_method take 2.99 sec!! The pattern is similar as yours for the pre-compiled code.
    What version of R are you using ? (I am using R 2.13.1). Would it caused by some improved of the compiler package? Any ideas?

    • Jason says:

      Hey, thanks for the reply.

      I am running 2.13.1 on Linux x86_64. Debian Squeeze (kernel 2.6.32-5), to be more specific. I was surprised at the results as well, so I reran the tests a few times and they were consistent. I wonder if there is some extra overhead for the for-loop in Linux or if there is a similar difference in Windows for vapply. I don’t have a Windows machine, otherwise I would do more testing.

    • Jason says:

      Thanks for comment, Vaidotas. I completely agree that premature optimization can be an issue. Most of the code I write is one-off scripts for research, where putting the time and energy into porting to C++ wouldn’t be worth it. The motivation for the above testing was simply that I was hoping to move away from for-loops if the performance penalty wasn’t too severe.

  2. Henrik says:

    Note that “[" is also just a function that can be called via vapply. There is no need to use an anonymous function. This further increases the speed:


    vapply_method2 <- function(Sims) {
    t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
    }

    system.time(replicate(20, R.5 <<- vapply_method2(Sims)))

    Note that "[" can also be called directly as a function from the R prompt using backticks notation:
    `[`(Sims[[2]], i = 1:5, j = 5)
    See help("[") for the arguments.

    Furthermore I can confirm, that the compiler makes the for-loop faster than sapply and vapply (see TszKin Julian Chan comment). WHich contrasts from your examples. Following results on my WinXP machine with compiled functions (my enhanced vapply version is the last one):


    > system.time(replicate(20, R.1 < system.time(replicate(20, R.2 < system.time(replicate(20, R.3 < system.time(replicate(20, R.4 < system.time(replicate(20, R.5 < all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
    [1] TRUE

    • Henrik says:

      last example should be:


      system.time(replicate(20, R.1 <<- for_method(S, Sims)))
      user system elapsed
      1.56 0.05 1.60

      system.time(replicate(20, R.2 <<- sapply_method(Sims)))
      user system elapsed
      1.99 0.03 2.02

      system.time(replicate(20, R.3 <<- lapply_method(Sims)))
      user system elapsed
      2.06 0.03 2.11

      system.time(replicate(20, R.4 <<- vapply_method(Sims)))
      user system elapsed
      1.33 0.03 1.39

      system.time(replicate(20, R.5 <<- vapply_method2(Sims)))
      user system elapsed
      1.12 0.06 1.19
      all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
      [1] TRUE

      • Jason says:

        Thanks, Henrik. I didn’t even think to use "[" directly. I’ll test your code to see where it comes out on my system. I am running Linux, so maybe that’s the difference in the compiled results.

  3. Berend Hasselman says:

    I have different results:

    vapply_method2 is the quickest
    and the for_method is the best pf the rest when using the compiler library

    # Mac mini 2.66Ghz R 2.13.1 Mac OS X 10.6.8

    > system.time(replicate(20, R.1 < system.time(replicate(20, R.2 < system.time(replicate(20, R.3 < system.time(replicate(20, R.4 < system.time(replicate(20, R.5 < all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
    [1] TRUE
    >
    > library(compiler)
    >
    > for_method sapply_method lapply_method vapply_method vapply_method2
    > system.time(replicate(20, R.1 < system.time(replicate(20, R.2 < system.time(replicate(20, R.3 < system.time(replicate(20, R.4 < system.time(replicate(20, R.5 < all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
    [1] TRUE

  4. Berend Hasselman says:

    Hopefully this is better. Sorry.

    <code>
    > system.time(replicate(20, R.1 < system.time(replicate(20, R.2 < system.time(replicate(20, R.3 < system.time(replicate(20, R.4 < system.time(replicate(20, R.5 < all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
    [1] TRUE
    >
    > library(compiler)
    >
    > for_method sapply_method lapply_method vapply_method vapply_method2
    > system.time(replicate(20, R.1 < system.time(replicate(20, R.2 < system.time(replicate(20, R.3 < system.time(replicate(20, R.4 < system.time(replicate(20, R.5 < all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4), identical(R.5, R.4))
    [1] TRUE
    </code>

Leave a Reply