Boolean 3 (finally) on CRAN

I have finally managed to get boolean3 accepted to CRAN. You can find it here: boolean3 on CRAN. To summarize:

boolean3 provides a means of estimating partial-observability binary 
response models following boolean logic. boolean3 was developed by 
Jason W. Morgan under the direction of Bear Braumoeller with support from 
The Ohio State University's College of Social and Behavioral Sciences. The 
package represents a significant re-write of the original boolean 
implementation developed by Bear Braumoeller, Ben Goodrich, and Jacob Kline.

For a more detailed overview of the package, see my previous post.

Posted in Data analysis, R | Leave a comment

The Latent Path Model for Social Networks: Polmeth 2013 poster

I was lucky to have the chance to attend Polmeth 2013 at the University of Virgina this July. In addition to presenting a paper with Luke Keele, I also presented a poster related to my dissertation project, titled “The Latent Path Model for Social Networks”. An earlier draft of the project was also presented as a paper at Polnet 2013 at Indiana University.

The full abstract:

This paper presents a method of estimating political actors’ positions in a latent social space from dynamic network data. The proposed latent path model is a natural extension of the latent space model for static networks (Hoff, Raftery, and Handcock 2002), allowing political actors’ locations in the social space to shift over time to reflect the changes in the structure of ties in the network. Unlike alternative models of (Purnamrita and Moore 2005; Ward, Ahlquist, and Rozenas 2013), the proposed model does not require a Markov assumption and a lag structure need not be specified. The proposed model is validated through Monte Carlo simulations and applied to Fowler’s Senate cosponsorship data (reported below) and Polish party switching (forthcoming).

The Polmeth 2013 poster: Latent Path Model for Social Networks.

In its current form, the model is estimated with maximum likelihood. However, I ran into some convergence problems with larger networks, such as the Polish party switching data mentioned in the abstract. I have begun testing the model in Stan. Preliminary results are encouraging, though I am still wrestling with issues of identification.

References

  • Hoff, Peter D., Adrian E. Raftery, and Mark S. Handcock. (2002). “Latent Space Approaches to Social Network Analysis”. JASA 97(460): 1090–1098.
  • Sarkar, Purnamrita and Andrew W. Moore. (2005). “Dynamic Social Network Analysis using Latent Space Models”. SIGKDD Explorations 7(2): 31–40.
  • Ward, Michael D., John S. Ahlquist, and Arturas Rozenas. (2013). “Gravity’s Rainbow: A Dynamic Latent Space Model for the World Trade Network”. Network Science 1(1): 95–118.
Posted in Network analysis, Research | Leave a comment

Pearson’s r: Not a good measure of electoral persistence

Pearson’s product-moment correlation, \(r\), is an incredibly useful tool for getting some idea about how two variables are (linearly) related. But there are times when using Pearson’s \(r\) is not appropriate and, even if linearity and all other assumptions hold, using it may lead one astray.

I recently ran into such a situation when looking at the persistence of municipal-level vote shares over the last four Polish parliamentary elections. The plot below presents an example of some of the data I was looking at. It shows the share of the vote going to Samoobrona (Self-Defense of the Republic of Poland) by municipality in 2005 (horizontal axis) and 2007 (vertical axis). The linear regression line is shown in red and the 45-degree line is the dashed grey line.

SRP vote share by municipality, 2005-2007

From the plot it is clear that support for SRP collapsed between 2005 and 2007: had SRP’s performance held steady across the two elections, we would expect the data to cluster around the 45-degree line. Instead, in all but one or two municipalities, support was lower in 2007 than 2005. Accordingly, the data fall below the 45-degree line.

However, if I had ignored the plot and used Pearson’s \(r\) as the primary measure of persistence in vote share across years — something that is not uncommon in the study party politics — I would have been lead to believe that there was, in fact, a high degree of persistence between the two elections. As reported in the plot, \(r=0.72\), which certainly seems to be good evidence of a high degree of persistence in vote shares between 2005 and 2007. So, what’s going on?

Jason Wittenberg explains in a recent paper why Pearson’s \(r\) is not an appropriate measure in this case. The problem is the fact that Pearson’s \(r\) is a measure of linearity, not persistence — it captures how far from a linear regression line the data fall (the red line in the above plot). But what we really want is a measure of how far the data are from the 45-degree line. For this, Wittenberg suggests using Lin’s concordance correlation coefficient, \(r_c\). From the Wikipedia article, Lin’s \(r_c\) is defined as follows:

\(
r_c = \cfrac{2s_{xy}}{s_x^2 + s_y^2 + (\bar{x} – \bar{y})^2}.
\)

 

The relationship between \(r\) and \(r_c\) is straightforward. From the original paper:

  1. \(-1 \leq -|r| \leq r_c \leq |r| \leq 1\);
  2. \(r_c = 0\) if and only if \(r=0\);
  3. \(r_c = r\) if and only if \(s_x = s_y\); and
  4. \(r_c = \pm 1\) if and only if \(r = \pm 1\), \(s_x = s_y\), and \(\bar{x} = \bar{y}\).

(These come from Lin’s 1989 paper, p. 258, cited below. Note I have left out a couple equivalent conditions from (iv) and have changed the notation to match the definition of \(r_c\).)

As Wittenberg notes, condition (i) is potentially quite problematic for studies of electoral persistence. It says that the magnitude of \(r_c\) is always less than or equal to \(r\). In other words, Pearson’s \(r\) (almost) always inflates the amount of persistence present in the data. Evidence of this is seen the in next plot, which compares electoral performance of each major party in the 2005 and 2007 Polish parliamentary elections.

Party vote share by municipality, 2005-2007

Clearly, Pearson’s \(r\) and Lin’s \(r_c\) point to significantly different interpretations of the data when it comes to the issue of electoral persistence. As expected the magnitude of \(r_c\) is always lower than \(r\) — sometimes dramatically lower. In the case of SRP, for instance, \(r_c = 0.06\) compared to \(r = 0.72\) reported above. Also note, as expected given the conditions above, \(r_c\) is closest to \(r\) when regression line is near a 45-degree line.

So, what’s the take-away? First, Wittenberg is right in calling into question the use of Pearson’s \(r\) as a measure of electoral persistence (anyone know a better measure?). It simply doesn’t properly capture the concept. Second, always look at your data. In this case, I had never heard of Lin’s concordance correlation coefficient until I went looking for some way to measure the discordance I was seeing in the data. Had I just decided to construct a table of correlations, I may have believed that there was a great deal more persistence in electoral votes shares than actually existed.

R Code

Lin’s \(r_c\) can be calculated with the epi.ccc function in the epiR package. For those who may care, here is the code I used to make the second plot above:

xyplot(vote_share.2007 ~ vote_share.2005 | factor(party), 
       data=PartyW[party %!in% c("Other", "RP")], as.table=TRUE, 
       between=list(x=1,y=1), aspect=1, col=grey[5], cex=0.75, 
       xlab="2005", ylab="2007", xlim=c(-0.1,1),
       ylim=c(-0.1,1), strip=strip.custom(bg=grey[3]),
       panel=function(x, y, ...)
       {
           r <- round(cor(x, y, use="pairwise.complete.obs"), 2)
           rc <- round(epi.ccc(x, y)[["rho.c"]][1,1], 2)
           panel.xyplot(x, y, ...)
           panel.abline(a=0, b=1, lty=2, col=grey[4], lwd=2)
           panel.abline(lm(y ~ x), col=red[6], lwd=2)
           panel.text(x=0, y=0.9, labels=bquote(r == .(r)), 
                      cex=0.75, pos=4)
           panel.text(x=0, y=0.8, labels=bquote(r[c] == .(rc)), 
                      cex=0.75, pos=4)
       })

References

  • Lin, Lawrence I-Kuei. (1989). “A Concordance Correlation Coefficient to Evaluate Reproducibility.” Biometrics 45: 255–268.
  • Lin, Lawrence I-Kuei. (2000). “Correction: A Note on the Concordance Correlation Coefficient.” Biometrics 56: 324–325.
  • Wittenberg, Jason. (Oct. 2011). “How Similar Are They? Rethinking Electoral Congruence.” Quality & Quantity.
Posted in Data analysis, Graphics, Poland, Political parties, R | 3 Comments

R Tip: Avoid using T and F as synonyms for TRUE and FALSE

By default when you start R, T and F are defined as TRUE and FALSE. When I review other people’s code, I often see functions defined with arguments set to these values by default. This is a very bad idea. T and F are symbols that can be redefined, so it shouldn’t be assumed that they will always evaluate to TRUE and FALSE. Making that assumption can introduce bugs to the code that are very hard to track down.

For example, imagine you have defined a function to sample from a vector after transforming the data in some way:

my_sample <- function(x, size, rep=F) {
    x <- x^2  # a simple transform
    sample(x, size, replace=rep)
}

When you just start R, my_sample will work as intended because F is FALSE:

> my_sample(1:10, 8)
 [1]   4   1   9 100  49  16  36  81

But if you call this function from inside another function, or after hours of hacking at the console, this may not be the case. For instance, what if at some point in your R session you redefined F <- 2:

> my_sample(1:10, 8)
 [1]   9  49 100   4  64  64  81  36

This type of bug can be very dangerous because they are very hard to replicate and, worse yet, you may never even notice you have them. Luckily, such bugs are easy to avoid: never use T and F as synonyms for TRUE and FALSE. It doesn’t take much more effort to type out TRUE and FALSE, but if you find it onerous, get an editor with tab-completion.

Update

Via email, Ananda Mahto suggests adding the following bit of code to the top of your scripts to provide a warning when T and F have been redefined.

if (!identical(T, TRUE)) 
    stop("'T' has been remapped to '", T, "'")
if (!identical(F, FALSE)) 
    stop("'F' has been remapped to '", F, "'")

This doesn’t solve the problem, of course, but at least you get some warning.

Posted in Hacking, R, Tip | 1 Comment

Closures in R: A useful abstraction

People who have been using R for any length of time have probably become accustomed to passing functions as arguments to other functions. From my experience, however, people are much less likely to return functions from their own custom code. This is too bad because doing so can open up a whole new world of abstraction that can greatly decrease the quantity and complexity of the code necessary to complete certain types of tasks. Here I provide some brief examples of how R programmers can utilize lexical closures to encapsulate both data and methods.

To begin with a simple example, suppose you want a function that adds 2 to its argument. You would likely write something like this:

add_2 <- function(y) { 2 + y }

Which does exactly what you would expect:

> add_2(1:10)
 [1] 3 4 5 6 7 8 9 10 11 12

Now suppose you need another function that instead adds 7 to its argument. The natural thing to do would be to write another function, just like add_2, where the 2 is replaced with a 7. But this would be grossly inefficient: if in the future you discover that you made a mistake and you in fact need to multiply the values instead of add them, you would be forced to change the code in two places. In this trivial example, that may not be much trouble, but for more complicated projects, duplicating code is a recipe for disaster.

A better idea would be to write a function that takes one argument, x, that returns another function which adds its argument, y, to x. In other words, something like this:

add_x <- function(x) {
    function(y) { x + y }
}

Now, when you call add_x with an argument, you will get back a function that does exactly what you want:

add_2 <- add_x(2)
add_7 <- add_x(7)

> add_2(1:10)
 [1] 3 4 5 6 7 8 9 10 11 12
> add_7(1:10)
 [1] 8 9 10 11 12 13 14 15 16 17

So far, this doesn’t appear too earth-shattering. But if you look closely at the definition of add_x, you may notice something odd: how does the return function know where to find x when it’s called at a later point?

It turns out that R is lexically scoped, meaning that functions carry with them a reference to the environment within which they were defined. In this case, when you call add_x, the x argument you provide gets attached to the environment for the return function. In other words, in this simple example, you can think of R as just replacing all instances of the x variable in the function to be returned with the value you specify when you called add_x.

Ok, so this may be a neat trick, but how this can be used more productively? For a slightly more complicated example, suppose you are performing some complex bootstrapping and, for efficiency, you pre-allocate container vectors to store the results. This is straightforward when you have just a single vector of results—all you have to do is remember to iterate an index counter each time you add a result to the vector.

for (i in 1:nboot) {
  bootmeans[i] <- mean(sample(data, length(data), replace=TRUE))
}

> mean(data)
 [1] 0.0196
> mean(bootmeans)
 [1] 0.0188

But suppose you want to track several different statistics, each requiring you to keep track of a different index variable. If your bootstrapping routine is even a little bit complicated, this can be tedious and prone to error. By using closures, you can abstract away all of this bookkeeping. Here is a constructor function that wraps a pre-allocated container vector:

make_container <- function(n) {
    x <- numeric(n)
    i <- 1

    function(value=NULL) {
        if (is.null(value)) {
            return(x)
        }
        else {
            x[i] <<- value
            i <<- i + 1
        } 
    }
}

When you call make_container with an argument, it pre-allocates a numeric vector of the specified length, n, and returns a function that allows you to add data to that vector without having to worry about keeping track of an index. If you don’t the argument to that return function is NULL, the full vector is returned.

bootmeans <- make_container(nboot)

for (i in 1:nboot)
 bootmeans(mean(sample(data, length(data), replace=TRUE)))

> mean(data)
 [1] 0.0196
> mean(bootmeans())
 [1] 0.0207

Here make_container is relatively simple, but it can be as complicated as you need. For instance, you may want to have the constructor function perform some expensive calculations that you would rather not do every time the function is called. In fact, this is what I have done in the boolean3 package to minimize the number of calculations done at every iteration of the optimization routine.

Posted in R | 1 Comment

Filtering a list with the Filter higher-order function

Last week markbulling over at Drunks & Lampposts posted a method of using sapply to filter a list by a predicate. Today the @RLangTip tip of the day was to use sapply similarly. This made makes me wonder if R‘s very useful higher-order functions aren’t as well known as they should be. In this case, the Filter higher-order function would be the tool to use. Filter works more or less like the *apply family of functions, but it performs the subsetting (the filtering) of a list based on a predicate in a single step.

As an example, let’s say we have a list of 1000 vectors, each of length 2 with \(x_1,\,x_2 \in [0,\,1]\), and we want to select only those vectors where the elements of the list sum to a value greater than 1. With Filter, this is all we have to do:

mylist <- lapply(1:1000, function(i) c(runif(1), runif(1)))
method.1 <- Filter(function(x) sum(x) > 1, mylist)

Which is at least a bit more transparent than the sapply alternative:

method.2 <- mylist[sapply(mylist, function(x) sum(x) > 1)]

In some very quick tests, I found no performance difference between the two approaches.

There are other useful higher-order functions. If you are interested, check out ?Filter.

Posted in R | Leave a comment

Announcing boolean3 (beta)

After entirely too long, I am happy to announce the beta release of boolean3, an R package for modeling causal complexity. The package can be downloaded at the following links:

(Please let me know if you have any trouble installing the Windows version. I didn’t have a Windows system handy when I built the package.)

The theoretical foundation for the package was developed by Bear Braumoeller in a 2003 Political Analysis piece:

Braumoeller, Bear F. (2003) 'Causal Complexity and the Study of Politics'. 
  Political Analysis 11(3): 209-233.

Additional theoretical content can be found at the Boolean Statistics homepage. Braumoeller and Carson have a 2011 paper titled “Political Irrelevance, Democracy, and the Limits of Militarized Conflict” in the Journal of Conflict Resolution that provides a useful example of the approach.

Summarizing from the boolean3 documentation and package description:

boolean3 provides a means of estimating partial-observability binary 
response models following boolean logic. boolean3 was developed by 
Jason W. Morgan under the direction of Bear Braumoeller with support from 
The Ohio State University's College of Social and Behavioral Sciences. The 
package represents a significant re-write of the original boolean 
implementation developed by Bear Braumoeller, Ben Goodrich, and Jacob Kline.

As is typically the case with “significant re-writes”, boolean3 breaks compatibility with previous versions (which can still be downloaded and installed from CRAN). Some of the many changes and enhancements include:

  • Removed dependence on Zelig.
  • Changed the method of specifying the model to me more consistent with the style of other R packages.
  • Improved performance. Many models can be estimated in 1/10th the time, or better, when compared to the original boolean package.
  • Added support for the optimizing methods available in the optimx package. This includes the ability to apply box constraints, which can be useful in maximizing boolean likelihoods.
  • Added multiprocessor/cluster support for bootstrapping using the snow package.
  • Genetic optimization using rgenoud also provides support for clustering with snow.
  • Plotting of predicted probabilities and likelihood profiles is now done with lattice.

Some things that are still missing from the package:

  • Documentation is still thin. There is enough in the help pages to get people started using the package, but more is needed.
  • Skewed logit (scobit) is not yet implemented. Most of the code has been written, but there are a few complications yet to be worked out.

If you are a current user of boolean, we’d love for you to try out boolean3. While a lot of bug-squashing has been done, there has been fairly little “real-world” testing. Bug reports and suggestions for changes and enhancements would be quite welcome, and can be emailed to Bear Braumoeller, me, or posted below in the comments. I will continue to post updates to the package as they are available.

Posted in Code, R | 2 Comments

Welford’s method for calculating the sample variance: An implementation in Scheme

John Cook has three entries up on his blog discussing the pitfalls of calculating the sample variance using the mathematical textbook definitions. He provides a Monte Carlo comparison of methods here, and a theoretical discussion here. He also provides a C++ implementation of Welford’s method here.

I recently started hacking on Scheme, writing a basic statistics library as a way to better understand the language (which is the most fun I have ever had programming). Below is a Scheme implementation of Welford’s method. For comparison, I have included a version that uses the textbook definition

\(
s^2 = \cfrac{1}{n-1} \sum_{i=0}^n (x_i – \bar{x})^2.
\)

 

Here is the code:

;;; Textbook definition
(define (variance x)
  (let ((mu (mean x)))
    (/ (sum (f64vector-map
             (lambda (x) (expt (- x mu) 2)) x))
        (- (f64vector-length x) 1))))

;;; Welford's method
(define (variance-welford x)
  (let loop ((n (f64vector-length x))
             (k 1)
             (m (f64vector-ref x 0))
             (s 0))
    (if (= k n)
        (/ s (fx- n 1))
        (let* ((xk (f64vector-ref x k))
               (mk (+ m (/ (- xk m) (fx+ k 1))))
               (sk (+ s (* (- xk m) (- xk mk)))))
          (loop n (fx+ k 1) mk sk)))))

I implemented these in Chicken Scheme, though the code should work in any Scheme that supports SRFI-4 homogeneous numeric vectors. The code could easily be modified to support standard Scheme vectors or lists.

I make no claims that this is the most efficient code, but it seems to work.

Posted in Code, Data analysis, Scheme | Leave a comment

Three free books for better programming in R (and any other language)

Like many users and producers of R packages, I have never had any formal training in computer science. I’ve come to to the conclusion that this is a serious omission in a professional researcher’s training. Computer scientists and professional hackers have learned a lot about effective, efficient programming over the last five decades and it’s past time academic researchers begin to learn from this experience.

To that end, here are three books which I think all users of R could benefit from reading:

Yes, I realize all of these are books that use Lisp/Scheme, and I realize the structure of Lisp can be a bit off-putting ((at '(first))). But, unless you want to learn Lisp, you shouldn’t try to read these books with the goal of understanding ever nuance of the syntax; rather, you should read them with an eye toward understand how a programming language should work for and not against you. For instance, each of these books shows, in a step-by-step approach, how seemingly complex programs can be effectively broken down into simple procedures that anyone with some programming experience can understand. A great deal of emphasis in these books is the important concepts of abstraction and non-duplication. If you’re like me and spend any significant amount of time going through various R packages, you’ll know what I mean when I say that the R community could benefit from less duplication and more abstraction in contributed packages (not to mention more comments!).

Posted in Hacking, R | 4 Comments

The performance cost of a for-loop, and some alternatives

I’ve recently been spending a lot of time running various simulations in R. Because I often use snow to perform simulations across several computers/cores, results typically come back in the form of a list object. Summarizing the results from a list is simple enough using a for-loop, but it’s much “sexier” to use a functional style of programming that takes advantage of higher order functions or the *apply-family of functions (R is, after all, a functional language at its core). But what’s the performance cost of using, for instance, lapply with a large list, particularly when the results from lapply have to be further manipulated to get them into a useful form? I recently did some performance testing on three of the *apply functions and compared them to an equivalent for-loop approach. The results follow.

First, I set up a large list with some fake data. Here I make a list, Sims, containing 30000 \(5 \times 5\) matrices. (This is a data structure similar to one I recently found myself analyzing.)

set.seed(12345)
S <- 30000
Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)

Now I define four functions that extract the fifth column of each matrix in Sims and create a \(30000 \times 5 \) matrix of results.

for_method <- function(S, Sims) {
  R <- matrix(NA, ncol=5, nrow=S)
  for (i in 1:S) R[i,] <- Sims[[i]][,5]
  R
}

sapply_method <- function(Sims) {
  t(sapply(Sims, function(x) x[,5]))
}

lapply_method <- function(Sims) {
  do.call(rbind, lapply(Sims, function(x) x[,5]))
}

vapply_method <- function(Sims) {
  t(vapply(Sims, function(x) x[,5], rep(0, 5)))
}

Notice that the for_method first creates a container matrix and inserts the results directly into it. Using something like rbind would force R to copy the matrix in each for-iteration, which is very inefficient. Also notice that in each of the apply methods, the result has to be manipulated before it is returned. It’s often possible to eliminate this step during the initial simulation process. But if you’re like me, you sometimes start your simulations before you realize what you intend on doing with the results.

Now for some timing results:

> system.time(replicate(20, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  4.340   0.072   4.414
> system.time(replicate(20, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  3.452   0.076   3.528
> system.time(replicate(20, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  4.393   0.044   4.433
> system.time(replicate(20, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  2.588   0.040   2.628
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results surprised me—I never expected vapply to be so much quicker than the other three methods. This may have to do with that fact that, since vapply requires you to explicitly specify the resultant data structure, R is able allocate the memory ahead of time.

As a simple extension, I thought I would retest taking advantage of byte-compiled code:

library(compiler)

for_method <- cmpfun(for_method)
sapply_method <- cmpfun(sapply_method)
lapply_method <- cmpfun(lapply_method)
vapply_method <- cmpfun(vapply_method)

And here are the results:

> system.time(replicate(10, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  2.084   0.024   2.110
> system.time(replicate(10, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  1.624   0.016   1.642
> system.time(replicate(10, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  2.152   0.012   2.164
> system.time(replicate(10, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  1.204   0.004   1.207
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results are also surprising. I really thought byte-compiling the functions would make the results much closer (doesn’t each function compile into what is effectively a for-loop?). However, in this simple example at least, that isn’t the case. vapply still has, by a significant margin, the best performance. sapply also performs noticeably better than lapply and the for-loop, which is an important result since sapply doesn’t require a pre-specified data structure. Assuming they hold in the presence of more complicated data structures (and I don’t see what they wouldn’t), these results seem to suggest that a functional approach can provide benefits in terms of coding efficiency and run-time performance.

I welcome any feedback or suggestions on how these tests could be improved or extended.

Update

A couple things have come up in the comments. First, Henrik provided a more elegant (and efficient) version of vapply_method:

vapply_method2 <- function(Sims) {
  t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
}

Second, several people have reported that the for_method on their system is just as fast, if not faster, than the vapply_method when compiled. I have retested my results, adding Henrik's vapply_method2. Here are the results for compiled code (note that I have increased the number of replications):

> library(compiler)
> set.seed(12345)
> S <- 30000
> Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)
> 
> for_method <- function(S, Sims) {
+   R <- matrix(NA, ncol=5, nrow=S)
+   for (i in 1:S) R[i,] <- Sims[[i]][,5]
+   R
+ }
> 
> sapply_method <- function(Sims) {
+   t(sapply(Sims, function(x) x[,5]))
+ }
> 
> lapply_method <- function(Sims) {
+   do.call(rbind, lapply(Sims, function(x) x[,5]))
+ }
> 
> vapply_method <- function(Sims) {
+   t(vapply(Sims, function(x) x[,5], rep(0, 5)))
+ }
> 
> vapply_method2 <- function(Sims) {
+   t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
+ }
> for_method <- cmpfun(for_method)
> sapply_method <- cmpfun(sapply_method)
> lapply_method <- cmpfun(lapply_method)
> vapply_method <- cmpfun(vapply_method)
> vapply_method2 <- cmpfun(vapply_method2)
> system.time(replicate(50, R.1 <<- for_method(S, Sims)))
   user  system elapsed 
  6.052   0.136   6.191 
> system.time(replicate(50, R.2 <<- sapply_method(Sims)))
   user  system elapsed 
  8.361   0.116   8.474 
> system.time(replicate(50, R.3 <<- lapply_method(Sims)))
   user  system elapsed 
 10.548   0.144  10.692 
> system.time(replicate(50, R.4 <<- vapply_method(Sims)))
   user  system elapsed 
  5.813   0.124   5.936 
> system.time(replicate(50, R.5 <<- vapply_method2(Sims)))
   user  system elapsed 
  4.268   0.084   4.352

For whatever reason, the vapply methods remain the fastest on my system (particularly Henrik’s version). I also tested on another Linux machine and got the same results (I have not yet tested on Windows). Of course, this is a very simple example, so the performance differences may not hold up to the increasing complexity of the input function. I’d be interested to know if anyone has an idea as to why vapply would perform better in Linux than in Windows/Mac. A library thing?

Posted in R | 10 Comments