Filtering a list with the Filter higher-order function

Last week markbulling over at Drunks & Lampposts posted a method of using sapply to filter a list by a predicate. Today the @RLangTip tip of the day was to use sapply similarly. This made makes me wonder if R‘s very useful higher-order functions aren’t as well known as they should be. In this case, the Filter higher-order function would be the tool to use. Filter works more or less like the *apply family of functions, but it performs the subsetting (the filtering) of a list based on a predicate in a single step.

As an example, let’s say we have a list of 1000 vectors, each of length 2 with x_1,\,x_2 \in [0,\,1], and we want to select only those vectors where the elements of the list sum to a value greater than 1. With Filter, this is all we have to do:

mylist <- lapply(1:1000, function(i) c(runif(1), runif(1)))
method.1 <- Filter(function(x) sum(x) > 1, mylist)

Which is at least a bit more transparent than the sapply alternative:

method.2 <- mylist[sapply(mylist, function(x) sum(x) > 1)]

In some very quick tests, I found no performance difference between the two approaches.

There are other useful higher-order functions. If you are interested, check out ?Filter.

Posted in R | Leave a comment

Announcing boolean3 (beta)

After entirely too long, I am happy to announce the beta release of boolean3, an R package for modeling causal complexity. The package can be downloaded at the following links:

(Please let me know if you have any trouble installing the Windows version. I didn’t have a Windows system handy when I built the package.)

The theoretical foundation for the package was developed by Bear Braumoeller in a 2003 Political Analysis piece:

Braumoeller, Bear F. (2003) 'Causal Complexity and the Study of Politics'.
  Political Analysis 11(3): 209-233.

Additional theoretical content can be found at the Boolean Statistics homepage. Braumoeller and Carson have a 2011 paper titled “Political Irrelevance, Democracy, and the Limits of Militarized Conflict” in the Journal of Conflict Resolution that provides a useful example of the approach.

Summarizing from the boolean3 documentation and package description:

boolean3 provides a means of estimating partial-observability binary
response models following boolean logic. boolean3 was developed by
Jason W. Morgan under the direction of Bear Braumoeller with support from
The Ohio State University's College of Social and Behavioral Sciences. The
package represents a significant re-write of the original boolean
implementation developed by Bear Braumoeller, Ben Goodrich, and Jacob Kline.

As is typically the case with “significant re-writes”, boolean3 breaks compatibility with previous versions (which can still be downloaded and installed from CRAN). Some of the many changes and enhancements include:

  • Removed dependence on Zelig.
  • Changed the method of specifying the model to me more consistent with the style of other R packages.
  • Improved performance. Many models can be estimated in 1/10th the time, or better, when compared to the original boolean package.
  • Added support for the optimizing methods available in the optimx package. This includes the ability to apply box constraints, which can be useful in maximizing boolean likelihoods.
  • Added multiprocessor/cluster support for bootstrapping using the snow package.
  • Genetic optimization using rgenoud also provides support for clustering with snow.
  • Plotting of predicted probabilities and likelihood profiles is now done with lattice.

Some things that are still missing from the package:

  • Documentation is still thin. There is enough in the help pages to get people started using the package, but more is needed.
  • Skewed logit (scobit) is not yet implemented. Most of the code has been written, but there are a few complications yet to be worked out.

If you are a current user of boolean, we’d love for you to try out boolean3. While a lot of bug-squashing has been done, there has been fairly little “real-world” testing. Bug reports and suggestions for changes and enhancements would be quite welcome, and can be emailed to Bear Braumoeller, me, or posted below in the comments. I will continue to post updates to the package as they are available.

Posted in Code, R | Leave a comment

Welford’s method for calculating the sample variance: An implementation in Scheme

John Cook has three entries up on his blog discussing the pitfalls of calculating the sample variance using the mathematical textbook definitions. He provides a Monte Carlo comparison of methods here, and a theoretical discussion here. He also provides a C++ implementation of Welford’s method here.

I recently started hacking on Scheme, writing a basic statistics library as a way to better understand the language (which is the most fun I have ever had programming). Below is a Scheme implementation of Welford’s method. For comparison, I have included a version that uses the textbook definition

  s^2 = \cfrac{1}{n-1} \sum_{i=0}^n (x_i - \bar{x})^2.

 

Here is the code:

;;; Textbook definition
(define (variance x)
  (let ((mu (mean x)))
    (/ (sum (f64vector-map
             (lambda (x) (expt (- x mu) 2)) x))
        (- (f64vector-length x) 1))))

;;; Welford's method
(define (variance-welford x)
  (let loop ((n (f64vector-length x))
             (k 1)
             (m (f64vector-ref x 0))
             (s 0))
    (if (= k n)
        (/ s (fx- n 1))
        (let* ((xk (f64vector-ref x k))
               (mk (+ m (/ (- xk m) (fx+ k 1))))
               (sk (+ s (* (- xk m) (- xk mk)))))
          (loop n (fx+ k 1) mk sk)))))

I implemented these in Chicken Scheme, though the code should work in any Scheme that supports SRFI-4 homogeneous numeric vectors. The code could easily be modified to support standard Scheme vectors or lists.

I make no claims that this is the most efficient code, but it seems to work.

Posted in Code, Data analysis, Scheme | Leave a comment

Three free books for better programming in R (and any other language)

Like many users and producers of R packages, I have never had any formal training in computer science. I’ve come to to the conclusion that this is a serious omission in a professional researcher’s training. Computer scientists and professional hackers have learned a lot about effective, efficient programming over the last five decades and it’s past time academic researchers begin to learn from this experience.

To that end, here are three books which I think all users of R could benefit from reading:

Yes, I realize all of these are books that use Lisp/Scheme, and I realize the structure of Lisp can be a bit off-putting ((at '(first))). But, unless you want to learn Lisp, you shouldn’t try to read these books with the goal of understanding ever nuance of the syntax; rather, you should read them with an eye toward understand how a programming language should work for and not against you. For instance, each of these books shows, in a step-by-step approach, how seemingly complex programs can be effectively broken down into simple procedures that anyone with some programming experience can understand. A great deal of emphasis in these books is the important concepts of abstraction and non-duplication. If you’re like me and spend any significant amount of time going through various R packages, you’ll know what I mean when I say that the R community could benefit from less duplication and more abstraction in contributed packages (not to mention more comments!).

Posted in Hacking, R | 4 Comments

The performance cost of a for-loop, and some alternatives

I’ve recently been spending a lot of time running various simulations in R. Because I often use snow to perform simulations across several computers/cores, results typically come back in the form of a list object. Summarizing the results from a list is simple enough using a for-loop, but it’s much “sexier” to use a functional style of programming that takes advantage of higher order functions or the *apply-family of functions (R is, after all, a functional language at its core). But what’s the performance cost of using, for instance, lapply with a large list, particularly when the results from lapply have to be further manipulated to get them into a useful form? I recently did some performance testing on three of the *apply functions and compared them to an equivalent for-loop approach. The results follow.

First, I set up a large list with some fake data. Here I make a list, Sims, containing 30000 5 \times 5 matrices. (This is a data structure similar to one I recently found myself analyzing.)

set.seed(12345)
S <- 30000
Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)

Now I define four functions that extract the fifth column of each matrix in Sims and create a 30000 \times 5 matrix of results.

for_method <- function(S, Sims) {
  R <- matrix(NA, ncol=5, nrow=S)
  for (i in 1:S) R[i,] <- Sims[[i]][,5]
  R
}

sapply_method <- function(Sims) {
  t(sapply(Sims, function(x) x[,5]))
}

lapply_method <- function(Sims) {
  do.call(rbind, lapply(Sims, function(x) x[,5]))
}

vapply_method <- function(Sims) {
  t(vapply(Sims, function(x) x[,5], rep(0, 5)))
}

Notice that the for_method first creates a container matrix and inserts the results directly into it. Using something like rbind would force R to copy the matrix in each for-iteration, which is very inefficient. Also notice that in each of the apply methods, the result has to be manipulated before it is returned. It’s often possible to eliminate this step during the initial simulation process. But if you’re like me, you sometimes start your simulations before you realize what you intend on doing with the results.

Now for some timing results:

> system.time(replicate(20, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  4.340   0.072   4.414
> system.time(replicate(20, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  3.452   0.076   3.528
> system.time(replicate(20, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  4.393   0.044   4.433
> system.time(replicate(20, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  2.588   0.040   2.628
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results surprised me—I never expected vapply to be so much quicker than the other three methods. This may have to do with that fact that, since vapply requires you to explicitly specify the resultant data structure, R is able allocate the memory ahead of time.

As a simple extension, I thought I would retest taking advantage of byte-compiled code:

library(compiler)

for_method <- cmpfun(for_method)
sapply_method <- cmpfun(sapply_method)
lapply_method <- cmpfun(lapply_method)
vapply_method <- cmpfun(vapply_method)

And here are the results:

> system.time(replicate(10, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  2.084   0.024   2.110
> system.time(replicate(10, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  1.624   0.016   1.642
> system.time(replicate(10, R.3 <<- lapply_method(Sims)))
   user  system elapsed
  2.152   0.012   2.164
> system.time(replicate(10, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  1.204   0.004   1.207
> all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4))
[1] TRUE

These results are also surprising. I really thought byte-compiling the functions would make the results much closer (doesn’t each function compile into what is effectively a for-loop?). However, in this simple example at least, that isn’t the case. vapply still has, by a significant margin, the best performance. sapply also performs noticeably better than lapply and the for-loop, which is an important result since sapply doesn’t require a pre-specified data structure. Assuming they hold in the presence of more complicated data structures (and I don’t see what they wouldn’t), these results seem to suggest that a functional approach can provide benefits in terms of coding efficiency and run-time performance.

I welcome any feedback or suggestions on how these tests could be improved or extended.

Update

A couple things have come up in the comments. First, Henrik provided a more elegant (and efficient) version of vapply_method:

vapply_method2 <- function(Sims) {
  t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
}

Second, several people have reported that the for_method on their system is just as fast, if not faster, than the vapply_method when compiled. I have retested my results, adding Henrik's vapply_method2. Here are the results for compiled code (note that I have increased the number of replications):

> library(compiler)
> set.seed(12345)
> S <- 30000
> Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)
>
> for_method <- function(S, Sims) {
+   R <- matrix(NA, ncol=5, nrow=S)
+   for (i in 1:S) R[i,] <- Sims[[i]][,5]
+   R
+ }
>
> sapply_method <- function(Sims) {
+   t(sapply(Sims, function(x) x[,5]))
+ }
>
> lapply_method <- function(Sims) {
+   do.call(rbind, lapply(Sims, function(x) x[,5]))
+ }
>
> vapply_method <- function(Sims) {
+   t(vapply(Sims, function(x) x[,5], rep(0, 5)))
+ }
>
> vapply_method2 <- function(Sims) {
+   t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5))
+ }
> for_method <- cmpfun(for_method)
> sapply_method <- cmpfun(sapply_method)
> lapply_method <- cmpfun(lapply_method)
> vapply_method <- cmpfun(vapply_method)
> vapply_method2 <- cmpfun(vapply_method2)
> system.time(replicate(50, R.1 <<- for_method(S, Sims)))
   user  system elapsed
  6.052   0.136   6.191
> system.time(replicate(50, R.2 <<- sapply_method(Sims)))
   user  system elapsed
  8.361   0.116   8.474
> system.time(replicate(50, R.3 <<- lapply_method(Sims)))
   user  system elapsed
 10.548   0.144  10.692
> system.time(replicate(50, R.4 <<- vapply_method(Sims)))
   user  system elapsed
  5.813   0.124   5.936
> system.time(replicate(50, R.5 <<- vapply_method2(Sims)))
   user  system elapsed
  4.268   0.084   4.352

For whatever reason, the vapply methods remain the fastest on my system (particularly Henrik’s version). I also tested on another Linux machine and got the same results (I have not yet tested on Windows). Of course, this is a very simple example, so the performance differences may not hold up to the increasing complexity of the input function. I’d be interested to know if anyone has an idea as to why vapply would perform better in Linux than in Windows/Mac. A library thing?

Posted in R | 10 Comments

Emacs-fu: The place to go for useful Emacs tips

After 5 years or so of using Emacs almost exclusively as my “text editor”, somehow I just ran across Emacs-fu for the first time today. It’s an incredibly useful site with numerous Emacs tips and tricks. If you are an Emacs user, you should check the site out if you don’t already know about it. And if you aren’t an Emacs user, you should still check out the site to see what your editor is missing.

Posted in Emacs, Hacking | Leave a comment

Stronger instruments by design: Polmeth 2011 poster

I just got back from Polmeth 2011 at Princeton, where I presented a poster for a project Luke Keele and I are working on. It deals with the problem of statistical inference in the presence of a weak instrument and the non-random assignment of the instrument. We employ a technique developed by Baoicchi, Small, Lorch, & Rosenbaum in a recent JASA article to effectively strengthen the instrument, thus increasing the robustness of effect estimates to deviations from random assignment.

I think the poster turned out pretty well and the feedback I received will be very useful as we move forward with project. I have included a link to the poster below the draft abstract. We hope to complete the draft paper in the next few months, but we would appreciate any comments people may have.

There is growing interest in natural experiments in political science. Natural experiments are often analyzed with instrumental variable estimators reflecting a belief that combining the power of natural random assignment with an instrumental variable approach will solve many of the research design problems endemic to social science. Here, we highlight how weak instruments can interact with the assumption of random assignment of the instrument. When the instrument is not randomly assigned, weak instruments produce bias that is not alleviated by additional data. We demonstrate how matching combined with a reverse caliper can be used to strengthen an instrument within a subset of the overall study. We start by presenting an alternative non-parametric instrumental variable estimator first proposed by Rosenbaum (1996) that allows us to combine matching with an IV estimator. Unlike the standard 2SLS IV estimator, this non-parametric approach provides accurate confidence intervals and consistent causal estimates even when the instrument is weak. A further advantage of this non-parametric method is the opportunity it provides to probe the random assignment assumption with a sensitivity test. We provide substantive examples of the proposed approach with a reevaluation of a recent paper that uses rainfall as an instrument for voter turnout in US counties Hansford & Gomez (2010).

The Polmeth 2011 poster: Stronger Instruments by Design.

Posted in Causal inference, Instrumental variables, Matching | Leave a comment

Code: mtable-ext updated

I have fixed a small bug in mtable-ext that prevented asterisks from being printed for negative coefficients in mixed effects models output by lme4. Thanks to Reinhold Kliegl and Martin Elff for pointing out the bug and for providing the fix. The updated code can be found below. However, I do want to warn users that the way the code calculates p-values is not strictly correct. For more information, see R FAQ 7.35; Doug Bates’ mailing list post and the ensuing discussion linked from the FAQ is a must-read for anyone using lme4 or estimating mixed effects models more generally.

Updated version: mtable-ext-20110621

Posted in Code, LaTeX, R | 1 Comment

Principal component analysis: some introductory resources

Principal component analysis (along with exploratory and confirmatory factor analysis) isn’t something we were exposed to in our Political Science methods sequence here at OSU. However, because they seem to come up from time to time, it’s probably a good idea to get some exposure to the basic theory underlying them. At a minimum, it seems that PCA would be valuable in a preliminary exploratory data analysis. I ran across a thread on Cross Validated today that provides a useful discussion as well as links to some valuable resources. Below are some links to the Cross Validated thread as well as a few of the resources mentioned there. As I run across more resources, I will be sure to add them to this list.

Posted in Data analysis | Leave a comment

A simple frequency plot

I’m currently working on a paper that uses Polish survey data (EVS 2008). I am specifically looking at regional variation in particular responses. Because there are only around 1800 observations in the survey, which are split across 66 subregions of Poland (NUTS-3, specifically), I suspected there would be a large degree of variation in how these interviews were distributed across regions. Typically, I would just use densityplot from the lattice package to get some idea of how a continuous variable is distributed. Of course, with discrete data, table would work well when the variable only takes on few possible values. When the variable can take on a larger number of values, barchart (also from lattice) may also work. However, none of these seemed to provide the type of information I wanted. densityplot obscured the distribution of the data, there were too many categories for table to be all that useful, and I found the barchart to be ugly and not that informative. What I came up with was the following (click the image to get the PDF version):

This is a simple variation of a frequency plot (I like simple plots), but I found it to be much more informative than the alternatives. I hope it’s obvious that each dot represents a NUTS-3 region in Poland, the x-axis, as the label states, is the number of interviews conducted in each region. The function I used to create this plot is as follows:

distplot <- function(x, ...) {
   d <- table(x)
   d <- do.call(rbind, tapply(d, d, function(x) cbind(x, 1:length(x))))
   xyplot(d[,2] ~ d[,1], ...)
}

And this is how it was called:

distplot(Data$subreg.id, ylab = NULL,
   xlab = "Number of interviews conducted",
   scales = list(x = list(at = seq(0, 80, 5)),
   y = list(at = seq(0, 20, 5))), col = red[7], pch = 16,
   ylim = c(0,15))

It is quite possible (even likely) that there is a more elegant way to produce a similar plot—there may even be a built-in function somewhere. But sometimes it’s just quicker to code something yourself than spending a bunch of time looking for “a better way”.

Posted in Data analysis, Graphics, R | 1 Comment