`boolean3`

accepted to CRAN. You can find it here: boolean3 on CRAN. To summarize:
`boolean3`

provides a means of estimating partial-observability binary response models following boolean logic.`boolean3`

was developed by Jason W. Morgan under the direction of Bear Braumoeller with support from The Ohio State University's College of Social and Behavioral Sciences. The package represents a significant re-write of the original boolean implementation developed by Bear Braumoeller, Ben Goodrich, and Jacob Kline.

For a more detailed overview of the package, see my previous post.

]]>The full abstract:

This paper presents a method of estimating political actors’ positions in a latent social space from dynamic network data. The proposed latent path model is a natural extension of the latent space model for static networks (Hoff, Raftery, and Handcock 2002), allowing political actors’ locations in the social space to shift over time to reflect the changes in the structure of ties in the network. Unlike alternative models of (Purnamrita and Moore 2005; Ward, Ahlquist, and Rozenas 2013), the proposed model does not require a Markov assumption and a lag structure need not be specified. The proposed model is validated through Monte Carlo simulations and applied to Fowler’s Senate cosponsorship data (reported below) and Polish party switching (forthcoming).

The Polmeth 2013 poster: Latent Path Model for Social Networks.

In its current form, the model is estimated with maximum likelihood. However, I ran into some convergence problems with larger networks, such as the Polish party switching data mentioned in the abstract. I have begun testing the model in Stan. Preliminary results are encouraging, though I am still wrestling with issues of identification.

- Hoff, Peter D., Adrian E. Raftery, and Mark S. Handcock. (2002). “Latent Space Approaches to Social Network Analysis”. JASA 97(460): 1090–1098.
- Sarkar, Purnamrita and Andrew W. Moore. (2005). “Dynamic Social Network Analysis using Latent Space Models”. SIGKDD Explorations 7(2): 31–40.
- Ward, Michael D., John S. Ahlquist, and Arturas Rozenas. (2013). “Gravity’s Rainbow: A Dynamic Latent Space Model for the World Trade Network”. Network Science 1(1): 95–118.

I recently ran into such a situation when looking at the persistence of municipal-level vote shares over the last four Polish parliamentary elections. The plot below presents an example of some of the data I was looking at. It shows the share of the vote going to Samoobrona (Self-Defense of the Republic of Poland) by municipality in 2005 (horizontal axis) and 2007 (vertical axis). The linear regression line is shown in red and the 45-degree line is the dashed grey line.

From the plot it is clear that support for SRP collapsed between 2005 and 2007: had SRP’s performance held steady across the two elections, we would expect the data to cluster around the 45-degree line. Instead, in all but one or two municipalities, support was lower in 2007 than 2005. Accordingly, the data fall below the 45-degree line.

However, if I had ignored the plot and used Pearson’s \(r\) as the primary measure of persistence in vote share across years — something that is not uncommon in the study party politics — I would have been lead to believe that there was, in fact, a high degree of persistence between the two elections. As reported in the plot, \(r=0.72\), which certainly seems to be good evidence of a high degree of persistence in vote shares between 2005 and 2007. So, what’s going on?

Jason Wittenberg explains in a recent paper why Pearson’s \(r\) is not an appropriate measure in this case. The problem is the fact that Pearson’s \(r\) is a measure of linearity, not persistence — it captures how far from a linear regression line the data fall (the red line in the above plot). But what we really want is a measure of how far the data are *from the 45-degree line*. For this, Wittenberg suggests using Lin’s concordance correlation coefficient, \(r_c\). From the Wikipedia article, Lin’s \(r_c\) is defined as follows:

r_c = \cfrac{2s_{xy}}{s_x^2 + s_y^2 + (\bar{x} – \bar{y})^2}.

\)

The relationship between \(r\) and \(r_c\) is straightforward. From the original paper:

- \(-1 \leq -|r| \leq r_c \leq |r| \leq 1\);
- \(r_c = 0\) if and only if \(r=0\);
- \(r_c = r\) if and only if \(s_x = s_y\); and
- \(r_c = \pm 1\) if and only if \(r = \pm 1\), \(s_x = s_y\), and \(\bar{x} = \bar{y}\).

(These come from Lin’s 1989 paper, p. 258, cited below. Note I have left out a couple equivalent conditions from (iv) and have changed the notation to match the definition of \(r_c\).)

As Wittenberg notes, condition (i) is potentially quite problematic for studies of electoral persistence. It says that the magnitude of \(r_c\) is always less than or equal to \(r\). In other words, Pearson’s \(r\) (almost) always inflates the amount of persistence present in the data. Evidence of this is seen the in next plot, which compares electoral performance of each major party in the 2005 and 2007 Polish parliamentary elections.

Clearly, Pearson’s \(r\) and Lin’s \(r_c\) point to significantly different interpretations of the data when it comes to the issue of electoral persistence. As expected the magnitude of \(r_c\) is always lower than \(r\) — sometimes dramatically lower. In the case of SRP, for instance, \(r_c = 0.06\) compared to \(r = 0.72\) reported above. Also note, as expected given the conditions above, \(r_c\) is closest to \(r\) when regression line is near a 45-degree line.

So, what’s the take-away? First, Wittenberg is right in calling into question the use of Pearson’s \(r\) as a measure of electoral persistence (anyone know a better measure?). It simply doesn’t properly capture the concept. Second, always look at your data. In this case, I had never heard of Lin’s concordance correlation coefficient until I went looking for some way to measure the discordance I was seeing in the data. Had I just decided to construct a table of correlations, I may have believed that there was a great deal more persistence in electoral votes shares than actually existed.

Lin’s \(r_c\) can be calculated with the `epi.ccc`

function in the `epiR`

package. For those who may care, here is the code I used to make the second plot above:

xyplot(vote_share.2007 ~ vote_share.2005 | factor(party), data=PartyW[party %!in% c("Other", "RP")], as.table=TRUE, between=list(x=1,y=1), aspect=1, col=grey[5], cex=0.75, xlab="2005", ylab="2007", xlim=c(-0.1,1), ylim=c(-0.1,1), strip=strip.custom(bg=grey[3]), panel=function(x, y, ...) { r <- round(cor(x, y, use="pairwise.complete.obs"), 2) rc <- round(epi.ccc(x, y)[["rho.c"]][1,1], 2) panel.xyplot(x, y, ...) panel.abline(a=0, b=1, lty=2, col=grey[4], lwd=2) panel.abline(lm(y ~ x), col=red[6], lwd=2) panel.text(x=0, y=0.9, labels=bquote(r == .(r)), cex=0.75, pos=4) panel.text(x=0, y=0.8, labels=bquote(r[c] == .(rc)), cex=0.75, pos=4) })

- Lin, Lawrence I-Kuei. (1989). “A Concordance Correlation Coefficient to Evaluate Reproducibility.”
*Biometrics*45: 255–268. - Lin, Lawrence I-Kuei. (2000). “Correction: A Note on the Concordance Correlation Coefficient.”
*Biometrics*56: 324–325. - Wittenberg, Jason. (Oct. 2011). “How Similar Are They? Rethinking Electoral Congruence.”
*Quality & Quantity*.

`T`

and `F`

are defined as `TRUE`

and `FALSE`

. When I review other people’s code, I often see functions defined with arguments set to these values by default. `T`

and `F`

are symbols that can be redefined, so it shouldn’t be assumed that they will always evaluate to `TRUE`

and `FALSE`

. Making that assumption can introduce bugs to the code that are very hard to track down.
For example, imagine you have defined a function to sample from a vector after transforming the data in some way:

my_sample <- function(x, size, rep=F) { x <- x^2 # a simple transform sample(x, size, replace=rep) }

When you just start **R**, `my_sample`

will work as intended because `F`

is `FALSE`

:

> my_sample(1:10, 8) [1] 4 1 9 100 49 16 36 81

But if you call this function from inside another function, or after hours of hacking at the console, this may not be the case. For instance, what if at some point in your **R** session you redefined `F <- 2`

:

> my_sample(1:10, 8) [1] 9 49 100 4 64 64 81 36

This type of bug can be very dangerous because they are very hard to replicate and, worse yet, you may never even notice you have them. Luckily, such bugs are easy to avoid: *never use* `T`

*and* `F`

*as synonyms for* `TRUE`

*and* `FALSE`

. It doesn’t take much more effort to type out `TRUE`

and `FALSE`

, but if you find it onerous, get an editor with tab-completion.

Via email, Ananda Mahto suggests adding the following bit of code to the top of your scripts to provide a warning when `T`

and `F`

have been redefined.

if (!identical(T, TRUE)) stop("'T' has been remapped to '", T, "'") if (!identical(F, FALSE)) stop("'F' has been remapped to '", F, "'")

This doesn’t solve the problem, of course, but at least you get some warning.

]]>To begin with a simple example, suppose you want a function that adds `2`

to its argument. You would likely write something like this:

add_2 <- function(y) { 2 + y }

Which does exactly what you would expect:

> add_2(1:10) [1] 3 4 5 6 7 8 9 10 11 12

Now suppose you need another function that instead adds `7`

to its argument. The natural thing to do would be to write another function, just like `add_2`

, where the `2`

is replaced with a `7`

. But this would be grossly inefficient: if in the future you discover that you made a mistake and you in fact need to *multiply* the values instead of add them, you would be forced to change the code in two places. In this trivial example, that may not be much trouble, but for more complicated projects, duplicating code is a recipe for disaster.

A better idea would be to write a function that takes one argument, `x`

, that returns another function which adds its argument, `y`

, to `x`

. In other words, something like this:

add_x <- function(x) { function(y) { x + y } }

Now, when you call `add_x`

with an argument, you will get back a function that does exactly what you want:

add_2 <- add_x(2) add_7 <- add_x(7) > add_2(1:10) [1] 3 4 5 6 7 8 9 10 11 12 > add_7(1:10) [1] 8 9 10 11 12 13 14 15 16 17

So far, this doesn’t appear too earth-shattering. But if you look closely at the definition of `add_x`

, you may notice something odd: how does the return function know where to find `x`

when it’s called at a later point?

It turns out that **R** is lexically scoped, meaning that functions carry with them a reference to the environment within which they were defined. In this case, when you call `add_x`

, the `x`

argument you provide gets attached to the environment for the return function. In other words, in this simple example, you can think of **R** as just replacing all instances of the x variable in the function to be returned with the value you specify when you called add_x.

Ok, so this may be a neat trick, but how this can be used more productively? For a slightly more complicated example, suppose you are performing some complex bootstrapping and, for efficiency, you pre-allocate container vectors to store the results. This is straightforward when you have just a single vector of results—all you have to do is remember to iterate an index counter each time you add a result to the vector.

for (i in 1:nboot) { bootmeans[i] <- mean(sample(data, length(data), replace=TRUE)) } > mean(data) [1] 0.0196 > mean(bootmeans) [1] 0.0188

But suppose you want to track several different statistics, each requiring you to keep track of a different index variable. If your bootstrapping routine is even a little bit complicated, this can be tedious and prone to error. By using closures, you can abstract away all of this bookkeeping. Here is a constructor function that wraps a pre-allocated container vector:

make_container <- function(n) { x <- numeric(n) i <- 1 function(value=NULL) { if (is.null(value)) { return(x) } else { x[i] <<- value i <<- i + 1 } } }

When you call `make_container`

with an argument, it pre-allocates a numeric vector of the specified length, `n`

, and returns a function that allows you to add data to that vector without having to worry about keeping track of an index. If you don’t the argument to that return function is `NULL`

, the full vector is returned.

bootmeans <- make_container(nboot) for (i in 1:nboot) bootmeans(mean(sample(data, length(data), replace=TRUE))) > mean(data) [1] 0.0196 > mean(bootmeans()) [1] 0.0207

Here `make_container`

is relatively simple, but it can be as complicated as you need. For instance, you may want to have the constructor function perform some expensive calculations that you would rather not do every time the function is called. In fact, this is what I have done in the boolean3 package to minimize the number of calculations done at every iteration of the optimization routine.

`sapply`

to filter a list by a predicate. Today the @RLangTip tip of the day was to use `sapply`

similarly. This made makes me wonder if `Filter`

higher-order function would be the tool to use. `Filter`

works more or less like the `*apply`

family of functions, but it performs the subsetting (the filtering) of a list based on a predicate in a single step.
As an example, let’s say we have a list of 1000 vectors, each of length 2 with \(x_1,\,x_2 \in [0,\,1]\), and we want to select only those vectors where the elements of the list sum to a value greater than 1. With `Filter`

, this is all we have to do:

mylist <- lapply(1:1000, function(i) c(runif(1), runif(1))) method.1 <- Filter(function(x) sum(x) > 1, mylist)

Which is at least a bit more transparent than the `sapply`

alternative:

method.2 <- mylist[sapply(mylist, function(x) sum(x) > 1)]

In some very quick tests, I found no performance difference between the two approaches.

There are other useful higher-order functions. If you are interested, check out `?Filter`

.

`boolean3`

, an - Unix/Linux: boolean3_3.0.20.tar.gz
- Windows: boolean3_3.0.20.zip

(Please let me know if you have any trouble installing the Windows version. I didn’t have a Windows system handy when I built the package.)

The theoretical foundation for the package was developed by Bear Braumoeller in a 2003 *Political Analysis* piece:

Braumoeller, Bear F. (2003) 'Causal Complexity and the Study of Politics'.Political Analysis11(3): 209-233.

Additional theoretical content can be found at the Boolean Statistics homepage. Braumoeller and Carson have a 2011 paper titled “Political Irrelevance, Democracy, and the Limits of Militarized Conflict” in the *Journal of Conflict Resolution* that provides a useful example of the approach.

Summarizing from the `boolean3`

documentation and package description:

`boolean3`

provides a means of estimating partial-observability binary response models following boolean logic.`boolean3`

was developed by Jason W. Morgan under the direction of Bear Braumoeller with support from The Ohio State University's College of Social and Behavioral Sciences. The package represents a significant re-write of the original boolean implementation developed by Bear Braumoeller, Ben Goodrich, and Jacob Kline.

As is typically the case with “significant re-writes”, `boolean3`

breaks compatibility with previous versions (which can still be downloaded and installed from CRAN). Some of the many changes and enhancements include:

- Removed dependence on Zelig.
- Changed the method of specifying the model to me more consistent with the style of other
**R**packages. - Improved performance. Many models can be estimated in 1/10th the time, or better, when compared to the original
`boolean`

package. - Added support for the optimizing methods available in the optimx package. This includes the ability to apply box constraints, which can be useful in maximizing boolean likelihoods.
- Added multiprocessor/cluster support for bootstrapping using the snow package.
- Genetic optimization using rgenoud also provides support for clustering with
`snow`

. - Plotting of predicted probabilities and likelihood profiles is now done with lattice.

Some things that are still missing from the package:

- Documentation is still thin. There is enough in the help pages to get people started using the package, but more is needed.
- Skewed logit (scobit) is not yet implemented. Most of the code has been written, but there are a few complications yet to be worked out.

If you are a current user of `boolean`

, we’d love for you to try out `boolean3`

. While a lot of bug-squashing has been done, there has been fairly little “real-world” testing. Bug reports and suggestions for changes and enhancements would be quite welcome, and can be emailed to Bear Braumoeller, me, or posted below in the comments. I will continue to post updates to the package as they are available.

I recently started hacking on Scheme, writing a basic statistics library as a way to better understand the language (which is the most fun I have ever had programming). Below is a Scheme implementation of Welford’s method. For comparison, I have included a version that uses the textbook definition

\(s^2 = \cfrac{1}{n-1} \sum_{i=0}^n (x_i – \bar{x})^2.

\)

Here is the code:

;;; Textbook definition (define (variance x) (let ((mu (mean x))) (/ (sum (f64vector-map (lambda (x) (expt (- x mu) 2)) x)) (- (f64vector-length x) 1)))) ;;; Welford's method (define (variance-welford x) (let loop ((n (f64vector-length x)) (k 1) (m (f64vector-ref x 0)) (s 0)) (if (= k n) (/ s (fx- n 1)) (let* ((xk (f64vector-ref x k)) (mk (+ m (/ (- xk m) (fx+ k 1)))) (sk (+ s (* (- xk m) (- xk mk))))) (loop n (fx+ k 1) mk sk)))))

I implemented these in Chicken Scheme, though the code should work in any Scheme that supports SRFI-4 homogeneous numeric vectors. The code could easily be modified to support standard Scheme vectors or lists.

I make no claims that this is the most efficient code, but it seems to work.

]]>To that end, here are three books which I think all users of R could benefit from reading:

*Practical Common Lisp*by Peter Seibel. Free online version.*Structure and Interpretation of Computer Programs*(SICP) by Harold Abelson, Gerald Jay Sussman, and Julie Sussman. Free online version.*On Lisp*by Paul Graham. Free PDF version.

Yes, I realize all of these are books that use Lisp/Scheme, and I realize the structure of Lisp can be a bit off-putting `((at '(first)))`

. But, unless you want to learn Lisp, you shouldn’t try to read these books with the goal of understanding ever nuance of the syntax; rather, you should read them with an eye toward understand how a programming language should work for and not against you. For instance, each of these books shows, in a step-by-step approach, how seemingly complex programs can be effectively broken down into simple procedures that anyone with some programming experience can understand. A great deal of emphasis in these books is the important concepts of abstraction and non-duplication. If you’re like me and spend any significant amount of time going through various **R** packages, you’ll know what I mean when I say that the **R** community could benefit from less duplication and more abstraction in contributed packages (not to mention more comments!).

`for`

-loop, but it’s much “sexier” to use a functional style of programming that takes advantage of higher order functions or the `*apply`

-family of functions (`lapply`

with a large list, particularly when the results from `lapply`

have to be further manipulated to get them into a useful form? I recently did some performance testing on three of the `*apply`

functions and compared them to an equivalent `for`

-loop approach. The results follow.
First, I set up a large list with some fake data. Here I make a list, `Sims`

, containing 30000 \(5 \times 5\) matrices. (This is a data structure similar to one I recently found myself analyzing.)

set.seed(12345) S <- 30000 Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S)

Now I define four functions that extract the fifth column of each matrix in `Sims`

and create a \(30000 \times 5 \) matrix of results.

for_method <- function(S, Sims) { R <- matrix(NA, ncol=5, nrow=S) for (i in 1:S) R[i,] <- Sims[[i]][,5] R } sapply_method <- function(Sims) { t(sapply(Sims, function(x) x[,5])) } lapply_method <- function(Sims) { do.call(rbind, lapply(Sims, function(x) x[,5])) } vapply_method <- function(Sims) { t(vapply(Sims, function(x) x[,5], rep(0, 5))) }

Notice that the `for_method`

first creates a container matrix and inserts the results directly into it. Using something like `rbind`

would force **R** to copy the matrix in each for-iteration, which is very inefficient. Also notice that in each of the apply methods, the result has to be manipulated before it is returned. It’s often possible to eliminate this step during the initial simulation process. But if you’re like me, you sometimes start your simulations before you realize what you intend on doing with the results.

Now for some timing results:

> system.time(replicate(20, R.1 <<- for_method(S, Sims))) user system elapsed 4.340 0.072 4.414 > system.time(replicate(20, R.2 <<- sapply_method(Sims))) user system elapsed 3.452 0.076 3.528 > system.time(replicate(20, R.3 <<- lapply_method(Sims))) user system elapsed 4.393 0.044 4.433 > system.time(replicate(20, R.4 <<- vapply_method(Sims))) user system elapsed 2.588 0.040 2.628 > all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4)) [1] TRUE

These results surprised me—I never expected `vapply`

to be so much quicker than the other three methods. This may have to do with that fact that, since `vapply`

requires you to explicitly specify the resultant data structure, **R** is able allocate the memory ahead of time.

As a simple extension, I thought I would retest taking advantage of byte-compiled code:

library(compiler) for_method <- cmpfun(for_method) sapply_method <- cmpfun(sapply_method) lapply_method <- cmpfun(lapply_method) vapply_method <- cmpfun(vapply_method)

And here are the results:

> system.time(replicate(10, R.1 <<- for_method(S, Sims))) user system elapsed 2.084 0.024 2.110 > system.time(replicate(10, R.2 <<- sapply_method(Sims))) user system elapsed 1.624 0.016 1.642 > system.time(replicate(10, R.3 <<- lapply_method(Sims))) user system elapsed 2.152 0.012 2.164 > system.time(replicate(10, R.4 <<- vapply_method(Sims))) user system elapsed 1.204 0.004 1.207 > all(identical(R.1, R.2), identical(R.1, R.3), identical(R.1, R.4)) [1] TRUE

These results are also surprising. I really thought byte-compiling the functions would make the results much closer (doesn’t each function compile into what is effectively a `for`

-loop?). However, in this simple example at least, that isn’t the case. `vapply`

still has, by a significant margin, the best performance. `sapply`

also performs noticeably better than `lapply`

and the `for`

-loop, which is an important result since `sapply`

doesn’t require a pre-specified data structure. Assuming they hold in the presence of more complicated data structures (and I don’t see what they wouldn’t), these results seem to suggest that a functional approach can provide benefits in terms of coding efficiency and run-time performance.

I welcome any feedback or suggestions on how these tests could be improved or extended.

A couple things have come up in the comments. First, Henrik provided a more elegant (and efficient) version of `vapply_method`

:

vapply_method2 <- function(Sims) { t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5)) }

Second, several people have reported that the `for_method`

on their system is just as fast, if not faster, than the `vapply_method`

when compiled. I have retested my results, adding Henrik's `vapply_method2`

. Here are the results for compiled code (note that I have increased the number of replications):

> library(compiler) > set.seed(12345) > S <- 30000 > Sims <- Map(function(x) matrix(rnorm(25), nrow=5, ncol=5), 1:S) > > for_method <- function(S, Sims) { + R <- matrix(NA, ncol=5, nrow=S) + for (i in 1:S) R[i,] <- Sims[[i]][,5] + R + } > > sapply_method <- function(Sims) { + t(sapply(Sims, function(x) x[,5])) + } > > lapply_method <- function(Sims) { + do.call(rbind, lapply(Sims, function(x) x[,5])) + } > > vapply_method <- function(Sims) { + t(vapply(Sims, function(x) x[,5], rep(0, 5))) + } > > vapply_method2 <- function(Sims) { + t(vapply(Sims, "[", rep(0, 5), i = 1:5, j = 5)) + } > for_method <- cmpfun(for_method) > sapply_method <- cmpfun(sapply_method) > lapply_method <- cmpfun(lapply_method) > vapply_method <- cmpfun(vapply_method) > vapply_method2 <- cmpfun(vapply_method2) > system.time(replicate(50, R.1 <<- for_method(S, Sims))) user system elapsed 6.052 0.136 6.191 > system.time(replicate(50, R.2 <<- sapply_method(Sims))) user system elapsed 8.361 0.116 8.474 > system.time(replicate(50, R.3 <<- lapply_method(Sims))) user system elapsed 10.548 0.144 10.692 > system.time(replicate(50, R.4 <<- vapply_method(Sims))) user system elapsed 5.813 0.124 5.936 > system.time(replicate(50, R.5 <<- vapply_method2(Sims))) user system elapsed 4.268 0.084 4.352

For whatever reason, the `vapply`

methods remain the fastest on my system (particularly Henrik’s version). I also tested on another Linux machine and got the same results (I have not yet tested on Windows). Of course, this is a very simple example, so the performance differences may not hold up to the increasing complexity of the input function. I’d be interested to know if anyone has an idea as to why `vapply`

would perform better in Linux than in Windows/Mac. A library thing?