A very short and unoriginal introduction to snow

As Jian-Feng rightly pointed out in a comment on my guide to setting up snow on the OSC cluster, it was probably somewhat cavalier of me to say:

Getting snow to run properly on single machines, or ever with a cluster of machines via ssh connections is fairly trivial.

In an effort to redeem myself, I provide this very short and unoriginal introduction to using snow. But first a caveat: to make the most of parallel processing in R, or any other environment, the problem you are trying to solve must be amenable to being broken up into smaller, (mostly) independent pieces. In other words, the results from one piece should not be dependent on the results from another. In statistics, depending on the problem at hand, this may or may not apply. Bootstrapping, a simple example of which I provide below, is one place where parallel processing can provide excellent returns from parallelization. On the other hand, a typical maximum likelihood estimate using, for instance, a BFGS optimization routine would gain little from parallel processing since step \(n+1\) is dependent on the results of step \(n\). (Unsurprisingly, things are a bit more complicated than this, and if you are really interested in learning about parallel processing, you may want to start with reading the Wikipedia entry.)

This simple example demonstrates how to calculate bootstrapped sample means of a given vector in parallel across a cluster. First, load the snow and rlecuyer libraries. Of course, snow is what provides the parallel processing, but rlecuyer is equally important as it guarantees the random numbers generated in each process are independent (snow also supports the rsprng library).

> library(snow)
> library(rlecuyer)

Now set up some sample data. Here I take 100 random draws, with replacement, from the integers in \([0,5]\).

> x <- sample(0:5, 100, replace = TRUE)
> mean(x)
[1] 2.64

Define a simple function to calculate a single bootstrapped mean from a given vector:

> bs.mean <- function(v) {
+   s <- sample(v, length(v), replace = TRUE)
+   mean(s)
+ }

Now it’s time to set up the cluster. Here I set up a SOCK-type connection, which can be used to set up multiple R instances on the local machine and/or to set up R instances on remote machines through ssh connections. snow offers other connection options that may be more convenient or necessary depending on your environment (for instance, MPI was needed on the OSC cluster).

> cl <- makeCluster(c("localhost", "localhost"), type = "SOCK")

Here, c("localhost", "localhost") tells snow where to set up the R instances, while type = "SOCK" is obviously the connection type. If I also wanted to run a single instance on a remote machine named chuck, I could specify c("localhost", "localhost", "chuck"). In this case, I would be prompted for my ssh password for chuck, though snow would take care of the rest once the connection was authenticated.

Once the connections are set up, you will want to provide unique random seeds on each of the instances.

> clusterSetupRNG(cl)
[1] "RNGstream"

The return value, RNGstream, just tells you what type of RNG was set up. Finally, it’s time to do some work.

> clusterCall(cl, bs.mean, x)
[[1]]
[1] 2.81

[[2]]
[1] 2.61

clusterCall instructs all instances in cl to execute the function bs.mean on the vector x, both of which we defined above. The results are returned in a list with a length equal to the number of instances; e.g., had we included chuck in our call to makeCluster, clusterCall would have returned a list of three bootstrapped means. Because bs.mean doesn’t depend on anything calculated by the other processes, these bootstrapped means are calculated in parallel.

When you are done with the cluster, you should always stop it. Otherwise, you may have to kill R instances by hand.

> stopCluster(cl)

Like I said at the outset, this was just a very short and unoriginal introduction to parallel processing with snow. There are many other examples available online, a couple of which I provide links to below.

This entry was posted in R. Bookmark the permalink.

4 Responses to A very short and unoriginal introduction to snow

  1. Tal Galili says:

    Hi there,

    I wrote about this topic (for windows) some time ago, here:
    http://www.r-statistics.com/2010/04/parallel-multicore-processing-with-r-on-windows/

    And I have a question: is there an (easy) way for having snow using different cores?
    I remember that when I played with it, it simply streamed all of my SOCKets into the same CPU.

    p.s: please add the “subscribe to comments” plugin :)

    Cheers,
    Tal

    • Jason says:

      Great. Thanks for the reference, Tal. I’ve added a link to your page. As for guaranteeing that R uses all available cores, I am not sure. I operate almost exclusively in a Unix environment, where cores are used as needed (the OS does the scheduling). Fairly recently, I successfully used snow with rgenoud and 4 cores on Windows without any modification to the batch file I run on Unix systems. I believe the above example would work on Windows as well. If I remember, I will try it out some time this week.

  2. Dear Jason, It is very GREAT. Your introduction will ease my way to implement parallel computation with R, though I have not so many knowledge of computer. Thanks a lot.

    • Jason says:

      No problem. Though it’s quite introductory, it should get you started with the basics. I will also try to add more links as I run across them. I am sure there are many very good introductions available.

Leave a Reply