Skip to content

Associative arrays in R

April 6, 2011

I learnt this from here.  The following R function gives you an associative array-like object with the given set of keys (initialised to NA):

hash <- function( keys ) {
    result <- new.env( hash = TRUE, parent = emptyenv(), size = length( keys ) )
    for( key in keys ) {
        result[[ key ]] <- NA
    }
    return( result )
}

It is a kind of hash table.  Why would you want it?  R already provides value lookup in lists and vectors using the

my_list[[ key ]]

syntax.  Unfortunately, this key lookup is done using a linear search through the keys.  I find this to be rather a misfeature because, being so easy to use, it encourages you to use code that ends up being very slow.  Try this for example:

start_time = proc.time()["elapsed"]
L = list()
N = 20000
for( i in 1:N ) {
    L[[ sprintf( "key%d", i ) ]] = NA
}
print( proc.time()[ "elapsed" ] - start_time )

With a linear key lookup, this loop is quadratic in the number of elements; on my laptop this takes about 16 seconds.  (16 seconds!  For 20000 elements!  Quadratic or not, it’s hard to see how R can achieve this incredible slowness.)  Now try this code which uses the hash() function above:

start_time = proc.time()["elapsed"]
L = hash( keys = sprintf( "key%d", 1:20000 ))
for( i in 1:20000 ) {
    L[[ sprintf( "key%d", i ) ]] = NA
}
print( proc.time()[ "elapsed" ] - start_time )

This takes about 0.4s, that is to say, it’s 40 times faster.  With more elements, the difference gets bigger.  You can coerce the result back to a list using as.list().

Warning: the objects returned by hash are “environments” and, unlike other objects, they get passed around by reference, not by copying.  If you want two different hash objects, use hash() twice, rather than trying to copy the first object.

The hash package

Although there’s a hash package on CRAN, described here, which ostensibly does what the hash() function above does only better, I found it also to be slow for the examples above.  (In fact, a call to

hash::hash( keys = sprintf( "key%d", 1:N ), values = 1:N )

with N = 100000 seems to take a very long time.  Not sure what causes this).  So I used the hash() function above instead.

 

Advertisements
3 Comments leave one →
  1. July 2, 2013 10:24 am

    Thanks a lot for this post. It saved my hours of waiting time using the linear lookup internal to R.

    Regards,
    Sven

  2. July 2, 2013 10:25 am

    Reblogged this on Sven Logan's Blog.

Trackbacks

  1. Associative arrays in R | Slept like a… | Tidbits from the Web

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: