# Associative arrays in R

I learnt this from here. The following R function gives you an associative array-like object with the given set of keys (initialised to NA):

hash <- function( keys ) { result <- new.env( hash = TRUE, parent = emptyenv(), size = length( keys ) ) for( key in keys ) { result[[ key ]] <- NA } return( result ) }

It is a kind of hash table. Why would you want it? R already provides value lookup in lists and vectors using the

my_list[[ key ]]

syntax. Unfortunately, this key lookup is done using a linear search through the keys. I find this to be rather a misfeature because, being so easy to use, it encourages you to use code that ends up being very slow. Try this for example:

start_time = proc.time()["elapsed"] L = list() N = 20000 for( i in 1:N ) { L[[ sprintf( "key%d", i ) ]] = NA } print( proc.time()[ "elapsed" ] - start_time )

With a linear key lookup, this loop is quadratic in the number of elements; on my laptop this takes about 16 seconds. (*16 seconds*! For 20000 elements! Quadratic or not, it’s hard to see how R can achieve this incredible slowness.) Now try this code which uses the hash() function above:

start_time = proc.time()["elapsed"] L = hash( keys = sprintf( "key%d", 1:20000 )) for( i in 1:20000 ) { L[[ sprintf( "key%d", i ) ]] = NA }

print( proc.time()[ "elapsed" ] - start_time )

This takes about 0.4s, that is to say, it’s 40 times faster. With more elements, the difference gets bigger. You can coerce the result back to a list using as.list().

*Warning*: the objects returned by hash are “environments” and, unlike other objects, they get passed around by reference, not by copying. If you want two different hash objects, use hash() twice, rather than trying to copy the first object.

### The hash package

Although there’s a hash package on CRAN, described here, which ostensibly does what the hash() function above does only better, I found it also to be slow for the examples above. (In fact, a call to

hash::hash( keys = sprintf( "key%d", 1:N ), values = 1:N )

with N = 100000 seems to take a very long time. Not sure what causes this). So I used the hash() function above instead.

Thanks a lot for this post. It saved my hours of waiting time using the linear lookup internal to R.

Regards,

Sven

Reblogged this on Sven Logan's Blog.