Computing LD in R

April 27, 2015

Suppose you have a matrix of 0’s and 1’s representing haplotypes at a list of SNPs (or other biallelic variants) and you want to compute pairwise LD statistics in R. Computing $r^2$ is easy, just use

r = cor( t( data ) )

and then square it (I am assuming the SNPs are in rows and chromosomes in columns, the way I usually arrange things. If not the transpose isn’t needed.) However, computing $|D'|$ is a bit more tricky especially if you want to compute it efficiently across a whole list of variants at once. So I wrote this function (also downloadable here):

compute.ld <- function( haps, focus.i = 1:nrow(haps), variant.names = rownames( haps ) ) {
    # haps is a 0-1 matrix with L SNPs (in rows) and N haplotypes (in columns).
    # Since the values are 0, 1, we rely on the fact that a*b=1 iff a and b are 1.
    # assume focus.i contains l values in range 1..L
    L = nrow( haps )
    l = length( focus.i )
    # p11 = lxL matrix.  i,jth entry is probability of 11 haplotype for ith focal SNP against jth SNP.
    focus.hap = haps[ focus.i, , drop = FALSE ]
    p11 <- ( focus.hap %*% t( haps )) / ncol( haps )
    # p1. = lxL matrix.  ith row is filled with the frequency of ith focal SNP.
    p1. <- matrix( rep( rowSums( focus.hap ) / ncol( haps ), L ), length( focus.i ), L, byrow = FALSE )
    # p.1 = lxL matrix.  jth column is filled with the frequency of jth SNP.
    frequency = rowSums( haps ) / ncol( haps )
    p.1 <- matrix( rep( frequency, length( focus.i ) ), length( focus.i ), L, byrow = TRUE )

    # Compute D
    D <- p11 - p1. * p.1

    # Compute D'
    denominator = pmin( p1.*(1-p.1), (1-p1.)*p.1 )
    wNeg = (D < 0)
    denominator[ wNeg ] = pmin( p1.*p.1, (1-p1.)*(1-p.1) )[wNeg]
    denominator[ denominator == 0 ] = NA
    Dprime = D / denominator 

    # Compute correlation, this result should agree with cor( t(haps ))
    R = D / sqrt( p1. * ( 1 - p1. ) * p.1 * ( 1 - p.1 ))
    R[ frequency[ focus.i ] == 0 | frequency[ focus.i ] == 1, frequency == 0 | frequency == 1 ] = NA
    if( !is.null( variant.names )) {
        rownames( D ) = rownames( Dprime ) = rownames( R ) = variant.names[ focus.i ]
        colnames( D ) = colnames( Dprime ) = colnames( R ) = variant.names
        names( frequency ) = variant.names
    }

    return( list( D = D, Dprime = Dprime, frequency = frequency, R = R ) ) ;
}

Big deal.

Well, what’s interesting about this function is that it’s quite fast (relatively speaking):

> dim(A)
[1] 1002 1000
> system.time( { r = cor( t(A)) } )
   user  system elapsed
  1.268   0.020   1.296
> system.time( { d = compute.ld( A ) } )
   user  system elapsed
  1.340   0.128   1.154

…which says that, on this dataset of 1002 SNPs and 1000 samples, compute.ld() takes less time than cor(), despite computing more stuff.

You can also use it to compute LD between a subset of SNPs and all the others:

> d = compute.ld( A, focus.i = c( 500,501 ))
> dim(d$Dprime)
[1]    2 1002

Example

Aha, you say, but does it work?. One way to test it is to try about the simplest set of haplotypes that give interesting results – so I tested it on every possible combination of five chromosomes. ( $D'=1$ unless all four haplotypes appear, so we anyway need at least four chromosomes in this test.)

data = as.matrix( expand.grid( c(0,1),c(0,1),c(0,1),c(0,1),c(0,1) ) )
ld = compute.ld( data )
Dprime = ld$Dprime
R = ld$R

The function computes D, correlation (R), and D’. Let’s check it gets R right first:

max( cor( t( data ) ) - R ), na.rm = T )
[1] 8.881784e-16

So this agrees with cor() although there is some numerical error. Also, there’s a symmetry in the data, namely $\text{data}[i,] = 1 - \text{data}[32-i,]$ . Let’s check that symmetry is reflected in the D’ values:

reversed = 32:1
stopifnot( is.na( Dprime) == is.na( Dprime[reversed, reversed]))
stopifnot( is.na( Dprime ) == is.na( Dprime[reversed,]))
stopifnot( is.na( Dprime ) == is.na( Dprime[, reversed]))

max( ( Dprime - Dprime[reversed,reversed] ), na.rm = T )
[1] 4.218847e-15
max( ( Dprime + Dprime[reversed,] ), na.rm = T )
[1] 3.330669e-16
max( ( Dprime + Dprime[,reversed] ), na.rm = T )
[1] 3.330669e-16

Also, $|D'|$ should be NA for zero-frequency variants and properly computed when the frequency is nonzero:

wNA = which( rowSums( data ) == 0 | rowSums( data ) == ncol( data ) )
stopifnot( length( which( !is.na( Dprime[wNA,wNA] ) ) ) == 0 )
stopifnot( length( which( is.na( Dprime[-wNA,-wNA] ) ) ) == 0 )

To test that the values are actually correct, let’s figure out what the answer should be. With 5 chromosomes, possible nonzero frequencies are $i/5$ for $i=1,2,3,4$ . If either SNP has frequency $1/5$ or $4/5$ then $|D'|=1$ , since there is only one copy of one of the alleles:

w = which( rowSums( data ) == 1 | rowSums( data ) == 4 )
table( compute.ld( data, focus.i = w )$Dprime )

which gives

 -1   1 
150 150

Suppose then that both SNPs have frequency $2/5$ (the other cases are symmetrical up to flipping alleles). The denominator in $D'$ is then $4/25$ if $D \geq 0$ and $6/25$ otherwise. The numerator is $\frac{i}{5}-\left(\frac{2}{5}\right)^2=\frac{5i-4}{25}$ where $i=0,1,2$ . Hence $D'$ takes the values $-1, \frac{1}{6}, 1$ accordingly as the “1” alleles at the two SNPs either never coincide, coincide once, or coincide twice. Let’s try:

rownames( data ) = paste( data[,1], data[,2], data[,3], data[,4], data[,5], sep = '' )
w = which( rowSums( data ) == 2 )
compute.ld( data[w,] )$Dprime

which outputs

        11000   10100   01100   10010   01010   00110   10001   01001   00101   00011
11000  1.0000  0.1667  0.1667  0.1667  0.1667 -1.0000  0.1667  0.1667 -1.0000 -1.0000
10100  0.1667  1.0000  0.1667  0.1667 -1.0000  0.1667  0.1667 -1.0000  0.1667 -1.0000
01100  0.1667  0.1667  1.0000 -1.0000  0.1667  0.1667 -1.0000  0.1667  0.1667 -1.0000
10010  0.1667  0.1667 -1.0000  1.0000  0.1667  0.1667  0.1667 -1.0000 -1.0000  0.1667
01010  0.1667 -1.0000  0.1667  0.1667  1.0000  0.1667 -1.0000  0.1667 -1.0000  0.1667
00110 -1.0000  0.1667  0.1667  0.1667  0.1667  1.0000 -1.0000 -1.0000  0.1667  0.1667
10001  0.1667  0.1667 -1.0000  0.1667 -1.0000 -1.0000  1.0000  0.1667  0.1667  0.1667
01001  0.1667 -1.0000  0.1667 -1.0000  0.1667 -1.0000  0.1667  1.0000  0.1667  0.1667
00101 -1.0000  0.1667  0.1667 -1.0000 -1.0000  0.1667  0.1667  0.1667  1.0000  0.1667
00011 -1.0000 -1.0000 -1.0000  0.1667  0.1667  0.1667  0.1667  0.1667  0.1667  1.0000

…so I basically think it does work.

No comments yet

(…a blog)

Computing LD in R

Big deal.

Example

Leave a comment Cancel reply

Recent posts…

Calendar

Post archive

(…a blog)

Computing LD in R

Big deal.

Example

Share this:

Related

Leave a comment Cancel reply

Recent posts…

Calendar

Post archive