Monday, April 13, 2015

Comparing the execution time between foverlaps and findOverlaps [data.table vs GenomicRanges]

Both of these functions find overlaps between genomic intervals. The findOverlaps function is from the Bioconductor package GenomicRanges (or IRanges if you don't need to compare intervals with an associated chromosome and strand). foverlaps is from the data.table package and is inspired by findOvelaps.

In genomics, we often have one large data set X with small interval ranges (usually sequenced reads) and another smaller data set Y with larger interval spans (usually exons, introns etc.). Generally, we are tasked with finding which intervals in X overlap with which intervals in Y.

In the foverlaps function Y has to be indexed using the setkey function (we don't have to do it on X). The key is intended to speed-up finding overlaps.

Which one is faster?

To check this we used the benchmark function from the rbenchmark package. It's a simple wrapper of the system.time function.

The code below plots the execution time of both functions for increasing numbers of rows of data set X.

Interestingly,  foverlaps is the fastest way to solve the problem of finding overlaps, but only when the large data set has less than 200k rows.

We also plotted situation when we exchanged the place of X and Y in arguments of both functions. In this case you can see that almost from the beginning foverlaps is much slower than findOverlaps.

Information about my R session:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.4 (Mavericks)

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] data.table_1.9.4     rbenchmark_1.0.0     GenomicRanges_1.18.4
[4] GenomeInfoDb_1.2.4   IRanges_2.0.1        S4Vectors_0.4.0     
[7] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):
[1] chron_2.3-45   plyr_1.8.1     Rcpp_0.11.5    reshape2_1.4.1 stringr_0.6.2 
[6] tools_3.1.3    XVector_0.6.0


  1. Hi Katarzyna, nice post! Thanks for the comparison. I've a couple of corrections and suggestions:

    1. I'd suggest re-running with the first two lines changed to:

    start <- seq(1L, 20e8L, 1000L)
    end <- start + 3000L

    so that intervals are integers. Your ranges are currently numeric, and allocation/joins on numeric ranges would be slower, of course, compared to integers. `foverlaps()` is designed with all numeric types, for e.g., double, int64, Date, POSIXct etc.. in mind, and therefore intervals are not automatically coerced.

    2. You missed a "nomatch = 0L" argument to 'foverlaps', since 'foverlaps' by default also returns no matches, whereas 'findOverlaps' doesn't.

    3. It'd be nice to benchmark on a more realistic dataset (under more realistic sizes as well). If you're interested, here's one of the same data that I've tested on:

    On just "Chr1", I get 0.8 vs 3.9 seconds and 3.9 vs 17 seconds on entire dataset (foverlaps vs findOverlaps).

    Note that on just one chromosome, the comparison is really foverlaps vs IRanges (more or less).

    4. It'd also be nice to compare when `query` isn't sorted. Testing on the sample data linked above by randomly reordering query resulted in 10s vs 50s (foverlaps vs findOverlaps).

    There are several improvements I'd like to get done in `foverlaps()`. I'll write a gist (or a vignette) and update with a link here, when I manage.

    Co-developer, data.table

  2. Hi Arun,
    Thanks for your reply! I've changed code according to your suggestions - changed intervals from numeric to integer and added "nomatch = 0L" argument. Thanks for the link. So why is such difference between our results? I've checked that in your dataset there is more overlapped intervals than in this generated by me. Is it because of width of intervals?

    1. Hi Katarzyna, I'm not sure if you've updated the plots as well?

      I know what happens in foverlaps, and why it doesn't perform as well when we revert "query" and "subject" (your 2nd plot). It's specifically designed for problems where 1) "query" is large, and "subject" is << than nrow("query"), and 2) to obtain overlaps without having to sort on "query".

      Optimising the other case (your 2nd plot) is on the to-do list, and will be done, but prioritised depending on whether there are domains / problems where such a scenario actually exists.. (in the case, for overlap type = "any", for example, foverlaps(x,y) and foverlaps(y,x) are the same with columns reversed).

      As to why this difference, I'm not sure why IRanges performs so much better on your artificial data, but I plan to look at that code in more detail when I find some time. I'll write back when I manage (and with any updates to foverlaps).

    2. The intervals are still numeric because you missed a `L` in `start`. It should be:

      `start <- seq(1L, 20e8L, 1000L)` # 1L, not 1.

    3. The code is correct now, but the explanation and plots aren't updated, and that's misleading.. Are you planning to update it?

    4. Yes, sorry, I've just updated plots. They look almost the same in comparison to those before update

    5. Hi Katarzyna, hm, that's strange because this is what I get:

    6. Do you run exactly the same code that is in this post (if not could you share it)? and do you have the same versions of these libraries?

    7. Same code as here. Tested on both 1.9.4 and 1.9.5 (current devel) of data.table, and latest version of GenomicRanges.

  3. This comment has been removed by the author.

  4. Lots of Good information in your post, I favorited your blog post so I can visit again in the future, Thanks.