Querying intervals efficiently in SQL + Python revisited

homeblogtwitterthingiverse



My previous post on querying intervals generated quite a lot of interest, but my solution was a hideous thing involving trinary numbers. I now have a better solution.

To recap: I am interested in efficiently finding, from a set of intervals, all intervals that contain a given value. I have quite a lot of intervals, so I want to use an SQL database to store and query these intervals.

By efficiently, I mean faster than O(n). Simply scanning the list is too slow, and indexing the upper and/or lower bound still yields an O(n) search.

It would furthermore be nice to be able to find all intervals that intersect a given query interval. My new solution allows this.

My new solution:

For each interval, store the upper and lower bounds, and the size of the interval rounded down to the nearest power of two (I'll call this value size). Create indexes on (size,lower) and (size,upper).

To perform a query, we consider each possible value of size (up to the maximum size interval you expect). Since size is at least half the true size of the interval, intervals that contain a certain value must satisfy either

lower <= value <= lower+size

or

upper-size <= value <= upper

or possibly both. We can efficiently find intervals satisfying these two conditions using the two indexes we have created. No mess, no fuss!

If your query itself is an interval, the two conditions become:

lower <= query_upper and query_lower <= lower+size

or

upper-size <= query_upper and query_lower <= upper


Implementation in Python/SQLite:

SQLite needs a little coddling when implementing this. You need to convert value <= lower+size to value-size <= lower, etc, for it to use the indexes properly. explain query plan is your friend.


Update: Istvan Albert pointed me to this paper on Nested Containment Lists, which solve the same problem and are also amenable to SQL database implementation, and also this discussion of python interval database/query implementations on the pygr mailing list.


Performance notes:
The python file above generates a test data set of 1,000,000 intervals and performs 100 queries on it. On my machine, my method is 20 times faster than just using an index on (upper, lower).

I've also tried this on set of 15,000,000 alignments of Illumina short reads to a reference bacterial genome. On my machine, with my method a query retrieving 500 alignments takes 0.009 seconds, as compared to 27 seconds just using an index on (lower, upper), or 49 seconds with no index at all.




[æ]