25 June 2009, 13:52 UTCA nasty story pattern
Here is a story pattern that is depressingly common. The most recent example of it I have encountered is the new seasons of Doctor Who, however it is omnipresent and cross-cultural.
1. Young women have special moral insight and judgement due to their innocence, expressed in the form of strong emotions.
2. In their innocence they might do things that endanger themselves. Therefore they must be protected from this, against their will.
Satisfying stories put their main characters to ultimate tests, in order to demonstrate their good qualities. Here, part one is the good quality, and part two allows it to be tested without actually destroying the young woman in question. It also generally gives a gentleman a chance to prove his character (possibly after having been shown the path of rightness by said young woman). So this gives an appealing story structure.
So. Both parts of this are actively evil. Part one is encourages women to be kept innocent, which might lead them to take actions with consequences counter to their expectations. It also denies moral insight to men and older women. Part two argues for overriding someone's preferences. There are rare cases where this is justified, however it is so open to abuse it must be viewed with great suspicion as to the true motives of the person doing the overriding. It's the kind of thing which people are rarely honest about even to themselves.
There is also the monstrously stupid notion that the most morally perfect people are so suicidal as to need protection from themselves. No, you can not both have and eat your cake. Morality is all about hard decisions, and the innocent are the least suited to making these. Innocence is not a praiseworthy state.
I don't know that I have this entirely right. But it's everywhere, and starting to bug me.
I note an interesting counter pattern in some recent anime, roughly corresponding to the story of the rabbit and dragon in the Chinese zodiac. Not entirely sure this is an improvement, but possibly different enough to throw a spanner in the works.
18 June 2009, 6:57 UTCDo I win a prize?
See Table 2 of "Common genetic variants on 5p14.1 associate with autism spectrum disorders", Wang et al, Nature, April 2009.
SNP location My Genotype
rs4307059 TT
rs7704909 TT
rs12518194 AA
rs4327572 CC
rs1896731 CC
rs10038113 CC
One of the problems with this type of research, indeed with many of the newer techniques in biology, is that each data point has high dimension. Here, about half a million dimensions per subject. If you are trying to work out which of those dimensions has significant predictive power, you need to perform half a million signficance tests. The correction for multiple testing is brutal. Furthermore, each data point is expensive to obtain. So you need a lot of data points for any significant result, but often can't afford them.
Ideally, you would want to not just pick out a few interesting dimensions, but construct a full linear model. However this would require several million subjects. Prohibitively expensive. Even when the price comes down, finding that many people with the condition in question may be impossible.
I suspect that principal components analysis requires a lot less data points than constructing a linear model. Or perhaps one can estimate a covariance matrix from a healthy population, and then only require a few people with the condition in question to get useful predictions. I lack a theoretical framework to fit these kinds of tricks into. Suggestions welcome.
6 June 2009, 22:38 UTCA key-value data warehouse
Welcome to Paul's crazy data warehouse.
Everything reduced!
So here we are in 2009. Disks got big, but they didn't get fast. We have so much memory now that it takes a titanic amount of data to even start hitting the disk. But when you hit that wall, oh boy do you feel it.
I happen to be working with sequencing data that's just brushing up against that wall. Annoying.
Scattered reads and writes are agonizingly slow now. It's like the era of tape all over again. Flash storage solves this for the read side of things, but you still want coherent writes.
So, I've written myself a key-value store with this in mind. It's the flavour of the month thing to do.
Basic concept:
Merge sort can be performed with serial reads and writes. We'll maintain a collections of sorted lists, progressively merging lists together in the background. Writing data is simply a matter of adding a new list to this mix. Reading is a matter of checking each list in turn, from latest to earliest, for the key you want.
Refining things a little:
For random access to a database, you want items that are accessed with about the same likelihood to be stored near each other (ie in the same disk block). This way, your memory cache mostly contains frequently accessed items.
So I store each list as a balanced binary tree, with each level of the tree stored in a separate file. The list merging operation only ever needs to append to the end of each of these files, so this is still coherent writing.
Levels of the tree nearer the root will be accessed more often, and should end up in the memory cache. In-order traversal of the list remains fairly nice, since each level remains in order.
This scheme is cache oblivious, it works well with any size disk block. Which is handy, because you don't know jack about disks.
There is some room for improvement here, perhaps a trie.
Some nice properties:
Writing is fast. Chuck it on disk, merge at your leisure. In parallel, because this is 2009 and CPUs have stopped getting faster but there are plenty of them.
Multi-Version Concurrency Control is trivially easy. With only a little work, it will be impossible for the database to become corrupted. And it's backup-friendly.
You don't need to look up a record in order to modify it. For example, if you want to append to a record or add to a counter, you can set things up so this is deferred until merge-time. Each key effectively becomes its own little reduce. For example, you can implement a very nice word counting system this way. Which is exactly what I need for my sequencing data.
Implementation:
My implementation is not terribly optimized, but it is written in Cython, and is at least within the ballpark of reasonable. You'll need a recent Cython, and for Python 2.5 you'll also need the "processing" library.
I batch up key-value pairs in an in-memory dict until I have one million of them, before sorting them and writing them to disk. Python hashtables are fast, it would be silly not to use them until we are forced not to. One million items seems to be about optimal. This is a long way from filling available memory... something to do with cache size perhaps?
The performance on my machine is about 10x faster for writing than bsddb, 2x faster for reading, but still about 10-20x slower than an in-memory Python dict.
It's far from polished. Just proof of concept for the moment.
As examples, version 0.2 contains:
- General_store, supporting set, delete, and append operations. {set x} + {append y} -> {set xy}, {append x} + {set y} -> {set y}, {set/append x} + {delete} -> {delete}, {delete} + {set/append y} -> {set y}
- Counting_store, for word counting.
5 May 2009, 22:37 UTCTordion and Galliard beat patterns
Arbeau says:
The tune of a tordion and that of a galliard are the same, and there is no difference between them, save that the tordion is danced close to the ground to a light, lively beat, and the galliard is danced higher off the ground to a slower, stronger beat.
Here are two different possible beat patterns for the tune Arbeau gives:
My hypothesis is that the strong beat is at the start in the tordion and in the middle (fourth kick) in the galliard. The effect in the tordion is to never finish phrases conclusively, with a delicious pause in the middle of the phrase as the dancers leap lightly through the air, like a run on sentence with commas rather than full stops.
A similar effect is possible with the 15th century Italian saltarello and piva steps, if you are a young metro like Cornazano.
11 April 2009, 14:04 UTCOld Measures
I just added a complete set of Old Measures to my SCA music page. There's some interesting theory behind these, which I will blog about at some point.
The most ambitious of these was my "arrangement" of Tinternell. You can judge for yourself whether I managed to pull it off. I am also particularly happy with how the Brownswyke Alman turned out.
To save these MP3 files, right click and select "Save Link As..."
8 March 2009, 8:34 UTCMusic synthesis by physical modelling: ChucK

It's a little raw about the edges, but very nice to use. I was able to whip together this galliard from Arbeau's Orchesography in an afternoon, using the Clarinet and Mandolin models: arbeau_galliard.mp3
The Clarinet is a bit... quirky, which I think is awesome. I had to write my own little script to do adjust the breath pressure so the notes tail off nicely. I absolutely love having that level of fine grained control.
Moar:
21 February 2009, 6:24 UTCQuerying intervals efficiently in SQL + Python revisited
My previous post on querying intervals generated quite a lot of interest, but my solution was a hideous thing involving trinary numbers. I now have a better solution.
To recap: I am interested in efficiently finding, from a set of intervals, all intervals that contain a given value. I have quite a lot of intervals, so I want to use an SQL database to store and query these intervals.
By efficiently, I mean faster than O(n). Simply scanning the list is too slow, and indexing the upper and/or lower bound still yields an O(n) search.
It would furthermore be nice to be able to find all intervals that intersect a given query interval. My new solution allows this.
My new solution:
For each interval, store the upper and lower bounds, and the size of the interval rounded down to the nearest power of two (I'll call this value size). Create indexes on (size,lower) and (size,upper).
To perform a query, we consider each possible value of size (up to the maximum size interval you expect). Since size is at least half the true size of the interval, intervals that contain a certain value must satisfy either
lower <= value <= lower+size
or
upper-size <= value <= upper
or possibly both. We can efficiently find intervals satisfying these two conditions using the two indexes we have created. No mess, no fuss!
If your query itself is an interval, the two conditions become:
lower <= query_upper and query_lower <= lower+size
or
upper-size <= query_upper and query_lower <= upper
Implementation in Python/SQLite:
SQLite needs a little coddling when implementing this. You need to convert value <= lower+size to value-size <= lower, etc, for it to use the indexes properly. explain query plan is your friend.
Update: Istvan Albert pointed me to this paper on Nested Containment Lists, which solve the same problem and are also amenable to SQL database implementation, and also this discussion of python interval database/query implementations on the pygr mailing list.
Performance notes:
The python file above generates a test data set of 1,000,000 intervals and performs 100 queries on it. On my machine, my method is 20 times faster than just using an index on (upper, lower).
I've also tried this on set of 15,000,000 alignments of Illumina short reads to a reference bacterial genome. On my machine, with my method a query retrieving 500 alignments takes 0.009 seconds, as compared to 27 seconds just using an index on (lower, upper), or 49 seconds with no index at all.
18 February 2009, 6:17 UTCQuerying intervals efficiently in SQL + Python
Update 21/2/2009: See my new solution to this problem here.
I have a problem where I need to be able to efficiently[1] find all of the intervals that contain a particular (integer) value. There are data-structures to do this efficiently, for example the interval tree. However, I have quite a lot of intervals, and I don't want to have to code up a complicated disk-based data structure, so I am constrained to use a database.
The solution I came up with is to assign each interval a particular numerical code, such that a query of a small set of such codes will guarantee finding each interval that contains a given value.
I'm not sure if this is novel. It seems a fairly obvious thing to do, so probably not. If not, can anyone tell me what it is called?
The code is this: I produce a trinary number from the binary representation of the lower and upper bounds of the interval. Starting with the most significant digit I map 0->1 and 1->2 until the first digit that varies between the lower and upper bounds. From there on I just produce 0s.
An example may make this clear:
Lower 00010101010 Upper 00010101111 Output 00021212000
To perform a query, I similarly convert the binary query number to trinary. The query set is then this trinary number with each least significant digit converted to zero in turn.
An example:
Query 00010101100
Output 00021212211
00021212210
00021212200
00021212000
00021210000
00021200000
00021000000
00020000000
00000000000
Here's the source code:
The source code includes code to test this idea using SQLite.
Update: This is hideous. Sorry. I am working on a much nicer scheme, stay tuned.
[1] Clarification: By efficiently I mean faster than O(n).
18 January 2009, 20:50 UTCSketchifier
Click image to see original.
Requires Python 2, Python Imaging Library, Numpy.
The basic principle is to draw lines but not corners or dots. Lines, and corners and dots can be detected by the eigenvalues and determinant of the Hessian respectively. I also need to blur the image for various reasons, which I attempt to do in as scale-free a way as possible by using a long tailed convolution kernel (of course, you know I'm obsessed by these things), a t-distribution to be precise.
12 January 2009, 8:51 UTCThis is not a tilt-shift photograph
Using tilt-shift photography to make a scene look like a miniature version of itself has been doing the rounds recently. Again. This technique simulates the look of a photograph taken with a macro lens by blurring the top and/or bottom of the image. The effect only works if the scene being photographed recedes into the distance in a fairly uniform way -- otherwise the wrong areas are blurred.
The image above is not a tilt-shift photograph, nor is it a fake tilt-shift photograph.
29 December 2008, 3:53 UTCThe Problem of Good, a Platform Game
21 December 2008, 22:39 UTCThe rise and fall of Christianity in Europe
26 November 2008, 3:37 UTCReligious logic gate
17 November 2008, 9:02 UTCGame theory's influence
4 November 2008, 5:48 UTCKin selection: a worked example
18 September 2008, 22:36 UTCFluff
17 August 2008, 10:11 UTCFurther violence done to time
9 August 2008, 8:20 UTCMythical
24 June 2008, 21:43 UTCMore drum beats
18 May 2008, 9:24 UTCDividing a beat, a la Arbeau
4 May 2008, 0:17 UTCInsane prisoner's dilemma
25 April 2008, 13:53 UTCIntelligent design for the evangelical atheist
13 April 2008, 10:39 UTCRobust topological sorting and Tarjan's algorithm in Python
12 April 2008, 0:28 UTCA Nokia phone has crashed
7 April 2008, 6:27 UTCAMOS message reader/writer module for Python
26 February 2008, 22:30 UTCModern anti-depressants such as Prozac are ineffective
23 February 2008, 10:51 UTCIntroducing Myrialign - align short reads to a reference genome
27 January 2008, 21:34 UTCWhy go insane? Some preliminary notes
15 January 2008, 7:39 UTCRandom color scheme generator
11 December 2007, 8:31 UTCA rest from opponent process
