Do I win a prize?

homeblogtwitterthingiverse



See Table 2 of "Common genetic variants on 5p14.1 associate with autism spectrum disorders", Wang et al, Nature, April 2009.

    SNP location    My Genotype

    rs4307059       TT
    rs7704909       TT
    rs12518194      AA
    rs4327572       CC
    rs1896731       CC
    rs10038113      CC


One of the problems with this type of research, indeed with many of the newer techniques in biology, is that each data point has high dimension. Here, about half a million dimensions per subject. If you are trying to work out which of those dimensions has significant predictive power, you need to perform half a million signficance tests. The correction for multiple testing is brutal. Furthermore, each data point is expensive to obtain. So you need a lot of data points for any significant result, but often can't afford them.

Ideally, you would want to not just pick out a few interesting dimensions, but construct a full linear model. However this would require several million subjects. Prohibitively expensive. Even when the price comes down, finding that many people with the condition in question may be impossible.

I suspect that principal components analysis requires a lot less data points than constructing a linear model. Or perhaps one can estimate a covariance matrix from a healthy population, and then only require a few people with the condition in question to get useful predictions. I lack a theoretical framework to fit these kinds of tricks into. Suggestions welcome.




[æ]