SNP location My Genotype rs4307059 TT rs7704909 TT rs12518194 AA rs4327572 CC rs1896731 CC rs10038113 CC

One of the problems with this type of research, indeed with many of the newer techniques in biology, is that each data point has high dimension. Here, about half a million dimensions per subject. If you are trying to work out which of those dimensions has significant predictive power, you need to perform half a million signficance tests. The correction for multiple testing is brutal. Furthermore, each data point is expensive to obtain. So you need a lot of data points for any significant result, but often can't afford them.

Ideally, you would want to not just pick out a few interesting dimensions, but construct a full linear model. However this would require several million subjects. Prohibitively expensive. Even when the price comes down, finding that many people with the condition in question may be impossible.

I suspect that principal components analysis requires a lot less data points than constructing a linear model. Or perhaps one can estimate a covariance matrix from a healthy population, and then only require a few people with the condition in question to get useful predictions. I lack a theoretical framework to fit these kinds of tricks into. Suggestions welcome.