16 February 2019, 21:03 UTCLorne Genome 2019 poster - weighted principal components and canonical correlation with single cell data

Poster for Lorne Genome conference 2019, looking at single cell data where each gene can produce two measurements: RNA expression level and choice of polyadenylation site. We're not exactly sure what the correct tools for analysing this data are yet, this poster is plays with weighted principal components and canonical correlation. I'm interested in expanding my use of multivariate techniques, there are whole histories of unfamiliar techniques, such as techniques from ecology and Exploratory Factor Analysis methods used in psychology and marketing. Apparently multivariate techniques are particularly popular in France.

27 December 2018, 6:32 UTCRecommender systems and the viral bubble

People worry about being trapped in a filter bubble, but I have a different concern. Amongst content with a viral coefficient close to one, the amount of attention equivalent content receives is highly variable. That is, we are all sucked into the same viral bubble, collectively seeing some things and missing others of equal merit. Furthermore we tend to see viral content over content of more specific interest to us.

Recommender systems -- now commonly called "algorithms" -- have the potential to enhance or reduce this effect. Recommender systems as applied to social network content are a little creepy, but also necessary as people build up large numbers of people to follow over time. It is important to see the announcement that a distant acquaintance has cancer, but not the latest cat picture they found funny. With this necessity, perhaps the best we can aim for is that people to have control over their algorithm, rather than being forced to take what Facebook or Twitter (etc) provide.

Recommender systems come in two forms:

**Explicit**recommender systems learn from a collection of ratings, and learn to predict the rating of any given content.

**Implicit**recommender systems learn only from observing what content a person consumes, and learn to predict what of all possible content they may consume in future.

I include systems which only have a "like" but no "dislike" rating such as Facebook among implicit systems, even though they take direct user input. However it might be that Facebook tracks exactly what it has shown a user, which would bring it closer to an explicit recommender system.

The problem with implicit recommender systems is that they are necessarily biassed by exposure: you can only like or consume something you see. Explicit recommender systems do not *necessarily* have this problem.

Some regularization is probably needed in a practical explicit recommender system to avoid being swamped by new content with few ratings. Compare "hot" and "new" on Reddit. Without regularization, a newish post on reddit with a single vote (other than by the author) will have an unbiassed estimate of the upvote proportion that is either 0% or 100%. Regularization introduces bias, but this can at least be dialled up or down.

One useful observation is that an explict recommender system can use implicit data from other people and still have low bias. The dependent variable is a specific user's ratings. We need "Missing At Random" (MAR) data for this, which means data in which the missingness is not dependent on the rating *given the independent variables*. Any information that helps predict missingness can be used as an independent variable to reduce bias and increase accuracy.

Having the choice on social networks to use an explicit recommender system algorithm with a bias dial is an important freedom we currently lack.

**Notes**

-- The terms "bias", "regularization", and "Missing At Random" here have technical meanings.

-- njh points out these systems are often thought of in terms of multi-armed bandits. A multi-armed bandit has a record of exactly what it has shown the user (what levers it has pulled), so it is an explicit system with the potential to manage bias. The bandit concept of exploration/exploitation trade-off may be a better way of thinking about what I've called regularization.

13 October 2018, 0:27 UTCBall hypothesis tests

Short note on ball hypothesis tests as a generalization of interval hypothesis tests.

5 October 2018, 11:17 UTCWeighted least squares

Short note on choosing weights for weighted least squares, mostly for my own future reference.

9 March 2018, 10:59 UTCDetermining the sign of an effect size is quite similar from Frequentist and Bayesian perspectives

p-values and confidence intervals on an effect size have this correspondence: if p<0.05, the 95% confidence interval does not contain zero (or choose whatever α cutoff and 100%-α confidence interval you prefer). This means the interval is either entirely above zero or entirely below zero, which is to say we have determined the sign of the effect size (see previous blog entry).

*Clarification:* The precise guarantee here is "whatever the effect size may be, we will only make a false claim about its sign with probability at most 0.05." We may make no claim at all, and this is counted as not making a false claim.

Formally, the p-value is a means of rejecting the hypothesis that the effect size is zero, but it seems it is often more than this. Significant p-values, at least such as can have an associated confidence interval, allow us to reject fully half of the number line of effect sizes.

Where Frequentists like to talk of p-values, Bayesians like to talk of posterior probabilities. It had always seemed to me that this failed at the first hurdle: trying to replicate the t-test. If we take as H0 that the effect size is zero, and as H1 that the effect size is non-zero and hence drawn from some prior distribution, P(H0|y) and P(H1|y) will be dependent on the prior distribution associated with H1, with an overly wide distribution leading to smaller P(H1|y). This seems hopelessly subjective. Furthermore it requires the machinery of measure theory to even represent these peculiar prior beliefs, with a point mass of probability at zero within a continuous distribution.

But now consider an H1 of an effect size less than zero, and an H2 of an effect size greater than zero. A perfectly natural prior belief is that the distribution of the effect size is symmetric around zero. We no longer need a point mass. This still corresponds to the Frequentist test in that we are attempting to determine the sign of the effect size.

For the t-test, there is a choice of prior* such that the p-value is simply twice the Bayesian posterior probability of the less likely hypothesis.

* Improper, but choose a proper prior to get as close as you like.

**Update:** @higherfiveprime notes that Andrew Gelman (of course) and Francis Tuerlinckx have a paper somewhat related to this. Errors determining the sign *conditional on having confidently determined the sign* are referred to as "Type S" errors, and their point is that these are not controlled by the Frequentist procedure. Frequentist "Type I" errors, which are not conditional on a determination of the sign being made, are still controlled.

For Frequentist Type S error control, it appears you need to perform a False Discovery Rate (FDR) correction (eg Benjamini & Hochberg's method). So now we also have a nice Bayesian equivalent of FDR control!

**See also:**

8 November 2017, 5:34 UTCTopconfects talk

I gave an informal talk today about my Topconfects R package. If you do RNA-seq Differential Expression analysis it may be of interest.

28 October 2017, 3:55 UTCScatter plots with density quartiles

I think this is a better way to show density in scatter plots.

24 June 2017, 6:44 UTCA Bayesian walks into a bar and observes that a statistical hypothesis has been rejected

A classical statistical test controls the probability P(R|H0) that H0 will be falsely rejected. Conventionally this is controlled to be α=0.05 or α=0.01.

( As a small complication, H0 may be a set of hypotheses. α is the probability of rejection of the hypothesis in the set that is most likely to be rejected. A Bayesian may believe some of the hypotheses are more likely than others, but whatever they believe they will still have that P(R|H0) ≤ α. )

A Bayesian, in order to update their belief in the odds of H1 as opposed to H0, P(H1)/P(H0), also needs to know the statistical power, P(R|H1), the probability that H0 will be rejected if it is actually false. The Bayes Factor is then at least P(R|H1)/P(R|H0), and the odds of the competing hypotheses can be conservatively updated as:

P(H1|R)/P(H0|R) = P(R|H1)/P(R|H0) * P(H1)/P(H0)

So we have the curious conclusion that a Bayesian will pay more heed to a statistical test they believe to have high statistical power. If the Bayesian believes that H0 had little chance of being correctly rejected if false, they will be surprised by the rejection but it will not update their beliefs much.

( The classical test sought only to reject H0, not to confirm H1. If the test is rejects H0, perhaps the Bayesian should consider that their H1 was not sufficiently broad an alternative hypothesis. )

Confidence intervals are increasingly accepted as a superior alternative to p-values. A Bayesian argument for this can be given: A confidence interval gives an indication of the accuracy to which an effect has been measured, and hence the statistical power. A Bayesian may use a confidence interval to update their beliefs immediately, whereas they would require further information if only provided with a p-value. ( Leaving aside the technical distinction between confidence intervals and credible intervals, which does not exist for simple tests such as the t-test. )

( The probabilities above are Bayesian -- personal and subjective quantities used to make decisions. If we were talking frequency probabilities, the above would earn a "No!" )

29 May 2017, 23:28 UTCMelbourne Datathon 2017 - my Kaggle entry

- make_db.R - load data into SQLite

- predict.R - make predictions (except for Stan model)

- sparse_glm_csr.stan - Stan model code

- stan.R - make input for Stan model, read results and make final prediction

This is a description of my Melbourne Datathon 2017 Kaggle entry, which came third. See also: Yuan Li's winning entry.

The data this year was drug purchases at pharmacies from 2011-2015. The objective was then to maximize the ROC AUC when predicting the probability of a person purchasing diabetes medication in 2016.

This entry spent quite a lot of time at the top of the leaderboard, however this was merely because the experienced Kaggle players only submitted in the last couple of days. On the plus side, I guess this gave new players a chance to compete with each other rather than be faced with a seemlingly insurmountable AUC from the start.

One thing I wondered was how much I should discuss my entry with other people or keep it secret. Well it was much more fun when I chatted than when I keept silent, and I needn't have worried anyway. The whole thing was a great learning exercise, I've learned some things sure to be relevant to my own work.

**Features**

All models I used followed a pattern of predicting the binary outcome for each patient (diabetes medication purchased in 2016) from a collection of predictor variables. These predictors were stored in a large, sparse matrix.

I actually used two predictor matrices, one for logistic regression prediction, and a larger one for tree prediction with XGBoost. XGBoost seems able to handle a very large number of predictor variables.

The logistic regression predictors were:

- Purchased a diabetes drug prior to 2016.

- Year of birth, missing values set to 1921.

- Year of birth, missing values set to 2016.

- Known to be female.

- Known to be male.

- Patient location code.

- For each drug, number of purchases, transformed log(1+x).

- For each drug, number of purchases in 2015, transformed log(1+x).

- For each chronic illness, number of drug purchases, transformed log(1+x).

- For each chronic illness, number of drug purchases in 2015, transformed log(1+x).

- For each prescriber, 1 if had a prescription from prescriber else 0.

- For each store, proportion of purchases from store (ie rows normalized to sum to 1).

log(1+x) transformation worked better than either using the count as-is or an indivator variable of whether the count was non-zero. I view it as an intermediate between these extremes.

Additionally the XGBoost predictor matrix contained number of drug purchases in the second half of 2015, from 2014, from 2013, and from 2012.

All features were standardized to have standard deviation one.

**Principal Component augmentation**

I used the R package irlba to augment the feature matrices with their principal components. Irlba is able to perform SVD on large sparse matrices.

I added 10 principal component columns to the matrix for tree models, and 20 to the matrix for logistic regression models. The main limitation here was that irlba started breaking or running a very long time when asked for more than this.

One interesting thing here is that this adds an aspect of unsupervised learning that includes the test-set patients.

**Bootstrap aggregation ("bagging")**

I wanted to take model uncertainty into account. If one possible model makes a very strong prediction, and another makes a weaker prediction, I wanted an intermediate prediction. The most likely model might not well characterize the full spread of possibilities.

To this end, each model was bootstrapped 60 times. Probabilities from these 60 bootstraps were averaged. The idea here is to obtain marginal probabilities, integrating over the posterior distribution of models. (Someone more familiar with the theory of bootstrapping could maybe tell me if this is a correct use of it. It certainly improved my AUC.)

Boostrapping is based on sampling with replacement. My bootstrapper function had one slightly fancy feature: overall, each data point is used the same number of times. Within this constraint, it mimics the distribution produced by sampling with replacement as closely as possible. I call this a "balanced bootstrap".

Since bootstraps always leave out about a third of the data, averaging over predictions from the left-out data provides a way to get probabilities for training-set patients that are not over-fitted. These are then used when blending different models. My "balanced bootstrap" scheme ensured every training-set patient was left out of the same proportion of bootrstaps, and I could dial the whole thing down to 2 bootstraps and still have it work if I wanted.

**Blending models**

The probabilities produced by different models were linearly blended so as to maximize AUC. I again used my bootstrapper to perform this, so I had a good prediction of how accurate it would be without needing to make a Kaggle submission. (Once you have a hammer...)

On to the models themselves.

**Regularized logistic regression with glmnet**

glmnet fits generalized linear models with a mixture of L2 (ridge) and L1 (lasso) regularization. I used glmnet to perform logistic regression.

glmnet's alpha parameter determines the mixture of L2 and L1 regression. alpha=0 means entirely L2 and alpha=1 means entirely L1. I found a tiny bit of L1 improved predictions slightly over purely L2 regularization. In the end I made one set of predictions using alpha=0.01 and one with alpha=0.

A neat feature of glmnet is that it fits a range of regularization amounts (lambda parameter) in one go, producing a path through coefficient space. As described above, I am always fitting models to bootstrapped data sets. I choose an optimal value for lambda each time to maximize the AUC for the patients left out by the bootstrap.

I want to mention one of the strengths of glmnet but not so relevant to a Kaggle competition: if you turn up alpha close or equal to 1, and turn up lambda high enough, it will produce a *sparse* set of coefficients with few enough non-zero coefficients that the model is *interpretable*. Here I've used a low value of alpha to get as accurate predictions as possible, but if I wanted an interpretable model I would do the opposite.

**Gradient boosted decision trees with XGBoost**

Speaking of non-interpretable models, my next and individually most accurate set of predictions came from XGBoost.

XGBoost creates a stack of decision trees, each tree additively refining the predicted log odds ratio for each patient. By piling up trees like this, and rather like glmnet, we have a path from high regularization to low regularization. Again I used the patients left out of the bootstrap sample to decide at what point along this path to stop.

I used a learning rate of eta=0.1, and allowed a maximum tree depth of 12. XGBoost also has lambda and alpha parameters controlling L1 and L2 regularization. However the meaning of these parameters is different to glmnet: lambda is L2 regularization and alpha is L1 regularization. I'm not sure of the units here. I made two sets of predictions, one with lambda=50, alpha=10, and one with lambda=0, alpha=40. L1 regularization can stop to tree splitting before the maximum depth of 12 is reached (I think).

Limiting the learning rate and number of trees also constitutes a form of regularization, even without these explicit regularization parameters. I'm not 100% clear on how all these parameters interact.

**Split glmnet**

The most important predictor of buying diabetes drugs in 2016 was buying diabetes drugs prior to 2016. I split the data into two parts on this basis, and fitted glmnet models to each.

**XGBoost on top of glmnet**

I tried running XGBoost starting from half the log odds ratio predicted by glmnet, as an attempt to blend these two approaches more intimately.

**Logistic regression with t distribution priors on coefficients with Stan**

Stan is a language for writing Bayesian models. A Stan model can then be compiled to a C++ program that performs Hamiltonian Monte Carlo sampling.

This was a late addition. I didn't bootstrap this because it does its own sampling from the posterior, just blended it in 10%-90% with the blend of the predictions described above.

L2 regulatization can be viewed as a prior belief that coefficients have a normal distribution, and L1 regularization can be viewed as a prior belief that coefficients have a Laplace distribution. However my favourite distribution is Student's t distribution, so I wanted to try this as well. The t distribution looks like a normal distribution, but with fat tails. The Laplace distribution has fatter tails than the normal distribution, but the t distribution's fat tails are fatter still.

I constructed a very simple Stan model for logistic regression with t distributed coefficients. To my surprise this worked, and once I figured out how to use the sparse matrix multiplication function it ran within a bearable time. I ran-in the sampler over 100 iterations, then sampled 100 models from the posterior distribution and averaged the predicted probabilities from these models.

I was impressed by Stan, and if I were doing this exercise over would have spent more time on it.

So there you go, that's my Melbourne Datathon 2017 Kaggle entry in all it's ~~horror~~ glory.

**Post-competition notes:**

I came third on the final leaderboard.

Yuan Li has described their winning solution, which is xgboost based with much better feature engineering than I've used.

The competition allowed me to submit two final entries, and I submitted one with the Stan model contribution and one without. The one without turned out to be the best of the two, by a hair.

A small attempt at interpretable modelling of the data, using L1 regularized logistic regression.

Better at predicting cesation of diabetes medication than starting it:

1 April 2017, 22:58 UTCDiagrams of classical statistical procedures

No matter what the truth may be, classical statistical procedures only ever reject it with some small specified probability α. This all-possible-worlds counterfactualism requires a kind of thinking filled with 90-degree turns, and it does my head in. Never fear, however, for I have devised the diagrams above to make what is going on clear. While looking at these diagrams I experience the sensation of understanding. I hope to now share this sensation with you.

We shall be concerned with estimating a single parameter, or "effect". There is a true effect, and we have an estimate of it, but we know the estimate is inaccurate and have a probabilistic model describing exactly how inaccurate it will be, P(estimate|true). Here we will assume it has added t-distributed noise with zero mean and known scale.

Consider a diagram with the true value of a parameter as the x-axis, and the estimate as the y-axis. We will color in all points that are non-rejected.

Such a diagram represents a **valid** procedure if, for each true value, the non-rejected region will contain the estimate with probability 1-α. So in assessing validity we look at each **vertical** slice of the diagram.

To **apply** the diagram, given an estimated effect, we look at the corresponding **horizontal** slice of the diagram, and obtain a set of non-rejected true effect sizes.

**Confidence intervals, and lower and upper confidence bounds**

The first diagram shows the smallest possible non-rejection region. This is the diagram for computing confidence intervals. Looking at each vertical slice, the non-rejection region is centered on the point where the estimate is equal to the true value. It covers the densest region, so it can be quite compact and still contain the estimate with probability 1-α. Now looking at each horizontal slice, we see that we will always obtain non-rejection regions centered on the estimate.

The second diagram shows a procedure for obtaining a lower confidence bound on the effect. Looking at each vertical slice, the non-rejection region is no longer centered -- it goes down to negative infinity. However this means the top of the non-rejection region can be moved down slightly compared to the confidence interval diagram and still contain the estimate with probability 1-α. Now looking at each horizontal slice, we see that we will obtain a non-rejection region from slightly below the estimate up to positive infinity, so this diagram is for giving lower bounds on the true effect. It gives a slightly tighter lower bound than the confidence region diagram.

In a similar way we may obtain an upper confidence bound. (However it would not be valid to apply both the lower bound and upper bound procedure to the same data -- in doing so we would risk rejecting the truth with probability 2α. So in this case we would need to use the confidence interval procedure.)

**t-tests to determine the sign of an effect**

The t-test only tests whether a true effect of zero can be rejected. Having performed this test, what can we say about other effect values?

We have only tested our estimate against the boundaries at a true effect of zero, so we have only compared our estimate to the boundaries in this vertical slice. This splits the confidence interval diagram into three layers, and the confidence bound diagrams into two layers each. Looking at vertical segments within these layers, if any point is non-rejected all must be non-rejected. Filling out the confidence interval diagram and bound diagrams in this way gives us the diagrams for the two-sided and one-sided t-tests.

Examining horizontal lines through these diagrams, possible outcomes are that non-rejected effects are restricted to positive values, or to negative values, or that no rejection occurs. So when the two-sided t-test rejects a true effect of zero it also tells us the sign of the effect.

Similar to the lower and upper bounds, one-sided t-tests need a smaller estimate than the two-sided test to determine if the effect has a certain sign, at the cost of not testing the opposite sign.

From these diagrams we can see that confidence intervals and bounds tell us more than the corresponding t-test. One small virtue of the t-test is that when it is reported a p-value for the test can be quoted, allowing the reader to set their own α and rejecting an effect of zero if p ≤ α. Confidence intervals and bounds need α to be specified before performing the procedure.

**TREAT**

The TREAT procedure (McCarthy and Smyth, 2009) represents a third kind of statistical procedure, a blend between the confidence interval and lower and upper bounds. The authors apply this to microarray and RNA-Seq data analysis, and it has implementations in their limma and edgeR packages. However there is nothing stopping it from being used more generally.

As described in the paper, the TREAT procedure calculates a test statistic and from this a p-value. Translating this into diagram-form: For each true effect size, a non-rejection interval centered on *zero* is found. For effect sizes close to zero, this interval is similar to our confidence interval, but further from zero it resembles an upper or lower bound.

The acceptance intervals are not centered on the true effect as in confidence intervals, but are also not infinitely off-center as in the upper and lower bounds.

Looking at horizontal slices through this diagram, we see that absolute effect sizes smaller than a certain amount will be rejected. This is what TREAT does: shows that the effect size is larger than a specified amount.

(As TREAT is implemented in limma and edgeR, one specifies a minimum effect size and obtains a p-value. Similar to our conversion from confidence interval to t-test, this would give us a squared up H-shaped region.)

Similarly to the t-test, having obtained statistical significance by TREAT, can we say anything about the sign of the effect? It seems not. There is no horizontal line which entirely rejects one or other sign. However it would be possible to fix this. This is the final diagram, my proposal for a modified TREAT, in which you learn not only that the absolute effect size is larger than some amount, but also whether it is positive or negative. Looking vertically, the non-rejection intervals are about half as far off-center. Looking horizontally, we are now able to determine the sign.

The plot below shows this with t-distributed errors (df=5) and α=0.05. The diagonal lines show the boundaries for confidence intervals, lower and upper bounds, and the central line of estimate=true. You can see there is only a tiny loss of power from this modification. (One would not reasonably attempt statistical analysis with a df smaller than this. For higher df the difference becomes even smaller, but does not entirely disappear.)

**Conclusion**

I hope these diagrams have given you a clearer understanding of some commonly used classical statistical procedures. They've certainly been necessary for me in order to think clearly about the TREAT procedure.

**References**

McCarthy, D. J., and Smyth, G. K. (2009). Testing significance relative to a fold-change threshold is a TREAT. *Bioinformatics*, 25(6), 765-771.

16 March 2017, 23:29 UTCFinding your ut-re-mi-fa-sol-la on a monochord, and making simple drinking straw reedpipes

7 December 2016, 0:50 UTCBriefly,

18 October 2016, 5:53 UTCShiny interactivity with grid graphics

7 August 2016, 2:17 UTCCrash course in R a la 2016 with a biological flavour

17 July 2016, 1:45 UTCVectors are enough

28 May 2016, 22:04 UTCSci-Hub over JSTOR

4 November 2015, 23:41 UTCComposable Shiny apps

14 September 2015, 4:39 UTCRecorder technique and divisions in the 16th century

30 July 2015, 0:04 UTCLinear models, a practical introduction in R

7 May 2015, 5:47 UTCWhen I was a young lad in the '90s

3 November 2014, 11:41 UTCVirtualenv Python+R

24 October 2014, 22:08 UTCSexism: spreading from computer science to biology

21 August 2014, 6:03 UTCFirst-past-the-post voting outcomes tend to surprise the candidates

21 August 2014, 2:59 UTCDates in Google Search aren't trustworthy

27 June 2014, 2:22 UTCReading "Practical Foundations of Mathematics"

18 May 2014, 10:34 UTCCellular automaton tiles revisited

7 April 2014, 8:27 UTCSelfish sweep

3 April 2014, 2:40 UTCBagpipes kickstarter

1 March 2014, 5:36 UTCTabor pipes on thingiverse

14 February 2014, 7:19 UTCDemakein: introducing --tweak-emission