pfh blog2017-04-01T22:58:04ZIdeas for free software projects, politics, philosophy, amateur psychology.Paul Harrisonpfh@logarithmic.netDiagrams of classical statistical procedures2017-04-01T22:58:04Z2017-04-01T22:58:04Zhttp://www.logarithmic.net/pfh/blog/01491087484
<img src="http://www.logarithmic.net/pfh-files/blog/01491087484/acceptance.svg">
<p><br>No matter what the truth may be, classical statistical procedures only ever reject it with some small specified probability α. This all-possible-worlds counterfactualism requires a kind of thinking filled with 90-degree turns, and it does my head in. Never fear, however, for I have devised the diagrams above to make what is going on clear. While looking at these diagrams I experience the sensation of understanding. I hope to now share this sensation with you.
<p>We shall be concerned with estimating a single parameter, or "effect". There is a true effect, and we have an estimate of it, but we know the estimate is inaccurate and have a probabilistic model describing exactly how inaccurate it will be, P(estimate|true). Here we will assume it has added t-distributed noise with zero mean and known scale.
<p>Consider a diagram with the true value of a parameter as the x-axis, and the estimate as the y-axis. We will color in all points that are non-rejected.
<p>Such a diagram represents a <b>valid</b> procedure if, for each true value, the non-rejected region will contain the estimate with probability 1-α. So in assessing validity we look at each <b>vertical</b> slice of the diagram.
<p>To <b>apply</b> the diagram, given an estimated effect, we look at the corresponding <b>horizontal</b> slice of the diagram, and obtain a set of non-rejected true effect sizes.
<p><br><b>Confidence intervals, and lower and upper confidence bounds</b>
<p>The first diagram shows the smallest possible non-rejection region. This is the diagram for computing confidence intervals. Looking at each vertical slice, the non-rejection region is centered on the point where the estimate is equal to the true value. It covers the densest region, so it can be quite compact and still contain the estimate with probability 1-α. Now looking at each horizontal slice, we see that we will always obtain non-rejection regions centered on the estimate.
<p>The second diagram shows a procedure for obtaining a lower confidence bound on the effect. Looking at each vertical slice, the non-rejection region is no longer centered -- it goes down to negative infinity. However this means the top of the non-rejection region can be moved down slightly compared to the confidence interval diagram and still contain the estimate with probability 1-α. Now looking at each horizontal slice, we see that we will obtain a non-rejection region from slightly below the estimate up to positive infinity, so this diagram is for giving lower bounds on the true effect. It gives a slightly tighter lower bound than the confidence region diagram.
<p>In a similar way we may obtain an upper confidence bound. <font size=-2>(However it would not be valid to apply both the lower bound and upper bound procedure to the same data -- in doing so we would risk rejecting the truth with probability 2α. So in this case we would need to use the confidence interval procedure.)</font>
<p><br><b>t-tests to determine the sign of an effect</b>
<p>The t-test only tests whether a true effect of zero can be rejected. Having performed this test, what can we say about other effect values?
<p>We have only tested our estimate against the boundaries at a true effect of zero, so we have only compared our estimate to the boundaries in this vertical slice. This splits the confidence interval diagram into three layers, and the confidence bound diagrams into two layers each. Looking at vertical segments within these layers, if any point is non-rejected all must be non-rejected. Filling out the confidence interval diagram and bound diagrams in this way gives us the diagrams for the two-sided and one-sided t-tests.
<p>Examining horizontal lines through these diagrams, possible outcomes are that non-rejected effects are restricted to positive values, or to negative values, or that no rejection occurs. So when the two-sided t-test rejects a true effect of zero it also tells us the sign of the effect.
<p>Similar to the lower and upper bounds, one-sided t-tests need a smaller estimate than the two-sided test to determine if the effect has a certain sign, at the cost of not testing the opposite sign.
<p>From these diagrams we can see that confidence intervals and bounds tell us more than the corresponding t-test. One small virtue of the t-test is that when it is reported a p-value for the test can be quoted, allowing the reader to set their own α and rejecting an effect of zero if p ≤ α. Confidence intervals and bounds need α to be specified before performing the procedure.
<p><br><b>TREAT</b>
<p><a href="https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp053">The TREAT procedure</a>
(McCarthy and Smyth, 2009) represents a third kind of statistical procedure, a blend between the confidence interval and lower and upper bounds. The authors apply this to microarray and RNA-Seq data analysis, and it has implementations in their <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">limma</a>
and
<a href="http://www.bioconductor.org/packages/release/bioc/html/edgeR.html">edgeR</a>
packages. However there is nothing stopping it from being used more generally.
<p>As described in the paper, the TREAT procedure calculates a test statistic and from this a p-value. Translating this into diagram-form: For each true effect size, a non-rejection interval centered on <i>zero</i> is found. For effect sizes close to zero, this interval is similar to our confidence interval, but further from zero it resembles an upper or lower bound.
<p>The acceptance intervals are not centered on the true effect as in confidence intervals, but are also not infinitely off-center as in the upper and lower bounds.
<p>Looking at horizontal slices through this diagram, we see that absolute effect sizes smaller than a certain amount will be rejected. This is what TREAT does: shows that the effect size is larger than a specified amount.
<p>(As TREAT is implemented in limma and edgeR, one specifies a minimum effect size and obtains a p-value. Similar to our conversion from confidence interval to t-test, this would give us a squared up H-shaped region.)
<p>Similarly to the t-test, having obtained statistical significance by TREAT, can we say anything about the sign of the effect? It seems not. There is no horizontal line which entirely rejects one or other sign. However it would be possible to fix this. This is the final diagram, my proposal for a modified TREAT, in which you learn not only that the absolute effect size is larger than some amount, but also whether it is positive or negative. Looking vertically, the non-rejection intervals are about half as far off-center. Looking horizontally, we are now able to determine the sign.
<p>The plot below shows this with t-distributed errors (df=5) and α=0.05. The diagonal lines show the boundaries for confidence intervals, lower and upper bounds, and the central line of estimate=true. You can see there is only a tiny loss of power from this modification. <font size=-2>(One would not reasonably attempt statistical analysis with a df smaller than this. For higher df the difference becomes even smaller, but does not entirely disappear.)</font>
<p><center><img src="/pfh-files/blog/01491087484/plot.svg" height=400></center>
<p><b>Conclusion</b>
<p>I hope these diagrams have given you a clearer understanding of some commonly used classical statistical procedures. They've certainly been necessary for me in order to think clearly about the TREAT procedure.
<p><br><b>References</b>
<p>McCarthy, D. J., and Smyth, G. K. (2009). Testing significance relative to a fold-change threshold is a TREAT. <i>Bioinformatics</i>, 25(6), 765-771.
Finding your ut-re-mi-fa-sol-la on a monochord, and making simple drinking straw reedpipes2017-03-16T23:29:51Z2017-03-16T23:29:51Zhttp://www.logarithmic.net/pfh/blog/01489706991
<p>I gave a class at Rowany Festival in 2013 on Guido of Arezzo's music textbook "Micrologus". It also covered a little bit on reedpipes. I don't seem to have posted the handout online. Rectifying that now:
<p>* <a href="http://www.logarithmic.net/pfh-files/blog/01489706991/monochords_and_reedpipes.pdf">Handout (PDF)</a>
<p>* <a href="http://www.logarithmic.net/pfh-files/blog/01489706991/reedpipes2.svg">Reedpipe template (SVG)</a>Briefly,2016-12-07T00:50:28Z2016-12-07T00:50:28Zhttp://www.logarithmic.net/pfh/blog/01481071828
<p>
<center style="font-size: 400%">MAKE<br><strike>WORK</strike></center>
Shiny interactivity with grid graphics2016-10-18T05:53:19Z2016-10-18T05:53:19Zhttp://www.logarithmic.net/pfh/blog/01476769999
<p>This post may be of interest to a very small number of people. Using a fancy <a href="https://stat.ethz.ch/R-manual/R-devel/library/grid/html/grid-package.html"><span class="mono">grid</span></a>-based graphics library in R such as
<a href="https://cran.r-project.org/web/packages/lattice/index.html"><span class="mono">lattice</span></a> and want some
<a href="http://shiny.rstudio.com/articles/plot-interaction.html">Shiny brushing goodness</a>? Sadly, Shiny only supports base graphics and ggplot2.
<p>Solution: First work out the name of the relevant viewport (<span class="mono">grid.ls(viewports=T,grobs=F)</span>), then use the <a href="https://cran.r-project.org/web/packages/gridBase/index.html"><span class="mono">gridBase</span></a> package to set up a corresponding base graphics viewport.
<p><pre>
output$plot <- renderPlot({
# ... draw some grid graphics ...
seekViewport("...someviewport...")
par(new=T, plt=gridBase::gridPLT())
plot(1, type="n", axes=F,
xlab="", ylab="",
xlim=c(0,1), ylim=c(0,1),
xaxs="i", yaxs="i")
})
</pre>
<p>If the viewport isn't named, tell the developer of the package to name their viewports.
Crash course in R a la 2016 with a biological flavour2016-08-07T02:17:31Z2016-08-07T02:17:31Zhttp://www.logarithmic.net/pfh/blog/01470536251
<p><ul><li><a href="https://monashbioinformaticsplatform.github.io/r-more/">Course notes</a></li></ul>
<p>Monash Bioinformatics Platform gave this day long course on Friday 5 July, 2016 (following an <a href="https://monashbioinformaticsplatform.github.io/r-intro/">introductory course</a> on Thursday). The course style is inspired by Software Carpentry, but currently sacrificing some interaction to get where I wanted to go, which is someone able to do useful bioinformatics work at a grad-student level.
<p>We definitely need to work on getting more interaction going, everyone was too quiet. This may have just been an effect of packing in so much material. However, plenty of solutions to challenges via the chat system, fairly sure most people were more or less following. The anonmymity of chat introduces an interesting dynamic -- students can propose solutions without worrying about defending their ego or propriety, or worrying about stereotype threat.Vectors are enough2016-07-17T01:45:24Z2016-07-17T01:45:24Zhttp://www.logarithmic.net/pfh/blog/01468719924
Numpy goes to some lengths to support multi-dimensional arrays, "tensors". This isn't necessary, vectors are enough.
<p>R has a different solution. A data frame (similar to a table in a database) with column vectors for the dimensions and for the value stores the same information, and a data frame can store many other things besides. Sparse tensors are easy, for example.
<p>As Judea Pearl might say, statistics is not about causality. A tensor implies causality, the dimensions of the tensor <i>cause</i> the values it holds. R, being a statistical language, rightly does not require causality to be specified and gains flexibility. Or we might say that the data frame approach does not require the primary key columns to be made explicit. So for example a bijective map need not privilege one direction of mapping (as a Python dict would).
<p>A data frame dense in some of its columns (eg from a full factorial experiment) can be stored more compactly as a tensor, but this is a detail of implementation that could be hidden (eg in a new flavour of <a href="https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html">tibble</a>). Similarly, one might want various indexes to be maintained alongside a data frame.
Sci-Hub over JSTOR2016-05-28T22:04:50Z2016-05-28T22:04:50Zhttp://www.logarithmic.net/pfh/blog/01464473090
<p>I use <a href="http://sci-hub.bz">Sci-Hub</a> as a source for academic literature.
<p>Sci-Hub has the massive advantage over JSTOR that they don't <a href="http://www.theguardian.com/commentisfree/2015/feb/07/aaron-swartz-suicide-internets-own-boy">claim to own the work of millions and academics who worked for the general betterment of humanity and then hound people who challenge this to suicide</a>. For me this is a compelling advantage.
<p>A basic process of science, publication of results, has become an exercise in restricting access to knowledge for profit. Sci-Hub is the only way most people have of accessing much of the academic literature without encountering endless punitive tolls.
Composable Shiny apps2015-11-04T23:41:05Z2015-11-04T23:41:05Zhttp://www.logarithmic.net/pfh/blog/01446680465
<p><b>Update April 2016:</b> <a href="http://shiny.rstudio.com/articles/modules.html">RStudio now has an official way to write composable shiny "modules"</a>.
<p><br><br><a href="http://shiny.rstudio.com/">Shiny</a> is a library for writing web applications in R. If you use R, I highly recommend it, it's very clever. If you haven't used R and Shiny, this blog post is going to be pretty incomprehensible.
<p>I wanted a way to <i>compose</i> Shiny apps. I haven't found an official way to do this, so this post details my solution. Maybe some day the clever people at RStudio will release an official way that makes this all moot.
<p><br>You may want to review the tutorial on <a href="http://shiny.rstudio.com/articles/app-formats.html">the various ways a Shiny app may be constructed and run</a>. I will be constructing Shiny apps using the shinyApp function. This can either be done interactively, or from a file "app.R" in a Shiny app directory. Working interactively, an app is launched when it is print()ed, similar to ggplot. So, working interactively, calling a function to create a shiny app will launch it immediately (unless you save it to a variable). Only in R!
<p>You may also want to review <a href="http://shiny.rstudio.com/articles/reactivity-overview.html">how reactive expressions in Shiny work</a>. Recall that it is the job of the server function in a Shiny app to create all the reactive expressions
<p><br><br>My code can be found in my <a href="https://github.com/MonashBioinformaticsPlatform/varistran">Varistran package</a>, in particular the file <a href="https://github.com/MonashBioinformaticsPlatform/varistran/blob/master/R/shiny_util.R">shiny_util.R</a>.
<p>What I call a "composable Shiny app" is a function in R returning a Shiny app object (so it can be used in the normal ways), but with that app object having some extra fields, <span class="mono">$component_ui</span> containing the UI part of the app and <span class="mono">$component_server</span> containing the server function.
<p><span class="mono">varistran::shiny_plot</span> is an example of a composable Shiny app. Such a function constructs <span class="mono">ui</span> and <span class="mono">server</span> and then calls <span class="mono">varistran::composable_shiny_app(ui,server)</span> to create the app.
<p>Considerations when writing a composable Shiny app:
<p><ul><li>To avoid namespace collisions, the app should be able to give names used in the input and output objects a specified prefix.</li></ul>
<p><ul><li>To provide output usable by other apps, the server function needs somewhere to store reactive expressions. Therefore <span class="mono">composable_shiny_app(ui,server)</span> wraps the server function in code that creates an R environment (ie the type in R that is <i>mutable</i>) and passes this to the server function. This is the only argument to the server function, the usual parameters <span class="mono">input</span> and <span class="mono">output</span> are placed inside the environment.</li></ul>
<p><ul><li>To use output from other apps, arguments to the function should be allowed to themselves be functions. I call these "reactables". These take a single argument, the environment, and hence are able to access reactive expressions that have been placed in the environment.</li></ul>
<p><ul><li>An app incorporating other apps needs to call their server function in its own. This ensures all the apps have a chance to set up their outputs and place reactive expressions in the environment.</li></ul>
<p><h3>Example</h3>
<p>This is all a bit confusing. Reading over the <span class="mono">varistran::shiny_plot</span> function should give you a better idea of how it all fits together. You will note in this function:
<p><ul><li>The argument "callback" should be a function taking env as an argument, and producing a plot. Hence this plot might be based on inputs and reactive expressions from elsewhere.</li></ul>
<p><ul><li>All of the inputs and outputs have a prefix pasted to the front of them.</li></ul>
<p><ul><li>The return value is produced by the function <span class="mono">varistran::composable_shiny_app</span>. This means is it can be used immediately by printing out the value, or in the app.R of a Shiny app directory, or as a component in a larger app.</li></ul>
<p><br><br>Example usage:
<p><pre>
install.packages(c("devtools", "shiny"))
devtools::install_github("MonashBioinformaticsPlatform/varistran")
library(shiny)
plot_app <- varistran::shiny_plot(
function(env) {
plot((1:10)**env$input$power)
},
prefix="myplot"
)
ui <- fluidPage(
titlePanel("Composable app demo"),
numericInput("power", "Power", 2),
plot_app$component_ui
)
server <- function(env) {
plot_app$component_server(env)
}
varistran::composable_shiny_app(ui, server)
</pre>
<p><a href="http://rnasystems.erc.monash.edu:3838/pfh/2015/example/">Here is the result.</a>
<p>The plot component is able to react to changes from outside, namely the value of <span class="mono">input$power</span>.
<p>Recorder technique and divisions in the 16th century2015-09-14T04:39:20Z2015-09-14T04:39:20Zhttp://www.logarithmic.net/pfh/blog/01442205560
<p>I gave a talk on the book <i>Fontegara</i> (1535) by Sylvestro Ganassi at this year's St. Vitas Dance & Music weekend. This book is a manual on recorder technique and making divisions (improvisation on a ground). Here are my class notes:
<p><ul><li><a href="/pfh-extra/ganassi/">Class notes</a></li></ul>
<p>The actual book and its translation is <a href="http://imslp.org/wiki/Opera_Intitulata_Fontegara_%28Ganassi,_Sylvestro%29">available online from IMSLP</a>, and is quite readable if you are into that sort of thing.Linear models, a practical introduction in R2015-07-30T00:04:58Z2015-07-30T00:04:58Zhttp://www.logarithmic.net/pfh/blog/01438214698
<p><ul><li><a href="http://nbviewer.ipython.org/url/rnasystems.erc.monash.edu/~pfh/linear-models.ipynb">IPython notebook</a></li></ul>
<p>These are slides to a talk I gave introducing linear models and nested model hypothesis testing in R. These are the basis of many common statistical tests.