Outline

These slides: https://tinyurl.com/y3nmue36


RNA-Seq as a typical bioinformatics data type

False Discovery Rates

The wider debate around p-values

False Coverage-statement Rates

topconfects package

RNA-Seq introduction

Biological samples containing mRNA molecules

  ↓ RNA to DNA reverse transcription

  ↓ Fragmentation

  ↓ High-throughput shotgun sequencing (Illumina)

Millions of short DNA sequences per sample, called “reads”

  ↓ “Align” reads to reference genome (approximate string search)

  ↓ Count number of reads associated with each gene

Matrix of read counts

RNA-Seq introduction

~20,000 genes.

Often only 2 or 3 biological samples per experimental group.


Which genes differ in expression level between two groups?

  • log transform read counts
  • perform 20,000 t-tests

RNA-Seq introduction

Typically done using limma Bioconductor package.

Experimental design may be complicated, so allow any linear model.
Many people’s first encounter with linear models!

limma’s novel feature:

  • genes have similar but not identical variability
  • use an “emprical Bayes” prior for the residual variance,
    as if we have some extra “prior” residual degrees of freedom
Smyth, G. K. (2004). Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):1–25.
(More advanced methods use negative-binomial GLM on counts directly: edgeR, DESeq2)

RNA-Seq introduction

We have ~20,000 p-values, one for each gene.

We want to select the significantly differentially expressed genes.

If we select genes with \(p \leq 0.05\), we will get ~1000 “discoveries” purely by chance.

False Discovery Rate (FDR)

Assume the set of true discoveries to be made is much smaller than \(n_\text{gene}\).

For p-value cutoff \(\alpha\) and total discoveries \(k\), the FDR \(q\) will be approximately

\[ q = { n_\text{gene}\alpha \over k } \]

False Discovery Rate (FDR)

\[ q = { n_\text{gene}\alpha \over k } \]

So to achieve a specified FDR \(q\), we need

\[ \alpha = { k \over n_\text{gene} } q \]

The larger the set of discoveries \(k\), the larger the \(\alpha\). Weirdly circular!

Greedily choose the largest \(\alpha\) possible.


“For whoever has, to him more will be given, and he will have abundance; but whoever does not have, even what he has will be taken away from him.”
Mathew 13:12

False Discovery Rate (FDR)

Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.


Benjamini and Hochberg proved this works, assuming the tests are independent or positively correlated.

In practice, provide “FDR-adjusted p-values” that let the reader select their desired FDR.

  • sort results by p-value
  • reader takes the head of the sorted table, down to their desired FDR

Path dependence

Example RNA-Seq volcano plot
Red points significant
at FDR 0.05

\(n_\text{sample}\)-poor, \(n_\text{gene}\)-rich situation, struggling to make any discoveries.

It has made sense to sort results by p-value or plot p-values.

Need distributional assumptions, rank-based methods can’t produce small enough p-values!

What if we do make a lot of discoveries?

  • Choose a smaller FDR
  • Ad-hoc combination of p-value and fold-change cutoff

Path dependence

Example GWAS Manhattan plot
(from https://en.wikipedia.org/wiki/Genome-wide_association_study)

Many ’omics datasets follow this \(n_\text{sample}\)-poor, \(n_\text{feature}\)-rich pattern:

  • microarrays
  • Genome Wide Association Study
    (which mutations cause a disease?)
  • protein mass-spectrometry
  • 16S sequencing
    (bacterial species ecology)
  • Bisulfite sequencing, Hi-C, ATAC-seq, ChIP-seq, …
    (DNA epigenetics, physical layout, openness to transcription, associated proteins)

Analysis almost always focussed on p-values.

The FDR of science

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8):e124.

Ioannidis estimates that the Positive Predictive Value (PPV=1-FDR) of most fields is less then 50%.

  • Tests performed with \(\alpha=0.05\)
  • Mostly tests where pre-study odds of a true effect are low
  • Mostly under powered, false-negative probability \(\beta\) high

Biasses:

  • Hypotheses tested multiple times by different labs
  • Flexibility in analysis
  • Tests selectively reported

Ongoing controversy about p-values

Meanwhile,
“Scientists rise up against statistical significance”

800 scientists signed a statement published in Nature, March 2019

Mostly reasonable, but…

“How do statistics so often lead scientists to deny differences that those not educated in statistics can plainly see?”

“This is why we urge authors to discuss the point estimate, even when they have a large P value or a wide interval, as well as discussing the limits of that interval.”



Gelman broadly liked it, Ioannidis disliked the undercurrent

Comment: in bioinformatics the best looking noise can look very convincing.

Some specific problems with p-values

  • “Insignificant” result may owe to no effect, or a lack of power, or luck

  • “Significant” result may owe to a large effect, or a powerful experiment, or selective reporting, or luck

  • Dichotomization leads to apprarent paradoxes


ASA statement on p-values, March 2016

“4. Proper inference requires full reporting and transparency. P-values and related analyses should not be reported selectively.”

“5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”

Confidence Intervals are better

\(p \leq 0.05\) iff 95% Confidence Interval doesn’t pass through zero.

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

  • Stop dichotomising
  • Instead think in terms of measurement accuracy
  • Report everything, combine using meta-analysis



But giving up selection isn’t an option for us!

False Coverage-statement Rates

Benjamini, Y. and Yekutieli, D. (2005). False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters. Journal of the American Statistical Association, 100(469):71– 81.


After selecting \(k\) of \(n_\text{gene}\) results,

for an FCR of \(q\),

provide \(1-{{k \over n_\text{gene}} q}\) confidence intervals.


Any selection rule may be used!



Caveat: point estimates remain biassed

Back to RNA-Seq

I want to move us away from p-values and use confidence bounds.

Sometimes we are not \(n_\text{sample}\) poor. Sometimes we have too many discoveries!

  • larger scale RNA-Seq studies
  • single cell RNA-Seq

I want to rank the results and let the reader select some to follow up.

\(n_\text{sample}\)-large example

TCGA breast cancer data.

97 RNA-seq tumor-normal pairs.

Shown is a random sample of 20 genes.

Highly heteroscedastic.


Too many dicoveries:
  • 13,784 DE genes at 5% FDR

  • Do we care about:
  • smallest p-values?
  • largest signal-to-noise ratios?
  • largest log2 fold changes (LFC)?
  • Top table by FDR adjusted \(p\)-value

    Focusses on high signal-noise ratio.

    Top enriched gene-sets to do with cell division.

    To follow up a reasonable number of genes, choose a tiny FDR?

    TREAT: using an interval null hypothesis

    McCarthy, D. J. and Smyth, G. K. (2009a). Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6):765–771.

    Null hypothesis for each gene that absolute LFC is smaller than threshold \(e\)

    • Specify threshold to be a biologically significant amount
    • Table sorted by adjusted \(p\)-values, again allows reader to choose FDR

    But

    • FDR of 5% is fine, don’t need to give me this choice!
    • Not clear how to choose threshold
    (Could also use worst case p-value from a t-test over null hypotheses within the interval. Compared to this, TREAT shrinks p-value by up to half.)

    Introducting Topconfects

    Harrison, P. F., Pattison, A. D., Powell, D. R., and Beilharz, T. H. (March 2019). Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biology, 20(1):67.

    TOP CONFident efFECT Sizes


    An alternative presentation of TREAT.

    • Fixed FDR
    • Ranking by “confect” values, allow reader to select threshold \(e\)


    Bonus: “confects” also work as FCR adjusted confidence bounds.

    Topconfects method

    Let \(S(e)\) be the set of genes selected by TREAT at threshold \(e\), for some fixed FDR (eg 5%)

    These sets nest:
    If \(e_1 > e_2\) then \(S(e_1) \subseteq S(e_2)\)

    Call the largest \(e\) for which a gene is a member of \(S(e)\) its “confect”, with appropriate sign

    Top table by confect at 5% FDR

    Top enriched gene-sets to do with blood vessels, cell-cell signalling, cell adhesion.

    Topconfects for \(n_\text{sample}\)-small

    Sub-sample of 10 patients.

    Confects much smaller than effects indicates lack of power.

    Bioconductor package

    # ... filter out low expression genes ...
    # ... transform to log2 Reads Per Million ...
    
    # Fit linear models with limma
    library(limma)
    fit <- lmFit(log2_rpm_matrix, design_matrix)
    # ... depending on design matrix, might also need contrasts.fit ...
    
    # Calculate confects
    library(topconfects)
    result <- limma_confects(fit, coef="treatmentTRUE", trend=TRUE)

    edger_confects and deseq2_confects also available, but slower!

    normal_confects if you have a colleciton of estimates with standard errors.

    Discussion

    FCR adjustment provides a very flexible fix for selective reporting of CIs.

    Topconfects provides FCR control when selecting the head of a ranked list.



    Topconfects tends to be conservative. Bayesian credible intervals method may be more efficient. If the prior distribution includes everything the selector knows, it will be immune from selection.

    FCR concept gives a way to test methods.

    Discussion

    Methods we use are cultural, path dependent:

    • We started \(n_\text{sample}\)-small, concentrated on significance and controlling FDR
    • GWAS uses a fixed p-value cutoff of 5e-8 rather than FDR!
    • Problems in mass-spectrometry, ecology are similar, but different methods used

    We may be overlooking useful \(n_\text{sample}\)-large methods:

    • sandwich estimator
    • bootstrap

    Discussion

    Why have current methods worked?

    • Significance implies more than \(\neq 0\)
      also implies the sign of the effect is known
    • Some effect sizes line up with p-values
      Cohen’s \(d\), \(R^2\), alignment scores
    • Some things that look like p-values are good effect sizes
      gene set enrichment

    We now have a new degree of freedom: what effect size?

    • For interactions, ratio of ratios or difference of proportions?
    • For gene set enrichment?

    Acknowledgements

    RNA Systems Biology Lab

    Traude Beilharz

    Andrew Pattison


    Monash Bioinformatics Platform

    David Powell

    Extra slides

    Simulation

    \(df_\mathrm{within}=2\), \(s_\mathrm{within}=0.75\), \(df_\mathrm{between}=3\), \(s_\mathrm{between}=0.5\)