topconfects

2020-03-10

Dissatisfaction with p-value usage in science

American Statistical Association statement on p-values, 2016

what p-values aren’t, what they shouldn’t be used for

Special issue of the ASA’s journal, 2019

many interesting ideas, no overall consensus

Meanwhile, “Scientists rise up against statistical significance”, 2019

800 scientists signed a statement published in Nature.

What goes wrong with p-values?

Selective reporting invalidates Type I error control.

Ioannidis, J. P. A. (2005). Why Most Published Research Findings Are False. PLoS Medicine, 2(8):e124.

Argues that amongst significant results reported, a majority are false.
(Note: this is about Positive Predictive Value, PPV=1-FDR, not Type I error rate.)

People read more into a p-value than they should:

p-values are not an effect size.
Dichotomization (p<0.05?) leads to apprarent paradoxes.
“Insignificant” taken as evidence of no effect, but may be lack of power or bad luck.
“Significant” taken to mean large effect, but may be a powerful experiment or luck.

Confidence Intervals are better

Interval given in meaningful units, can judge importance.
Cause of dichotomization “paradoxes” are clear.
Rejects more hypotheses — reject everything outside the interval!

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

Stop dichotomising.
Instead think in terms of measurement accuracy.
Report everything, combine using meta-analysis.

Confidence Intervals are better

Cumming, Geoff. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Taylor & Francis Group, New York and London.

“New Statistics” approach:

Stop dichotomising.
Instead think in terms of measurement accuracy.
Report everything, combine using meta-analysis.

Good prescription for clinical trials or psychology experiments.

Not so great for bioinformatics.

We often consider thousands of p-values, selective reporting is the whole point.

False Discovery Rate

Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.

To select a set of hypotheses to reject, \(S\), from \(n\), with an FDR of \(q\),

choose the largest set satisfying:

\[ S = \left\{ i : p_i \leq {|S| \over n} q \right\} \]

Sets for smaller FDR nest within sets for larger FDR, so can be presented in a way that leaves choice of FDR to reader (idea attributed to Gordon Smyth):

Calculate an “adjusted p-value”.
Reader can read down a sorted list to desired cutoff value.

What about Confidence Intervals?

False Coverage-statement Rate

Benjamini, Y. and Yekutieli, D. (2005). False Discovery Rate–Adjusted Multiple Confidence Intervals for Selected Parameters. Journal of the American Statistical Association, 100(469):71– 81.

After selecting a subset \(S\) out of \(n\) parameters, for an FCR of \(q\),

provide intervals with coverage probability \(1-{{|S| \over n} q}\).

Very broadly applicable:

Example of CIs reported in an abstract.

Caveat:

Point estimates remain biassed by selection.

RNA-seq example with many samples

TCGA breast cancer data.

97 RNA-seq tumor-normal pairs.

Shown is a random sample of 20 genes.

Highly heteroscedastic.

Ranking by p-value

Using this table:
Read down the table until desired FDR reached.

eg 13,784 DE genes at 5% FDR.

Too many discoveries!

What would choosing a stricter FDR actually do?

Sorting by p-value focusses on high signal-noise ratio.

Like using Cohen’s d effect size, but not comparable between experiments.

TREAT solves the problem of too many results

McCarthy, D. J. and Smyth, G. K. (2009). Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics, 25(6):765–771.

Null hypothesis for each gene that absolute LFC is smaller than threshold \(e\).

Specify threshold to be a biologically significant amount.
Table sorted by adjusted \(p\)-values, again allows reader to choose FDR.

But:

FDR of 5% is fine, don’t need to give me this choice!
Not clear how to choose threshold.

(Could also use worst case p-value from a t-test over null hypotheses within the interval. Compared to this, TREAT shrinks p-value by up to half.)

Topconfects improves TREAT usability

TOP CONFident efFECT Sizes

Harrison, P. F., Pattison, A. D., Powell, D. R., and Beilharz, T. H. (2019). Topconfects: a package for confident effect sizes in differential expression analysis provides a more biologically useful ranked gene list. Genome Biology, 20(1):67.

An alternative presentation of TREAT.

Fixed FDR.
Table ranked by “confect” values, allows reader to select threshold \(e\).
Once threshold chosen, confect values are valid FCR adjusted confidence bounds (slightly conservative).

Topconfects method

Let \(S(e)\) be the set of genes selected by TREAT at threshold \(e\), for some fixed FDR (eg 5%).

These sets nest:

If \(e_1 > e_2\) then \(S(e_1) \subseteq S(e_2)\).

Try \(e=0, 0.01, 0.02, 0.03, \dots\)

Call the largest \(e\) for which a gene is a member of \(S(e)\) its “confect”, with appropriate sign.

Topconfects at 5% FDR

Top ranked genes have large LFCs.

Changes are less consistent than top-ranked by p-value, but confidently large on average.

Sorting by p-value, a common default in software, we might have missed some of the largest effects.

Topconfects at 5% FDR, small number of samples

Sub-sample of 10 patients.

Confects much smaller than effects indicates lack of power.

Using `topconfects`

BiocManager::install("topconfects")

As a further step in limma analysis:

fit <- lmFit(y, design)
fit <- contrasts.fit(fit, contrasts)

limma_confects(fit, "contrast_to_test", fdr=0.05)

Also has (slower) functions to follow edgeR or DESeq2 analysis.
Normal- or \(t\)-distributed errors:

normal_confects(estimates, standard_errors, df, fdr=0.05)

Any TREAT-like test:

nest_confects(pvalue_function, n, fdr=0.05)

FCR with simulated data

Select a LFC threshold, obtain a set. What FDR and FCR are achieved?

(15,000 simulated genes, scaled inverse chi-square noise, \(df_\mathrm{within}=2\), \(s_\mathrm{within}=0.75\), t-distributed LFCs, \(df_\mathrm{between}=3\), \(s_\mathrm{between}=0.5\))

FDR: Proportion selected genes actually below threshold.
(Exactly the same as for TREAT test.)

FCR is below allowed level,
confects are conservative.

FCR: Proportion selected genes outside confidence bound.

Runs of confect values

Think about what happens as we dial up the TREAT threshold.

Some gene’s p-value is now above the p-value threshold.
→ The proportion of significant genes goes down.
→ p-value threshold for other genes becomes stricter.

library(topconfects)
normal_confects(effect=c(2.7, 2.6, 2.5, 2.3, 1.5, 0.5), se=1, fdr=0.05)

## $table
##  rank index confect effect
##  1    1     0.517   2.7   
##  2    2     0.517   2.6   
##  3    3     0.517   2.5   
##  4    4     0.420   2.3   
##  5    5        NA   1.5   
##  6    6        NA   0.5   
## 4 of 6 non-zero effect size at FDR 0.05

Several genes can drop out simultaneously.

Runs of FDR adjusted p-values

Think about what happens as we dial down the allowed FDR.

Some gene’s p-value is now above the p-value threshold.
→ The proportion of significant genes goes down.
→ p-value threshold for other genes becomes stricter.

p.adjust(c(0.1, 0.125, 0.15, 0.4, 0.5), method="fdr")

## [1] 0.25 0.25 0.25 0.50 0.50

Several genes can drop out simultaneously.

Thinking in terms of measuremment

~~What is the null hypothesis?~~

What are we measuring?

How accurately was it measured?

Example: gene set enrichment

Tumor-normal LFC as a vector in gene-space.

For each gene-set:

Measure the length of the projection onto a centered indicator variable for the gene set. (Centering makes this “competitive” enrichment.)

Projection is linear, so each patient gives an unbiassed estimate of the population average projection.

Calculate mean and standard error of mean, using all patients.

Having cake: Choose a measurement that selects results most interesting to me.

Eating it too: Assess the accuracy of measurement with patients as the sampling unit, not genes!

If a “p-value” was the only output, would need to give up one or other of these.

Gene set enrichment by confect

The length of the tumor-normal LFC vector is 143. (Largest possible effect size.)

Cancer may have something to do with cell division.

GO BP Term	Confect	Effect (LFC length)	Size	Up	Down
chromosome organization	11.4	15.5	1155	595	320
DNA conformation change	10.0	12.8	292	182	59
system process	-10.0	-13.3	1656	400	884
DNA packaging	9.7	12.1	176	117	36
mitotic cell cycle process	9.6	12.9	827	430	248
mitotic cell cycle	9.6	12.8	967	488	306
nucleosome assembly	9.3	11.6	118	85	20
chromosome segregation	9.2	12.1	311	182	77
DNA metabolic process	9.2	12.1	974	515	259
chromatin assembly	9.1	11.3	137	94	26

(Packages GSVA and singscore also offer sample-level scores, can use with limma then topconfects. Choice of scale/normalization may need tweaking — similar results with singscore after scaling scores by \(\sqrt{\textrm{gene set size}}\) .)

Gene set enrichment by p-value

These sets undoubtedly contain a gene with expression differing from the background.

Top ranked terms miss the importance of cell division in cancer.

GO BP Term	Adjusted p-value	Size	Up	Down
parturition	1.5e-41	14	0	10
regulation of NAD+ ADP-ribosyltransferase activity	1.5e-41	3	3	0
regulation of long-term synaptic potentiation	1.5e-41	43	9	27
negative regulation of intrinsic apoptotic signaling pathway in response to DNA damage by p53 class mediator	2.1e-40	16	12	3
negative regulation of cytosolic calcium ion concentration	5.2e-40	11	1	9
maintenance of rDNA	5.2e-40	2	2	0
negative regulation of insulin secretion	2.8e-39	36	9	21
regulation of systemic arterial blood pressure by vasopressin	2.8e-39	4	0	4
bicarbonate transport	2.8e-39	42	9	24
negative regulation of peptide hormone secretion	3.6e-39	42	10	25

Transition to measurement

More choice:

My choice to use geometric projection implies scaling the indicator variable by \(\approx 1/\sqrt{\textrm{gene set size}}\) then taking the dot product with LFC. Other scaling choices possible.
Interaction effects: Ratio of ratios or difference of proportions?
- Will affect confect ranking.
- No difference for p-values.

Thinking about measurement leads to thinking about bias and reproducibility.

Effect sizes with unbiassed estimators seem preferable for sparse large-\(n\) data such as single cell.

Naive calculation of confects or CIs by test inversion is computationally intensive.

Likelihood ratio, permutation/rotation methods slower.

Summary

Greater use of confidence intervals in science is desirable.
FCR does for CIs what FDR does for p-values.
Described topconfects, a method that gives a useful ranking of results and provides confidence bounds once top results selected.
- Good usability.
- Conservative in simulation.
- FCR concept can be used to validate similar methods (eg Bayesian).
Choice of measurement gives more freedom than choice of null hypothesis.

Acknowledgements: Traude Beilharz, David Powell, Andrew Pattison

Extra slides

Mixed-sign enrichment using squared LFCs

\(E(X_1) = E(X_2) = \mu\) with \(X_1, X_2\) independent \(\implies E(X_1 X_2) = \mu^2\)

Multiplying estimated LFC of a gene from two different patients provides an unbiassed estimate of the squared LFC. Average over all pairs of patients. Projection to measure enrichment as before.

Jacknife to estimate standard error.

Term	Confect	Effect (LFC² length)	Size	Up	Down
system process	28.1	39.4	1656	400	884
multicellular organismal process	28.1	38.2	6614	2173	2932
extracellular structure organization	25.2	32.5	404	143	179
tendon development	24.0	33.9	6	4	2
extracellular matrix organization	23.3	29.9	357	129	155
anatomical structure development	23.1	30.5	5426	1814	2386
system development	22.5	29.8	4430	1478	1970
developmental process	22.5	29.8	5789	1950	2529
response to chemical	22.4	31.5	4077	1442	1733
nervous system process	22.4	29.4	952	236	487

Dissatisfaction with p-value usage in science

What goes wrong with p-values?

Confidence Intervals are better

Confidence Intervals are better

Confidence Intervals are better

False Discovery Rate

False Coverage-statement Rate

RNA-seq example with many samples

Ranking by p-value

TREAT solves the problem of too many results

Topconfects improves TREAT usability

Topconfects method

Topconfects at 5% FDR

Topconfects at 5% FDR, small number of samples

Using topconfects

FCR with simulated data

Runs of confect values

Runs of FDR adjusted p-values

Thinking in terms of measuremment

Example: gene set enrichment

Gene set enrichment by confect

Gene set enrichment by p-value

Transition to measurement

Summary

Acknowledgements: Traude Beilharz, David Powell, Andrew Pattison

Extra slides

Mixed-sign enrichment using squared LFCs

Using `topconfects`