pfh blog2021-04-10T05:53:57ZIdeas for free software projects, politics, philosophy, amateur psychology.Paul Harrisonpfh@logarithmic.netWe've been doing k-means wrong for more than half a century2021-04-10T05:53:57Z2021-04-10T05:53:57Zhttps://www.logarithmic.net/pfh/blog/01618034037
<p><ul><li><a href="https://www.logarithmic.net/pfh-files/blog/01618034037/k-means-better.html">We've been doing k-means wrong for more than half a century</a></li></ul>
<p><a href="https://www.logarithmic.net/pfh/blog/01601166184">(previously)</a>
<p><b>Updated 2021-06-04:</b> The k-means++ implementation I was using previously appears to have been flawed. I've updated results using a better implementation.
<p>The above report focusses on R. <a href="https://twitter.com/ctwardy">@ctwardy</a> has <a href="https://ctwardy.micro.blog/2021/06/10/on-kmeans-clustering.html">replicated the basic result here and done some further exploration in Python</a>.
<p><b>Updated 2021-06-19:</b> Added Appendix 2, sketching an argument that the asymptotic density of k-means++ is optimal.k-means the diversifier, the deviralizer2020-09-27T00:23:04Z2020-09-27T00:23:04Zhttps://www.logarithmic.net/pfh/blog/01601166184
<p>For a collection of points, the k-means algorithm seeks a set of k "mean" points minimizing the sum of squared distances from each point to its nearest mean. k-means is a simple way of clustering data. It has a fast approximate algorithm to find a local optimum, but this might not be sufficient for the application I am talking about here, which needs something like a truly global optimum. It can also be viewed as a way of approximating a dataset using a smaller number of points, even if it does not consist of distinct "clusters".
<p>I've recently become <a href="http://logarithmic.net/weitrix/reference/well_knotted_spline.html">interested</a> in the behaviour of k-means for large k. What is the distribution of the means compared to the original distribution of vectors, as k becomes large but assuming n is always much larger?
<p>In one dimension <a href="https://dspace.mit.edu/handle/1721.1/46876">Wong (1982)</a> has shown that the density distribution of means is proportional the cube root of the original density distribution. Raising a density to a fractional power such as here 1/3 has a flattening, widening effect. Peaks are lowered and tails are fattened. After some rough calculation (see end), in d dimensions I believe the distribution will be proportional to the original distribution to the power d/(2+d). Altering the distance metric (k-medians, etc) I think will result in a different power.
<p>So, for any collection of things where we have a notion of distance, k-means provides a way to flatten and summarize the distribution. (If we don't want to interpolate, we could limit means to members of the original collection.)
<p>In a collection in which most of the variation is only in a subset of dimensions I think the effective d will depend on k. For example, for moderate k the effective dimension might be 1 or 2, but for large k one gets into the fine structure, the effective d rises, and the flattening effect is reduced.
<p>This idea of flattening a distribution seems useful, so an algorithm that does it is exciting:
<p><ul><li>From a news feed, pick items covering the diversity of opinions. Do not concentrate too much on wild outliers, but also do not concentrate too much on the most common themes.</li></ul>
<p><ul><li>In a shop, stock items to maximize the average ability to meet the needs of customers. Some stocked items will have higher turnover than others.</li></ul>
<p><ul><li>Looking to employ k people, choose a set of people who will bring diverse skills. For tasks that arise we want to ensure we've employed someone with skills fairly well matching the task. We have an anticipated distribution of tasks. The possible "means" are the available applicants.</li></ul>
<p>It also seems reminiscent of Wikipedia pages, which tend to cover all the major opinions, not including wild theories but also not entirely focussing on the dominant theory.
<p><br>
<p><b>Update 2020-10-24:</b> I've applied this idea by <a href="/2020/biorxiv/">clustering 2020 bioRxiv abstracts</a>. This uses a "greedy" variant of k-means where means are optimized one after the other. In other words, it is an ordered list in which topics become progressively more specialized. The algorithm I used also has fairly good ability to escape local optima.
<p><br>
<p><b>Further note 2021-03-23:</b> The usual k-means algorithm performs <i>very poorly</i> at finding the global optimum, or even at producing clusterings with the expected properties I have described the global optimum as having. Some improvement may be obtained by initializing cluster membership using Ward agglomerative clustering. In R, use <span class="mono">fastcluster::hclust.vector()</span> followed by <span class="mono">cutree()</span>.
<p>In one dimension, obtain the exact global optimum with <span class="mono">Ckmeans.1d.dp::Ckmeans.1d.dp()</span>.
<p><br>
<p><b>Appendix:</b> A very rough examination of what happens in d dimensions.
<p>Consider two d-dimensional unit hypercubes, containing n1 and n2 points respectively.
<p>Out of k, how shall we allocate the means to the two hypercubes, k1, k2=k-k1?
<p>Within a hypercube, the distance to the nearest mean will typically go proportional to k^(-1/d).
<p>So within a hypercube the sum of squared distances will go approximately like
<p><pre>
SS = c (k^(-1/d))^2 n
= c k^(-2/d) n
</pre>
<p>where c is some constant.
<p>Within two hypercubes we would have
<p><pre>
SS = c k1^(-2/d) n1 + c (k-k1)^(-2/d) n2
</pre>
<p>Assuming k is large enough that we can treat it as effectively continuous, find the minimum by differentiation:
<p><pre>
dSS/dk1 = c n1 (-2/d) k1 ^(-2/d-1) - c n2 (-1/d) (k-k1)^(-2/d-1)
Set dSS/dk1 = 0
=> n1 k1^(-2/d-1) = n2 (k-k1)^(-2/d-1)
=> (k1/(k-k1))^(-2/d-1) = n2/n1
=> ((k-k1)/k1)^(2/d+1) = n2/n1
=> (k2/k1)^(2/d+1) = n2/n1
=> k2/k1 = (n2/n1)^(1/(2/d+1))
=> k2/k1 = (n2/n1)^(d/(2+d))
</pre>
<p>This shows how k means will be allocated between two regions of differing density. Between more regions, each pair of regions will be balanced in this way.
Next steps for video conferencing2020-05-03T00:00:46Z2020-05-03T00:00:46Zhttps://www.logarithmic.net/pfh/blog/01588464046
<p>With everyone using video conferencing a lot more, I have a couple of things on my wishlist.
<p>1. Get the latency down. For my home DSL, it's around 250-500ms. In air, the latency is about 3ms per meter. So there's about two orders of magnitude improvement possible to match in-person meetings. I think each order of magnitude improvement will give qualitative gains. This might need a rethink right down to the hardware level. Our current packet-based approach to networking could be replaced by circuit switching. CPUs could be more about having many cores dedicatied to specific tasks. Audio compression can be done at low latency using prediction rather than block-based Fourier transforms*. Maybe something similar is possible for video?
<p>2. Shared virtual space. The monitors should show the projection of this virtual space from the point of view of the viewer. Depth of field estimation seems to be solved, and for example Apple's new phones seem to be able to do it even with a single camera.
<p><br><br>* I know from my old <a href="https://www.logarithmic.net/pfh/bonk">Bonk</a> project that a predictor based approach can perform similarly to tranformation based lossy compression -- i.e. errors have the same frequency envelope as the signal and are perceptually hidden. Bonk is still packet based, but I'm sure a very low latency streaming version is possible.Slides from a topconfects talk at WEHI2020-03-10T19:22:08Z2020-03-10T19:22:08Zhttps://www.logarithmic.net/pfh/blog/01583868128
<p><ul><li><a href="https://www.logarithmic.net/pfh-files/blog/01583868128/wehi-talk.html">Slides</a></li></ul>
<p>This is a somewhat extended talk I gave at the Walter and Eliza Hall Institute. It goes into some more details about how confect values behave. The gene-set enrichment section is also an improved method that uses an effect size that is a linear function and no longer needs bootstrapping.Slides for topconfects talk at BiocAsia 20192019-11-30T23:14:49Z2019-11-30T23:14:49Zhttps://www.logarithmic.net/pfh/blog/01575155689
<p><ul><li><a href="https://www.logarithmic.net/pfh-files/blog/01575155689/topconfects-biocasia2019.pdf">Slides</a></li></ul>
<p>These are slides for a 15 minute presentation on my <a href="https://bioconductor.org/packages/release/bioc/html/topconfects.html">topconfects</a> Bioconductor package, for <a href="https://bioconductor.github.io/BiocAsia/">BiocAsia 2019</a>.
<p>New material for this presentation is the application to gene set enrichment measurement.
Slides from a talk putting Topconfects in context2019-08-21T23:43:57Z2019-08-21T23:43:57Zhttps://www.logarithmic.net/pfh/blog/01566431037
<p><ul><li><a href="https://www.logarithmic.net/pfh-files/blog/01566431037/numbats-talk.html">Slides</a></li></ul>
<p>This slideshow places my <a href="https://bioconductor.org/packages/release/bioc/html/topconfects.html">Topconfects</a> method in the wider context of the current debate over the use of p-values.E(f(rarefied count)) for consistently biassed transformation2019-08-07T22:25:10Z2019-08-07T22:25:10Zhttps://www.logarithmic.net/pfh/blog/01565216710
<p>This is a small improvement on the log transformation we use in RNA-Seq and scRNA-Seq.
<p><ul><li><a href="https://twitter.com/paulfharrison/status/1157421482093338624">The idea and discussion</a> on Twitter.</li></ul>
<p><ul><li><a href="https://www.logarithmic.net/pfh-files/blog/01565216710/efrare.html">R code implementing the idea.</a></li></ul>Lorne Genome 2019 poster - weighted principal components and canonical correlation with single cell data2019-02-16T21:03:45Z2019-02-16T21:03:45Zhttps://www.logarithmic.net/pfh/blog/01550351025
<p><a href="https://www.logarithmic.net/pfh-files/blog/01550351025/poster-lorne-genome-harrison-et-al-2019.pdf"><img src="https://www.logarithmic.net/pfh-files/blog/01550351025/preview.png"> <ul><li>PDF</li></ul></a>
<p>Poster for <a href="http://www.lornegenome.org/">Lorne Genome conference 2019</a>, looking at single cell data where each gene can produce two measurements: RNA expression level and choice of polyadenylation site. We're not exactly sure what the correct tools for analysing this data are yet, this poster is plays with weighted principal components and canonical correlation. I'm interested in expanding my use of multivariate techniques, there are whole histories of unfamiliar techniques, such as techniques from ecology and Exploratory Factor Analysis methods used in psychology and marketing. Apparently multivariate techniques are particularly <a href="https://projecteuclid.org/euclid.imsc/1207580085">popular in France</a>.
Recommender systems and the viral bubble2018-12-27T06:32:38Z2018-12-27T06:32:38Zhttps://www.logarithmic.net/pfh/blog/01545892358
<p>People worry about being trapped in a filter bubble, but I have a different concern. Amongst content with a viral coefficient close to one, the amount of attention equivalent content receives is highly variable. That is, we are all sucked into the same viral bubble, collectively seeing some things and missing others of equal merit. Furthermore we tend to see viral content over content of more specific interest to us.
<p><a href="https://en.wikipedia.org/wiki/Recommender_system">Recommender systems</a> -- now commonly called "algorithms" -- have the potential to enhance or reduce this effect. Recommender systems as applied to social network content are a little creepy, but also necessary as people build up large numbers of people to follow over time. It is important to see the announcement that a distant acquaintance has cancer, but not the latest cat picture they found funny. With this necessity, perhaps the best we can aim for is that people to have control over their algorithm, rather than being forced to take what Facebook or Twitter (etc) provide.
<p>Recommender systems come in two forms:
<p><ul><li><b>Explicit</b> recommender systems learn from a collection of ratings, and learn to predict the rating of any given content.</li></ul>
<p><ul><li><a href="http://yifanhu.net/PUB/cf.pdf"><b>Implicit</b></a> recommender systems learn only from observing what content a person consumes, and learn to predict what of all possible content they may consume in future.</li></ul>
<p>I include systems which only have a "like" but no "dislike" rating such as Facebook among implicit systems, even though they take direct user input. However it might be that Facebook tracks exactly what it has shown a user, which would bring it closer to an explicit recommender system.
<p>The problem with implicit recommender systems is that they are necessarily biassed by exposure: you can only like or consume something you see. Explicit recommender systems do not <i>necessarily</i> have this problem.
<p>Some regularization is probably needed in a practical explicit recommender system to avoid being swamped by new content with few ratings. Compare <a href="https://www.reddit.com/">"hot"</a> and <a href="https://www.reddit.com/new/">"new"</a> on Reddit. Without regularization, a newish post on reddit with a single vote (other than by the author) will have an unbiassed estimate of the upvote proportion that is either 0% or 100%. Regularization introduces bias, but this can at least be dialled up or down.
<p>One useful observation is that an explict recommender system can use implicit data from other people and still have low bias. The dependent variable is a specific user's ratings. We need "Missing At Random" (MAR) data for this, which means data in which the missingness is not dependent on the rating <i>given the independent variables</i>. Any information that helps predict missingness can be used as an independent variable to reduce bias and increase accuracy.
<p>Having the choice on social networks to use an explicit recommender system algorithm with a bias dial is an important freedom we currently lack.
<p><br>
<p><b>Notes</b>
<p>-- The terms "bias", "regularization", and "Missing At Random" here have technical meanings.
<p>-- njh points out these systems are often thought of in terms of multi-armed bandits. A multi-armed bandit has a record of exactly what it has shown the user (what levers it has pulled), so it is an explicit system with the potential to manage bias. The bandit concept of exploration/exploitation trade-off may be a better way of thinking about what I've called regularization.
<p><br>Ball hypothesis tests2018-10-13T00:27:57Z2018-10-13T00:27:57Zhttps://www.logarithmic.net/pfh/blog/01539390477
<p><a href="https://www.logarithmic.net/pfh-files/blog/01539390477/ballhyp.html">Short note on ball hypothesis tests as a generalization of interval hypothesis tests.</a>
<p><img src="https://www.logarithmic.net/pfh-files/blog/01539390477/ball_small.png">