It has been another busy week for TidyBlocks:
we have a beautiful new layout courtesy of Maya Gans,
we can download SVG versions of all our blocks at once thanks to Justin Singh,
and we have a new Twitter account @tidyblocks.
We have also started to add more statistical blocks
so that people can use TidyBlocks in high school and college classrooms.
We are basing our work on the Simple Statistics library
created by Tom MacWright:
it is well documentd,
only 20kb in compressed form,
and has most of the functions we need.
But "most" isn't "all",
so I submitted our first pull request this week
to add k-means clustering.
I learned something interesting along the way:
there's a big difference between algorithms and implementations.
The naïve k-means clustering algorithm is pretty simple:
But consider this one-dimensional case in the unit square:
the points are at 0.0 and 0.5,
and your initial centroids are at 0.25 and 1.0.
Both points belong to Group 0,
so Group 1 is empty.
How do you calculate its new position?
So far as I can tell by looking at a couple of other implementations (thanks, Julia)
and asking some questions (thanks, Dhavide),
the answer is either:
None of these are well documented (or documented at all).
Libraries do try to prevent this problem from arising in the first place by:
The first strategy clearly prevents friendless centroids on the first pass,
but I cannot find a proof that the second does,
or that either guarantees friendless centroids won't arise eventually.
On the other hand,
I can't construct a case that does produce friendless centroids,
which means I can't test the error-handling code in the function I wrote.
K-means clustering should be deployed within the week.
We hope you'll find it useful;
if there are other simple statistical methods you'd like us to add,
please let us know.
— Greg Wilson / 2020-08-15