TidyBlocks: A New Design for Join

Our previous post talked about the problem of representing join operations with blocks. Our existing implementation uses a "listen for a signal" approach, which means there is no visual clue that pipeline C depends on pipelines A and B. In other words, one of the most complex operations we support is invisible in the user interface.

A join combines two tables, so the most obvious visual representation uses connectors like this:

Join using connectors

However, Blockly doesn't support multi-input blocks like this:

Multi-input blocks

Even if it did, multiple joins would quickly become unwieldy:

Multi-input double join

We could instead combine vertical and left-to-right control flow:

Sideways join

but making blocks work both top-to-bottom and left-to-right on the same canvas would be complicated. This model also doesn't scale visually to more complicated joins like (A+B)+(C+D), but we probably don't care: TidyBlocks is only meant to handle cases that come up during first contact with data science, and only a handful of the examples in the introductory stats lessons we've looked at require more than one table.

An alternative is to nest one pipeline in another like the body of a for loop. Maya Gans has prototyped an E-shaped block that treats both incoming data streams as equals:

E-block joins

and we're also considering a C-shaped block where one data stream comes in the top and the other is nested:

C-block joins

But there's a problem. One of the datasets in the C-block diagram above is a top block while the other is nested. To make this work, we could:

  1. Provide a single uncapped Data block to use in either situation, All pipelines would then start without a top (capped) block. This would be easy to implement, but isn't consistent with other blocks-based systems.

  2. Provide both capped and uncapped Data blocks. This will probably be confusing.

  3. Have the Data block change shape depending on where it's placed. We think this is doable, but would probably also be confusing: if a block has a cap, it's signalling pretty clearly that it can't be nested.

  4. Provide a new top block called "Name" to give pipelines names. All Data blocks would then click together on both top and bottom edges no matter where they were placed.

Option #4 appeals right now because it would be easy to build and because it would solve the problem of unnamed results. Right now, users have to add a "Save As" block to the end of a pipeline to name their results. Putting the name at the start and carrying that through might (?) be more intuitive, and would provide a bit of documentation (e.g., a student could call a pipeline "Question 3").

Adding a Name block

We'd be very grateful for input: if you have any suggestions or better ideas, please email us or reach out on Twitter. Thanks in advance: we hope you and yours have a safe and happy New Year.

— Greg Wilson / 2020-12-20