TidyBlocks: Guide

A blocks-based tool for tidy data manipulation and analysis. Please see for a free online version or visit [our GitHub repository]().

This work is made freely available under the Hippocratic License. Contributions of all kinds are welcome: please see our contributors' guide to get started, and please note that all contributors are required to abide by our Code of Conduct.

DATA

Colors

color block

The colors dataset has red-green-blue (RGB) values for eleven standard colors: black, red, maroon, lime, green, blue, navy, yellow, fuchsia, aqua, and white Each value is an integer in the range 0…255.

Column Datatype Value
name text color name
red integer (0…255) red value
green integer (0…255) green value
blue integer (0…255) blue value

Earthquakes

earthquakes block

This block provides a subset of US Geological Survey data on earthquakes from 2016.

Column Datatype Value
Time datetime Universal Time Coordinates
Latitude number fractional degrees
Longitude number fractional degrees
Depth_Km number (km) depth in fractional kilometers
Magnitude number Richter scale

Penguins

penguins block
Column Datatype Value
species text type of penguin
island text where penguin was found
bill_length_mm number (mm) length of bill
bill_depth_mm number (mm) depth of bill
flipper_length_mm number (mm) length of flippers
body_mass_g number (g) mass
sex text sex

Sequence

sequence block

Create a sequence of numbers from 1 to N inclusive.

  • name: The name of the column holding the values.
  • range: The upper limit of the range.

User Data

user data block

Use a previously-loaded dataset.

  • drop down: Select dataset by name.

TRANSFORMS

Bin

bin block

Divide data into evenly-spaced bins.

  • column: Name of column containing data.
  • first space: Number of bins (must be 1 or greater).
  • label: Name of new column for bin labels.

Create

create block

Add new columns while keeping existing ones. A column can be replaced if a new column is given the same name as an existing column.

  • new_column: Name for new column.
  • first space: A value or the result of an operation.

Drop

drop block

Discard one or more columns from the data. This block isn't strictly necessary—you can just ignore a column if you don't need it—but dropping columns often makes the display easier to read. This block is the opposite of select.

  • column, column: A comma-separated list of the names of the columns to drop.

Filter

filter block

Keep a subset of rows that pass some test such as age > 65 or country = "Iceland". The test is checked independently for each row, and tests can be combined using the and, or, and not blocks.

  • expression: the test each row must pass to be included in the result.

Group By

grouping block

Most data operations are done on groups of records that share values, such as people from the same country. This block adds a new column to the table called _group_ that has a unique value for each group. Grouping can be removed using the ungroup block.

  • column, column: A comma-separated list of the names of the columns to group by. Every unique combination of values in these columns produces one group.

Save As

save as block

Save a result for later inspection or to be combined with a result from another pipeline.

  • name Result name

Select

select block

Choose columns from a table: columns that are not named will be dropped. This block is not strictly necessary, since unneeded columns can simply be ignored, but discarding unneeded columns can make the display easier to read. This block is the opposite of drop.

  • column, column: One or more columns to keep.

Sort

sort block

Sort the rows in a table according to the values in one or more columns.

  • column, column: A comma-separated list of the names of the columns to sort by.
  • descending: If checked, sort in descending order (i.e., greatest value first).

Summarize

summarize block

Summarize the values in one or more columns. If the data has been grouped, a summary value is created for each group. The summary values are put in a new column op_col, e.g., mean_age.

  • drop down: which summarization operation to use.
  • column: which column to summarize.

Running Values

running block

Calculate a running value for a column, such as a running sum. If the data has been grouped, separate running values are calculated for each group. The running values are put in a new column op_col, e.g., index_age.

  • drop down: which summarization operation to use.
  • column: which column to summarize.

Ungroup

ungroup block

Undo grouping created by group by removing the special _group_ column.

Unique

unique block

Discard rows containing redundant values. If several rows have the same values in the specified columns but different values in other columns, one row from that group will be chosen arbitrarily and kept.

  • column, column: One or more columns to check for distinct values.

PLOTS

Bar

bar block

The bar block makes the height of the bar proportional to the number of cases in each group. A bar chart uses height to represent a value, and so the base of the bar must always be shown to produce a valid visual comparison.

  • X_axis: Which column to use for the X axis.
  • Y_axis: Which column to use for the Y axis.

Box

box block

The Tukey box plot block summarizes a distribution of quantitative values using a set of summary statistics. The middle tick in the box represents the median. The lower and upper parts of the box represent the first and third quartile respectively. The whisker spans from the smallest data to the largest data within the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] where Q1 and Q3 are the first and third quartiles while IQR is the interquartile range (Q3-Q1). Any outlier points beyond the whisker are displayed using point marks.

  • X_axis: Which column to use for the X axis.
  • Y_axis: Which column to use for the Y axis.

Dot

dot block

Display a dot plot.

  • X_axis: Which column to use for the X axis.

Histogram

histogram block

Visualise the distribution of a single continuous variable by dividing the X axis into bins and counting the number of observations in each bin. Histograms display the counts with bars.

  • column: Which column to bin.
  • bins (10): The number of bins.

Scatter

scatter block

Display a scatter plot.

  • X_axis: Which column to use for X coordinates.
  • Y_axis: Which column to use for Y coordinates.
  • color: Which column to use for colors (optional).
  • Add line: Show a linear regression line?

STATISTICS

One-sample T test

ttest_one block

Run a one-sample t-test.

  • column: The column containing the values of interest.
  • mean: The mean to test against.
  • significance: The significance threshold.

Two-sample T test

ttest_two block

Run a paired t-test.

  • column_a: The column containing one set of values.
  • column_b: The column containing the other set of values.
  • significance: The significance threshold.

K-means clustering

k-means clustering block

Use k-means clustering to group data in two dimensions.

  • X_axis: Which column to use for the X axis.
  • Y_axis: Which column to use for the Y axis.
  • number (2): The number of clusters to create.
  • label: The new column containing each row's cluster ID.

Silhouette

silhouette block

Calculate the silhouette for every two-dimensional point in clustered data.

  • X_axis: Which column to use for the X axis.
  • Y_axis: Which column to use for the Y axis.
  • label: The column containing each row's cluster ID.
  • score: The silhouette score for that point.

OPERATIONS

Arithmetic

arithmetic block

This block implements mathematical computations on two values. It accepts numbers, column names, and nested operation blocks.

  • left space: The left-hand side of the operation.
  • drop down: Select addition, subtraction, multiplication, division, remainder, or exponentiation.
  • right space: The right-hand side of the operation.

Negate

negate block

Negate a number.

  • space: The value to negate.

Comparison

comparison block

This block compares two values. It accepts any values on the left and right side and produces either true or false.

  • left space: The left-hand side of the operation.
  • drop down: Select equal, not equal, greater, less or equal, and so on.
  • right space: The right-hand side of the operation.

Maximum and Minimum

maximum/minimum block

This block selects the maximum or minimum value from two different columns.

  • left space: The left-hand side of the operation.
  • drop down: Select maximum or minimum.
  • right space: The right-hand side of the operation.

Logical

logical operation block

This block implements logical operations on two values. It accepts any values on the left and right side and produces either true or false.

  • left space: The left-hand side of the operation.
  • drop down: Select logical AND or logical OR.
  • right space: The right-hand side of the operation.

Note that logical AND is only true if both sides are true, while logical OR is true if either or both sides are true: it is not either-or-both rather than one-or-the-other.

Not

not block

Produce true if the value is false or false if the value is true.

  • space: The value to invert.

Type

type checking block

Check if a value is of a particular type.

  • space: The value to check.
  • drop down: Select the type to convert for.
type conversion block

Convert a value from one type to another.

  • space: The value to convert.
  • drop down: Select the type to convert to.

Dates/Times

datetime block

Extract the year, month, or day from a date/time value.

  • space: The date/time value to convert.
  • drop down: Select the sub-value to extract.

Conditional

conditional block

Select one of two values based on a condition. Any value can be used for the condition or for the results if the condition is true or false, but the values used for the true and false cases must have the same type.

  • first space: The condition to test.
  • second space: The value if the condition is true.
  • third space: The value if the condition is false.

Shift

shift block

Shift values up or down in a column.

  • column: Which column to shift
  • number: How much to shift by (positive to shift up that many places, negative to shift down).

VALUES

Column

column block

Specify the name of a single column in the data.

  • column: The name of the column whose value is desired.

Datetime

datetime block

Specify a fixed date and time.

  • YYYY-MM-DD: The 4-digit year, month, and day joined with dashes.

Logical

logical block

Select a constant logical value.

  • pulldown: Select true or false.

Number

number block

Specify a fixed number.

  • number: The desired number.

Text

text block

Specify a fixed text. The value should not be quoted: any single or double quotes provided will be included in the text.

  • text: The desired text.

Missing Value

missing value block

The special value representing missing data.

Row Number

row number block

Generate the row number, starting from 1.

Exponential Random Value

exponential random block

Generate a random value from the exponential distribution with the rate parameter λ.

  • rate: the rate parameter.

Normal Random Variable

normal random value block

Generate a random value from the normal distribution with mean μ and standard deviation σ.

  • mean: the center of the distribution.
  • std dev: the spread of the distribution.

Uniform Random Variable

uniforn random value block

Generate a random value from the uniform distribution across the given range.

  • low: the low end of the range.
  • high: the high end of the range.

COMBINING

Join

join block

Join two tables by matching values in column X of table A to the values in column Y of table B. If table A contains:

num name
1 p1
2 p2
3 p3

and table B contains:

val label
1 q11
1 q12
3 q3
4 q4

and the tables are joined on num and val, then the final table contains rows for the matches 1 and 3:

_join_ A_name B_label
1 p1 q11
1 p1 q12
3 p3 q3

The new column _join_ contains the values that matched, while the other columns appear as table_column.

  • left_table: The name used to identify a table in a report block.
  • left_column: The column to join on from that table.
  • right_table: The name used to identify a table in a report block.
  • right_column: The column to join on from that table.

Glue

glue block

Combine the rows of two tables to create a new table. The input tables must have the same number of columns, and those columns must have the same names. A new column is added to show where each row came from. For example, if table A contains:

num name
1 p1
2 p2

and table B contains:

num name
2 q2
3 q3

and the label column is called source, then the final table contains:

num name source
1 p1 A
2 p2 A
2 q2 B
3 q3 B
  • left_table: The name used to identify a table in a report block.
  • right_table: The name used to identify a table in a report block.
  • label: The name of the column holding the row's source.

CONTROL

Seed

seed block

Re-seed random number generation. Only one seed block is allowed per program, and seeds are reset before any other pipelines run.

  • text: A random phrase to use as a seed.