# TidyBlocks: Guide

This work is made freely available under the Hippocratic License. Contributions of all kinds are welcome: please see our contributors' guide to get started, and please note that all contributors are required to abide by our Code of Conduct.

# DATA

### Colors

The `colors`

dataset has red-green-blue (RGB) values for eleven standard colors:
black, red, maroon, lime, green, blue, navy, yellow, fuchsia, aqua, and white
Each value is an integer in the range 0…255.

Column | Datatype | Value |
---|---|---|

name | text | color name |

red | integer (0…255) | red value |

green | integer (0…255) | green value |

blue | integer (0…255) | blue value |

### Earthquakes

This block provides a subset of US Geological Survey data on earthquakes from 2016.

Column | Datatype | Value |
---|---|---|

Time | datetime | Universal Time Coordinates |

Latitude | number | fractional degrees |

Longitude | number | fractional degrees |

Depth_Km | number (km) | depth in fractional kilometers |

Magnitude | number | Richter scale |

### Penguins

Column | Datatype | Value |
---|---|---|

species | text | type of penguin |

island | text | where penguin was found |

bill_length_mm | number (mm) | length of bill |

bill_depth_mm | number (mm) | depth of bill |

flipper_length_mm | number (mm) | length of flippers |

body_mass_g | number (g) | mass |

sex | text | sex |

### Sequence

Create a sequence of numbers from 1 to N inclusive.

**name**: The name of the column holding the values.**range**: The upper limit of the range.

### User Data

Use a previously-loaded dataset.

*drop down*: Select dataset by name.

# TRANSFORMS

### Bin

Divide data into evenly-spaced bins.

**column**: Name of column containing data.*first space*: Number of bins (must be 1 or greater).**label**: Name of new column for bin labels.

### Create

Add new columns while keeping existing ones. A column can be replaced if a new column is given the same name as an existing column.

### Drop

Discard one or more columns from the data. This block isn't strictly necessary—you can just ignore a column if you don't need it—but dropping columns often makes the display easier to read. This block is the opposite of select.

**column, column**: A comma-separated list of the names of the columns to drop.

### Filter

Keep a subset of rows that pass some test such as `age > 65`

or `country = "Iceland"`

.
The test is checked independently for each row,
and tests can be combined using the and,
or,
and not blocks.

*expression*: the test each row must pass to be included in the result.

### Group By

Most data operations are done on groups of records that share values, such as people from the same country.
This block adds a new column to the table called `_group_`

that has a unique value for each group.
Grouping can be removed using the ungroup block.

**column, column**: A comma-separated list of the names of the columns to group by. Every unique combination of values in these columns produces one group.

### Save As

Save a result for later inspection or to be combined with a result from another pipeline.

**name**Result name

### Select

Choose columns from a table: columns that are not named will be dropped. This block is not strictly necessary, since unneeded columns can simply be ignored, but discarding unneeded columns can make the display easier to read. This block is the opposite of drop.

**column, column**: One or more columns to keep.

### Sort

Sort the rows in a table according to the values in one or more columns.

**column, column**: A comma-separated list of the names of the columns to sort by.**descending**: If checked, sort in descending order (i.e., greatest value first).

### Summarize

Summarize the values in one or more columns.
If the data has been grouped,
a summary value is created for each group.
The summary values are put in a new column

,
e.g., *op*_*col*`mean_age`

.

*drop down*: which summarization operation to use.**column**: which column to summarize.

### Running Values

Calculate a running value for a column,
such as a running sum.
If the data has been grouped,
separate running values are calculated for each group.
The running values are put in a new column

,
e.g., *op*_*col*`index_age`

.

*drop down*: which summarization operation to use.**column**: which column to summarize.

### Ungroup

Undo grouping created by group by removing the special _group_ column.

### Unique

Discard rows containing redundant values. If several rows have the same values in the specified columns but different values in other columns, one row from that group will be chosen arbitrarily and kept.

**column, column**: One or more columns to check for distinct values.

# PLOTS

### Bar

The bar block makes the height of the bar proportional to the number of cases in each group. A bar chart uses height to represent a value, and so the base of the bar must always be shown to produce a valid visual comparison.

**X_axis**: Which column to use for the X axis.**Y_axis**: Which column to use for the Y axis.

### Box

The Tukey box plot block summarizes a distribution of quantitative values using a set of summary statistics. The middle tick in the box represents the median. The lower and upper parts of the box represent the first and third quartile respectively. The whisker spans from the smallest data to the largest data within the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR] where Q1 and Q3 are the first and third quartiles while IQR is the interquartile range (Q3-Q1). Any outlier points beyond the whisker are displayed using point marks.

**X_axis**: Which column to use for the X axis.**Y_axis**: Which column to use for the Y axis.

### Dot

Display a dot plot.

**X_axis**: Which column to use for the X axis.

### Histogram

Visualise the distribution of a single continuous variable by dividing the X axis into bins and counting the number of observations in each bin. Histograms display the counts with bars.

**column**: Which column to bin.**bins (10)**: The number of bins.

### Scatter

Display a scatter plot.

**X_axis**: Which column to use for X coordinates.**Y_axis**: Which column to use for Y coordinates.**color**: Which column to use for colors (optional).**Add line**: Show a linear regression line?

# STATISTICS

### One-sample T test

Run a one-sample t-test.

**column**: The column containing the values of interest.**mean**: The mean to test against.**significance**: The significance threshold.

### Two-sample T test

Run a paired t-test.

**column_a**: The column containing one set of values.**column_b**: The column containing the other set of values.**significance**: The significance threshold.

### K-means clustering

Use k-means clustering to group data in two dimensions.

**X_axis**: Which column to use for the X axis.**Y_axis**: Which column to use for the Y axis.**number (2)**: The number of clusters to create.**label**: The new column containing each row's cluster ID.

### Silhouette

Calculate the silhouette for every two-dimensional point in clustered data.

**X_axis**: Which column to use for the X axis.**Y_axis**: Which column to use for the Y axis.**label**: The column containing each row's cluster ID.**score**: The silhouette score for that point.

# OPERATIONS

### Arithmetic

This block implements mathematical computations on two values. It accepts numbers, column names, and nested operation blocks.

*left space*: The left-hand side of the operation.*drop down*: Select addition, subtraction, multiplication, division, remainder, or exponentiation.*right space*: The right-hand side of the operation.

### Negate

Negate a number.

*space*: The value to negate.

### Comparison

This block compares two values.
It accepts any values on the left and right side
and produces either `true`

or `false`

.

*left space*: The left-hand side of the operation.*drop down*: Select equal, not equal, greater, less or equal, and so on.*right space*: The right-hand side of the operation.

### Maximum and Minimum

This block selects the maximum or minimum value from two different columns.

*left space*: The left-hand side of the operation.*drop down*: Select maximum or minimum.*right space*: The right-hand side of the operation.

### Logical

This block implements logical operations on two values.
It accepts any values on the left and right side
and produces either `true`

or `false`

.

*left space*: The left-hand side of the operation.*drop down*: Select logical AND or logical OR.*right space*: The right-hand side of the operation.

Note that logical AND is only true if *both* sides are true,
while logical OR is true if *either or both* sides are true:
it is not either-or-both rather than one-or-the-other.

### Not

Produce `true`

if the value is `false`

or `false`

if the value is `true`

.

*space*: The value to invert.

### Type

Check if a value is of a particular type.

*space*: The value to check.*drop down*: Select the type to convert for.

Convert a value from one type to another.

*space*: The value to convert.*drop down*: Select the type to convert to.

### Dates/Times

Extract the year, month, or day from a date/time value.

*space*: The date/time value to convert.*drop down*: Select the sub-value to extract.

### Conditional

Select one of two values based on a condition. Any value can be used for the condition or for the results if the condition is true or false, but the values used for the true and false cases must have the same type.

*first space*: The condition to test.*second space*: The value if the condition is true.*third space*: The value if the condition is false.

### Shift

Shift values up or down in a column.

**column**: Which column to shift*number*: How much to shift by (positive to shift up that many places, negative to shift down).

# VALUES

### Column

Specify the name of a single column in the data.

**column**: The name of the column whose value is desired.

### Datetime

Specify a fixed date and time.

**YYYY-MM-DD**: The 4-digit year, month, and day joined with dashes.

### Logical

Select a constant logical value.

*pulldown*: Select`true`

or`false`

.

### Number

Specify a fixed number.

**number**: The desired number.

### Text

Specify a fixed text.
The value should *not* be quoted:
any single or double quotes provided will be included in the text.

**text**: The desired text.

### Missing Value

The special value representing missing data.

### Row Number

Generate the row number, starting from 1.

### Exponential Random Value

Generate a random value from the exponential distribution with the rate parameter λ.

**rate**: the rate parameter.

### Normal Random Variable

Generate a random value from the normal distribution with mean μ and standard deviation σ.

**mean**: the center of the distribution.**std dev**: the spread of the distribution.

### Uniform Random Variable

Generate a random value from the uniform distribution across the given range.

**low**: the low end of the range.**high**: the high end of the range.

# COMBINING

### Join

Join two tables by matching values in column X of table A to the values in column Y of table B. If table A contains:

num | name |
---|---|

1 | p1 |

2 | p2 |

3 | p3 |

and table B contains:

val | label |
---|---|

1 | q11 |

1 | q12 |

3 | q3 |

4 | q4 |

and the tables are joined on `num`

and `val`

,
then the final table contains rows for the matches 1 and 3:

_join_ | A_name | B_label |
---|---|---|

1 | p1 | q11 |

1 | p1 | q12 |

3 | p3 | q3 |

The new column `_join_`

contains the values that matched,
while the other columns appear as

.*table*_*column*

### Glue

Combine the rows of two tables to create a new table. The input tables must have the same number of columns, and those columns must have the same names. A new column is added to show where each row came from. For example, if table A contains:

num | name |
---|---|

1 | p1 |

2 | p2 |

and table B contains:

num | name |
---|---|

2 | q2 |

3 | q3 |

and the label column is called `source`

,
then the final table contains:

num | name | source |
---|---|---|

1 | p1 | A |

2 | p2 | A |

2 | q2 | B |

3 | q3 | B |

# CONTROL

### Seed

Re-seed random number generation. Only one seed block is allowed per program, and seeds are reset before any other pipelines run.

**text**: A random phrase to use as a seed.