These blocks can be used after you’ve brought in a data block into the workspace and want to tidy or transform the data. If you need to create a new variable, summarize the data, or just want to rename a column then these are the blocks for you.
Discard one or more columns from the data. This block isn’t strictly necessary—you can just ignore a column if you don’t need it—but dropping columns often makes the display easier to read. This block is the opposite of select.
- column, column: A comma-separated list of the names of the columns to drop.
Keep a subset of rows that pass some test such as
age > 65 or
country = "Iceland".
The test is checked independently for each row,
and tests can be combined using the and,
and not blocks.
- expression: the test each row must pass to be included in the result.
Most data operations are done on groups of records that share values, such as people from the same country.
This block adds a new column to the table called
_group_ that has a unique value for each group.
Grouping can be removed using the ungroup block.
- column, column: A comma-separated list of the names of the columns to group by. Every unique combination of values in these columns produces one group.
Calculate new values
Add new columns while keeping existing ones. A column can be replaced if a new column is given the same name as an existing column.
Choose columns from a table: columns that are not named will be dropped. This block is not strictly necessary, since unneeded columns can simply be ignored, but discarding unneeded columns can make the display easier to read. This block is the opposite of drop.
- column, column: One or more columns to keep.
Sort the rows in a table according to the values in one or more columns.
- column, column: A comma-separated list of the names of the columns to sort by.
- descending: If checked, sort in descending order (i.e., greatest value first).
Summarize many columns
Summarize the values in one or more columns. Each summary is specified by a nested block. If the data has been grouped, one summary row is created for each group.
- contains: One or more nested blocks that specify which columns to summarize and how to summarize them.
Summarize a column
When placed inside a summarize block, each of these blocks specifies a column to summarize and a summarization function to use. If the data has been grouped, one summary row is created for each group.
- column: Which column of the table to summarize.
- drop down: What summarization function to use.
Undo grouping created by group by removing the special _group_ column.
Discard rows containing redundant values. If several rows have the same values in the specified columns but different values in other columns, one row from that group will be chosen arbitrarily and kept.
- column, column: One or more columns to check for distinct values.