Dataviz.Garden

Powered by šŸŒ±Roam Garden

Data Organization in Spreadsheets

An article by Karl W. Broman & Kara H. Woo in The American Statistician. spreadsheet

Related to tidy data.


"Dangers of spreadsheets are real": that's why a web documenting spreadsheet horror stories exists! (By the European Spreadsheet Risks Interest Group. )

1 It's better to analyse and visualise separately

We believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.

2 Be consistent

Stay consistent with: codes for categorical variables & missing values, variable names, subject identifiers, data layouts (in multiple files), file names.

We prefer to have every cell filled in, so that one can distinguish between truly missing values and unintentionally missing values.

Also, use consistent date formats. Ideally use YYYY-MM-DD.

3 Choose good names for things

Choose short but meaningful names.

Do not use spaces in variable names or file names.

Generally, don't throw extra spaces around.

Avoid special characters, except for underscores and hyphens.

4 Write dates as YYYY-MM-DD

Excel can cause problems. Might be better to use other format column for dates.

You could create three separate columns with year, month, and day. Those will be ordinary numbers, and so Excel will not mess them up. Finally, you could represent dates as an 8-digit integer of the form YYYYMMDD, for example, 20140614 for 2014-06-14

5 No empty cells

Fill in all cells. Use some common code for missing data.

6 Put just one thing in a cell

The cells in your spreadsheet should each contain one piece of data.

Related to DRY.

7 Make it a rectangle

Single big rectangle with rows corresponding to subjects and columns corresponding to variables.

Do not use more than one row for the variable names.

Some datasets will not fit nicely into a single rectangle, but they will usually fit into a set of rectangles, in which case you can make a set of Excel files, each with a rectangle of data.

8 Create a data dictionary

Separate file that explains what all of the variables are.

For example:

The exact variable name as in the data file.

A version of the variable name that might be used in data visualizations.

A longer explanation of what the variable means.

The measurement units.

Expected minimum and maximum values.

9 No calculations in the raw data files

Primary data file should contain just the dataĀ and nothing else.

No calculations, no graphs, ...

Your primary data file should be a pristine store of data. Write-protect it, back it up, and do not touch it.

10 Don't use font color or highlighting as data

You might be tempted to highlight particular cells with suspicious data, or rows that should be ignored. Or the font or font color might have some meaning. Instead, add another column with an indicator variable (e.g., ā€trustedā€ with values TRUE or FALSE).

11 Make back-ups

Make regular backups of your data. In multiple locations.

Consider using a formal version control system, like git, though it is not ideal for data files.

Before you start inserting more data, make a copy of the file with a new version number.

file_v1.xlsx, file_v2.xlsx,...

12 Use data validation to avoid errors

Choose appropriate validation criteria.

For example:

A whole number in some range

A decimal number in some range

A list of possible values

Text, but with a limit on length

13 Save the data in plain text files

You can open CSV anywhere, anyhow.

Nonproprietary file format does not and never will require any sort of special software.