Dataviz.Garden

Data Organization in Spreadsheets

An article by Karl W. Broman & Kara H. Woo in The American Statistician. spreadsheet

"Dangers of spreadsheets are real": that's why a web documenting spreadsheet horror stories exists! (By the European Spreadsheet Risks Interest Group. )

1 It's better to analyse and visualise separately

We believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.

2 Be consistent

Stay consistent with: codes for categorical variables & missing values, variable names, subject identifiers, data layouts (in multiple files), file names.

We prefer to have every cell filled in, so that one can distinguish between truly missing values and unintentionally missing values.

Also, use consistent date formats. Ideally use YYYY-MM-DD.

3 Choose good names for things

Choose short but meaningful names.

Do not use spaces in variable names or file names.

Generally, don't throw extra spaces around.

Avoid special characters, except for underscores and hyphens.

4 Write dates as YYYY-MM-DD

Excel can cause problems. Might be better to use other format column for dates.

You could create three separate columns with year, month, and day. Those will be ordinary numbers, and so Excel will not mess them up. Finally, you could represent dates as an 8-digit integer of the form YYYYMMDD, for example, 20140614 for 2014-06-14

5 No empty cells

Fill in all cells. Use some common code for missing data.

6 Put just one thing in a cell

The cells in your spreadsheet should each contain one piece of data.

Related to DRY.

7 Make it a rectangle

Single big rectangle with rows corresponding to subjects and columns corresponding to variables.

Do not use more than one row for the variable names.

Some datasets will not fit nicely into a single rectangle, but they will usually fit into a set of rectangles, in which case you can make a set of Excel files, each with a rectangle of data.

8 Create a data dictionary

Separate file that explains what all of the variables are.

For example:

The exact variable name as in the data file.

A version of the variable name that might be used in data visualizations.

A longer explanation of what the variable means.

The measurement units.

Expected minimum and maximum values.

Aka codebook.

9 No calculations in the raw data files

Primary data file should contain just the data and nothing else.

No calculations, no graphs, ...

Your primary data file should be a pristine store of data. Write-protect it, back it up, and do not touch it.

10 Don't use font color or highlighting as data

You might be tempted to highlight particular cells with suspicious data, or rows that should be ignored. Or the font or font color might have some meaning. Instead, add another column with an indicator variable (e.g., ”trusted” with values TRUE or FALSE).