An article by Karl W. Broman & Kara H. Woo in The American Statistician. spreadsheet
Related to tidy data.
"Dangers of spreadsheets are real": that's why a web documenting spreadsheet horror stories exists! (By the European Spreadsheet Risks Interest Group. )
We believe that spreadsheets are best suited to data entry and storage, and that analysis and visualization should happen separately. Analyzing and visualizing data in a separate program, or at least in a separate copy of the data file, reduces the risk of contaminating or destroying the raw data in the spreadsheet.
Stay consistent with: codes for categorical variables & missing values, variable names, subject identifiers, data layouts (in multiple files), file names.
We prefer to have every cell filled in, so that one can distinguish between truly missing values and unintentionally missing values.
Also, use consistent date formats. Ideally use YYYY-MM-DD.
Choose short but meaningful names.
Do not use spaces in variable names or file names.
Generally, don't throw extra spaces around.
Avoid special characters, except for underscores and hyphens.
Excel can cause problems. Might be better to use other format column for dates.
You could create three separate columns with year, month, and day. Those will be ordinary numbers, and so Excel will not mess them up. Finally, you could represent dates as an 8-digit integer of the form YYYYMMDD, for example, 20140614 for 2014-06-14
Fill in all cells. Use some common code for missing data.
The cells in your spreadsheet should each contain one piece of data.
Related to DRY.
Single big rectangle with rows corresponding to subjects and columns corresponding to variables.
Do not use more than one row for the variable names.
Some datasets will not fit nicely into a single rectangle, but they will usually fit into a set of rectangles, in which case you can make a set of Excel files, each with a rectangle of data.
Separate file that explains what all of the variables are.
For example:
The exact variable name as in the data file.
A version of the variable name that might be used in data visualizations.
A longer explanation of what the variable means.
The measurement units.
Expected minimum and maximum values.
Aka codebook.
Primary data file should contain just the dataĀ and nothing else.
No calculations, no graphs, ...
Your primary data file should be a pristine store of data. Write-protect it, back it up, and do not touch it.
You might be tempted to highlight particular cells with suspicious data, or rows that should be ignored. Or the font or font color might have some meaning. Instead, add another column with an indicator variable (e.g., ātrustedā with values TRUE or FALSE).
Make regular backups of your data. In multiple locations.
Consider using a formal version control system, like git, though it is not ideal for data files.
Before you start inserting more data, make a copy of the file with a new version number.
file_v1.xlsx, file_v2.xlsx,...
Choose appropriate validation criteria.
For example:
A whole number in some range
A decimal number in some range
A list of possible values
Text, but with a limit on length
You can open CSV anywhere, anyhow.
Nonproprietary file format does not and never will require any sort of special software.