LESEM Lab Standard Operating Procedures for File and Data Management
Always, always, always have your files located in three places at any given time. Regardless of the fact that research is expensive personally and monetarily (think: your blood, sweat tears; your salary, benefits, tuition; potentially vehicle, lodging, equipment costs), data collection often times is not repeatable – it’s a one-time opportunity – and thus the value of your data is priceless. In the case of files and data, redundancy is your best friend. As a lab, we’ve decided that this is a good storage plan for the moment:
- Storage on your lab desktop computer and personal laptop computers – these are the files you directly work with. This creates some personal redundancy – these files will be easiest to access if you accidently delete a file or if your hard drive malfunctions.
- Storage on the cloud through CyBox Sync. The can choose to use the Sync feature with both your desktop and laptop so that you are always working on the most up-to-date verstion. CyBox has a version control feature so that past versions are also accessible for a certain period of time. Using CyBoxSync also creates some institutional redundancy – this gets your data out of Science II.
- Back up through Crash Plan. All computers in the NREM department are backed up on a 15 minute basis through this software.
Read, reread, and memorize Chapter 8 in Gotelli and Ellison’s A Primer of Ecological Statistics (2nd edition). There’s loads of good advice in there on the topics of managing and curating data, as well as data Quality Assurance-Quality Control (QAQC) procedures. QA inspects the process and QC controls the process and the end product; or more specifically:
- QA constitutes “planned and systematic activities implemented in a quality system so that quality requirements for a product or service will be fulfilled.” In other words, QA makes sure you are doing the right things, the right way and procedures should be developed prior to any data collection.
- QC includes “observation techniques and activities used to fulfill requirements for quality” and makes sure your results are within the bounds of your expectations. QC should precede any real data analysis.
Another simple and good guide on data management can be found here.
Metadata, or “data about your data,” should be formally recorded and stored alongside it. For an introduction to metadata, read the Gotelli and Ellison chapter cited above and also a paper by Michener and colleagues (1997), titled “Nongeospatial Metadata for the Ecological Sciences”, published in Ecological Applications. Metadata can be effectively stored alongside data within Access or in a separate Excel, OneNote, or Word file. All data fields should be listed along with text descriptions of the fields, instruments used to collect the data, timing of collection, units, and expected range of values. In the past, we’ve worked on metadata descriptions this at the end of a project, but that’s really too late. Metadata should be developed along with QA procedures, updated during QC, and always stored alongside data files.
Your first and foremost consideration when naming files and data fields should be: “would this make sense to someone besides me?” Specific file naming convention may vary based on file type, but we always recommend including authorship and date in all file names. We find this to be the clearest form of version control, as internal metadata associated with files may change with emailing, copying, saving. Here are some examples of file naming conventions we think work well:
- For thesis files: “LastName_Thesis_Date”.
- We recommend recording date as such for easiest file sorting/finding: “YearMonthDay” (e.g., “20130521” for the 21st of May, 2013).
- If your specific project is a part of a larger and/or long-term effort (e.g., PEWI, LandscapeBiomass [or LandBio], RedPine, STRIPS), we recommend including the project name in the file name.
- We recommend always noting the final version of any of your files within the file name (e.g., “LastName_Thesis_Final”, “LandBio_Ontl_RootMassData_Final”).
- It’s probably a good idea not to get in the habit of putting spaces in folder or file names. If you’re going to call anything into another program – like data files into R or SAS – spaces in folder or file names can be a problem, so it’s probably best practice not to include them but rather use the underscore character “_” instead.
- For PDFs of published papers, we recommend storing them in the following form: “Author(s)_Year_Journal – 3-5 word description”.
Keep a “lab notebook” documenting detailed methods involved with your data collection and manipulation. This is SOP in any chemistry or physics experiment, and should be in our work as well. Given the character of our work, however, our lab notebooks might take a slightly different form than standard ruled paper notebooks. For example, the lab notebook may be an Excel, OneNote, or Word file kept alongside data, statistical code, and results in a folder containing a particular analysis. Record the date, the goal of the analysis in plain English, the name of data set(s) involved, potentially data fields involved, specific tests or manipulations (e.g., PROC MIXED, “clipped GIS_File_A with GIS_File_B”, the name(s) of output files, brief descriptions of any issues that emerged or troubleshooting involved, and a your brief conclusions on the results (e.g., “test didn’t work as expected”, “perfect! I believe this is my final run”).
Use copious commenting in your statistical code. As with file naming, your first and foremost consideration when coding should be: “would this make sense to someone besides me?” Within code comments also doubles as a lab notebook.
Don’t underestimate the importance of and time required by file and data management. This is a major part of your job and should be given due diligence. Clean up and prune files on thoughtful intervals. “Thoughtful intervals” means after you’ve completed a specific task but while all aspects of the analysis are fresh in your mind – or before the specifics involved with each intermediate data set or data step get fuzzy in your head. For example, I like to clean up (a) analysis files once I know I have all the results I need to write a paper, (b) manuscript files after a paper is accepted for publication, and (c) course files at the end the semester. That said, always keep the following files and clearly label them as such:
- Your original data in their rawest form (immediate post-entry);
- Your post QAQC dataset (used for your analysis);
- Your final statistical code, test results, graphics, lab notebook, presentations (posters and talks), and paper document files.
In terms of a filing system, we suggest starting with the following file structure somewhat like the following:
- Clubs (e.g., NREM GSO, SASA, Grebe)
- Courses (e.g., NREM507, SUSTAG509)
- Chapter 1
Time to move on? OK, then follow this transition plan so it’s easy for someone else to pick up right where you left off.
- Create an “Everything” folder – Copy all documents and files used for graduate research.
- Create a “Transition” folder – Duplicate the “Everything” folder. Then organize and clean the folder using SOP filing and naming conventions, eliminating unnecessary documents and files.
- Create a “Transition Document” – By “transition document”, we mean a OneNote or Word file that outlines any important details regarding the project that heretofore is only in your head. For example, where to find important files, where you left off in your data analysis, questions you wished you’d been able to pursue but ran out of time, important sources of information (e.g., stats methods consultants), any methodological details that aren’t stored anywhere else, your specific file naming convention, your subsequent contact information, etc. Include a table of contents document for the clean Transition folder. Specifically, the transition document should be organized like this:
Name of graduate research assistant (yourself)
· Project name
· Date range of research assistantship
· Contact information
· Table of contents
* Names of each sub-folder in the "Transition" folder
* Brief description of each sub-folder
* Sub-folder contents; for each document/file:
· Brief description
· Status – Final, In Progress
· Action items
- Upload files to the LESEM Lab shared space on the NREM server and burn a CD/DVD, naming the CD/folder: <YYYYMMDD>_<ProjectName1>_<ProjectName2>_Files_<LastName>_<FirstName>”. The contents of the CD/DVD: Everything folder, Transition folder, and the Transition Document.
- Schedule a closeout meeting with Lisa and others to whom you are transitioning your work. During this meeting, briefly walk through the transition document, highlighting any action items and important files.
If you haven’t ever used OneNote before, we highly recommend you check it out! This is likely to become the “LESEM Lab-preferred” software for note taking and maintaining lab notebooks. It has a lot more functionality than Word in terms of embedding images and graphics; sketching and otherwise marking up; inserting file pathways and associating related files with one another; tagging entries and being searchable; etc.
Draft: 16 June 2013; updated 16 February 2015