Lab Data and File Management SOPs

LESEM Lab Standard Operating Procedures for File and Data Management

This data management plan addresses two major themes: data integrity and reproducible research. Many best practices will satisfy both goals. Research data can lose integrity or be lost completely, wasting valuable time and effort and in some cases be non-reproducible. Fortunately, it is easy to protect your data by developing a solid data management plan at the outset of a project and adhering to your plan through the lifecycle of the project.

The LESEM Lab is committed to the principles of Reproducible Research. Research is reproducible when others can reproduce the results of a scientific study given only the original data, code, and documentation (Essawy et al. 2020). Alston & Rick (2021) gives an excellent introduction to reproducible research, but I will summarize their personal and community benefits and barriers to reproducible research:

Personal benefits

Remember how and why you performed specific analyses
Quickly and simply modify analyses and figures long after you have moved on to other projects
Fast reconfiguration of previous coding tasks so you don’t have to reinvent processes
A strong indicator of rigor, trustworthiness, and transparency to other professionals
Increases paper citation rates and allows data and code citation in addition to manuscripts
Meets journal requirements

Research community benefits

Allows others to learn from your work
Allows others to understand and reproduce your work
Protect yourself and the community from mistakes

Barriers to reproducible research

Complexity of software or analytical approach
Technological change over time
Intellectual property rights

To achieve reproducible research, the LESEM Lab has a set of data best practices:

Before data collection

Data management systems

As you design your experiments and decide what data you need to collect, you should also design your data management system. Tabular data can be stored in a database program such as Microsoft Access or SQL lite or in “flat format” such as .csv files, Excel files, or Google sheets. Every data table (especially tables stored “flat”) should have a variable that serves as a primary key that uniquely identifies every row in the table as well as secondary keys that relate to primary keys in other tables.

Directory structures should be clearly named and logically organized. An example would be a parent directory for each project (e.g., “Bird Abundance” or “Landowner Surveys”), with each containing a folder labeled manuscript, data, analyses, and analysis products (Noble 2009).

Develop or adopt a consistent set of naming conventions. Naming conventions may already be established by prior work or by partner labs/organizations. The first consideration to naming files should be, “Would this make sense to someone besides me?” Examples include always formatting dates as YYYYMMDD for easy sorting and searching, using underscores instead of spaces in directory, file, and variable names for compatibility with the widest variety of software, and always including units in variable names. In general, variable names should err on the side of being long and descriptive rather than abbreviated and cryptic. Autocomplete features in modern IDEs make long variable names less tedious and the additional clarity can help avoid misunderstandings. Finally, your surname should appear in the file name of any file shared with others.

“Quality Assurance” means designing a system in such a way that mistakes are unlikely. Design your data-entry portal with safeguards such as drop-down menus instead of textboxes or restrict entry of continuous data to previously defined acceptable ranges (e.g., 0-100 for percentage data).

Define data access and roles ahead of time. For instance, the graduate student might be the database owner, advisors may have editorial access, technicians have entry permission but not editing, or collaborators are granted read-only access. As part of the “leaving ISU” process, the graduate student should ensure that advisors are passed “owner” status of databases.

If your project collects Personally Identifiable Information (PII), it should be stored separately from other research data in such a way that it will not be shared or publicly archived at the completion of the project. Remember that PII could be retrieved from a Git repository even after being removed. Collecting PII should be avoided if possible or destroyed at the end of the project (if possible).

Metadata

Metadata (“data about data”) includes information such as what the data is, who collected it, when and where it was collected, why it was collected, and how. Draft metadata should be created during the study design phase and then updated after every field season to reflect changes made during implementation. Typically, each table and each variable (column) will have a piece of metadata describing it. Metadata for finished data products should include a description of how the data has been changed from the raw version.

Metadata should be stored and shared with the dataset. It should be combined with the data whenever possible, such as stored in a few lines at the top of a .csv file (it can be skipped over during importation into statistical software with the addition of a simple argument), in the metadata fields in a MS Access database, or in the documentation of an R package. If there is no good way to combine the metadata with the data itself, it can be contained in a clearly labelled .txt file within the data directory.

Statistician Consultation

Before finalizing methods and beginning data collection, always meet with a statistician to review your plans. It is far better to make changes to your methods at this stage then to find out you made a critical error in your methods after years of data collection.

Before data analysis

Data security

Always back-up your data (both physical and digital) as soon as possible in multiple locations. Physical data (such as paper datasheets) should be digitized immediately upon completion, such as by copying them with an office copier/scanner or photographing/scanning them with a mobile device. Digital data should be backed up in at least three network and two distinct physical locations. For example, you might have a database hosted on the NREM network storage (1), backed up by NREM IT (2), and uploaded to GitHub (3). This would meet the requirement to be in three network locations and two physical locations (two located within Science II and one at GitHub’s data center).

Physical data should be stored in a secure location (e.g., in the LESEM lab space). Consider the risk of fire, flood, or theft in your storage location. Physical data should never be sent home with technicians for data entry and should only go home with graduate students when absolutely necessary.

Physical data should be organized in a way that others besides you could locate a particular data sheet on demand. Physical data is not actually “available” if the needed datasheet is lost among a thousand sheets of paper.

Data should be entered periodically as it is captured. Do not wait until the end of data collection. Entering data in a timely manner will avoid forgetting details that may help resolve errors or find missing or dis-jointed datasheets and will avoid the daunting task of entering a towering pile of datasheets by yourself after technicians have left.

Quality control

Quality control means reviewing your data to ensure it has no missing or incorrect data. This can be accomplished programmatically by computing a series of summary statistics or viewing figures such as histograms and boxplots and investigating suspect values. This should be done as a regular part of your data prep leading up to analysis and not just when you notice a problem late in the process. Programs such as Google Dashboard may provide data summary and visualization functions.

Data should be audited quickly after capture, for example on the drive home from a field location. Pay particular attention to ensuring data sheets have the correct location, date, and observers noted and if there are any missing data fields. Physical datasheets should be kept in an organized system in the field vehicle and transferred to the data entry or storage location quickly to avoid loss or destruction in the field vehicle.

Data Digitization

Capturing data digitally has many advantages over paper data sheets, but if you record data on paper datasheets care should be taken during the data entry step. Having each data sheet entered twice by two different people is the gold-standard for avoiding data-entry errors in addition to safeguards built into the database or data-entry portal. Double entry will allow you to computationally compare different entries of the same datasheet to look for differences, which can then be investigated. If double-entry is prohibitively expensive, then auditing a small subset can give you insight into your data entry error rate and identify any system problems (e.g., an unreliable transcriber).

Data storage should be write-protected so that the native version cannot be altered once entered. When mistakes are discovered they should be corrected programmatically within a cleaning script with notations explaining why the change was made. This precludes the possibility of changes being made in error with no record of what was changed or why. Changes made to data on the physical data sheet should be avoided whenever possible, but if a piece of data needs to be altered in the field, the mistake should be crossed out once so that it is still readable, and the new value noted along with your initials and date. This preserves some record of when a change was made in case the change was made in error.

Data collected digitally by a device should be transferred into your data storage system digitally and not transcribed by reading off a screen and punching numbers into a separate system. Take the time to figure out how to make the conversion digitally to guard against transcription errors and provide a digital record.

Data collected digitally (e.g., GPS unit, temperature probe, photographs) should be copied to a non-proprietary format as soon as possible after collection, or at least to an industry-standard format if a non-proprietary format is not available or is insufficient. For example, at the time of this writing photographs taken with an Apple iPhone are saved in a newer format which does not have wide support on Windows OS and so should be copied to the more widely supported .jpeg format. A version of the original proprietary form should also be kept to guard against conversion errors.

During data analysis

Coding scripts

Whenever possible data munging and analysis should be completed using coding scripts. Avoid point-and-click Graphical User Interfaces (GUIs) whenever possible. The ability to show (and reproduce) exactly what you did is at the heart of reproducible research. The first few lines of a code script should include the purpose of the code, who authored it and when, and what other data or scripts are needed to make it work. Comment your code copiously throughout. Not only does it serve as a guide to others that might use your code, but also as a reminder to your future self (who will thank you!). Many organizations provide style guides for coding to improve quality and interpretability of code, such as this Tidyverse style guide for R.

Version control

Version control software such as Git (conveniently implemented through Github) serves as both documentation for what changes were made to a script but also allows rolling back changes if a mistake was introduced. It also serves as a convenient collaboration platform between yourself and technicians or other collaborators.

If Github is not a practical solution, DropBox (implemented through CyBox) should be used to keep important data and code synced between computers. This avoids situations where a change was made when working on one computer and then forgotten and lost when work was picked up using another computer with a different version of the document. DropBox also serves as a backup system, but beware that it only backs up data when the app is running so it is vulnerable to gaps.

Using a cloud storage solution such as the lab server space or your personal ISU U:\\ drive also eases versioning issues between separate computers. Those storage services also have automatic backups, meeting one of the backup requirements.

If manually versioning your document or code is necessary, adopt a naming/versioning system and apply it consistently (e.g., “name_analysis_v3.2.R”). This will avoid the “name_thesis_final_initials_v2.2_FINAL.doc” situation. Papers and reports that are in the publication process should be clearly labeled as “submitted,” “accepted,” and “published” versions.

Keep a lab journal

A lab journal can be kept describing the work of the day or week that includes the date, the goal of the work, data files involved, the outcome, and a brief conclusion or discussion. This will help you capture and organize your thoughts and will be helpful if you are unable to remember why or exactly how a change was made. Less formal methods of journaling that have some of the same functionality include detailed git commits, RMarkdown files, readme files, R Notebook, and standardized log files.

After data analysis

Review results

After completing the first draft of an analysis, always have your code and results reviewed by at least one collaborator or colleague. Finding errors at this stage is much less painful than during peer review, or worse, discovering an error after publication that requires a retraction.

Manuscript drafting

Manuscript writing can benefit from many of the same best practices as coding. The document drafting language LaTeX is a powerful way to extend the principles of reproducible research to your paper writing. PhD students and master’s students who intend to continue in academia should consider learning LaTeX to streamline the document drafting and editing process. ISU has an institutional subscription to Overleaf, a hosted LaTeX platform geared towards beginners. LaTeX makes updating results, tables, and figures much easier than in traditional programs such as MS Word.

Depending on the requirements of the targeted journal, you might also consider posting your final pre-publication draft to a pre-print server such as arXiv or bioRxiv. Pre-print servers host non-peer-reviewed versions of manuscripts with little editorial oversight as a method for distributing work quickly and freely. They are not a replacement for the peer-reviewed journal process but are an extension of that system intended to increase speed and transparency.

Data and code dissemination and archiving

Reproducible research requires sharing data and code in the rawest version possible so that others may check or build on your work. ISU Libraries is the preferred data repository for the lab because the data is ultimately owned by the University and this allows tighter control over the data once published. Other data repositories include Dryad and Zenodo, or formatting your code and data as an R package for distribution on GitHub.

When preparing your data for archiving, be sure that your data is in a durable, non-proprietary format, preferably “tidy” (Wickham 2014) tables in .csv format and that all appropriate metadata is integrated with the data or packaged alongside. Remember to scrub or mask any Personally Identifiable Information including names, addresses, and spatial coordinates before publication. Your data will need a distribution license attached to it before publication. Consult your co-authors and funders to choose the most appropriate license. Your final published manuscript should include durable links to archived data and code. Many archives will issue a Digital Object Identifier (DOI) number for this purpose.

In addition to meeting reproducible research requirements, data publication may be required by project funders and published data can be cited in your CV.

Moving on from ISU

When it is time to move on from ISU, following this transition plan will make it easy for someone else to pick up right where you left off. Remember that this someone may be a new student who doesn’t know you or any details of your prior work.

Create an “Everything” folder – Copy all documents and files used for graduate research.
Create a “Transition” folder – Duplicate the “Everything” folder. Then organize and clean the folder using SOP filing and naming conventions, eliminating unnecessary documents and files.
Create a “Transition Document” – a OneNote or Word file that outlines any important details regarding the project that heretofore is only in your head. For example, where to find important files, where you left off in your data analysis, questions you wished you’d been able to pursue but ran out of time, important sources of information (e.g., stats methods consultants), any methodological details that aren’t stored anywhere else, your specific file naming convention, your subsequent contact information, etc. Include a table of contents document for the clean Transition folder. Specifically, the transition document should be organized like this:

Name of graduate research assistant (yourself)

Project name
Date range of research assistantship
Contact information
Table of contents
- - - - Names of each sub-folder in the "Transition" folder
      - Brief description of each sub-folder
      - Sub-folder contents; for each document/file:
        
        Name
        
        Brief description
        
        Status – Final, In Progress
        
        Action items

Upload files to the LESEM Lab shared space on the NREM server in the “People” directory.
Schedule a closeout meeting with Lisa and others to whom you are transitioning your work. During this meeting, briefly walk through the transition document, highlighting any action items and important files.

Remember that the data you collected is owned by the University (and possibly co-owned by other entities) and you cannot take the original data with you. After discussion with Lisa, you may take a version of the data for future work, but the original data must remain with ISU.

Authored by Matt Stephenson with input from current lab members

Posted 4/7/2022

References

Alston, J.M. and J.A. Rick. 2021. A beginner’s guide to conducting reproducible research. Ecological Society of America Bulletin 102:2.

Essawy, B. T., J. L. Goodall, D. Voce, M. M. Morsy, J. M. Sadler, Y. D. Choi, D. G. Tarboton, and T. Malik. 2020. A taxonomy for reproducible and replicable research in environmental modelling. Environmental Modelling & Software 134:104753.

Noble, W. S. 2009. A quick guide to organizing computational biology projects. PLOS Computational Biology 5:e1000424.

Wickham, H. 2014. Tidy data. Journal of Statistical Software 59:1–23.