| id | img_location | sex | condition | treatment |
|---|---|---|---|---|
| subject_04 | dummyfolder/subject_04_imgfile.tiff | female | A | control |
| subject_93 | dummyfolder/subject_93_imgfile.tiff | female | B | control |
| subject_12 | dummyfolder/subject_12_imgfile.tiff | female | B | treat_3 |
| subject_65 | dummyfolder/subject_65_imgfile.tiff | male | B | control |
| subject_98 | dummyfolder/subject_98_imgfile.tiff | female | A | treat_1 |
| subject_71 | dummyfolder/subject_71_imgfile.tiff | male | A | treat_3 |
| subject_24 | dummyfolder/subject_24_imgfile.tiff | female | B | control |
Data Organization, documentation and metadata
Essential elements to make data reusable
Standards
There are many field and community-specific standards describing how to structure data and metadata, filenaming conventions and file formats. We recommend searching for what is available, reusing it or modifying it just as much as necessary. Adopting available community standards will result in a much lower workload than defining a new standard from scratch. We recommend adopting standards that are actively maintained, well documented and that are used by a broad community for some time. If no standard is available for a specific data type or community, it is preferable to adapt standards for similar data types than to create a new one. Some popular standards offer the possibility of submitting extension proposals for additional data types.
Finding data structure standards
Data stewards or researchers involved in planning data management may need to conduct a systematic search for field-specific metadata standards.
The FAIRsharing (https://fairsharing.org/) resource is an extensive list of data and metadata standards, databases, and policies, and it is kept up-to-date by the field-specific research communities (Sansone et al. 2019).
The Digital Curation center of the UK also offers a list of metadata standards organized by discipline: https://www.dcc.ac.uk/guidance/standards/metadata
Imaging standards
Brain Imaging Data Structure (BIDS)
The Brain Imaging Data Structure (BIDS) (Gorgolewski et al. 2016) describes a series of data organization principles to facilitate large-scale data sharing between researchers, while also making metadata machine-readable to allow automation. It started in 2015 and it can be considered one of the most complete guidelines in neuroimaging. It covers a range of imaging modalities (MRI, EEG, iEEG, MEG, PET, CT, microscopy). The BIDS extension proposals (BEP) process allows researchers to propose BIDS specifications for their data modality, which are then reviewed and integrated into the standards. BIDS guidelines cover multiple aspects of data management including:
- Data structure
- Metadata
- File names and formats
- Preprocessing and analysis workflows
The defining features of BIDS are those desirable in most standards:
Flexible and easy-to-adopt. BIDS combines technically complex documentation with more accessible guides and apps to facilitate adherence, particularly for early-career researchers. The BIDS Starter Kit collects tutorials, wikis, and templates to help any lab get started with BIDS.
Community-driven, platform-independent, and growing. BIDS was developed by researchers and it has evolved from its initial focus on human functional magnetic resonance imaging to accommodate other data modalities (EEG, MEG, PET, CT, microscopy).
An ecosystem of resources. BIDS has fostered an ecosystem of tools and resources, including validation tools, apps for data preprocessing, and databases like OpenNeuro.
Documentation, protocols and reporting
Documentation is metadata, that is, information about our data objects, mostly aimed at human readers and usually consisting of text and images. We recommend producing useful documentation from early on and through out the entire project. It should preferably be written in a way that can be understood by other collaborators from the start, to prevent additional efforts translating internal documentation into more publishable documents. We also recommend using discipline-specific guidelines when available, to reduce efforts and to facilitate reuse by others in the community.
Readme files
README files are text files usually written in text (.txt) or Markdown (.md) formats. They are intended for humans rather than for machine-readability. It is good practice to have at least a README file at the root directory (the topmost folder) of a project folder or code repository. They are usually concise and can be placed at multiple levels. For example, at the project-level providing information about the the project authors, institution, goals, etc. At data-level, they can provide information about the data sets and help users navigate the folder or perform actions. When accessing a code repository in Gitlab or Github, the user is presented with the content of a README file that provides the main instructions and information about the repository.
An example of the content in a README file based on https://rdmkit.elixir-europe.org/metadata_management:
PROJECT TITLE
- Project Unique ID
- Funding Grant Nr and period
- Description: <provide a short description of the study/project>
- Principle Investigator:
- Data contact person
- Link to Data management plan
ORGANIZATION <in large projects, this can also describe subfolders>
- Folder structure
- File naming conventions (with examples)
- File formatsDocumenting data collection
Data collection is the basis for research findings. Procedures and quality measures need to be implemented and documented to have reliable data.
Key considerations
As described in https://rdmkit.elixir-europe.org/collecting :
Capture the provenance e.g. of samples, researchers and instruments.
Ensure data quality, since data can either be generated by yourself, or by another infrastructure or facility with this specialisation.
Check options for reusing data instead of generating new data.
Define the experimental design including a collection plan (e.g. repetitions, controls, randomisation) in advance.
Calibrate the instruments.
Check data protection and security issues if you work with sensitive or confidential data.
Define how to store the data e.g. format and volume.
Find suitable repository to store the data.
Identify suitable metadata standards.
Tools for data management and documentation during data collection
It is necessary to document what happens in the lab in an electronic format, with version control and tracked changes. If handwritten notes cannot be avoided, they should be digitized in an interoperable format as soon as possible. Suitable tools are:
- Electronic Lab Notebooks (ELNs)
- Electronic Data Capture (EDC) systems
- Laboratory Information Management Systems (LIMS)
Some of these tools can be complex, expensive and have a broader scope, integrating workflows for analysis and file storage. Research groups need to find a balance between their needs and available resources (financial or in user expertise) to choose one.
Researchers should also consider:
- Long-term sustainability of the tools (e.g., is the tool likely to be maintained in the long run?)
- Dependencies (e.g., do researchers need a lot of technical support from external organizations? Are they accessible?)
- Format lock-in (e.g., the tool should allow exporting data into a format that can be imported into other platforms without losing information)
- Scalability (e.g., can it be adapted to new lab collaborators?)
- Portability (e.g., how easy it will be to switch or transfer the data to another tool if the current runs out of maintenance?)
- Version control (e.g., can I track changes and revert to previous versions?)
Electronic lab notebooks (ELNs)
There is a very complete section with information and resources that will help selecting a ELN suitable to your needs The Turing Way - Electronic Lab Notebooks. We highlight the following resources:
Article detailing considerations for choosing ELN Higgins (2003).
Comparison matrix of ELNs Harvard Longwood Medical Area Research Data Management Working Group (2021) (spreadsheet version here ).
ELN finder online tool.
Consider that also tools not specifically designed as ELNs can have similar functionalities.
Standard Operating Procedures (SOPs) and lab protocols
SOPs and lab protocols are essential to ensure that data collection has been implemented according to good practices, in a reliable manner. They can also be used to introduce new researchers in a project, facilitating collaboration. They are also key to reproducing and replicating experiments. For these reasons they constitute an important asset to share and publish. SOPs usually contain text and images. A recommended format is the standardized PDF (PDF/A-1 ISO 19005-1) which can be open with many software applications. It is also recommended to explore reusing or adapting any available SOPs from relevant institutions in a specific field.
SOPs and protocols are often living documents that undergo relatively frequent changes and updates, thus requiring to keep track of the SOP versions. They are necessary to comply with Good Laboratory Practices.
Reporting guidelines: the EQUATOR Network
A reporting guideline is a simple, structured tool for researchers to use while writing manuscripts. It provides a minimum list of information needed to ensure a manuscript can be understood, replicated, or used for a practical purpose or in a meta-analysis. The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network is an international initiative that seeks to improve the reliability and value of published health research literature by promoting transparent and accurate reporting and wider use of robust reporting guidelines.
Tables and spreadsheets
Almost all research projects use some form of tabular data. They may contain descriptive information, experimental task performance, filenames of image or scan files, recording parameters, measurements or any other information for analysis. Further, metadata tables are also essential in a project as a form of structured metadata with essential information to interpret the data.
Guidelines for increasing the machine-readability of spreadsheets are given in (Perkel 2022) and (Broman and Woo 2018). We recommend these guidelines as they are concise and accessible reads. They highlight the need for consistency in formatting and machine-readability to enable automation of data validation, preprocessing, synthesis, and analysis of tabular data or metadata. Additional online material is provided in the chapter data organisation in spreadsheets of the e-book for reproducible research from ‘The Turing way’(The Turing Way Community 2022). This page summarizes key points from the recommended guidelines and offers additional examples and information on common actions for working with tabular data.
Organizing tables or spreadsheets
Tables MUST be structured and formatted to be machine-readable. For example: do not encode any information with formatting (e.g., in color-coding schemes), but create additional columns instead that can be used as filter; do not merge cells and do not use empty cells (see examples below).
2. Use interoperable file formats
Publish and share them in an interoperable format, that is, a format compatible with most tools and programming languages. Comma-separated values (.CSV) is recommended. Working on spreadsheet software like MS Excel or Google Spreadsheets can be convenient and in some cases it may be preferred for collaborative editing of tabular data. When using such spreadsheet software, researchers should ensure that the table can be exported to more interoperable formats like CSV and read programmatically without losing or distorting the information.
3. Include a codebook or data dictionary
The tables MUST be shared/published together with a codebook or data dictionary that explains the meaning of the variables, abbreviations, naming conventions and units. This can be documented in a table format (e.g., variable names as rows in the first column and description in a second column) or in text format (e.g., in a README file).
4. Keep data ‘tidy’
Include a single piece of information in a cell (e.g., ,instead of a cell in the column weight
20_kgbetter make an entry20in a columnweight_kg). More on tidy data in Wickham (2014).Keep formatting and naming consistent and follow broadly established standards when possible (e.g., write dates following
ISO 8601standardYYYY-MM-DD(this should be described in the data dictionary). There can be also standards referring to file or variable naming, units or number formats among others.
5. Keep raw data
Researchers may want to use functions in spreadsheet software or manual transformations to change the source data into more reusable format. If such transformations are necessary, at a minimum, keep track of them (document in a README file and/or share the script if there is one available) and make sure the results are read correctly when the files are exported into more interoperable formats like .CSV. Always keep the unedited raw data for future reference.
Examples
A poorly formatted table
The example in Table 1 illustrates errors that are often found in tables, specially when they are edited by multiple researchers . The structure is not adequate for data processing or analysis and there are inconsistencies in the cell values (e.g., subject labels) and formats (e.g., dates, decimals in weight). White spaces in variable names led to automatically adding a dot ‘.’ when reading the table with R code. Color code seems to highlight images with artifacts.
A better formatted version
The example in Table 2 can be easily read using code, without ambiguities. It is accompanied by a codebook (Table 3) describing the variables.
| Variable | Description |
|---|---|
| Subject_ID | Unique subject identifier. Possible values: S001, S002, S010, etc. |
| Scan_Date | Date when the scan was performed. Possible values: YYYY-MM-DD. |
| Modality | Imaging modality used. Possible values: MRI, fMRI, DTI, or T1. |
| Sex | Biological sex of the subject. Possible values: M (male), F (female). |
| Age_years | Age of the subject in years. Possible values: 25, 30, 34. |
| Weight_kg | Subject’s weight in kilograms. Possible values: e.g., 10.00. |
| Exp_Group | Experimental group assignment. Possible values: Control, Exp (experimental). |
| Scanner_Model | MRI scanner brand. Possible values: Siemens, Philips, or GE Discovery. |
| Discard | Discard images with artifacts TRUE or FALSE. |
| Notes | Additional notes on scan quality or issues. Possible values: Free text. |
Frequent actions with tables
Researchers often need to do some minor transformations or operations with tabular data. The main recommendations are:
- Use code as it will be a more reproducible and transparent process than doing it manually. Irrespective of the programming language, the code for these operations is usually easy to implement for users without advance programming skills. In the
Rlanguage,dplyrandtidyrare two useful packages for handling and manipulating tables. - Keep the code and preferrably keep track of changes in the code and the tables: version control.
Data validation
There are a few initial checks and data validation actions that be conducted before archiving, publishing, sharing or conducting any analysis on tabular data to ensure they do not contain errors. Some frequent checks are:
- Variable types (e.g., numeric or character) and variable names (do they conform with the documented naming convention and codebook?)
- Missing values and complete cases (e.g., are they as expected?)
- Number and/or date formats (e.g., are they consistent?)
- Unique values (e.g., is sex consistently described with the same label?).
- Implausible values (e.g., negative age value, percentiles above 100)
- The ‘tail’ of the data (e.g., researchers may have calculated the mean of a column in the last row)
Some useful functions in R programming language are:
str()(shows the structure of the data, inluding variable types and lengths)unique()(for unique values)is.na()(for missing values)summary()(summary statistics).
There are also dedicated packages for data validation. For a more detailed guideline, together with the code (in R) to perform different data validation actions we recommend the cookbook from the validate package for R. For python users, a recent package for data validation is the pointblank python package.
For a more conceptual overview, without code, we recommend the summary tables in the framework for initial data analysis (Huebner et al. 2018) used in the working group 3 from the STRATOS initiative (STRengthening Analytical Thinking for Observational Studies).
Combining tables
Researchers often need to combine information from two tables. A common scenario is to have a table with as many rows as subjects (e.g., subject characteristics) and another table with multiple rows per subject (e.g., indicating image files, slices, scans, etc.). If we want to merge these two tables we need a key variable that is consistent in both tables, for example, a subjectID variable uniquely identifying each subject. Here it is essential that each subject is labeled consistently and that the key variables have identical names in each table.
Reshaping tables: wide and long formats
We can distinguish two basic data formats:
Wide format: values that do not repeat are usually in the first column and every measure that varies occupies a set of columns. E.g., One row per subject, columns specifying features of the subjects.
Long format: multiple records for each individual. Some variables may not vary are identical in each record, whereas other variables vary across the records. E.g., Multiple rows per subject, listing filenames of images associated with each.
Research projects often need to combine both types of tables, and some statistical analysis or visualizations may required one specific table format. Researchers often need to ‘reshape’ the tables. These transformations can be very straightforward but they can also become quite complex when manipulating large tables with many variables involved. Always check the output of these operations (e.g., running validation or descriptive summaries) to make sure that no information was misplaced or data was lost.
| Subject_ID | Test_A | Test_B |
|---|---|---|
| S001 | 5.4 | 3.2 |
| S002 | 6.1 | 5.8 |
| S003 | 7.3 | 6.5 |
| S004 | 4.8 | 4.1 |
| Subject_ID | Test | Value |
|---|---|---|
| S001 | Test_A | 5.4 |
| S001 | Test_B | 3.2 |
| S002 | Test_A | 6.1 |
| S002 | Test_B | 5.8 |
| S003 | Test_A | 7.3 |
| S003 | Test_B | 6.5 |
| S004 | Test_A | 4.8 |
| S004 | Test_B | 4.1 |
Metadata companion files
Metadata are essential information about data. It is as important as the data itself, it provides the context to the data. Metadata include documents intended for human use, consisting mainly of text and images, but mostly, when refering to metadata we refer to machine-readable information in the form of tables and structured files that accompany data. ‘Machine-readable’ means that it can be easily processed by computers without human intervention. These files are also referred sometimes as ‘sidecar’ or ‘companion’ files.
We recommend thinking about content and format of these files early on in a project. When possible, it is preferable to adopt or adjust discipline-specific standards instead of creating new standards.
Format: JavaScript Object Notation (JSON)
A popular format for metadata companion files is JavaScript Object Notation (JSON). JSON is an open and text-based standard format for data interchange. It is often recommended as a format for metadata files because it is easy to read by humans and by many programming languages. JSON is also a good format for larger (metadata) data that have a hierarchical structured relationship. The structure of a JSON object is:
The data are in name/value pairs and data objects are separated by commas
Curly braces {} hold objects
Square brackets [] hold arrays
Each data element is enclosed with quotes “” if it is a character, or without quotes if it is a numeric value
The ‘fields’ or name and value pairs in the JSON files can vary widely depending on the data and file type they are accompanying. The examples are provided by the Brain Imaging Data Standards for microscopy data:
Example of sidecar JSON file (*_<suffix>.json)
{
"Manufacturer": "Hamamatsu",
"ManufacturersModelName": "C9600-12",
"PixelSize": [0.23, 0.23],
"PixelSizeUnits": "um",
"Magnification": 40,
"BodyPart": "BRAIN",
"BodyPartDetails": "corpus callosum",
"SampleEnvironment": "ex vivo",
"SampleFixation": "4% paraformaldehyde, 2% glutaraldehyde",
"SampleStaining": "LFB",
"SliceThickness": 5,
"TissueDeformationScaling": 97
}Example of participants.json:
{
"species": {
"Description": "binomial species name from the NCBI Taxonomy
(https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)"
},
"strain": {
"Description": "name of the strain of the species"
},
"strain_rrid": {
"Description": "research resource identifier (RRID) of the strain
(https://rrid.site/data/source/nlx_154697-1/search)"
}
}Creating metadata files from tables
Data acquisition instruments often generate metadata files automatically. However, this is not always the case and sometimes researchers need to customize the content of these files. Adopting existing metadata standards is usually recommended. Some projects may required their own JSON-schemas, although developing these schema may be labor intensive and require data stewards or expert data managers. Often, simpler solutions are sufficient, and researchers may, for example, generate a JSON file from the content of a table (see Table 6). This can be done using many programming languages, for example, the R package jsonlite is simple to use and offers several encoding formatting options. Automatically generating these files is recommended, rather than resorting to manual edits that can lead to errors that go unnoticed.
The first row of Table 6 can represented in a JSON file would look like this:
{
"id": "subject_04",
"img_location": "dummyfolder/subject_04_imgfile.tiff",
"sex": "female",
"condition": "A",
"treatment": "control"
}Terminologies and ontologies
For data to be reusable, the metadata and documentation must use consistent terminology that is clearly explained so that different users can understand its meaning unambiguously. We recommend starting with simple, controlled vocabularies before moving on to more complex structures, such as taxonomies or ontologies. As with other metadata and documentation elements, standard or widely used vocabularies or ontologies should be used where available.
Controlled vocabularies
Controlled vocabularies are standardized and organized arrangements of words and phrases presented as alphabetical lists of terms or as thesauri and taxonomies with a hierarchical structure of broader and narrower terms (see ‘controlled vocabulary’ concept in EU vocabulary).
Consistency: they provide a common language for researchers, minimizing variations and ambiguities
Interoperability: they facilitate integration and sharing of data across systems, disciplines, and organizations, enabling seamless collaboration (human and machine actionable).
Reusability: Research data annotated with standard vocabularies are easier to interpret and reuse
Taxonomies
A taxonomy is a controlled vocabulary in which all the terms belong to a single hierarchical structure and have parent/child or broader/narrower relationships to other terms. The structure is sometimes referred to as a ‘tree’. The addition of non-preferred terms/synonyms may or may not be part of a taxonomy. (see ‘taxonomy’ concept in EU vocabularies).
Ontologies
An ontology is a formal way of representing knowledge about a specific domain, using a structured framework of concepts and their relationships see ‘ontology’ in EU vocabularies. It can also be seen as a hierarchical graph covering a specific subject area or domain. An ontology has classes (i.e., concepts, terms, etc) as basic units and relations or inks between the classes.
Selected resources
Open Biological and Biomedical Ontology Foundry is a community that develops interoperable ontologies for the biological sciences
Ontobee is a server that aims to facilitate ontology data sharing, visualization, query integration and analysis.
FairSharing is a platform to look for standards, policies and databases. It can also be used to search for ontologies and controlled vocabularies

