Data Organization, documentation and metadata

Essential elements to make data reusable

Published

March 31, 2026

Standards

There are many field and community-specific standards describing how to structure data and metadata, filenaming conventions and file formats. We recommend searching for what is available, reusing it or modifying it just as much as necessary. Adopting available community standards will result in a much lower workload than defining a new standard from scratch. We recommend adopting standards that are actively maintained, well documented and that are used by a broad community for some time. If no standard is available for a specific data type or community, it is preferable to adapt standards for similar data types than to create a new one. Some popular standards offer the possibility of submitting extension proposals for additional data types.

Finding data structure standards

Data stewards or researchers involved in planning data management may need to conduct a systematic search for field-specific metadata standards.

Imaging standards

Brain Imaging Data Structure (BIDS)

The Brain Imaging Data Structure (BIDS) (Gorgolewski et al. 2016) describes a series of data organization principles to facilitate large-scale data sharing between researchers, while also making metadata machine-readable to allow automation. It started in 2015 and it can be considered one of the most complete guidelines in neuroimaging. It covers a range of imaging modalities (MRI, EEG, iEEG, MEG, PET, CT, microscopy). The BIDS extension proposals (BEP) process allows researchers to propose BIDS specifications for their data modality, which are then reviewed and integrated into the standards. BIDS guidelines cover multiple aspects of data management including:

  • Data structure
  • Metadata
  • File names and formats
  • Preprocessing and analysis workflows

The defining features of BIDS are those desirable in most standards:

  • Flexible and easy-to-adopt. BIDS combines technically complex documentation with more accessible guides and apps to facilitate adherence, particularly for early-career researchers. The BIDS Starter Kit collects tutorials, wikis, and templates to help any lab get started with BIDS.

  • Community-driven, platform-independent, and growing. BIDS was developed by researchers and it has evolved from its initial focus on human functional magnetic resonance imaging to accommodate other data modalities (EEG, MEG, PET, CT, microscopy).

  • An ecosystem of resources. BIDS has fostered an ecosystem of tools and resources, including validation tools, apps for data preprocessing, and databases like OpenNeuro.

Documentation, protocols and reporting

Documentation is metadata, that is, information about our data objects, mostly aimed at human readers and usually consisting of text and images. We recommend producing useful documentation from early on and through out the entire project. It should preferably be written in a way that can be understood by other collaborators from the start, to prevent additional efforts translating internal documentation into more publishable documents. We also recommend using discipline-specific guidelines when available, to reduce efforts and to facilitate reuse by others in the community.

Readme files

README files are text files usually written in text (.txt) or Markdown (.md) formats. They are intended for humans rather than for machine-readability. It is good practice to have at least a README file at the root directory (the topmost folder) of a project folder or code repository. They are usually concise and can be placed at multiple levels. For example, at the project-level providing information about the the project authors, institution, goals, etc. At data-level, they can provide information about the data sets and help users navigate the folder or perform actions. When accessing a code repository in Gitlab or Github, the user is presented with the content of a README file that provides the main instructions and information about the repository.

An example of the content in a README file based on https://rdmkit.elixir-europe.org/metadata_management:

PROJECT TITLE 
- Project Unique ID 
- Funding Grant Nr and period 
- Description: <provide a short description of the study/project>
- Principle Investigator:
- Data contact person 
- Link to Data management plan

ORGANIZATION <in large projects, this can also describe subfolders>
- Folder structure 
- File naming conventions (with examples) 
- File formats

Documenting data collection

Data collection is the basis for research findings. Procedures and quality measures need to be implemented and documented to have reliable data.

Key considerations

As described in https://rdmkit.elixir-europe.org/collecting :

  • Capture the provenance e.g. of samples, researchers and instruments.

  • Ensure data quality, since data can either be generated by yourself, or by another infrastructure or facility with this specialisation.

  • Check options for reusing data instead of generating new data.

  • Define the experimental design including a collection plan (e.g. repetitions, controls, randomisation) in advance.

  • Calibrate the instruments.

  • Check data protection and security issues if you work with sensitive or confidential data.

  • Define how to store the data e.g. format and volume.

  • Find suitable repository to store the data.

  • Identify suitable metadata standards.

Tools for data management and documentation during data collection

It is necessary to document what happens in the lab in an electronic format, with version control and tracked changes. If handwritten notes cannot be avoided, they should be digitized in an interoperable format as soon as possible. Suitable tools are:

  • Electronic Lab Notebooks (ELNs)
  • Electronic Data Capture (EDC) systems
  • Laboratory Information Management Systems (LIMS)

Some of these tools can be complex, expensive and have a broader scope, integrating workflows for analysis and file storage. Research groups need to find a balance between their needs and available resources (financial or in user expertise) to choose one.

Researchers should also consider:

  • Long-term sustainability of the tools (e.g., is the tool likely to be maintained in the long run?)
  • Dependencies (e.g., do researchers need a lot of technical support from external organizations? Are they accessible?)
  • Format lock-in (e.g., the tool should allow exporting data into a format that can be imported into other platforms without losing information)
  • Scalability (e.g., can it be adapted to new lab collaborators?)
  • Portability (e.g., how easy it will be to switch or transfer the data to another tool if the current runs out of maintenance?)
  • Version control (e.g., can I track changes and revert to previous versions?)

Electronic lab notebooks (ELNs)

There is a very complete section with information and resources that will help selecting a ELN suitable to your needs The Turing Way - Electronic Lab Notebooks. We highlight the following resources:

  • Article detailing considerations for choosing ELN Higgins (2003).

  • Comparison matrix of ELNs Harvard Longwood Medical Area Research Data Management Working Group (2021) (spreadsheet version here ).

  • ELN finder online tool.

Consider that also tools not specifically designed as ELNs can have similar functionalities.

Standard Operating Procedures (SOPs) and lab protocols

SOPs and lab protocols are essential to ensure that data collection has been implemented according to good practices, in a reliable manner. They can also be used to introduce new researchers in a project, facilitating collaboration. They are also key to reproducing and replicating experiments. For these reasons they constitute an important asset to share and publish. SOPs usually contain text and images. A recommended format is the standardized PDF (PDF/A-1 ISO 19005-1) which can be open with many software applications. It is also recommended to explore reusing or adapting any available SOPs from relevant institutions in a specific field.

SOPs and protocols are often living documents that undergo relatively frequent changes and updates, thus requiring to keep track of the SOP versions. They are necessary to comply with Good Laboratory Practices.

Reporting guidelines: the EQUATOR Network

A reporting guideline is a simple, structured tool for researchers to use while writing manuscripts. It provides a minimum list of information needed to ensure a manuscript can be understood, replicated, or used for a practical purpose or in a meta-analysis. The EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network is an international initiative that seeks to improve the reliability and value of published health research literature by promoting transparent and accurate reporting and wider use of robust reporting guidelines.

Tables and spreadsheets

Almost all research projects use some form of tabular data. They may contain descriptive information, experimental task performance, filenames of image or scan files, recording parameters, measurements or any other information for analysis. Further, metadata tables are also essential in a project as a form of structured metadata with essential information to interpret the data.

Guidelines for increasing the machine-readability of spreadsheets are given in (Perkel 2022) and (Broman and Woo 2018). We recommend these guidelines as they are concise and accessible reads. They highlight the need for consistency in formatting and machine-readability to enable automation of data validation, preprocessing, synthesis, and analysis of tabular data or metadata. Additional online material is provided in the chapter data organisation in spreadsheets of the e-book for reproducible research from ‘The Turing way’(The Turing Way Community 2022). This page summarizes key points from the recommended guidelines and offers additional examples and information on common actions for working with tabular data.

Organizing tables or spreadsheets

Tables MUST be structured and formatted to be machine-readable. For example: do not encode any information with formatting (e.g., in color-coding schemes), but create additional columns instead that can be used as filter; do not merge cells and do not use empty cells (see examples below).

2. Use interoperable file formats

Publish and share them in an interoperable format, that is, a format compatible with most tools and programming languages. Comma-separated values (.CSV) is recommended. Working on spreadsheet software like MS Excel or Google Spreadsheets can be convenient and in some cases it may be preferred for collaborative editing of tabular data. When using such spreadsheet software, researchers should ensure that the table can be exported to more interoperable formats like CSV and read programmatically without losing or distorting the information.

3. Include a codebook or data dictionary

The tables MUST be shared/published together with a codebook or data dictionary that explains the meaning of the variables, abbreviations, naming conventions and units. This can be documented in a table format (e.g., variable names as rows in the first column and description in a second column) or in text format (e.g., in a README file).

4. Keep data ‘tidy’

  • Include a single piece of information in a cell (e.g., ,instead of a cell in the column weight 20_kg better make an entry 20 in a column weight_kg ). More on tidy data in Wickham (2014).

  • Keep formatting and naming consistent and follow broadly established standards when possible (e.g., write dates following ISO 8601 standard YYYY-MM-DD (this should be described in the data dictionary). There can be also standards referring to file or variable naming, units or number formats among others.

5. Keep raw data

Researchers may want to use functions in spreadsheet software or manual transformations to change the source data into more reusable format. If such transformations are necessary, at a minimum, keep track of them (document in a README file and/or share the script if there is one available) and make sure the results are read correctly when the files are exported into more interoperable formats like .CSV. Always keep the unedited raw data for future reference.

Examples

A poorly formatted table

The example in Table 1 illustrates errors that are often found in tables, specially when they are edited by multiple researchers . The structure is not adequate for data processing or analysis and there are inconsistencies in the cell values (e.g., subject labels) and formats (e.g., dates, decimals in weight). White spaces in variable names led to automatically adding a dot ‘.’ when reading the table with R code. Color code seems to highlight images with artifacts.

Table 1: A poorly formatted table

A better formatted version

The example in Table 2 can be easily read using code, without ambiguities. It is accompanied by a codebook (Table 3) describing the variables.

Table 2: A table formatted for machine-reability
Table 3: Example of a codebook
Variable Description
Subject_ID Unique subject identifier. Possible values: S001, S002, S010, etc.
Scan_Date Date when the scan was performed. Possible values: YYYY-MM-DD.
Modality Imaging modality used. Possible values: MRI, fMRI, DTI, or T1.
Sex Biological sex of the subject. Possible values: M (male), F (female).
Age_years Age of the subject in years. Possible values: 25, 30, 34.
Weight_kg Subject’s weight in kilograms. Possible values: e.g., 10.00.
Exp_Group Experimental group assignment. Possible values: Control, Exp (experimental).
Scanner_Model MRI scanner brand. Possible values: Siemens, Philips, or GE Discovery.
Discard Discard images with artifacts TRUE or FALSE.
Notes Additional notes on scan quality or issues. Possible values: Free text.

Frequent actions with tables

Researchers often need to do some minor transformations or operations with tabular data. The main recommendations are:

  • Use code as it will be a more reproducible and transparent process than doing it manually. Irrespective of the programming language, the code for these operations is usually easy to implement for users without advance programming skills. In the R language, dplyr and tidyr are two useful packages for handling and manipulating tables.
  • Keep the code and preferrably keep track of changes in the code and the tables: version control.

Data validation

There are a few initial checks and data validation actions that be conducted before archiving, publishing, sharing or conducting any analysis on tabular data to ensure they do not contain errors. Some frequent checks are:

  • Variable types (e.g., numeric or character) and variable names (do they conform with the documented naming convention and codebook?)
  • Missing values and complete cases (e.g., are they as expected?)
  • Number and/or date formats (e.g., are they consistent?)
  • Unique values (e.g., is sex consistently described with the same label?).
  • Implausible values (e.g., negative age value, percentiles above 100)
  • The ‘tail’ of the data (e.g., researchers may have calculated the mean of a column in the last row)

Some useful functions in R programming language are:

  • str() (shows the structure of the data, inluding variable types and lengths)
  • unique() (for unique values)
  • is.na() (for missing values)
  • summary()(summary statistics).

There are also dedicated packages for data validation. For a more detailed guideline, together with the code (in R) to perform different data validation actions we recommend the cookbook from the validate package for R. For python users, a recent package for data validation is the pointblank python package.

For a more conceptual overview, without code, we recommend the summary tables in the framework for initial data analysis (Huebner et al. 2018) used in the working group 3 from the STRATOS initiative (STRengthening Analytical Thinking for Observational Studies).

Combining tables

Researchers often need to combine information from two tables. A common scenario is to have a table with as many rows as subjects (e.g., subject characteristics) and another table with multiple rows per subject (e.g., indicating image files, slices, scans, etc.). If we want to merge these two tables we need a key variable that is consistent in both tables, for example, a subjectID variable uniquely identifying each subject. Here it is essential that each subject is labeled consistently and that the key variables have identical names in each table.

Reshaping tables: wide and long formats

We can distinguish two basic data formats:

  • Wide format: values that do not repeat are usually in the first column and every measure that varies occupies a set of columns. E.g., One row per subject, columns specifying features of the subjects.

  • Long format: multiple records for each individual. Some variables may not vary are identical in each record, whereas other variables vary across the records. E.g., Multiple rows per subject, listing filenames of images associated with each.

Research projects often need to combine both types of tables, and some statistical analysis or visualizations may required one specific table format. Researchers often need to ‘reshape’ the tables. These transformations can be very straightforward but they can also become quite complex when manipulating large tables with many variables involved. Always check the output of these operations (e.g., running validation or descriptive summaries) to make sure that no information was misplaced or data was lost.

Table 4: A table in wide format
Subject_ID Test_A Test_B
S001 5.4 3.2
S002 6.1 5.8
S003 7.3 6.5
S004 4.8 4.1
Table 5: Table converted to long format
Subject_ID Test Value
S001 Test_A 5.4
S001 Test_B 3.2
S002 Test_A 6.1
S002 Test_B 5.8
S003 Test_A 7.3
S003 Test_B 6.5
S004 Test_A 4.8
S004 Test_B 4.1

Metadata companion files

Metadata are essential information about data. It is as important as the data itself, it provides the context to the data. Metadata include documents intended for human use, consisting mainly of text and images, but mostly, when refering to metadata we refer to machine-readable information in the form of tables and structured files that accompany data. ‘Machine-readable’ means that it can be easily processed by computers without human intervention. These files are also referred sometimes as ‘sidecar’ or ‘companion’ files.

We recommend thinking about content and format of these files early on in a project. When possible, it is preferable to adopt or adjust discipline-specific standards instead of creating new standards.

Format: JavaScript Object Notation (JSON)

A popular format for metadata companion files is JavaScript Object Notation (JSON). JSON is an open and text-based standard format for data interchange. It is often recommended as a format for metadata files because it is easy to read by humans and by many programming languages. JSON is also a good format for larger (metadata) data that have a hierarchical structured relationship. The structure of a JSON object is:

  • The data are in name/value pairs and data objects are separated by commas

  • Curly braces {} hold objects

  • Square brackets [] hold arrays

  • Each data element is enclosed with quotes “” if it is a character, or without quotes if it is a numeric value

The ‘fields’ or name and value pairs in the JSON files can vary widely depending on the data and file type they are accompanying. The examples are provided by the Brain Imaging Data Standards for microscopy data:


Example of sidecar JSON file (*_<suffix>.json)
{
        "Manufacturer": "Hamamatsu",
        "ManufacturersModelName": "C9600-12",
        "PixelSize": [0.23, 0.23],
        "PixelSizeUnits": "um",
        "Magnification": 40,
        "BodyPart": "BRAIN",
        "BodyPartDetails": "corpus callosum",
        "SampleEnvironment": "ex vivo",
        "SampleFixation": "4% paraformaldehyde, 2% glutaraldehyde",
        "SampleStaining": "LFB",
        "SliceThickness": 5,
        "TissueDeformationScaling": 97
}

Example of participants.json:

{
    "species": {
        "Description": "binomial species name from the NCBI Taxonomy 
        (https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)"
    },
    "strain": {
        "Description": "name of the strain of the species"
    },
    "strain_rrid": {
        "Description": "research resource identifier (RRID) of the strain 
        (https://rrid.site/data/source/nlx_154697-1/search)"
    }
}

Creating metadata files from tables

Data acquisition instruments often generate metadata files automatically. However, this is not always the case and sometimes researchers need to customize the content of these files. Adopting existing metadata standards is usually recommended. Some projects may required their own JSON-schemas, although developing these schema may be labor intensive and require data stewards or expert data managers. Often, simpler solutions are sufficient, and researchers may, for example, generate a JSON file from the content of a table (see Table 6). This can be done using many programming languages, for example, the R package jsonlite is simple to use and offers several encoding formatting options. Automatically generating these files is recommended, rather than resorting to manual edits that can lead to errors that go unnoticed.

Table 6: Example of metadata in tabular format.
id img_location sex condition treatment
subject_04 dummyfolder/subject_04_imgfile.tiff female A control
subject_93 dummyfolder/subject_93_imgfile.tiff female B control
subject_12 dummyfolder/subject_12_imgfile.tiff female B treat_3
subject_65 dummyfolder/subject_65_imgfile.tiff male B control
subject_98 dummyfolder/subject_98_imgfile.tiff female A treat_1
subject_71 dummyfolder/subject_71_imgfile.tiff male A treat_3
subject_24 dummyfolder/subject_24_imgfile.tiff female B control

The first row of Table 6 can represented in a JSON file would look like this:

 
{
  "id": "subject_04",
  "img_location": "dummyfolder/subject_04_imgfile.tiff",
  "sex": "female",
  "condition": "A",
  "treatment": "control"
}

Terminologies and ontologies

For data to be reusable, the metadata and documentation must use consistent terminology that is clearly explained so that different users can understand its meaning unambiguously. We recommend starting with simple, controlled vocabularies before moving on to more complex structures, such as taxonomies or ontologies. As with other metadata and documentation elements, standard or widely used vocabularies or ontologies should be used where available.

Controlled vocabularies

Controlled vocabularies are standardized and organized arrangements of words and phrases presented as alphabetical lists of terms or as thesauri and taxonomies with a hierarchical structure of broader and narrower terms (see ‘controlled vocabulary’ concept in EU vocabulary).

  • Consistency: they provide a common language for researchers, minimizing variations and ambiguities

  • Interoperability: they facilitate integration and sharing of data across systems, disciplines, and organizations, enabling seamless collaboration (human and machine actionable).

  • Reusability: Research data annotated with standard vocabularies are easier to interpret and reuse

Taxonomies

A taxonomy is a controlled vocabulary in which all the terms belong to a single hierarchical structure and have parent/child or broader/narrower relationships to other terms. The structure is sometimes referred to as a ‘tree’. The addition of non-preferred terms/synonyms may or may not be part of a taxonomy. (see ‘taxonomy’ concept in EU vocabularies).

Ontologies

An ontology is a formal way of representing knowledge about a specific domain, using a structured framework of concepts and their relationships see ‘ontology’ in EU vocabularies. It can also be seen as a hierarchical graph covering a specific subject area or domain. An ontology has classes (i.e., concepts, terms, etc) as basic units and relations or inks between the classes.

Selected resources

  • Open Biological and Biomedical Ontology Foundry is a community that develops interoperable ontologies for the biological sciences

  • Ontobee is a server that aims to facilitate ontology data sharing, visualization, query integration and analysis.

  • FairSharing is a platform to look for standards, policies and databases. It can also be used to search for ontologies and controlled vocabularies

Back to top

References

Bourget, Marie-Hélène, Lee Kamentsky, Satrajit S. Ghosh, Giacomo Mazzamuto, Alberto Lazari, Christopher J. Markiewicz, Robert Oostenveld, et al. 2022. “Microscopy-BIDS: An Extension to the Brain Imaging Data Structure for Microscopy Data.” Frontiers in Neuroscience 16. https://doi.org/10.1038/s41598-018-22181-4.
Broman, Karl W., and Kara H. Woo. 2018. “Data Organization in Spreadsheets.” The American Statistician 72 (1): 2–10. https://doi.org/10.1080/00031305.2017.1375989.
Goldberg, Ilya G, Chris Allan, Jean-Marie Burel, Doug Creager, Andrea Falconi, Harry Hochheiser, Josiah Johnston, Jeff Mellen, Peter K Sorger, and Jason R Swedlow. 2005. “The Open Microscopy Environment (OME) Data Model and XML File: Open Tools for Informatics and Quantitative Analysis in Biological Imaging.” Genome Biology 6 (5): R47. https://doi.org/10.1186/gb-2005-6-5-r47.
Gorgolewski, Krzysztof J., Tibor Auer, Vince D. Calhoun, R. Cameron Craddock, Samir Das, Eugene P. Duff, Guillaume Flandin, et al. 2016. “The Brain Imaging Data Structure, a Format for Organizing and Describing Outputs of Neuroimaging Experiments.” Scientific Data 3 (1): 160044. https://doi.org/10.1038/sdata.2016.44.
Hammer, Mathias, Maximiliaan Huisman, Alessandro Rigano, Ulrike Boehm, James J. Chambers, Nathalie Gaudreault, Alison J. North, et al. 2021. “Towards Community-Driven Metadata Standards for Light Microscopy: Tiered Specifications Extending the OME Model.” Nature Methods 18 (12): 1427–40. https://doi.org/10.1038/s41592-021-01327-9.
Harvard Longwood Medical Area Research Data Management Working Group. 2021. “Electronic Lab Notebook Comparison Matrix,” May. https://doi.org/10.5281/ZENODO.4723753.
Higgins, J. P T. 2003. “Measuring Inconsistency in Meta-Analyses.” BMJ 327 (7414): 557560. https://doi.org/10.1136/bmj.327.7414.557.
Huebner, Marianne, Saskia Le Cessie, Carsten O. Schmidt, and Werner Vach. 2018. “A Contemporary Conceptual Framework for Initial Data Analysis.” Observational Studies 4 (1): 171–92. https://doi.org/10.1353/obs.2018.0014.
Perkel, Jeffrey M. 2022. “Six Tips for Better Spreadsheets.” Nature 608 (7921): 229–30. https://doi.org/10.1038/d41586-022-02076-1.
Ropelewski, Alexander J., Megan A. Rizzo, Jason R. Swedlow, Jan Huisken, Pavel Osten, Neda Khanjani, Kurt Weiss, et al. 2022. “Standard Metadata for 3D Microscopy.” Scientific Data 9 (1): 449. https://doi.org/10.1038/s41597-022-01562-5.
Sansone, Susanna Assunta, Peter McQuilton, Philippe Rocca-Serra, Alejandra Gonzalez Beltran, Massimiliano Izzo, Allyson L. Lister, and Milo Thurston. 2019. “FAIRsharing as a Community Approach to Standards, Repositories and Policies.” Nature Biotechnology 37 (4): 358–67. https://doi.org/10.1038/s41587-019-0080-8.
Sarkans, Ugis, Wah Chiu, Lucy Collinson, Michele C. Darrow, Jan Ellenberg, David Grunwald, Jean-Karim Hériché, et al. 2021. “REMBI: Recommended Metadata for Biological Imagesenabling Reuse of Microscopy Data in Biology.” Nature Methods 18 (12): 1418–22. https://doi.org/10.1038/s41592-021-01166-8.
The Turing Way Community. 2022. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research (1.0.2). Zenodo. https://doi.org/10.5281/ZENODO.3233853.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.