Sharing and Publishing Data

Long-term preservation and interoperability of the data

Published

March 31, 2026

File formats

The choice of a format will determine how data can be accessed throughout its lifecycle, including sharing and reuse. In general, we need to consider long-term preservation when choosing a format. Files in formats with the following features will be more likely to be accessible in the future:

  • Non-proprietary

  • Open, documented standard

  • Popular format

  • Standard representation

  • Unencrypted

  • Uncompressed

We recommend selecting file formats based on these features and to follow existing guidelines for general and domain-specific file formats. The section below summarizes the file formats recommended by various institutions.

File formats in the general domain

There are many institutions providing recommendations on file formats in general. This section shows a couple of representative examples.

Recommendations from the Faculty of Biology and Medicine, University of Lausanne (Medical Library):

Category Recommended Formats
Text PDF/A – PDF/X, Plain text (.txt), Open Office (.odt), XML / HTML (with schema), Word XML (.docx), RTF, LaTeX
Images Bitmap: TIFF (uncompressed), PNG, JPEG2000, (GIF). Vector: SVG
Tabular Data CSV (comma, tab, semi-colon), Open Office (.ods), XML / HTML (with schema), Excel (.xlsx), .SQL
Video MPEG-4 (H.264) (~ MP4), Motion JPEG 2000, MPEG-1/2
Audio WAV (preferably Broadcast Wave Format, LPCM), AIFF (LPCM), OGG Vorbis, MP3 (MPEG Layer III), AAC (MPEG-4)

Recommendations from the open data publishing platform (Dryad):

Category Recommended formats
README files Markdown (MD) or text (TXT)
Tabular data Comma-separated values (CSV)
Non-tabular data Semi-structured plain text (e.g., protein sequences)
Structured plain text XML, JSON (e.g., metadata companion files)
Images PDF, JPEG, PNG, TIFF, SVG
Audio FLAC, AIFF, WAV, MP3, OGG
Video AVI, MPEG, MP4
Compressed file archives TAR.GZ, 7Z, ZIP

Interoperable formats in neuroimaging

There are are many modalities of neuroimaging research and therefore many different file formats used in this field. This section covers some of the most popular formats in neuroimaging. The Brain Imaging Data Structure - BIDS standards also provides recommendations for some of these formats in the different data modalities available.

Magnetic Resonance Imaging (MRI), Computed Tomography (CT), Positron Emission Tomography (PET) and similar

  • Neuroimaging Informatics Technology Initiative (NIfTI). It is an open file format commonly used to stored imaging data obtained with magnetic resonance imaging methods. The associated file extensions are: .nii, .nii.gz, .hdr/.img.

  • Digital Imaging and Communication in Medicine (DICOM) is a technical standard for digital storage and transmission of biomedical images. It was developed in the 80s and it is widely in use today. The DICOM standard is widely adopted by hospitals and medical software. The files extension is usually .dcm.

Microscopy

  • OME data model. OME Model is an specification from the Open Microscopy Environent for storing data on biological imaging. The model includes parameter and extensive metadata. OME-XML is a file format used to store data according ot the model. OME-TIFF is a multiplane tiff file with OME metadata in the header (as OME-XML). See more details in their publication (Goldberg et al. 2005). Further, OME-ZARR (sometimes referred to as OME-NGFF or NGFF) is a cloud-optimized format developed to improve access and storage for large data (Moore et al. 2023).

  • In BIDS (v1.10.0) microscopy raw data MUST be stored in one of the following formats:

Electroencephalography (EEG) and intracraneal electroencephalography (iEEG)

There is no single standard. The recommendations from BIDS for raw data are the following:

  • European data format (.edf). Each recording is a single file. edf+ files are permitted. The capital .EDF extension MUST NOT be used.

  • BrainVision Core Data Format (.vhdr, .vmrk, .eeg). Each dataset recording consists of a file triplet.

  • EEGLAB (.set, .fdt). The format used by the MATLAB toolbox EEGLAB. Each recording consists of a .set file with an OPTIONAL .fdt file.

  • Neurodata Without Borders (.nwb). For iEEG data only. Each recording consists of a single .nwb file.

  • MEF3 (.mefd). For iEEG data only. Each recording consists of a .mefd directory.

  • Biosemi (.bdf). For EEG data only. Each recording consists of a single .bdf file. bdf+ files are permitted. The capital .BDF extension MUST NOT be used.

Metadata, markup and data serialization

Markup languages specify the structure and formatting of a document. Data serialization languages are used to create configuration files in plain text and compatible between systems.

  • JavaScript Object Notation (JSON). (.json In the BIDS specification, JSON files are primarily used as “sidecar” files accompanying the data files. There are also a few special cases of JSON files being first-order data files, such as genetic_info.json.

  • Extensible Markup Language (XML) (.xml) is a markup language and file format for storing, transmitting, and reconstructing data.

  • YAML Ain’t Markup Language (.yaml,.yml) is a human-friendly data serialization language for all programming languages which is often used for configuration files.

  • Markdown, a lightweight language with simple syntax that will add formatting elements to plain text. This format is widely used for technical documentation online.

File format conversion

There are multiple reasons why researchers often have to convert files to a different format. For example, imaging acquisition hardware may generate files in a proprietary format. Researchers will often convert these files to a standardized format like NIFTI to be able to analyze it, share and publish the images in a ‘more universally’ interoperable format. File compression is also often done to facilitate sharing.

These are some important considerations when converting files:

  • Quality should be preserved: there should be no data loss (e.g., lossy compression methods where some data is lost may have an impact when reusing the data) or data corruption (data that is still present but unreadable).

  • When possible, the source and converted files should be kept, in case errors in the process were to be found in the future.

  • Conversion is better done automatically with a script logging any conversion parameters.

  • Long-time preservation and reusability should be considered when choosing the conversion format.

Version control

Version control refers to the record of changes made in a file or set of files over time. The record should allow collaborators to track history, to review and to revert changes. Versioning refers to the management of changes in a file or project (read more in https://book.the-turing-way.org/reproducible-research/vcs).

Versioning is very important for scientific reproducibility and transparency as it allows to follow provenance information, i.e., the record of how the results are derived from data. Data and code evolve and change over time and it can become really difficult to track which version of them generated a figure or a results table in a publication.

We recommend adopting Git, introduced below, to apply version control in documents, data and code. We suggest beginning with simple workflows; there is no need to master all of Git’s advanced features right away.

Version control in documents

There are many tools widely used tools for version control and collaborative editing of text documents. For example, MS Word, Google Docs and Overleaf. These tools have some limitations that need to be considered:

  • They require a master document (a single source document where we edit changes)

  • It does not scale up well to large number of collaborators

  • These tools do not show the complete history of changes

  • The changes and master document are stored in a company’s cloud

  • They only work for specific types of documents

In view of these limitations, these tools cannot provide a reliable and manageable version control system for many research outputs like metadata, documentation, tables and specially code.

Git version control system for data science

Git is a version control system that tracks changes in files, including timestamps. It is the de facto standard in code development and it is increasingly used in research and data science in general. It is open source, developed by Linus Torvalds in 2005. These are some characteristics of the Git system:

  • It is known to be fast, efficient, and reliable

  • Runs on all major operating systems

  • it is integrated in most IDEs (Integrated Development Environments): RStudio, Visual Studio Code, Eclipse, PyCharm, Spyder, MATLAB®…

Git is a distributed version control system (DVCS), which means collaborators create local copies of a project, make changes locally and then commit those changes to a remote repository (see Figure 1)

Figure 1: Centralized vs distributed version control systems

Main concepts in Git

  • Repository: A collection of files, typically organized as a project, managed with version control.

  • Commit: The action of recording changes in a repository. It is like a snapshot of the entire repository at a given time.

  • History: A registry (“log”) and collection of all the snapshots (“commits”) of a repository, allowing us to revert changes.

Platforms to host Git repositories online

  • GitLab: An open-source web-based tool that provides a Git repository manager (www.gitlab.com).
    Many institutions and organizations have their own GitLab instance (e.g., UZH https://gitlab.uzh.ch/).

  • GitHub: Another provider, owned by Microsoft (potential data governance issues) (www.github.com).

An overview of the Git system

The main areas and commands in Git are summarized in Figure 2.

Git ‘areas’:

  • Remote repository: the online repository with all files and history of changes, hosted in platforms like Gitlab or Github.

  • Workspace: this is the “visible” folder. In most cases this is a local folder with a copy of the online repository (including history of changes).

  • Index : (not directly visible) is a stagging area used to prepare changes before committing them to the project history

  • Local repository: (usually a hidden folder .git ) is a copy of the entire project’s history and codebase

Main actions:

  • Pull: Pull changes from the remote to update the local copy.

  • Commit & Push: Commit and push local changes into the remote.

  • Revert: Revert changes to a previous commit.

    Figure 2: Overview of Git commands

Workflows

Git can be used in the context of complex application development and in smaller settings. Although there may be a steep learning curve at the beginning, scientists can gradually incorporate Git into their routine workflow and it can be kept relatively simple without needing to use all the advanced features.

  • Branches: a repository can have multiple ‘branches’, which allows working in parallel. The same user or multiple collaborators can work on parallel versions of the repository. Some of the changes from the branches (e.g., new features into the main code) can be eventually merged into the ‘main’ code branch.
Figure 3: Illustration of git branches from the Turing Way

Publishing and sharing

  • Releases. The content of a Git repository is changing frequently. When scientists publish the code associated with a publication, the version of the repository with the files involved in that published analysis need to be also shared. That can be done by publishing a gitlab release, which is a snapshot of a repository with a version tag.
  • Persistent identifiers. Sharing the URL to a Gitlab or Github repository associated with a publication is not recommended. The URL may change in time and the code version become unavailable. It is recommended to publish a Git repository release in a platform that allows to assign a persistent identifier. For example, a digital object identifier or DOI, so that the repository can become a citable resource to be found in the future. Platforms like Zenodo, for example, offer direct integration with Github, assigning a DOI to every release of the repository. Another way is to simply upload in the platform a .zip folder with all the files in the repository (the ‘snapshot’ created when making a release) with a version tag.

Best practices

Setting up a repository

  • “Documentation is a love letter that you write to your future self.”

  • Add a README file in the main folder (root folder) and in other folders if needed.

  • Use a naming convention with a balance between concise and informative names. Avoid too cryptic or ambiguous folder or file names. For example: run_model_AB.r instead of run_scripts.r

Code writing

  • Consider numbering scripts that need to be run sequentially.

  • Use relative paths.

  • Log software libraries and system dependencies, e.g., sessionInfo(), and hardware when applicable.

  • Scripts should include comments (but concise).

  • Consider scripts to generate reports (dynamic reporting).

Sharing and Publishing

  • Describe the project in the main README file (dates, project ID, funding, affiliations, and DOI).

  • When the repository contains the code associated with a publication, include the DOI of the publication in the main README file.

  • Create a release, take a snapshot with a version tag .

  • Assign a DOI to the release.

  • Include a license file to explain reusability (agree with PI, e.g., GPL, MIT).

  • Specify the location of the source data.

Selected resources on Git

Archives and online repositories

Making research data available to others for reuse is essential to scientific progress and the main purpose of archiving open research data (ORD). This is recognized in the funding regulations of the Swiss National Science Foundation (SNSF).

We recommend to plan early for the archiving of research outputs — deciding what will be archived, how, when, and where. At a minimum, a data archive should involve raw data as well as any relevant derivatives, metadata files and documentation, and when applicable, any code involved in processing the data or making it usable. Without the adequate context, provided by metadata and documentation, archiving data results in a data dump of little scientific value.

Types of data archives

Data temperature by frequency of access

We can broadly classify the data archives by how frequently we expect it be accessed. Table 1 summarizes different ‘temperatures’ of the data defined by access frequency (Pernet et al. 2023). Irrespective of the temperature, any data archiving should comply with the pertinent legal governance requirements. The contents and steps described in this primer can apply to any type of archive, but the presented use-case fits into the warm archive category. It should be noted that an archive does not necessarily make data publicly available immediately and most repositories allow restricting access, for example by requiring an application or applying an embargo until a publication is released.

Table 1: An overview of different types of data archives depending on the frequency of access.
Feature \ Temperature Cold Warm Hot
Access frequency Very low Regular Constant
Back-ups No Yes Yes
Costs Cheap Expensive Expensive
Content All High-utility High-utility
Duration Long Long Short/Medium
Medium Tape Disk/Server/Cloud Disk/Server/Cloud
Online No Yes Yes
  • Cold data are not expected to be accessed in a long time but they should still be curated, unlike data considered disposable (Pernet et al. 2023). In some contexts, rarely accessed data can be essential. Collecting new data is usually much more expensive than archiving existing data, so it is generally advisable to cold-archive everything by default. Tapes are relatively cheap and robust to data quality deterioration if stored in facilities that ensure data integrity preservation. Cold data tends to involve larger storage volume than warm and hot data, as it includes all data, including that of limited utility. Therefore it is often not cost-effective to make cold data available online.

  • Warm data are expected to be accessed regularly and they are long-term archived in online servers or cloud repositories. Usually there are multiple back-up copies in physically separated locations to protect data integrity. Warm data are considered of high-utility for research reuse.

  • Hot data are accessed constantly for a specific period of time, for example, while a project is running to share it with collaborators. Hot data are stored in relatively expensive online servers or cloud repositories, and are more likely to move location than cold and warm data.

Open Research Data (ORD) online repositories

Open research data repositories aim at preserving and making data freely accessible to anyone, promoting transparency and scientific collaboration. Online repositories are meant to host warm data that will be accessed. They are too expensive and not conceived for storing cold data that will remain for long time without being used. Nonetheless, a good online repository should offer security and guaranty that the data will be findable and accessible for a long period of time.

SNSF requirements and FAIR repositories

A key requirement in a repository is that it should provide a persistent identifier to the data and metadata and facilitate making the archived data “FAIR enough”. These can be determining factors when evaluating the suitability of a repository, although assessing its FAIRness may not be a straightforward task. For example, the Swiss National Science Foundation defined a set of minimal criteria for the repositories to fulfill to be considered as ‘FAIR enough’ (Milzow et al. 2020). The criteria in that report were:

  • Globally unique and persistent identifiers are attributed to data sets (e.g. DOI)

  • Upload of intrinsic (e.g. author’s name, content of dataset, associated publication, etc.) and submitter-defined (e.g. definition of variable names, etc.) metadata possible

  • Reuse defined (e.g. Creative Commons, Open Data Commons, etc.)

  • Citation information and metadata always (even in the case of datasets with restricted access) publicly accessible

  • Intrinsic metadata submitted via standardized template/mask (to ensure machine readability and interoperability)

  • Long-term preservation plan for the archived data in place

Initiatives like the CoreTrustSeal make these criteria more explicit by defining and providing certifications based on how well repositories can ensure long-term storage and accessibility of data. These certifications also try to integrate FAIR-enabling assessments, taking into account repository features that help publishing FAIRer data.

Finding a repository

The SNSF requires researchers to submit data associated with the project’s publications into public repositories. Researchers are therefore encouraged to think about the field-specific options available in the early stages of project planning:

  • The SNSF provides an overview of data repositories and links to relevant institutional, generalist and discipline-specific repositories.
  • The website https://www.re3data.org/, (Pampel et al. 2013) also listed on the SNSF page, allows to explore the available options for a variety of domains. This data base allows browsing by subject, content type and country, as well as other filters (e.g., versioning, metadata standards, data licenses and many others).

Repositories of special interest for preclinical neuroimaging

  • EBRAINS. EBRAINS is a platform and repository created as a core aim of the EU-funded Human Brain Project to promote advancements in scientific and industrial research in neuroscience, computing, and brain-related medicine. It gathers a broad range of data and tools, including computing resources. It is designed to accommodate large volumes of data. The platform offers EBRAIN Curation Services accessible via a request form.

  • Image Data Resource (IDR) is a public repository of reference (microscopy) image datasets from published studies. It is led by PIs from Dundee University and the European Molecular Biology Laboratory - European Bioinformatics Institute. The accepted images must be reference images according to the criteria of the Euro-BioImaging - Elixir Image Data Strategy. For other images, the IDR website points at project repositories of BioStudies and Dryad. Metadata guidelines are provided together with examples and templates of annotations.

  • OpenNeuro. OpenNeuro is the former openfMRI repository, from which the Brain Imaging Data Structure (BIDS) guidelines emerged. It offers DOIs to dataset snapshots (versions) and although it has a strong focus on human neuroimaging, there are also animal imaging datasets published. It is focused exclusively on neuroimaging data of different modalities (MRI, PET, MEG, EEG, and iEEG). It is free, maintained by the Stanford Center for Reproducible Neuroscience, and endorsed by NIH as a data-sharing resource.

Back to top

References

Fraga Gonzalez, Gorka. n.d. “Reproducibility in Practice: Version Control and Dynamic Reporting - shareCTD Schooling Event.” https://doi.org/10.5281/ZENODO.14754696.
Goldberg, Ilya G, Chris Allan, Jean-Marie Burel, Doug Creager, Andrea Falconi, Harry Hochheiser, Josiah Johnston, Jeff Mellen, Peter K Sorger, and Jason R Swedlow. 2005. “The Open Microscopy Environment (OME) Data Model and XML File: Open Tools for Informatics and Quantitative Analysis in Biological Imaging.” Genome Biology 6 (5): R47. https://doi.org/10.1186/gb-2005-6-5-r47.
Hofmann, Felix, Samuel Pawel, Melanie Röthlisberger, and Leonhard Held. 2023. “Primer: Digital Collaboration.” https://doi.org/10.5281/ZENODO.8354375.
Milzow, Katrin, Martin von Arx, Cornélia Sommer, Julia Cahenzli, and Lionel Perini. 2020. “Open Research Data: SNSF Monitoring Report 2017-2018.” https://doi.org/10.5281/zenodo.3618123.
Moore, Josh, Daniela Basurto-Lozada, Sébastien Besson, John Bogovic, Jordão Bragantini, Eva M. Brown, Jean-Marie Burel, et al. 2023. “OME-Zarr: A Cloud-Optimized Bioimaging File Format with International Community Support.” Histochemistry and Cell Biology 160 (3): 223–51. https://doi.org/10.1007/s00418-023-02209-1.
Pampel, Heinz, Paul Vierkant, Frank Scholze, Roland Bertelmann, Maxi Kindling, Jens Klump, Hans-Jürgen Goebelbecker, Jens Gundlach, Peter Schirmbacher, and Uwe Dierolf. 2013. “Making Research Data Repositories Visible: The Re3data.org Registry.” PLOS ONE 8 (11): e78080. https://doi.org/10.1371/journal.pone.0078080.
Pernet, Cyril, Claus Svarer, Ross Blair, John D. Van Horn, and Russell A. Poldrack. 2023. “On the Long-Term Archiving of Research Data.” Neuroinformatics 21 (2): 243–46. https://doi.org/10.1007/s12021-023-09621-x.
Ram, Karthik. 2013. “Git Can Facilitate Greater Reproducibility and Increased Transparency in Science.” Source Code for Biology and Medicine 8 (1): 7. https://doi.org/10.1186/1751-0473-8-7.