Sharing and Publishing Data
Long-term preservation and interoperability of the data
File formats
The choice of a format will determine how data can be accessed throughout its lifecycle, including sharing and reuse. In general, we need to consider long-term preservation when choosing a format. Files in formats with the following features will be more likely to be accessible in the future:
Non-proprietary
Open, documented standard
Popular format
Standard representation
Unencrypted
Uncompressed
We recommend selecting file formats based on these features and to follow existing guidelines for general and domain-specific file formats. The section below summarizes the file formats recommended by various institutions.
File formats in the general domain
There are many institutions providing recommendations on file formats in general. This section shows a couple of representative examples.
Recommendations from the Faculty of Biology and Medicine, University of Lausanne (Medical Library):
| Category | Recommended Formats |
|---|---|
| Text | PDF/A – PDF/X, Plain text (.txt), Open Office (.odt), XML / HTML (with schema), Word XML (.docx), RTF, LaTeX |
| Images | Bitmap: TIFF (uncompressed), PNG, JPEG2000, (GIF). Vector: SVG |
| Tabular Data | CSV (comma, tab, semi-colon), Open Office (.ods), XML / HTML (with schema), Excel (.xlsx), .SQL |
| Video | MPEG-4 (H.264) (~ MP4), Motion JPEG 2000, MPEG-1/2 |
| Audio | WAV (preferably Broadcast Wave Format, LPCM), AIFF (LPCM), OGG Vorbis, MP3 (MPEG Layer III), AAC (MPEG-4) |
Recommendations from the open data publishing platform (Dryad):
| Category | Recommended formats |
|---|---|
| README files | Markdown (MD) or text (TXT) |
| Tabular data | Comma-separated values (CSV) |
| Non-tabular data | Semi-structured plain text (e.g., protein sequences) |
| Structured plain text | XML, JSON (e.g., metadata companion files) |
| Images | PDF, JPEG, PNG, TIFF, SVG |
| Audio | FLAC, AIFF, WAV, MP3, OGG |
| Video | AVI, MPEG, MP4 |
| Compressed file archives | TAR.GZ, 7Z, ZIP |
Interoperable formats in neuroimaging
There are are many modalities of neuroimaging research and therefore many different file formats used in this field. This section covers some of the most popular formats in neuroimaging. The Brain Imaging Data Structure - BIDS standards also provides recommendations for some of these formats in the different data modalities available.
Magnetic Resonance Imaging (MRI), Computed Tomography (CT), Positron Emission Tomography (PET) and similar
Neuroimaging Informatics Technology Initiative (NIfTI). It is an open file format commonly used to stored imaging data obtained with magnetic resonance imaging methods. The associated file extensions are:
.nii,.nii.gz,.hdr/.img.Digital Imaging and Communication in Medicine (DICOM) is a technical standard for digital storage and transmission of biomedical images. It was developed in the 80s and it is widely in use today. The DICOM standard is widely adopted by hospitals and medical software. The files extension is usually
.dcm.
Microscopy
OME data model. OME Model is an specification from the Open Microscopy Environent for storing data on biological imaging. The model includes parameter and extensive metadata. OME-XML is a file format used to store data according ot the model. OME-TIFF is a multiplane tiff file with OME metadata in the header (as OME-XML). See more details in their publication (Goldberg et al. 2005). Further, OME-ZARR (sometimes referred to as OME-NGFF or NGFF) is a cloud-optimized format developed to improve access and storage for large data (Moore et al. 2023).
In BIDS (v1.10.0) microscopy raw data MUST be stored in one of the following formats:
Portable Network Graphics (
.png)Tag Image File Format (
.tif)OME-TIFF (
.ome.tiffor standard TIFF files or.ome.btffor BigTIFF files)OME-ZARR/NGFF (
.ome.zarrdirectories)
Electroencephalography (EEG) and intracraneal electroencephalography (iEEG)
There is no single standard. The recommendations from BIDS for raw data are the following:
European data format (
.edf). Each recording is a single file.edf+files are permitted. The capital .EDF extension MUST NOT be used.BrainVision Core Data Format (
.vhdr,.vmrk,.eeg). Each dataset recording consists of a file triplet.EEGLAB (
.set,.fdt). The format used by the MATLAB toolbox EEGLAB. Each recording consists of a .set file with an OPTIONAL .fdt file.Neurodata Without Borders (
.nwb). For iEEG data only. Each recording consists of a single .nwb file.MEF3 (
.mefd). For iEEG data only. Each recording consists of a .mefd directory.Biosemi (
.bdf). For EEG data only. Each recording consists of a single .bdf file. bdf+ files are permitted. The capital .BDF extension MUST NOT be used.
Metadata, markup and data serialization
Markup languages specify the structure and formatting of a document. Data serialization languages are used to create configuration files in plain text and compatible between systems.
JavaScript Object Notation (JSON). (
.jsonIn the BIDS specification, JSON files are primarily used as “sidecar” files accompanying the data files. There are also a few special cases of JSON files being first-order data files, such as genetic_info.json.Extensible Markup Language (XML) (
.xml) is a markup language and file format for storing, transmitting, and reconstructing data.YAML Ain’t Markup Language (
.yaml,.yml) is a human-friendly data serialization language for all programming languages which is often used for configuration files.Markdown, a lightweight language with simple syntax that will add formatting elements to plain text. This format is widely used for technical documentation online.
File format conversion
There are multiple reasons why researchers often have to convert files to a different format. For example, imaging acquisition hardware may generate files in a proprietary format. Researchers will often convert these files to a standardized format like NIFTI to be able to analyze it, share and publish the images in a ‘more universally’ interoperable format. File compression is also often done to facilitate sharing.
These are some important considerations when converting files:
Quality should be preserved: there should be no data loss (e.g., lossy compression methods where some data is lost may have an impact when reusing the data) or data corruption (data that is still present but unreadable).
When possible, the source and converted files should be kept, in case errors in the process were to be found in the future.
Conversion is better done automatically with a script logging any conversion parameters.
Long-time preservation and reusability should be considered when choosing the conversion format.
Version control
Version control refers to the record of changes made in a file or set of files over time. The record should allow collaborators to track history, to review and to revert changes. Versioning refers to the management of changes in a file or project (read more in https://book.the-turing-way.org/reproducible-research/vcs).
Versioning is very important for scientific reproducibility and transparency as it allows to follow provenance information, i.e., the record of how the results are derived from data. Data and code evolve and change over time and it can become really difficult to track which version of them generated a figure or a results table in a publication.
We recommend adopting Git, introduced below, to apply version control in documents, data and code. We suggest beginning with simple workflows; there is no need to master all of Git’s advanced features right away.
Version control in documents
There are many tools widely used tools for version control and collaborative editing of text documents. For example, MS Word, Google Docs and Overleaf. These tools have some limitations that need to be considered:
They require a master document (a single source document where we edit changes)
It does not scale up well to large number of collaborators
These tools do not show the complete history of changes
The changes and master document are stored in a company’s cloud
They only work for specific types of documents
In view of these limitations, these tools cannot provide a reliable and manageable version control system for many research outputs like metadata, documentation, tables and specially code.
Git version control system for data science
Git is a version control system that tracks changes in files, including timestamps. It is the de facto standard in code development and it is increasingly used in research and data science in general. It is open source, developed by Linus Torvalds in 2005. These are some characteristics of the Git system:
It is known to be fast, efficient, and reliable
Runs on all major operating systems
it is integrated in most IDEs (Integrated Development Environments): RStudio, Visual Studio Code, Eclipse, PyCharm, Spyder, MATLAB®…
Git is a distributed version control system (DVCS), which means collaborators create local copies of a project, make changes locally and then commit those changes to a remote repository (see Figure 1)
Main concepts in Git
Repository: A collection of files, typically organized as a project, managed with version control.
Commit: The action of recording changes in a repository. It is like a snapshot of the entire repository at a given time.
History: A registry (“log”) and collection of all the snapshots (“commits”) of a repository, allowing us to revert changes.
Platforms to host Git repositories online
GitLab: An open-source web-based tool that provides a Git repository manager (www.gitlab.com).
Many institutions and organizations have their own GitLab instance (e.g., UZH https://gitlab.uzh.ch/).GitHub: Another provider, owned by Microsoft (potential data governance issues) (www.github.com).
An overview of the Git system
The main areas and commands in Git are summarized in Figure 2.
Git ‘areas’:
Remote repository: the online repository with all files and history of changes, hosted in platforms like Gitlab or Github.
Workspace: this is the “visible” folder. In most cases this is a local folder with a copy of the online repository (including history of changes).
Index : (not directly visible) is a stagging area used to prepare changes before committing them to the project history
Local repository: (usually a hidden folder
.git) is a copy of the entire project’s history and codebase
Main actions:
Pull: Pull changes from the remote to update the local copy.
Commit & Push: Commit and push local changes into the remote.
Revert: Revert changes to a previous commit.
Workflows
Git can be used in the context of complex application development and in smaller settings. Although there may be a steep learning curve at the beginning, scientists can gradually incorporate Git into their routine workflow and it can be kept relatively simple without needing to use all the advanced features.
- Branches: a repository can have multiple ‘branches’, which allows working in parallel. The same user or multiple collaborators can work on parallel versions of the repository. Some of the changes from the branches (e.g., new features into the main code) can be eventually merged into the ‘main’ code branch.
Publishing and sharing
- Releases. The content of a Git repository is changing frequently. When scientists publish the code associated with a publication, the version of the repository with the files involved in that published analysis need to be also shared. That can be done by publishing a gitlab release, which is a snapshot of a repository with a version tag.
- Persistent identifiers. Sharing the URL to a Gitlab or Github repository associated with a publication is not recommended. The URL may change in time and the code version become unavailable. It is recommended to publish a Git repository release in a platform that allows to assign a persistent identifier. For example, a digital object identifier or DOI, so that the repository can become a citable resource to be found in the future. Platforms like Zenodo, for example, offer direct integration with Github, assigning a DOI to every release of the repository. Another way is to simply upload in the platform a .zip folder with all the files in the repository (the ‘snapshot’ created when making a release) with a version tag.
Best practices
Setting up a repository
“Documentation is a love letter that you write to your future self.”
Add a README file in the main folder (root folder) and in other folders if needed.
Use a naming convention with a balance between concise and informative names. Avoid too cryptic or ambiguous folder or file names. For example:
run_model_AB.rinstead ofrun_scripts.r
Code writing
Consider numbering scripts that need to be run sequentially.
Use relative paths.
Log software libraries and system dependencies, e.g.,
sessionInfo(), and hardware when applicable.Scripts should include comments (but concise).
Consider scripts to generate reports (dynamic reporting).
Sharing and Publishing
Describe the project in the main README file (dates, project ID, funding, affiliations, and DOI).
When the repository contains the code associated with a publication, include the DOI of the publication in the main README file.
Create a release, take a snapshot with a version tag .
Assign a DOI to the release.
Include a license file to explain reusability (agree with PI, e.g., GPL, MIT).
Specify the location of the source data.
Selected resources on Git
Article on Git and reproduciblity and transparency in science: Ram (2013): https://doi.org/10.1186/1751-0473-8-7.
Primer on digital collaboration from the Center for Reproducible Hofmann et al. (2023): https://doi.org/10.5281/ZENODO.8354375.
Reproducibility in Practice: Version Control and Dynamic Reporting - workshop materials Fraga Gonzalez (n.d.): https://doi.org/10.5281/ZENODO.14754696.
Archives and online repositories
Making research data available to others for reuse is essential to scientific progress and the main purpose of archiving open research data (ORD). This is recognized in the funding regulations of the Swiss National Science Foundation (SNSF).
We recommend to plan early for the archiving of research outputs — deciding what will be archived, how, when, and where. At a minimum, a data archive should involve raw data as well as any relevant derivatives, metadata files and documentation, and when applicable, any code involved in processing the data or making it usable. Without the adequate context, provided by metadata and documentation, archiving data results in a data dump of little scientific value.
Types of data archives
Data temperature by frequency of access
We can broadly classify the data archives by how frequently we expect it be accessed. Table 1 summarizes different ‘temperatures’ of the data defined by access frequency (Pernet et al. 2023). Irrespective of the temperature, any data archiving should comply with the pertinent legal governance requirements. The contents and steps described in this primer can apply to any type of archive, but the presented use-case fits into the warm archive category. It should be noted that an archive does not necessarily make data publicly available immediately and most repositories allow restricting access, for example by requiring an application or applying an embargo until a publication is released.
| Feature \ Temperature | Cold | Warm | Hot |
|---|---|---|---|
| Access frequency | Very low | Regular | Constant |
| Back-ups | No | Yes | Yes |
| Costs | Cheap | Expensive | Expensive |
| Content | All | High-utility | High-utility |
| Duration | Long | Long | Short/Medium |
| Medium | Tape | Disk/Server/Cloud | Disk/Server/Cloud |
| Online | No | Yes | Yes |
Cold data are not expected to be accessed in a long time but they should still be curated, unlike data considered disposable (Pernet et al. 2023). In some contexts, rarely accessed data can be essential. Collecting new data is usually much more expensive than archiving existing data, so it is generally advisable to cold-archive everything by default. Tapes are relatively cheap and robust to data quality deterioration if stored in facilities that ensure data integrity preservation. Cold data tends to involve larger storage volume than warm and hot data, as it includes all data, including that of limited utility. Therefore it is often not cost-effective to make cold data available online.
Warm data are expected to be accessed regularly and they are long-term archived in online servers or cloud repositories. Usually there are multiple back-up copies in physically separated locations to protect data integrity. Warm data are considered of high-utility for research reuse.
Hot data are accessed constantly for a specific period of time, for example, while a project is running to share it with collaborators. Hot data are stored in relatively expensive online servers or cloud repositories, and are more likely to move location than cold and warm data.
Open Research Data (ORD) online repositories
Open research data repositories aim at preserving and making data freely accessible to anyone, promoting transparency and scientific collaboration. Online repositories are meant to host warm data that will be accessed. They are too expensive and not conceived for storing cold data that will remain for long time without being used. Nonetheless, a good online repository should offer security and guaranty that the data will be findable and accessible for a long period of time.
SNSF requirements and FAIR repositories
A key requirement in a repository is that it should provide a persistent identifier to the data and metadata and facilitate making the archived data “FAIR enough”. These can be determining factors when evaluating the suitability of a repository, although assessing its FAIRness may not be a straightforward task. For example, the Swiss National Science Foundation defined a set of minimal criteria for the repositories to fulfill to be considered as ‘FAIR enough’ (Milzow et al. 2020). The criteria in that report were:
Globally unique and persistent identifiers are attributed to data sets (e.g. DOI)
Upload of intrinsic (e.g. author’s name, content of dataset, associated publication, etc.) and submitter-defined (e.g. definition of variable names, etc.) metadata possible
Reuse defined (e.g. Creative Commons, Open Data Commons, etc.)
Citation information and metadata always (even in the case of datasets with restricted access) publicly accessible
Intrinsic metadata submitted via standardized template/mask (to ensure machine readability and interoperability)
Long-term preservation plan for the archived data in place
Initiatives like the CoreTrustSeal make these criteria more explicit by defining and providing certifications based on how well repositories can ensure long-term storage and accessibility of data. These certifications also try to integrate FAIR-enabling assessments, taking into account repository features that help publishing FAIRer data.
Finding a repository
The SNSF requires researchers to submit data associated with the project’s publications into public repositories. Researchers are therefore encouraged to think about the field-specific options available in the early stages of project planning:
- The SNSF provides an overview of data repositories and links to relevant institutional, generalist and discipline-specific repositories.
- The website https://www.re3data.org/, (Pampel et al. 2013) also listed on the SNSF page, allows to explore the available options for a variety of domains. This data base allows browsing by subject, content type and country, as well as other filters (e.g., versioning, metadata standards, data licenses and many others).
Popular generalist repositories
Open Science Framework (OSF). OSF is a free and open-source software platform to facilitate open collaboration. It is multidisciplinary, besides DOIs, it offers synchronization with Git platforms and it has its own platform for preprints and study registrations. The default storage location is the United States, but it allows to choose other locations from those available, including Canada, Germany, and Australia.
Zenodo. Zenodo is an Open Science platform built and developed by researchers. Hosted, developed, and operated by OpenAIRE and CERN (Conseil Européen pour la Recherche Nucléaire) scientists. It is free and accommodates any type of research data. It offers DOI versioning and good direct integration with Github. It also offers the possibility of creating communities.
Repositories of special interest for preclinical neuroimaging
EBRAINS. EBRAINS is a platform and repository created as a core aim of the EU-funded Human Brain Project to promote advancements in scientific and industrial research in neuroscience, computing, and brain-related medicine. It gathers a broad range of data and tools, including computing resources. It is designed to accommodate large volumes of data. The platform offers EBRAIN Curation Services accessible via a request form.
Image Data Resource (IDR) is a public repository of reference (microscopy) image datasets from published studies. It is led by PIs from Dundee University and the European Molecular Biology Laboratory - European Bioinformatics Institute. The accepted images must be reference images according to the criteria of the Euro-BioImaging - Elixir Image Data Strategy. For other images, the IDR website points at project repositories of BioStudies and Dryad. Metadata guidelines are provided together with examples and templates of annotations.
OpenNeuro. OpenNeuro is the former openfMRI repository, from which the Brain Imaging Data Structure (BIDS) guidelines emerged. It offers DOIs to dataset snapshots (versions) and although it has a strong focus on human neuroimaging, there are also animal imaging datasets published. It is focused exclusively on neuroimaging data of different modalities (MRI, PET, MEG, EEG, and iEEG). It is free, maintained by the Stanford Center for Reproducible Neuroscience, and endorsed by NIH as a data-sharing resource.


