Environmental impact of data management and openness
Data and open science
Open science, i.e. the free and open access of research products (publications, data, protocols, source codes, software), has progressively become an integral part of European calls for proposals (H2020 then Horizon Europe), based on the principle of "as open as possible, as closed as necessary".
This approach is also reflected in the calls for tender issued by the French National Research Agency (ANR) in application of the French law for a digital republic and the national plan for open science. Thus, the opening up of research data is not only a legal obligation, but also the result of a national and international political will to promote transparency. Data is a crucial issue for higher education establishments and other organizations. Beyond the strategic importance of data storage and the costs inherent in securing it, it is important to take into account the environmental impacts generated by data during the various phases of its lifecycle: use, backup, archiving, deletion and restoration. In addition to the issues linked to data volumes, we are dealing with data that is fairly fragmented and heterogeneous in terms of its nature (biological, environmental, social, geographical, etc.), format (text, digital, audio, images), dispersion on different storage media (laptop, desktop, mobile media (hard disks, USB sticks, etc.), clouds, etc.), quality and reliability, accessibility and value linked to both the intended use of the data and historical, social and geographical circumstances.
Managing research data according to the FAIR (Easy to Find, Accessible, Interoperable and Reusable) principles helps to ensure that data is respected and preserved under optimized conditions of interoperability and reusability. Nevertheless, the application of FAIR principles can also require the mobilization of significant IT resources for data preservation and dissemination, with major drawbacks for the environment. These include
- Exponential growth in digital data volumes due to the increase in data production capacity brought about by technological advances
- The risk of saving "anything and everything" on the assumption that "it may or may not be useful", without any prior thought.Ecological best practices in terms of data management:
To reduce the impact of data management on the environment, there are several ways of improving practices throughout the data life cycle:
1. Reduce the impact of data acquisition :
Question the usefulness of producing this or that dataset.
To do this, start by exploring the possibility that equivalent data (accompanied by rich metadata) that could be reused exists elsewhere in warehouses.
- Favoring harvesting to reuse existing data avoids duplication and redundancy of datasets, and therefore reduces data volume.
- In the absence of reusable data for the project, making data collection essential, use low-tech, reusable sensors to collect/produce data.
2. Reduce the impact of data processing, analysis and distribution:
- Reduce the "physical" distance involved in data handling as much as possible, so that data processing takes place as close as possible to the storage location.
- Choose open file formats (e.g. .csv instead of .xls) that are less "greedy": a file with the same data in .csv format can be 1 to 10 times smaller than one in .xls format, for example. In addition, limit the number of formats offered. Avoiding file redundancy, allowing partial downloading of data and offering compressed files for download reduce the environmental impact of data storage and distribution, which are the primary negative externalities of data, due to the need for infrastructure (network and data centers) and user terminals.
- Avoid sending attachments to share data: prefer a link to the source of the files (link to a cloud or a file-sharing/sending tool such as FileSender).
3. Reduce the impact of data storage and backup :
- Thinking about the storage of data produced according to its use enables us to define the usefulness of keeping a given set of data "hot" or "cold", or even keeping it at all. Hot data" is data that is regularly requested and used, and should be stored on fast, immediately accessible media (local network, synchronized cloud, internal PC hard drives (provided you have a rigorous backup policy)) (hot storage). On the other hand, data that is only accessed occasionally, known as "lukewarm or even cold data", needs to be stored on slower, less energy-intensive media (hard disk, magnetic tapes, optical disks (CDs), non-synchronized cloud or data centers, etc.) (cold storage). Cold storage thus consists in backing up and recovering data (in the short term) and archiving (in the long term) data that is rarely used or no longer needed.
- Pooling services to create (preferably local) "cold storage" infrastructures for lightly-used data (see above), which can be accessed on demand on the basis of reasonable and acceptable processing times. AgroDataRing, for example, is a shared infrastructure for long-term storage.
4. Reduce the impact of data archiving :
- Not all data needs to be archived. It is important to sort the data by selecting those deemed relevant to be saved on the basis of criteria such as scientific value recognized by the community, data with evidential value for publications, legal interest, historical interest (heritage), non-reproducibility, intelligibility of the data (thanks to rigorous documentation). In some cases, it may be appropriate and sufficient to archive a sample of the dataset in order to limit its volume.
- Always favor open and durable formats to ensure long-term archiving and reusability of data.
- The Centre Informatique National de l'Enseignement Supérieur (CINES), the Data Terra research infrastructure or the TGIR Huma-Num, for example, can support you in this archiving process, while ensuring data reusability.
NB: All these eco-responsible best practices must be considered and implemented in accordance with the FAIR principles.
References :
1- Didier Mallarino, Sylvie Le Bras, Cyrille Bonamy. Les impacts environnementaux et sociétaux des données : un défi pour l’avenir. Congrès JRES : Les Journées Réseaux de l’Enseignement et de la Recherche, RENATER, May 2022, Marseille, France. -hal-03702208 HAL Id: hal-03702208; https://hal.archives-ouvertes.fr/hal-03702
2- Christine Hadrossek, Joanna Janik, Maurice Libes, Violaine Louvet, Marie-Claude Quidoz, et al. Guide de bonnes pratiques sur la gestion des données de la Recherche. 2023. hal-03152732v2 HAL Id: hal-03152732; https://hal.science/hal-03152732v2
3- https://ecoresponsable.numerique.gouv.fr/docs/2022/guide-de-bonnes-prat…
4- https://opendatafrance.gitbook.io/greendata-pour-un-impact-maitrise-des…
5- https://www.cnil.fr/sites/cnil/files/atoms/files/guide_durees_de_conser…