SAEON ODP Preservation Policy
Purpose
The purpose of this document is to describe the data preservation framework governing the operations of the SAEON Open Data Platform (ODP). This policy covers all data and metadata archived and published in the ODP with the exception of community portals hosted on behalf of stakeholders external to SAEON that have different preservation requirements.
The principles of this policy are informed by:
- Trusted Digital Repository Standards and Frameworks
- ISO 16363
- Trustworthy Repositories Audit & Certification (TRAC)
- Reference Model for an Open Archival Information System (OAIS)
- FAIR Data Principles
This policy may be revised if the framework governing the SAEON ODP changes.
Definitions
AIP - Archival Information Package: This is a package containing data and the metadata that describes it. It is created by a data curator from the Submission Information Package (SIP) supplied by the data provider, with the addition of any necessary format migrations or additional information added to the metadata.
DOI - Digital Object Identifier: A Digital Object Identifier (DOI) is a unique persistent identifier assigned to an object. This links to the metadata record for the object as well as to a digital location, where details about the object can be found. SAEON makes use of DataCite’s DOI system.
Data curators: The uLwazi team members responsible for the management of data throughout its lifecycle, from ingestion to dissemination and long-term archiving.
Data providers: The people and organisations who are submitting data to be archived and published in the SAEON ODP.
DSI - Department of Science and Innovation: A South African government department whose mission is to provide leadership, an enabling environment, and resources for science, technology and innovation in support of South Africa’s development.
FAIR - Findable, Accessible, Interoperable, Reusable Data Principles: These are a set of principles that describe how to make data Findable, Accessible, Interoperable and Reusable (FAIR), the principles were created through a collaborative effort of stakeholders, representing academia, industry, funding agencies, and scholarly publishers.
NRF - National Research Foundation: The mandate of the National Research Foundation (NRF) is to promote and support research through funding, human resource development and the provision of the necessary research facilities in order to facilitate the creation of knowledge, innovation and development in all fields of science and technology, including indigenous knowledge, and thereby contribute to the improvement of the quality of life of all South Africans.
OAIS - Open Archival Information System Reference Model: SAEON makes use of the Reference Model for an Open Archival Information System (OAIS) , developed by The Consultative Committee for Space Data Systems (CCSDS), as a best-practice standard to work towards.
QA - Quality Assurance: In this context quality assurance is performed by data curators to ensure that the SIPs provided by the data providers contain data that falls within SAEON’s collection policy in an acceptable format and that sufficient metadata has been provided to describe the data.
QC - Quality Control: Quality control is performed by data curators to check that all the necessary quality assurance steps were taken.
SAEON - South African Environmental Observation Network: The South African Environmental Observation Network (SAEON) is a business unit of the NRF and serves as a national platform for detecting, translating and predicting environmental change through scientifically designed observation systems and research. SAEON also captures and makes long-term datasets freely accessible, and runs an education outreach programme. SAEON has six nodes dispersed geographically across the country.
SAEON ODP - SAEON Open Data Platform: The Open Data Platform is SAEON’s system of systems that includes a number of data and metadata infrastructures and community portals that are customised for particular stakeholder communities.
SLA - Service Level Agreement: These detail the agreements between SAEON and its stakeholders and define the roles and responsibilities of both parties, the service levels and issue resolution procedures and the duration of the agreements.
SIP - Submission Information Package: This is the package of data and metadata that the data provider sends to SAEON for archiving and publishing.
TRAC - Trustworthy Repositories Audit & Certification: The Trustworthy Repositories Audit & Certification was created through an international collaboration and provides tools for the audit, assessment, and potential certification of digital repositories, establishes the documentation requirements required for audit, delineates a process for certification, and establishes appropriate methodologies for determining the soundness and sustainability of digital repositories.
uLwazi: The uLwazi node is one of the seven nodes of the South African Environmental Observation Network (SAEON). uLwazi means ‘knowledge’ in Nguni languages. The SAEON uLwazi node is made up of four teams, Infrastructure Management, Systems Development, Data Curation and Data Science, which provide infrastructure and support, development, curation and data science services to SAEON and external organisations. uLwazi hosts and develops data systems, which are distributed on an Open Data Platform (ODP), and online tools for research data infrastructure and associated decision making.
Background Information on the SAEON Open Data Platform (ODP)
Mission and Organisational Mandate
SAEON is a sustained, coordinated, responsive and comprehensive in situ Earth observation network that delivers long-term reliable data for scientific research and informs decision-making for a knowledge society and improved quality of life.
SAEON received a portfolio of funding from the Department of Science and Innovation (DSI) to preserve and provide access to earth and environmental observation data for South Africa and so archives any publicly funded data or open that is captured in this domain.
uLwazi is the node within SAEON that is responsible for acquiring, enhancing, storing, maintaining and disseminating the data and is made up of four teams focused on IT infrastructure management, systems development, data curation and data science. The uLwazi node also runs various data related projects for government stakeholders and so continuously works on sourcing datasets that can supplement decision support and policy making in areas relevant to these projects.
Data Infrastructure
The ODP is SAEON’s overall research data infrastructure that includes a number of data and metadata collections and portals that are customised for particular stakeholder communities.
Data Collection Policy
The datasets hosted by SAEON include spatial data, multidimensional data, time series data and general digital objects and media data for the earth and environmental observation domain.
Appraisal and selection of data
On verification of a SAEON Data Policy compliant data submission, a Submission Information Package (SIP) consisting of data and metadata submitted by a data provider is created by the data curation team and uploaded to the SAEON ODP file repository.
Once the SIP has been created, Quality Assurance (QA) and selection of an appropriate data store are executed. Available data store options depend on the data formats. These are:
- Geospatial Databases and Servers - for vector and raster spatial datasets.
- THREDDS/OPeNDAP - for multidimensional datasets (NetCDF)
- Relational Database Management Systems - for time Series observations.
- File System for managing text files, images, video, audio - for any other digital object or unstructured data.
At this stage the SIP may be reassigned to a data curator with the relevant domain expertise. The purpose of this QA step is to check if any information is needed from the data provider prior to publishing the dataset. If no further data management actions are required, an Archival Information Package (AIP) is generated and passed on to another data curator for Quality Control (QC) and publication.
The Archival Information Package (AIP) generation is initiated by uploading the data in the SIP into the correct data store. If the data are not in the correct format for long-term preservation, additional curation steps such as format migration, further updates to metadata and additional quality assurance are added to the workflow. See Table 1 below for preferred preservation file formats.
Table 1: Recommended file formats for long-term data preservation
Data Type | Recommended File Formats |
Documents | Plain text (.txt), PDF (.pdf) |
Tabular data | Comma separated values (.csv) |
Geospatial data | Shapefile (.shp, including .shx and .dbf), GeoTIFF (.tiff and .tfw) |
Multidimensional | NetCDF (.nc) |
Time series data | Relational database (SQL), Comma separated values (.csv), Plain text (.txt) |
Images | TIFF (.tiff) |
Audio | Wav (.wav) |
Video | Quicktime (.mov), Mpeg 4 (.mp4) |
Approaches towards data that do not fall within the mission/collection profile
If the data do not fall within the collection profile for a specific ODP repository, the data curators will check it against the criteria for SAEON’s other repositories. If the data do not fall within any of these collection profiles then this will be communicated to the data provider once a curator has reviewed the intended data submission or SIP and they will be informed that SAEON will not be able to archive or publish their data.
Archival Storage
Preservation levels
Once AIPs have been published, there are monitored systems in place to check for broken links, while changes to metadata records are recorded using the DataCite metadata schema related identifiers.
The SAEON ODP has three preservation levels, one of which is currently in use and two of which are available for future use. The first is online storage that ensures the data are immediately accessible. The second is nearline storage of data that are moved to magnetic tape, which could be made available within 24 hours and the third is offline storage. The latter two are currently not in use, but as the repository expands they may be needed. The access policy and workflow for these latter preservation levels will be determined prior to their use, as will the migration policy within the two levels.
Security and hosting
SAEON hosts the ODP at its Cape Town Offices on IBM enterprise infrastructure, with virtual servers provided through VMWare. Physical access to the server and network infrastructure is controlled with Biometrics. This infrastructure is replicated at the National Research Foundation offices in Pretoria and synchronised daily. Internally, all SAEON servers and end user machines have Bitdefender deployed as protection against intrusions.
SAEON employs a multi-prong security approach. There are smart next-generation firewalls that utilise AI and smart technology to add extra layers of protection to the network perimeter. Within the network there is a siloed approach offering separations between networks and between servers. This minimises the attack surface and would isolate a compromise to contain any potential damage.
Constant vulnerability checks are performed against the latest identified vulnerabilities and remediations are put in place should any vulnerabilities be identified. The systems are also routinely checked by outside organisations to ensure that SAEON stays constantly informed on the integrity of its systems.
All the client machines are closely monitored to ensure that they stay as secure as possible, client firewalls are utilised to ensure that all client machines stay protected even while outside SAEON’s networks. VPNs are used on all client machines to ensure that they have secure access to SAEON networks while away from the office, and also ensure that they fall under SAEON’s umbrella of security while at home or travelling.
Data Management
Version management
The version control of the data is conducted through metadata using the DataCite schema, which makes use of related identifiers to link versions that are derived from the AIP, or which provide major or minor versions of the original AIP. Changes in datasets will trigger a major version whereas additional details about the dataset will trigger a minor version.
Data transformation
All Data providers are required to comply with the SAEON Data Policy which grants SAEON staff the necessary rights to convert data archived in the ODP to new formats when the need arises. For institutional data providers there are Service Level Agreements (SLA) in place that allow for format migration.
Data retention
SAEON is mandated to maintain and store all the data in the ODP indefinitely. Currently, copies of all the data managed in the ODP on behalf of data providers are available online. However, as the data holdings expand and storage space becomes more of a consideration, some data may be moved to less expensive nearline or offline media. All DOIs will be maintained for proper resolution to metadata landing pages and details on data migration will be provided in the metadata.
Data retention checklist
The checklist in Table 2, adapted from the Natural Environment Research Council (NERC) data value checklist, is intended to guide decision-making on data accessioning and data retention. If any of the legal considerations are applicable then the data must be accessioned and retained, and if any of the criteria in the other sections are applicable then the data should probably be accessioned and retained.
Table 2: Data retention checklist adapted from NERC (1)
Legal considerations | Yes | No |
Are there laws or legislation in place that dictate that the data should be retained? | ||
Are there any reasons the data might need to be kept for legal reasons that do not fall under laws or legislation e.g. to be used in litigation? | ||
Are there any contractual or financial agreements in place that require us to keep the data? | ||
Policy | ||
Was the data collection funded by SAEON, either fully or partially? | ||
Do the data fall within the SAEON ODP’s collection policy? | ||
Scientific or Historical Value | ||
Are the data a unique unrepeatable measurement of the environment? | ||
Do the data have a broad geographical or temporal extent that makes them useful to others? | ||
Do the data have historic value i.e. do they represent a landmark in scientific discovery? | ||
Do the data include changes in processing methods, new standards or set any precedents? | ||
Do the data support current projects or trends in science? | ||
Are the data likely to meet the future needs/direction of the scientific community? | ||
Do the data contribute to a pre-existing collection? | ||
Is there potential for re-use of the data? | ||
Is it likely that the data will be cited or referenced in a publication? |
(1. NERC. (n.d.). Data value checklist. [online] Available at: https://nerc.ukri.org/research/sites/data/policy/data-value-checklist/ [Accessed 20 Feb. 2020].)
Data access
SAEON is committed to the principles of free and open access and, in the interest of keeping the ODP as accessible as possible. However, if datasets are listed under restrictive licenses then registration will be required to access the data and the data user will need to confirm that they are aware of the restrictions on the use of the data.
The data users are able to browse and search the metadata records; make use of the data services; and download the metadata, data and supplementary information that is available. There is a user feedback form that allows them to comment on data downloads or provide general feedback.
Continuity of Access to Data Holdings
SAEON currently receives a portfolio of funding for development and maintenance of its research data infrastructure. Should SAEON fail to receive long-term funding in the future, the National Research Foundation (NRF), its host organisation, will take over hosting responsibilities for the ODP.
Preservation Policy Review
This policy is to be reviewed on an annual basis, or as needed, at the discretion of the SAEON Data Committee or Managing Director.