"Vocabulary Management" and actual data product focused data modeling

Is the aim of the Vocabulary Hub or "vocabulary management component" to actually enable the creation, administration and distribution of "vocabularies" (or ontologies) that would serve as a basis for the description of the actual data in a data product?

In the context of data spaces it seems as if "data descriptions" are only concerned with DCAT-AP types of metadata about the data product, whereas the actual data description - like an actual JSON or JSON-LD schema about the data product itself - is always missing. "Data" is also often referred to as only some massive dumps of data that somehow magically just explains itself?

If you look at how the UN/CEFACT is developing the description of "business documents" that in my mind are "data products" just like any other data to be exchanged or shared, they are focusing on creating JSON-LD vocabularies (https://vocabulary.uncefact.org/) that describe in an pretty understandable form the actual content of the information in a specific business document that is building it's semantics on the common vocabulary.

In the context of Gaia-X, IDSA or DSSC I've not encountered anything similar. Regarding Gaia-X, when asked, Pierre Grosselier stated a couple of years that I'm talking about "domain specific matters" that Gaia-X is not occupied with. Then again in the DSSC Blueprint, there's lots of stuff now in the Technical building blocks part about data modeling with the W3C Semantic Web (Linked Data) stack - but is the outcome of all the use of ontologies and vocabularies only the production of semantically interoperable METAdata descriptions about a data product, not the actual data itself?

I'm starting to get really lost :-) at least in the context of data sharing in the form of W3C VC 2.0 conformant verifiable credentials the use of "semantics" is pretty clear: you create an ontology to describe the concepts that describe your data, then use the ontology to create actual data models of the data to be shared and then turn these into physical data sharing artifacts in the Verifiable Credential format. We've done that already in the eIDAS 2.0 digital wallet large scale pilot EWC, focusing on data stemming from national business registries (company certificate, signatory rights, beneficial owners, power of attorney, tax related information etc.)

Would assume that this approach would fit at least a certain type of data spaces, including for instance the Health Data in the present SIMPL context?

Johannes Stofberg • 2 months ago Moderator

Thank you for this insightful question, and our apologies for the very delayed response.

The aim of vocabulary management is to maintain the vocabularies used to create the ontologies or schemas for resource descriptions within the data space. Within Simpl-Open, the data space is governed by a data space governance authority. The overall goal is to harmonize the semantics of the metadata used across the data space.

Below are examples of the current schemas for data description (resource description). Please note that these can be configured and extended based on the specific requirements of each data space:

https://code.europa.eu/simpl/simpl-open/development/data1/sdtooling-sd-schemas/-/tree/main/yaml2shape/service-offering?ref_type=heads

The description of the actual product is provided by the resource providers and can currently be referenced in the schema. The metadata schema allows providers to reference this detailed description. In the base configuration, this can be done using a general-purpose property ("additional info") to include further information about the data set. However, a specific data space is expected to define its own, additional properties to enhance the specific information required as part of the resource descriptions.

Simpl-Open is the enabler of interoperability between data space participants; it supports discoverability and data exchange. This requires agreement on the vocabularies and ontologies that describe the offering, which Simpl-Open enforces (part of making the data offerings discoverable). It also requires that the participants agree on the vocabularies and ontologies that describe the domain specific data (this impacts semantic interoperability). Simpl-Open does not provide guidelines for this, it is left to the domain experts and participants of the specific data space to provide this. For true interoperability the data space participants must engage with each other and define the vocabularies and schemas of their domain specific data.

Ultimately, how the data is defined is up to the provider. The existence of a description of the data can be considered a quality criterion of the service offering.

Domain specific matters may require specific vocabularies and harmonization of the data, which are typically handled by the data providers themselves.

They may provide references to any descriptive resources, but the aim of Simpl-Open is to harmonize the metadata exchange on service offerings and contracts between data space participants to establish trust, preserve sovereignty and maintain interoperability.

The resource description is published as a verifiable credential and, as you noted, it concerns only metadata and not the data model. It serves as a basis for establishing contract agreements between participants. This is based on the Data Space Protocol and requires standardized exchange of metadata on datasets, usage control and data transferal/access.

https://docs.internationaldataspaces.org/ids-knowledgebase/dataspace-protocol

"Vocabulary Management" and actual data product focused data modeling

Comments (1)