ShareDoc:PublicDocumentation/LODPlatform/ClusterKnowledgeBase: Difference between revisions

Revision as of 14:49, 14 April 2025

THIS PAGE IS WORK IN PROGRESS

The CKB - Cluster Knowledge Base (also called Entity Knowledge Base) it’s the LOD Platform database of linked data entities. It's a source of high quality data including the clusters of entities created in the reconciliation and conversion of bibliographic and authority data to the entity-relationship model in use in the LOD Platform framework.

It serves as a source of truth for managing high-quality bibliographic and authority data, offering a format-agnostic and highly interoperable solution that seamlessly integrates also with environments external to the LOD Platform, such as ILS / LSP and other linked data-based environments.

The data output of the CKB is available in RDF, the framework designed as a data model for metadata by the World Wide Web Consortium (W3C).

Clusters, Prisms, Entities

The CKB in the LOD Platform workflow

The Cluster Knowledge Base on PostgreSQL database and the corresponding RDF version are the result of the data processing and enrichment procedures with external data sources for each entity; typically: clusters of Agents (authorized and variant forms of the names of Persons, Organisations, Conferences, Families) and clusters of Works (authorized access points and variant forms for the titles of the Works and Instances). The CKB is populated with clusters of all the linked data entities that are created by LOD Platform processes. Such clusters derive from the reconciliation and clustering of the bibliographic and authority records (both records internal to the library system and from external sources) to form groups of resources that are converted to linked data to represent a real world object.

The CKB is the pool where new entities are collected, as the clustering processes go along. The CKB is the authoritative source of the system and it’s available both on the relational database PostgreSQL (mostly for internal maintenance purposes, reports etc.), as well as in RDF in order to be used for the Entity Discovery Interface (the end-users portal) and public exposure. The CKB is updated both through automated procedures, as well as through manual actions via JCricket, the entity editor. Each change performed on the CKB (both manual and automatic) is reported in the Entity registry, that has the key role of keeping track of every variation of the resource URI, in order to guarantee the effective and broad sharing of resources.

The role of the CKB in the LOD Platform workflow is outlined in the sections:

CKB pipelines and the evolution of the conversion model

Until now, the support of linked data conversion from multiple input formats has been possible at the high cost of having multiple conversion pipelines.

In line with the evolution of the Share-VDE system over time, it became clear that a single “source of truth” is needed to orchestrate a conversion workflow as frictionless as possible. So, the current conversion model aims to streamline this process by extending input data capabilities and making the RDFizer component format-agnostic.

To achieve this, several analysis and implementation actions must be undertaken to merge the two current conversion pipelines.

Here follows the summary of the major tasks entailed, where the granularization of the CKB is a crucial factor of success:

comparison of existing conversion pipelines and sources of input data (ie. MARC files enriched with URIs and the data elements stored in the Cluster Knowledge Base);
extension of the CKB data structure and mapping tool to accept new data elements in a format-agnostic fashion;
enrichment of the CKB data structure with data elements reflecting the wealth of information and granularity of MARC format (the so-called granularization);
enhancement of the RDFizer conversion component that will read the data only from the CKB where all the Share-VDE entities reside.

As a result, the new conversion model will support:

a finer granularity level of the CKB in compliance with BIBFRAME granularity;
a “format-agnostic” CKB with extended input data capabilities to converge all input formats into one conversion source (eg. MARC21, UNIMARC, native BIBFRAME/RDF eg. from LD4P Sinopia application profiles etc.);
one single conversion pipeline from the CKB – removing the conversion pipeline based on MARC.

Granularization of the CKB

A key aspect of LOD Platform data management is the ability to manage data at a very fine granularity level. The aim is to have the CKB as a single source of truth for the LOD Platform and Share Family installations. This will improve:

a “format-agnostic” CKB where all input formats converge into one conversion source (eg. MARC21, UNIMARC, native BIBFRAME/RDF eg. from LD4P Sinopia application profiles…);

improved conversion from MARC to BIBFRAME and vice versa;
single conversion pipeline from the CKB – removing the conversion pipeline based on MARC.

One of the main pipelines we’re following in this context is the management of attributes through controlled vocabularies instead of literals. This is being achieved by:

defining each controlled vocabulary in collaboration with the Sapientia Entity Identification Working Group;
enriching each vocabulary with external authoritative sources (RDA, Library of Congress, FINTO…);
clustering of controlled vocabularies and assignment of a URI to each value.

After completing the work on these vocabularies, they will be made available on JCricket.

The expected outcome of this model is the advancement of the CKB technology to enable simpler data processing and effectiveness of conversion results, and to allow for more efficient performance of clustering procedures, ultimately increasing the added value of one of the key features of the Share-VDE system.