Open access peer-reviewed chapter

Ontologies as a Tool for Formalizing Data Validation Rules

Written By

Nicholas Nicholson and Iztok Štotl

Submitted: 13 January 2023 Reviewed: 02 March 2023 Published: 17 April 2023

DOI: 10.5772/intechopen.110757

From the Edited Volume

Latest Advances and New Visions of Ontology in Information Science

Edited by Morteza SaberiKamarposhti and Mahdi Sahlabadi

Chapter metrics overview

68 Chapter Downloads

View Full Metrics

Abstract

Comparison of health data across national or even regional boundaries is a challenging task. Data sources, data collection methods, and data quality can vary widely and the quality of the indicators themselves is dependent upon the veracity of the underlying data. For any trans-regional or trans-national comparison of indicators, it is imperative to ensure data are appropriately validated. Ontologies provide a number of functionalities to help in this process. Data rules can be formalized using the ontology axioms, which are useful for removing the ambiguities of rules expressed in natural language. In addition, the axioms serve to identify the metadata and their corresponding semantic relationships, which can in turn be linked to standard data dictionaries or other ontologies. Moreover, ontologies provide the means for encapsulating the underlying data model of the domain allowing the rules and the data model to be maintained in a single application. Finally the expression of the axioms in description logic, as supported for example by the web ontology language, allows machine reasoning to validate data sets automatically against the formalized rules.

Keywords

  • web ontology language
  • data harmonization
  • data validation
  • data rules
  • description logic
  • linked metadata

1. Introduction

Data validation is a key part of the overall data harmonization process that allows meaningful comparison or integration of different data sets. This is particularly important for the derivation of indicators, which may be used for comparison or benchmarking purposes across countries or regions. Prime examples are population-based disease surveillance programs and environmental monitoring and control programs.

Disease monitoring and surveillance is a particular focus of the European Union and a number of pan-European registry networks exist for this purpose. The European Network of Cancer Registries (ENCR) is the most established surveillance network incorporating over 150 separate regional or national registries [1]. A similar initiative in the United States is the Surveillance, Epidemiology, and End Results (SEER) program [2].

In order to help harmonize the data, which may be collected via different processes from different sources, registry networks generally agree a core or common data set that comprises the most accessible, important and well-defined variables. As an example the ENCR common data set consists of about 50 variables [3]. Even though the common data set variables are generally well defined, they may not necessarily be described in a manner that easily allows semantic linkage or cross-reference. Furthermore, they may depend on domain-specific knowledge not readily available to data users outside the domain.

Indictors for comparison purposes tend to be derived from common data sets since they constitute the variables that are the most harmonized within a disease domain. It is particularly important that the underlying data of the indicators are consistent and complete to avoid erroneous conclusions or bias in the results [4]. Ensuring an adequate level of consistency however is quite difficult to achieve in practice given the heterogeneity of data sources and data-collection processes.

Assuming a pre-defined level of quality, data consistency can nevertheless be verified using rule-based systems to check that the individual data fields are present and within the expected ranges. More complex, inter-variable rules check data consistencies between variables and their values. Other consistency checks can compare the frequency of occurrences of specific values of data. All these checks provide greater confidence in the fidelity of data sets for comparison purposes [5].

Advertisement

2. Specification of the rule base

Specifying the data-validation rules in an optimal way is itself a challenge. Rules are often described using natural language which, whilst having the advantage of making them more readable, leads to ambiguities for anything other than the most simple rules. Complex rules with dependencies on multiple variables can be illustrated more easily via a series of tables that constrain the values of the variables not forming the major focus within a particular table. Ensuring the consistency and verifying the accuracy of the rules across multiple tables is not straightforward and leads to considerable maintenance overheads.

The ENCR common data set comprises variables describing a tumor, such as: morphology (type of tumor); behavior (how the tumor acts in the body); topography (organ affected); basis of diagnosis (how the tumor was diagnosed); grade (how the tumor cells compare with normal cells under the microscope); and stage (extent of the tumor). Morphology, behavior, topography, and grade are specified by codes adhering to the international classification of diseases for oncology, edition 3 (ICD-O-3) [6]. Stage for solid tumors is generally specified according to the globally recognized TNM staging system describing the extent of cancer disease, where the “T” component is related to the size of the tumor or its invasion into local structures; the “N” component is related to the number and nature of lymph node groups adjacent to the tumor with evidence of tumor spread; and the “M” component is related to the presence of local or distant metastatic sites. The rule interdependencies of all these tumor-description variables in the ENCR rules are illustrated in Table 1. To manage more easily the complexity of the interdependencies, the rules are divided into nine separate sets of tables, namely:

  1. age/morphology/topography;

  2. sex/topography;

  3. sex/morphology;

  4. basis of diagnosis/morphology/topography/age;

  5. grade/morphology/behavior;

  6. morphology/topography;

  7. topography/stage-grouping/TNM;

  8. topography/topography-grouping (for multiple primary tumor conditions);

  9. morphology/morphology-grouping (for multiple primary tumor conditions).

MorphTopogAgeSexBoDGradeBehStage
MorphXXXXXXX
TopogXXXXX
AgeXXXX
SexXXX
BoDXXXXX
GradeXXX
BehXXX
StageXXXXXX

Table 1.

Rule interdependencies (marked with an “X”) of some of the main variables within the ENCR common data set. Morph = morphology; Topog = topography; BoD = basis of diagnosis; Beh = behavior. The shaded cells indicate no interdependencies.

Given the size of the tables, only a few excerpts are shown for illustrative purposes in Tables 26. Whereas they are specific to the ENCR common data set, they are nevertheless indicative of the sorts of difficulties faced by other rule sets defined in a similar fashion.

Age group (years)MorphologyTopography
0–2Hodgkin lymphoma 9650–9667
>7Malignant extra-cranial and extra gonadal germ cell: 9060–9065, 9070–9072, 9080 9085, 9100–9105C00-C55, C57-C61, C63-C69, C73-C750, C754-C768, C80
0–14Mesothelial neoplasms: 9050–9053Any
< 40Adenocarcinoma: 8140C61

Table 2.

Unlikely and rare combinations of age and tumor type (excerpt from table 3 in [3]).

Basis of DiagnosisMorphology (and topography, age, and sex where indicated)
28000, 8720, 8800, 8960 (age 0–8), 9140, 9380 (C717), 9384/1, 9500 (age 0–9), 9510 (age 0–5), 9530–9539 (C70), 9590, 9800
48000, 8150–8154, 8170, 8270–8281(C751), 9100 (female age 15–49), 9500 (age 0–9), 9732 (and age 40+), 9761 (and age 50+)
6≠ 8000; 9590–9731; ≠ 9732; ≠ 9733–9760; ≠ 9761; ≠ 9762–9992

Table 3.

Valid combinations for basis of diagnosis and morphology (excerpt from figure 2 in [3]).

SexTopography
FemaleC60, C61, C62, C63
MaleC51, C52, C53, C54, C55, C56. C57, C58

Table 4.

Invalid combinations for sex and topography (excerpt from table 4 in [3]).

MorphologyAllowed topographyDisallowed topography
8010–8589C38, C40-C42, C47, C480, C49, C70-C72, C77
8090–8095, 8097, 8100–8103, 8110C300, C44, C51, C60, C632
8800–8811, 8814–8831, 8840–8921, 8963, 8990, 8991, 9040–9043, 9120–9150, 9170, 9540, 9550, 9561, 9580, 9581C420, C421, C77

Table 5.

Morphology codes and allowed/refused topography codes (excerpt from table 8 in [3]).

StageTNM
Thyroid gland – papillary or follicular, < 45 years
IAny TAny NM0
IIAny TAny NM1
Thyroid gland – papillary or follicular, ≥ 45 years
IT1a, T1bN0M0
IIT2N0M0
IIIT3N0M0
T1, T2, T3N1aM0
IVAT1, T2, T3N1bM0
T4aN0, N1M0
IVBT4bAny NM0
IVCAny TAny NM1

Table 6.

TNM edition 7 stage grouping and T, N, M values for thyroid gland (C73) papillary or follicular (excerpt from appendix III in [3]).

Apart from the difficulty of ensuring consistency across the rule tables, a further drawback to defining rules in this way relates to the intricacy they impose on compiling a test data set. A comprehensive test data set is important for verifying the ability of data-checking software to trap the different types of errors against the rules. In constructing a test data set, it is necessary to keep record of the variables set incorrectly for each individual test case.

Creating a test record using the tabular rules requires one first to establish a valid morphology/topography combination (one table look-up), then a correct morphology/behavior combination (second table look-up), and thereafter multiple table look-ups for all the other variable interdependencies. Given that not all possible morphology/topography combinations lead to defined combinations of the other variables, it becomes an arduous task to follow this process to completion. In practice, what is done is to start from a real cancer registry data set and systematically set the variables to incorrect values. However, such an approach does not guarantee all possible record combination conditions are thereby tested, potentially leading to undetected bugs in the validation software.

For many practical reasons therefore, a more formal representation of the data rules is necessary. Ontologies are interesting since they provide the basis for doing this in a way that is also integrated with the underlying data model.

Advertisement

3. The relationship between ontologies and description logics

Computational ontologies describe and categorize classes of objects and specify the relationships associated with those classes and categories. This information is captured using axiomatic constructs that provide an appropriate mechanism for describing the majority of the ENCR data rules.

There is in fact a very close relationship between the axiom constructs and description logics (DLs) [7], which are themselves closely related to first-order and modal logics. Since first-order logic draws from a well-established mathematical foundation, DLs provide a solid formal framework for representing axioms that can be developed using the more readily understandable ontology constructs.

DLs form a family of knowledge representation languages that are distinguished by their level of expressivity [8]. Expressivity refers to the expressive power of the language governed by the types of operations it can support. The base language is attributive language (AL) supporting concept intersection (⊓), some level of negation (⌐), universal restrictions (∀), and existential restrictions (∃) with limited quantification. The restriction operators ∀ and ∃ are used for qualifying the entities on which a given role acts, with ∃ specifying the notion of an “at-least-one relationship” and ∀ the notion of an “only relationship”; they are similar to the existential and universal quantifiers of first-order logic.

The addition of complex concept negation (C), which includes concept disjunction (⊔), increases the expressivity to attributive language with complements (ALC) that already provides quite a powerful expressivity able to handle many types of data rules. A language of higher expressivity is SHOIN, where S refers to ALC with transitive roles, H to role hierarchy, O to nominals, I to inverse properties, and N to cardinal restrictions. Higher expressivities are also possible but there is a trade-off between expressivity and computational cost for automatic reasoning.

In DL terminology, a knowledge base has two distinct components – a terminological part or TBox, and an assertional part or ABox. An additional term RBox is sometimes used to denote an extended set of role axioms that are described by the letter R in higher expressivities such as SROIQ [8].

The distinction between the TBox and ABox is sometimes also made in the division between ontologies and knowledge graphs [9]. An ontology is considered as a schema that captures the semantic data model using classes, relationships, and attributes (i.e. the TBox, where concepts stand for classes and roles for relationships). A knowledge graph in contrast contains specific instances following the semantic data model represented by the ontology (i.e. the ABox).

3.1 Web ontology language

The World Wide Web Consortium (W3C) describes the web ontology language (OWL) as “a semantic web language designed to represent rich and complex knowledge about things, groups of things, and relations between things”. It refers to OWL documents as ontologies [10]. OWL is structured closely along the lines of DLs and provides support for automatic reasoning. It uses the terminology of classes and properties (instead of concepts and roles) for the TBox and represents the ABox as a set of individuals instanced (or asserted) from the TBox axioms.

A number of free, open-source graphical user interface OWL editors are available (e.g. Protégé [11]) that greatly ease the task of ontology development. It is generally more straightforward to define classes and relationships from an ontological point of view than construct them from scratch using DL. The DL expressions can afterwards be determined from the resulting OWL axioms.

Advertisement

4. OWL: A formal framework for the specification of the data rules

OWL’s roots in DL allow a formal context to be established for data rules that can overcome the inherent ambiguities associated with their formulation in natural language. Given the relatively rich set of logic operators available however, care is required in deciding how best to formulate the axioms. Unfortunately, there is no simple set of guidelines to help with this task since it is very much dependent on how the ontology will be used. Moreover, DL expressivity comes at the cost of computational speed [12] and where this is important, it is preferable to restrict the DL expressivity to the extent necessary.

4.1 Representation of the data rules

By way of illustration, the following simple examples are only intended to show how some of the rules depicted in Tables 26 can be encoded in DL. With reference to Table 5 (morphology/topography), capturing the fact that the topography code C300 (nasal cavity) with a morphology code of 8090 (basal cell carcinoma) is a permissible combination, one can create an OWL axiom stating that C300 is a subclasss of the object property hasMorphology with a filler class M_8090 (where the prescript M_ has been added for convenience to represent morphology). This statement is represented in DL by:

C300hasMorphology.M_8090E1

In a similar manner, one can capture the rule in the last row of Table 2 that an ICD-O-3 topography code C61 (prostate gland) together with a morphology of 8140 (adenocarcinoma) is unlikely in men aged less than forty years at diagnosis. This rule, which requires use of an OWL data property, can be framed in such a way to say that for a combination of topography and morphology, the expected age of patients is above thirty-nine years:

C61M_8140expectedAge.>39E2

The introduction of another axiom stating that the conjunction of an expected age of more than thirty-nine years and a patient age at diagnosis of less than forty years is an improbable scenario, Eq. (3), would flag a potential coding error (via subsumption under the class ImprobableAge) for any prostate tumor cases with morphology code 8140 for patients younger than forty years of age.

expectedAge.>39patientAgeAtDiagnosis.<40ImprobableAgeE3

Clearly such a rule would have to be replicated for all the relevant upper age restrictions provided in the rule table. To avoid logic conflicts, a modified set of axioms would need to be created for the rules with lower age restrictions, c.f. row 2 in Table 2.

By building up axioms in this manner, all the rules relevant to a given class or hierarchy of classes can be defined. The advantage is that each rule governing a class of objects is visible on the ontology editor’s view of the class, unlike the representation of the rules in Tables 26 where one has to search between various tables to ascertain all the rules pertinent to a particular entity. As observed earlier, this greatly simplifies the task of building up test cases of data both to validate the behavior of the rules as well as to construct comprehensive test data sets.

4.2 Automatic reasoning

Owing to its DL foundations, OWL provides the possibility for automatic reasoning. Automatic reasoning is a valuable tool for detecting rule violations in a set of data records. Eq. (3) provided an example where a reasoner could flag a potential coding error in a cancer case.

In designing error-trapping axioms, it is important to be aware of the issues relating to the open world assumption of DL. The open world assumption holds the view that anything not explicitly stated can only be assumed to be unknown. This is in contrast to the closed world assumption in which anything not explicitly stated is considered incorrect (typical for rules expressed for instance in Datalog). The open world assumption has implications in the subsumption of classes in a hierarchy and can dictate the structure of the ontology dependent on the reasoning requirements.

Data rules, which by definition are prescriptive in the dependencies between data variables, are more suited to the closed world assumption. Axioms may therefore have to be written in such a way that serves to force class subsumption in an otherwise open world view. One means for achieving this is to “invert” the class tree – which may be more easily clarified by the following simple practical example. Say we wished to subsume a class with certain attributes (e.g. a class having a topography code of C40 and a morphology with code 919) under a general classification class of Osteosarcoma. Following the traditional approach of constructing classes using an ontology editor such as Protégé, we might declare an axiom such as:

OsteosarcomaC40M_919E4

If we were to declare a class TumorCase also subclassed from an intersection of C40 and M_919 and then run the reasoner, we would find that our TumorCase class had not been classified under (i.e. subsumed by) the class Osteosarcoma. This is due to the open world assumption since it cannot be assumed that the class Osteosarcoma is not subclassed from other classes that have not been explicitly stated. It cannot therefore be assumed that the TumorCase class is contained by the Osteosarcoma class – there is not enough information to say.

The problem can be circumvented either by creating an equivalence (using defined classes) or by inverting the subclass definition. Creating many equivalences with complex classes can however lead to unintended consequences. For example, if the containment operator () in Eq. (3) were to be replaced by an equivalence (), and if this approach were to be replicated for the whole set of axioms modeling each of the age-restricted rules (c.f. Table 3), then all the expressions on the left-hand-side of the equivalence would also become equivalent (since they are all equivalent to the class ImprobableAge) and this would be erroneous. Alternatively, the subclass definition of Eq. (4) can be inverted as indicated in Eq. (5):

C40M_919OsteosarcomaE5

Running the reasoner now would result in the subsumption of the class TumorCase. under the class Osteosarcoma.

This method of axiom formulation has been coined “being complex on the left-hand side” [13]. Ontology editors such a Protégé lead developers to put the complexity on the right-hand side of the class containment relation (i.e. subclassing from complex classes). Although moving the complexity to the left-hand side can overcome the subsumption issues of the open world view, it tends to obfuscate the ontology structure. Eq. (3) is a further example of defining axioms following this approach.

Regarding the different formulations for expressing the rule illustrated in Eqs. (4) and (5), it is instructive to note that the equivalence expression:

C40M_919OsteosarcomaE6

is in fact a short-hand way of writing the implied DL expression:

C40M_919Osteosarcoma,OsteosarcomaC40M_919E7

Figure 1 is a view from the Protégé application showing the result of reasoning based on the classes and properties given in an imaginary cancer test case. The non-highlighted lines indicate the information passed into the reasoner and the lines highlighted with yellow background show the extra information returned by the reasoner. Noting that the topography class C619 is a subclass of C61 and the morphology class M_8140_3 is a subclass of M_8140, and in accordance with the rules provided in Table 2 (row 4) and Table 3 (row 3), and Table 4 (row 1), the reasoner has ascertained that: the age at diagnosis is improbable for the morphology/topography combination; the basis of diagnosis is correct; and the combination of sex and topography is incorrect. The question mark in the gray circle on the highlighted lines provides the means of polling the reasoner to understand why it has subsumed the class under the identified class.

Figure 1.

Information added from the reasoning process (highlighted lines) based on the prior information of classes asserted in a test case (non-highlighted lines).

Figure 2.

Graphical view of the classification structure (containing both asserted and inferred classes) of the cancer test case shown in Figure 1.

Protégé also provides a graphical view on the inferred classification tree for the named classes (unnamed classes are not visible). Figure 2 provides an amplification of the classification tree summarized in Figure 1.

Figure 3.

Thyroid cancer TNM test case to verify the class subsumption results from the reasoner.

The reasoner can be polled to understand the reasoning applied for class subsumption. Figure 3 shows a cancer test case for the thyroid gland (C739) restricted to TNM information to check whether the test case is subsumed under stage III (c.f. Table 6, row 7b). Figure 4 is the classification tree resulting from the automatic reasoning process on the TNM test case of Figure 3. It can be seen that the reasoner has correctly subsumed the test case under the stage III class.” Figure 5 shows the results from polling the reasoner to understand why it subsumed the test class under the TNMStageIII class. The specific rule is stated in line 11 of the figure and the other lines provide the reasons for subsuming the classes asserted in the test case under the various classes in the rule itself.

Figure 4.

Classification tree of the thyroid cancer test case of Figure 3, showing that the reasoner has correctly identified the stage III class (the top-most class in the figure) as required from the rule table shown in Table 6. The classes shaded in the darker color represent defined classes (classes with some equivalence conditions).

Figure 5.

Reasoner justification for the subsumption of the thyroid cancer test case under the TNMStageIII class.

Automatic reasoning can be performed using both TBox axioms and ABox axioms. Since data rules are more often associated with classes of objects, TBox reasoning is in most cases sufficient and can reduce computational costs. Most of the ENCR data rules can be modeled by TBox axioms apart from those, for example, that pertain to multiple tumors (where a person has more than one type of cancer). The rules specify the topography and morphology combinations of any two tumors to be considered different and since two entities with the same class attributes have to be compared, the use of ABox axioms is necessary. Modeling of the multiple primary rules on the basis of DLs supported by OWL has been addressed at length in [14].

The ability to include closed world reasoning in OWL would be ideal and has been made possible to a certain degree via the incorporation of the semantic web rule language (SWRL) into the semantic web stack. SWRL is based on first-order Horn-logic in which rules in Datalog are also expressed [15], but requires an ABox. Another expressive logic formalism allowing some integration of open- and closed-world reasoning is minimal knowledge and negation as failure (MKNF) [16]. This formalism is being developed in a unifying framework in the KAON2 infrastructure [17].

4.3 Encapsulation of the data model

The axiomatic constructs of an ontology are useful for capturing many of the different aspects of a data model that for relational database models have traditionally been divided across three independent levels of abstraction. Namely, the conceptual schema (describing the semantics of the domain and the scope of the model); the logical schema (describing the structure of the information, as for example a relational database schema); and the physical schema (describing the physical means of storing the data) [18].

One of the strengths of OWL is its relationship with the resource description framework (RDF), which serves as the data interchange layer of the semantic web stack [19]. RDF data is in essence a network of connected triplets of resources, in which the resources at the edges of the triplets (subject and object) are related by the resource in the middle of the triplet (predicate). Each resource is identified by a uniform resource identifier (URI). All OWL constructs are described in terms of RDF data, allowing ontologies to bridge the traditional divide between conceptual and logical levels of abstraction and providing a richer, more integrated data model description framework.

The flexibility and descriptive power of an ontology present their own sets of challenges however. While the usefulness of ontologies is widely acknowledged, the task of building a good ontology is a particularly hard one and falls within the developing domain of ontology engineering [20]. Designing an appropriate ontology does not only depend on a thorough understanding of the domain to be modeled, but must be performed circumspectly in view of the ontology’s purpose and future extensibility. There are pitfalls in making an ontology too granular or not granular enough – the result is either a multiplication of application-specific ontologies that cannot easily be integrated, or an ontology overly generic to be useful to any particular application. OWL provides the functionality for importing ontologies that allows larger ontologies to be built up in a modular fashion and this can aid the design process if performed carefully [21].

There are also certain design aspects to take into account that can affect the overall structure of the ontology. One important consideration relates to the extent to which the ontology is to be used in a pre-coordinated or post-coordinated way [22]. Pre-coordination refers to the situation in which all the terms and relationships are stated explicitly in the axioms and leads to a static use of the ontology, whereas post-coordination refers to the more dynamic situation in which new relationships are determined by the automatic reasoning process on the basis of the pre defined axioms. The pitfalls are exacerbated in applications that need to tweak the normal approach to structuring class hierarchies to overcome restrictions in post-coordination that the open world assumption places on class subsumption.

If the axioms describing the data rules are developed circumspectly however, the advantage is that the data model falls out almost by default – the data rules necessarily identify all the concepts within the domain as well as their inter-relations. This may require an iterative process combining both the bottom-up approach of developing axioms in DL and the top-down approach of structuring the ontology, while testing each stage of the development with the reasoner.

The task of developing a data model in an ontology used in a predominantly pre-coordinated way is perhaps more straightforward and does not require too much juggling in defining the axioms. Moreover, the axioms can be constructed in the more usual manner of subclassing from complex classes. The intelligence of validating data sets would however need to be moved from the ontology to a computer program (for instance via the OWL-API) thereby compounding maintenance issues. The advantage of encapsulating the intelligence in the ontology is that all the knowledge is contained in one application and maintenance aspects are thereby confined to that one application.

4.4 Metadata by default

Elements in an ontology are described in terms of their semantic relations to other elements in the ontology thereby providing a description and context, or in other words the metadata, of the element. Moreover, since each element in an OWL ontology is uniquely defined by a uniform resource identifier (URI), it is readily linkable with other web resources. This allows any element to be associated with other relevant resources via linked open data (LOD) principles. Using knowledge organization schemes, such as simple knowledge organization system (SKOS), it becomes a straightforward matter to link OWL resources semantically with other web-based resources such as data-dictionary or thesauri elements.

The interlinking of any OWL resource to other web resources, especially to other RDF resources, provides a powerful and extensible means of capturing all the necessary metadata components for comprehensively describing a data model element. This aspect has been exploited to create extensive frameworks of distributed metadata registries that allow the reuse of existing metadata resources [23].

It is important to emphasize that a number of complementary tools exist that can be used together to provide a more comprehensive toolkit for validating different types of data rules. Included in the semantic web standards are the shape languages: shape expression (ShEx) and shapes constraint language (SHACL) for providing structural schema for RDF data. There are also additional tools for polling knowledge bases such as the SPARQL protocol and RDF query language (SPARQL) as well as those for extending the expressivity of OWL DLs, such as SWRL. Depending on the type of rule, some of these tools may be more suitable than others; however, since they are agreed or proposed semantic web standards and based on the standard model for data interchange (RDF), they can all reference the elements of a data model described in RDF. This provides a highly flexible and versatile environment in which to develop an integrated toolkit. Table 7 gives a summary breakdown of these applications with the sorts of operations they support and the components of a knowledge base to which they are applicable.

ApplicationScopeInference mechanismTypes of operations supportedTBox, ABox focus
OWLKnowledge bases, DLYesComplex inter-variable checks supported by DL expressivityTBox, Abox
SWRLExtension of logic to OWLYesComplex inter-variable checksABox
OWL-APIProgramming interface to OWL ontologiesYesComplex inter-variable checks supported by DL expressivity and additional computer logicTBox, Abox
ShExGrammar check of RDF graphsNo, although can be used in post-coordinationEnsuring RDF data conforms to an expected template. Can perform some inter-variable dependency checks and verify if values of variables are in range.TBox, Abox
SHACLConstraint requirements of RDF graphsSomeEnsuring RDF data conforms to a given set of constraints: Can compare date fields. More suitable for validating RDF graphs than conformance with a specific template (for which ShEx is better)TBox, Abox
SPARQLQuery language for RDF dataNo, although can be used in post-coordinationQuerying of data by user-defined query-language constructsAbox

Table 7.

Summary breakdown of some of the semantic web standard applications with the sorts of operations they support.

Whereas other tools and languages (e.g. Datalog) are also available for validating data, and may arguably be more appropriate for defining rules predominantly based on closed-work scenarios, they fall down in this aspect of unifying the rules with the data model and the metadata, especially in the LOD sense. For federated data-validation processes, the unification of all these elements brings many advantages in terms of data linkage, maintenance, and collaborative development. Having said that, OWL is not able to handle all types of validation checks – such as those for example requiring comparison of dates, checking of frequencies of occurrence, or expressing certain relations between individuals. ShEx, SWRL, and SPARQL can all go some way to handling such checks. SWRL and SPARQL however require an ABox and SWRL has implications on decidability [24]. Moreover, introducing an ABox can create performance issues for DL reasoning when many hundreds of thousands of individuals are involved and requires careful consideration in the ontology design phase. An alternative is to create an ABox and use SPARQL querying instead of DL reasoning but this would move the rule logic out of the ontology and into the SPARQL query scripts.

An example for handling the axioms of Eqs. (2) and (3) using a simple SPARQLscript to list all the associated erroneous cancer-case records is shown in Figure 6. A ShEx script for checking the same condition is shown in Figure 7. The same rule using SWRL could be expressed as shown in Figure 8.

Figure 6.

An example of a SPARQL script to list all the erroneous patient-age related cancer-case records associated with a particular combination of topography and morphology codes.

Figure 7.

An example of a ShEx script to trap any erroneous patient-age related cancer-case records associated with a particular combination of topography and morphology codes.

Figure 8.

An example of an SWRL rule to catch the same validation errors as for Figures 6 and 7.

The effort required to maintain the rule base developed with such tools however would be considerable and it would make more sense to use them in a pre-processing stage on the data to be validated (translated beforehand into RDF) for those types of checks that cannot be handled within the ontology itself. ShEx in particular provides a valuable pre-processing tool to check the ranges and formats of variables.

Advertisement

5. Role of ontologies in data harmonization

The focus until this point has been on how ontologies can provide many advantages in the task of data validation against a set of specific data-validation rules. Checking the conformity of data against such rules is just one element in the whole process of data harmonization.

Data harmonization is a term that eludes a clear and concise definition, perhaps partly due to its dependence on the context to which it is applied [25, 26] as well as the fact that it is a multistep activity involving both technical and social processes [5, 26]. An idealized breakdown of these steps has been provided in [5] based on the accumulated experience gained by the Comprehensive Center for the Advancement of Scientific Strategies (COMPASS) resulting from multiple data-harmonization projects across widely different types of data, collaborators, and scientific questions. Whereas not all projects were found to follow all steps and the order of the steps might vary, the six most common steps identified were:

  1. Identification of the questions that the harmonized data set is required to answer

  2. Identification of the high-level data concepts required to answer those questions

  3. Assessment of the data availability for the data concepts

  4. Development of CDEs for each data concept

  5. Mapping and transformation of individual data points to CDEs

  6. Quality-control procedures

In this breakdown, the process of data validation falls manly under steps 5) and 6) although it should be stressed that validation forms only part of the quality-control procedures of step 6). Other fundamental quality metrics consist of the following dimensions: completeness, consistency, accuracy, timeliness, uniqueness, and auditability [27]. Moreover, different entities in the data process may be responsible for ensuring the quality of the data associated with these separate dimensions. They are nevertheless all important for ensuing an appropriate level of harmonization that allows meaningful comparison or integration of data and it would not be correct to state that data solely validated against a set of validation rules have the prerequisite level of quality for purposes of data comparison.

The degree to which data are harmonized depends ultimately on the specific end use, but the step can never entirely be ignored. In the field of health for example, data harmonization is a critical step in pooling data sets for increasing the power of individual epidemiological studies [5]. It is also a necessary part of health management decision-making, particularly with regard to: clinical decision-making for individual patient clinical management or clinical support and quality improvement tools; operational and strategic decision-making for health system managers and policy-makers; and population-level decision-making for disease surveillance and outbreak management [26].

The point is that ontologies can play an important part in all stages of data harmonization. Starting from the highest levels of abstraction in the six-step harmonization process presented above, ontologies provide the means to capture and organize the high-level data concepts needed to address the questions the harmonized data are required to answer. Ontologies would moreover be able to formalize the questions in direct reference to the high-level data concepts and help identify any missing concepts as well as to verify the underlying logic of those relationships. The next steps are to identify the availability of the data and to develop common data elements (CDEs). The data may be in an unstructured format. The development of CDEs is a process of structuring the data and the semantic relations described in a domain ontology can help identify the relevant information. The role of ontologies in ETL (extract, transform, load) processes has been extensively reviewed in [28]. In particular, the authors point to the efficacy of ontologies: (a) to formalize the needs and requirements of users and resolve semantic ambiguity; (b) to discover concepts and their relationships; (c) to enrich source data, provide mappings (also generating them automatically) and increase ETL performance and efficiency; and (d) to support configuration and instantiation of ETL patterns. Moreover, the validation rule base for the data can itself be derived automatically from the data themselves using ontological methods [29] and allows a verification of any pre-defined set of validation rules.

Advertisement

6. Conclusions

Data validation is an essential step in the task of ascertaining the veracity and homogeneity of data for data comparison purposes. In the case of structured data, validation is often performed using a set of data validation rules. Using the ontology layer (OWL) of the semantic web stack to perform this task brings a number of major advantages. First, it provides the means of formalizing the rules in DL, thereby removing the ambiguities and redundancies inherent in natural language. Second, it helps encapsulate the data model and integrate the conceptual and logical schemas that have traditionally been separated. The encapsulation of the data model and the definition of the rules in DL is a mutually supportive step that allows the integration of a bottom-up approach (rule definitions) with a top-down approach (classification and semantic context), from which the data model is the result. Third, the data model expressed in OWL automatically incorporates the metadata. All named entities (classes, properties, and individuals) have their own URIs that can be accessed and linked individually. Accessing an OWL link provides the whole semantic context of the entity, which may in turn be annotated with links to other semantic resources to enrich further the contextual information. Other advantages include the possibility of reasoning on the ontology, allowing inferences to be made automatically and providing other semantic relations not explicitly stated a priori in the ontology. Ontologies can also play an important role in more general data harmonization steps. In particular, they can help in defining and formalizing user needs, discovering semantic contexts in unstructured data, and generating semantic mappings.

Whereas ontologies do suffer some drawbacks (such as issues relating to the open world assumption), the fact they can to a large extent unify the underlying data model with the data rules, as well as capture the metadata that can be linked semantically to other metadata dictionaries and classification schemes, makes them an interesting solution. These considerations are of particular importance for applications that need to harmonize data across multiple data providers and heterogeneous data-collection procedures, as well as for improved contextualization of the data that is useful for downstream processes.

Advertisement

Acknowledgments

This work was partly conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

Advertisement

Nomenclature

CDEcommon data element
COMPASSComprehensive Center for the Advancement of Scientific Strategies
DLdescription logic
AL, ALC, SHOIN, SROIQDL expressivities, where:
ALattributive language
ALCAL with complements
SALC with transitive roles
Hrole hierarchy
Onominals
Iinverse properties
Ncardinal restrictions
Rextended set of role axioms
Qqualified cardinality restrictions
ENCREuropean Network of Cancer Registries
ETLExtract, Transform, Load
ICD-O-3International Classification of Diseases for Oncology, third edition
KBknowledge base
ABoxassertional part of a KB
TBoxterminological part of a KB
RBoxextended set of role axioms in a KB
LODlinked open data
MKNFminimal knowledge and negation as failure
RDFresource description framework
SKOSsimple knowledge organization system
SWRLsemantic web rule language
SHACLshapes constraint language
ShExshape expressions
SPARQLSPARQL protocol and RDF query language
TNMTNM classification of malignant tumors
Tsize of tumor
Ninvolvement of regional lymph nodes
Mpresence of distant metastasis
URIuniform resource identifier
OWLweb ontology language
OWL-APIweb ontology language application program interface
W3CWorld Wide Web Consortium

References

  1. 1. European Network of Cancer Registries (ENCR). Available from: https://www.encr.eu/ [Accessed: December 26, 2022]
  2. 2. National Cancer Institute. Surveillance, Epidemiology, and End Results Program (SEER). Available from: https://seer.cancer.gov/ [Accessed: December 26, 2022]
  3. 3. Martos C, Crocetti E, Visser O, Rous B, Giusti F. A proposal on cancer data quality checks: one common procedure for European cancer registries. JRC Technical Report, p. 1-99. DOI: 10.2760/429053
  4. 4. Tijhuis M, Finger JD, Slobbe L, Sund R, Tolonen H. In Verschuuren M, van Oers H, editors. Population Health Monitoring. Climbing the Information Pyramid. Cham: Springer; 2019. p. 59-81. DOI: 10.1007/978-3-319-76562-4_4
  5. 5. Rolland B, Reid S, Stelling D, Warnick G, Thornquist M, Feng Z, et al. Toward rigorous data harmonization in cancer epidemiology research: One approach. American Journal of Epidemiology. 2015;182(12):1033-1038. DOI: 10.1093/aje/kwv133
  6. 6. World Health Organization. International Classification of Diseases for Oncology (ICD-O) – 3rd Edition, 1st Revision. 2013. Available online: https://apps.who.int/iris/handle/10665/96612 [Accessed: December 26, 2022]
  7. 7. Calvanese D, Guarino N. Ontologies and description logics. Intelligenza Artificiale. 2006;3:21-27
  8. 8. Baader F, Horrocks I, Lutz C, Sattler U. An Introduction to Description Logic. Cambridge: Cambridge University Press; 2017. DOI: 10.1017/9781139025355
  9. 9. Schrader B. Enterprise Knowledge. White paper: What’s the Difference Between an Ontology and a Knowledge Graph? 2020. Available from: https://enterprise-knowledge.com/whats-the-difference-between-an-ontology-and-a-knowledge-graph/ [Accessed: December 26, 2022]
  10. 10. W3C. Web Ontology Language (OWL). 2012. Available from: https://www.w3.org/OWL/ [Accessed: December 26, 2022]
  11. 11. Protégé. A Free, Open-Source Ontology Editor and Framework for Building Intelligent Systems. Available from: https://protege.stanford.edu/ [Accessed: December 26, 2022]
  12. 12. Calvanese D, De Giacomo G, Lembo D, Lenzerini M, Rosati R. Data complexity of query answering in description logics. Artificial Intelligence. 2013;195:335-360. DOI: 10.1016/j.artint.2012.10.003
  13. 13. Sattler U, Stevens R. Being complex on the left-hand side: General concept inclusions. Ontogenesis. 2012. Available from: http://ontogenesis.knowledgeblog.org/1288 [Accessed: December 26, 2022]
  14. 14. Nicholson NC, Giusti F, Bettio M, Negrao Carvalho R, Dimitrova N, Dyba T, et al. An ontology to model the international rules for multiple primary malignant tumours in cancer registration. Applied Sciences. 2021;11:7233. DOI: 10.3390/app11167233
  15. 15. Krötzsch M, Rudolph S, Schmitt PH. On the semantic relationship between Datalog and description logics. In: Hitzler P, Lukasiewicz T, editors. Web Reasoning and Rule Systems. RR 2010. Lecture Notes in Computer Science. Vol. 6333. Berlin, Heidelberg: Springer; 2010. pp. 88-102. DOI: 10.1007/978-3-642-15918-3_8
  16. 16. Motik B, Rosati R. Closing Semantic Web Ontologies. 2006. Available from: http://www.cs.ox.ac.uk/boris.motik/pubs/mr06closing-report.pdf [Accessed: January 10, 2023]
  17. 17. KAON2. Available from: http://kaon2.semanticweb.org/ [Accessed: January 10, 2023]
  18. 18. TopQuadrant. Ontologies and Data Models – are They the Same? 2011. Available from: https://topquadrantblog.blogspot.com/2011/09/ontologies-and-data-models-are-they.html [Accessed: December 26, 2022]
  19. 19. W3C. Resource Description Framework (RDF). 2014. Available from: https://www.w3.org/RDF/ [Accessed: December 26, 2022]
  20. 20. Mizoguchi R. Ontology engineering environments. In: Staab S, Studer R, editors. Handbook on Ontologies. International Handbooks on Information Systems. Berlin, Heidelberg: Springer; 2004. pp. 275-295. DOI: 10.1007/978-3-540-24750-0_14
  21. 21. Cuenca Grau B, Horrocks I, Kazakov Y. Modular reuse of ontologies: Theory and practice. Journal of Artificial Intelligence Research. 2008;31:273-318. DOI: 10.1613/jair.2375
  22. 22. Stevens R, Sattler U. Post-coordination: Making things up as you go along. Ontogenesis. 2013. Available from: http://ontogenesis.knowledgeblog.org/1305 [Accessed: December 26, 2022]
  23. 23. Sinaci AA, Laleci Erturkmen GB. A federated semantic metadata registry framework for enabling interoperability across clinical research and care domains. Journal of Biomedical Informatics. 2013;46:784-794. DOI: 10.1016/j.jbi.2013.05.009
  24. 24. Hitzler P, Krötzsch M, Rudolph S. Knowledge Representation for the Semantic Web Part II: Rules for OWL, KI 2009 Paderborn; Integrationszentrum, Kreis Paderborn; 2009. p. 8-14. Available from: https://www.semantic-web-book.org/w/images/5/5e/KI09-OWL-Rules-2.pdf [Accessed: February 21, 2023]
  25. 25. Paquette J. The Many Marvelous Meanings of “Data Harmonization”. Towards Data Science. Canada: Towards Data Science Inc.; 2021. Available from: https://towardsdatascience.com/about-towards-data-science-d691af11cc2f [Accessed: November 16, 2022]
  26. 26. Schmidt BM, Colvin CJ, Hohlfeld A, Leon N. Definitions, components and processes of data harmonisation in healthcare: A scoping review. BMC Medical Informatics and Decision Making. 2020;20(1):222. DOI: 10.1186/s12911-020-01218-7
  27. 27. Nicholson N, Giusti F, Neamtiu L, Randi G, Dyba T, Bettio M, et al. Dotting the “i” of interoperability in FAIR cancer-registry data sets. In: Kais G, Hamdi Y, editors. Cancer Bioinformatics [Internet]. London: IntechOpen; 2021. pp. 131-156. Available from: https://www.intechopen.com/chapters/79580. DOI: 10.5772/intechopen.101330
  28. 28. Lorvão Antunes A, Cardoso E, Barateiro J. Incorporation of ontologies in data warehouse/business intelligence systems - a systematic literature review. International Journal of Information Management Data Insights. 2022;2(2):100131. DOI: 10.1016/j.jjimei.2022.100131
  29. 29. Brüggemann S, Aden T. Ontology based data validation and cleaning: Restructuring operations for ontology maintenance. In: Koschke R, Herzog O, Rödiger K-H, Ronthaler M, editors. Informatik 2007 – Informatik trifft Logistik – Band 1. Bonn: Gesellschaft für Informatik e.V.; 2007. p. 207-211. Available from: https://dl.gi.de/handle/20.500.12116/22581 [Accessed: January 10, 2023]

Written By

Nicholas Nicholson and Iztok Štotl

Submitted: 13 January 2023 Reviewed: 02 March 2023 Published: 17 April 2023