Parking observations (number of cars) subset.
Due to the increasing number of IoT devices, the amount of data gathered nowadays is rather large and continuously growing. The availability of new sensors presented in IoT devices and open data platforms provides new possibilities for innovative applications and use-cases. However, the dependence on data for the provision of services creates the necessity of assuring the quality of data to ensure the viability of the services. In order to support the evaluation of the valuable information, this chapter shows the development of a series of metrics that have been defined as indicators of the quality of data in a quantifiable, fast, reliable, and human-understandable way. The metrics are based on sound statistical indicators. Statistical analysis, machine learning algorithms, and contextual information are some of the methods to create quality indicators. The developed framework is also suitable for deciding between different datasets that hold similar information, since until now with no way of rapidly discovering which one is best in terms of quality had been developed. These metrics have been applied to real scenarios which have been smart parking and environmental sensing for smart buildings, and in both cases, the methods have been representative for the quality of the data.
- data quality
- data integrity
The emergence of Internet of Things (IoT) deployments has allowed millions of connected, communicating, and exchanging objects to be embedded seamlessly around the world, generating large amounts of data through sensor monitoring on a timely basis.
The data flow between the physical and the digital world through artificial intelligence can expand the computer’s awareness of the surrounding environment, thereby obtaining the ability to act on behalf of humans through ubiquitous services.
In this IoT-based environment, the basis for making wise decisions and providing services is the data collected by sensors and actuators. If the data quality is poor, these automated decisions may be incorrect, ranging from sensor failure to deliberately providing false information with malicious intent. Data quality (DQ) is therefore needed to attract users to participate and accept IoT paradigms and services.
Data Quality refers to how well data meet the requirements of data consumers . In a similar manner, Quality of Information (QoI) relates to the ability to judge whether information is adequate for a particular purpose [2, 3].
From such a well-known and accepted definition, we understand that it refers to a perception or an evaluation of the suitability of the data to fulfill its purpose in a given context, subject to the requirements of the consumer. On the literature, the quality of the data is determined by factors such as availability, usability, reliability accuracy, completeness, relevance, and novelty .
According to , ensuring data quality is crucial when deploying and leveraging devices, given that:
Decision-making is only possible if the data available are correct and appropriate.
Serious problems are practically unapproachable without an adequate data source.
The way to tackle this problem is through the use of so-called data quality metrics which are calculated in order to validate the Quality of the Information (QoI).
The aim of this chapter is to define some metrics for DQ and calculate them in IoT scenarios in order to test their viability.
2. Data integrity
Data integrity refers to the accuracy and reliability of data. The data must be complete, without variations or compromises from the original, which is considered reliable and accurate. Therefore, this term is closely related to the quality of the data and this in turn to the quality metrics .
There are several types of data integrity :
Physical integrity is the protection of the integrity and accuracy of data as they are stored and extracted. That is, it is related to the physical layer of the systems. In the context of IoT, a physical integrity problem comes from the physical degradation of the sensors, whether due to a breakdown or sabotage.
Logical integrity preserves the data without any change, since it is used differently in a relational database. Logical integrity protects data from human errors and also from hackers, but in a very different way than physical integrity.
The integrity of the entity is based on the creation of primary keys, or unique values, that identify data to ensure that it is not listed more than once and that there is no field in a table considered null. It is a feature of relational systems that store data in tables that can be linked and used in very different ways. In an IoT scenario, an entity integrity problem can arise in case of a sensor failure which produces redundant measurements or by a human failure in which two different sensors are assigned the same identifier, which produces redundancies in databases.
Referential integrity is a series of processes that ensure that data is stored and used consistently. The rules built into the database structure about how foreign keys are used to ensure that only appropriate data changes, additions, or deletions occur.
Domain integrity is the set of processes that guarantee the veracity of each data in a domain. In this context, a domain is a set of acceptable values that a column can contain. You can incorporate restrictions and other measures that limit the format, type, and amount of data entered. Due to an error in the IoT devices, one of them could be entering data that does not correspond to the correct type in a column of a database, such as saving a number when a date should be saving or a date in a format that is not adequate.
User-defined integrity comprises the rules and constraints created by the user to suit their particular needs. Sometimes entity, referential, and domain integrity are not enough to safeguard data. Often times, specific corporate rules need to be considered and incorporated into measures regarding data integrity. In an IoT scenario, a sensor may be giving acceptable values, that is, that they respect the rest of the integrity criteria, however, it may not be meeting a necessary criterion for the correct functioning of the system, such as a sensor that collects percentage values and that you are receiving a value greater than 100.
In this section, Data Integrity has been defined, however, it is necessary to note what is the difference between this term and the term Data Quality. Data quality is related to the reliability of the information, which is necessary for planning and decision making for a specific operation. Whereas, the integrity of the data guarantees the reliability of the data in physical and logical terms.
3. Data quality metrics
In this section, we describe the metrics that have been defined to calculate and annotate the QoI for IoT data. Those were previously described on .
3.1 QoI basic metrics
The first set of metrics is based on a descriptive analysis. This approach was also used on the IoTCrawler framework . It proposes to integrate quality measures and analysis modules to rate data sources to identify the best fitting data sources to get the needed information. The first step before implementing some quality analysis modules is to identify quality measures, which can be used to rate data sources and the delivered/produced data for their Quality of Information. To measure the QoI, we propose to use the so-called QoI Vector, which is defined in Eq. (1) and gathers the information belonging to all the metrics proposed in this framework
The elements of the vector are defined as follows:
Completeness (): it represents the percentage of missing or the unusable data.
where is the sum of missing values and is the sum of expected values of an incoming dataset.
Timeliness (): refers to the expected time of accessibility and availability of information. In other words, it represents how long is the time difference between the data capture and the reality event happening. It is crucial in critical IoT applications such as traffic safety. Its definition is:
where is the difference between the expected time and the time taken by the sensor (), and is the proper time of the system, which is chosen arbitrarily.
Plausibility (): shows if received data is coherent according to the probabilistic knowledge of the variables that are being measured. Sensor annotations or meta-data are used to determine an expected value range of an incoming measurement.
The range of Plausibility value is defined between 0 and 1.
Artificiality (): this metric determines the inverse degree of the used sensor fusion techniques and defines if this is a direct measurement of a singular sensor, an aggregated sensor value of multiple sources or an artificially interpolated value.
Concordance (): describes the agreement between information of the data source and the information of other independent data sources, which report correlating effects. The Concordance analysis takes any given sensor and computes the individual concordances, , with a finite set of sensors ().
with as a weight-function
And propagation and infrastructure-based distance function between sensor location and or sensors and .
All the metrics exposed in this section take values between 0 and 1, with the value 1 being the ideal case in which the quality of the data is maximum and 0 the opposite case.
These metrics represent the simplest ones that can be calculated in this kind of IoT scenarios. However, it is possible to go further and compute some metrics that give us a deeper knowledge of the IoT system.
3.2 Oultier-based metrics using heuristics
Since these metrics provide us basic information, it is possible to go further and obtain a series of metrics that can be useful. These new metrics come from the hand of Machine Learning (ML), in this case the search for outliers.
In machine learning, an outlier is an observation that diverges from an overall pattern. The number of outliers in an indicator of data quality.
In the literature, there are usually considered 4 types of basic outliers for time series: additive outliers, level shifts, temporary changes and innovational outliers, see [10, 11] for a complete description.
A metric similar to the case of can be defined, only taking into account the values that are considered outliers instead of the missing ones. The percentage of outliers in the studied sensor is named (see Eq. (7)). In order to obtain which of these values are considered outliers it is useful an Autoregressive Integrated Moving Average (ARIMA) based framework . It can also determine if the oulier is innovational, additive, level shift, temporary changes or seasonal level shifts.
where is the sum of outlier values on the features of the sensor and is the sum of total features.
As important as determining whether an instance is an outlier or not is knowing how much it deviates from what would be the expected value corresponding to the normal behavior of the time series. For that purpose it is necessary to impute the data of the time series that are considered outliers as if they were missing values, in order to know what this expected value would be. Then the difference between the value and the imputation is another metric that has been computed by dividing the difference of each sensors value by the mean, median or mode of the values and then calculate their mean, median or mode (, , ).
where i corresponds to those indices of the features that present an anomalous behavior, while and represent the imputed value that follows the expected behavior and the value of the outlier respectively. This metric takes values between 0 and 1, with 1 being the ideal case.
Unsupervised methods are also adequate for oultier detection, so we propose . This metric corresponds to the probability of belonging to a certain cluster that has been computed using Gaussian Mixture Models (GMM), which consists of representing in the most faithful way possible the data points by adding some Gaussian distributions. It informs quantitatively of the anomalous values. The number of clusters or Gaussians distributions is an hyperparameter and it could be chosen in different ways. In the experiments we used silhouette coefficient.
where is the number of clusters or distributions used, corresponds to distribution i and is the vector taken by the sensor. Because this metric is probabilistic, it takes values between 0 and 1, in such a way that the closer the value is to 1, the more quality the instance has.
Another way to determine if the data series exhibits anomalous behavior is by using so-called AutoEncoders. An AutoEncoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The objective of these autoencoders is to learn a representation of the data to be studied, with the aim of eliminating noise, however it is possible to use this tool to detect anomalous values. AE are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation.
The metric based on AE  informs us about how the correlations between the different variables of the system behave. Given that, the metric is based on the difference between the input and the output value of the AE, in such a way that the greater the reconstruction error, the less concordance there will be between the variables .
where correspond to the features of the data taken by the sensors, is the number of total features. On the other hand is the value of the vector of variables reconstructed by the AE. Sometimes is known as a reconstruction error and is represented as . Since this metric is based on a difference between two values, it can take any real value greater than 0, in such a way 0 is the value with the highest quality.
3.3 Geospatial-based metrics
Considering sensors’ location is also highly relevant for knowledge extraction. In this sense, we also provide two metrics that use interpolation methods for assessing how well a sensor is coordinated and correlated with its peers according to their distance. The used models are Inverse Distance Weighting (IDW)  and Bayesian Maximum Entropy (BME) . IDW is a deterministic estimation method in which, assuming that the near sensors are more similar, a weighted average of available values at known points is used to calculate unknown data points. BME is a knowledge-based probabilistic modeling framework for spatial and temporal information. It allows various knowledge bases to be used as prior information, and the determination rules for hard (high precision) and soft (low precision) data are logically incorporated into the modeling. Like previously, we calculate the difference between the interpolated and the real measure, and the average value will become the metric, named:
and for IDW.
and for BME.
4. Examples of implementation
In this section, 3 different IoT scenarios are introduced, in which the previous metrics are computed and highlight the possible drawbacks.
4.1 Parking data
This data was collected from 5 private parking sensors located in the city of Murcia1, Spain.
First, the variables that are useful for our goal had to be chosen: the timestamp and the parking occupation measurements and aggregated the data in 10 minutes intervals.
This aggregation can generate redundancies on the timestamps, so the result has been averaged. Storing information about this aggregation process will be useful for the Artificiality metric.
NA (not available) instances have been kept since due to their importance in obtaining some quality metrics (Completeness). Given that the data is not measured periodically, a lot of missing values are generated at this point. For illustrative purposes, a new variable called real_time was computed, which adds a random delay to the timestamps, simulating that the data needs some time to be stored. These are some highlights:
Completeness: it consists on counting instance by instance the percentage of non-absent values there are.
Timeliness: the random time lag that is included in the data () is used, so when it is divided by the arbitrary aggregation time (600 seconds, in this case) it shows the time that data takes to be available, as follows
Plausibility: if the data of each parking lot belongs to the interval , this measure will be said to be plausible and will receive a value of 1. The values of are: 330, 312, 305, 162 and 220 respectively.
Artificiality: due to aggregation over time, the number of instances used for computing the mean and therefore the aggregated value were considered. Thus, if a data was obtained by means of two data-points taken in the same time frame, its metric of artificiality will be .
Concordance: the geostatistical metrics have been used for covering this concept.
Outliers: given the amount of missing data, the ARIMA framework could not be used for detecting outliers in this dataset.
A subset of the quality metrics and data values are shown in Table 1. Where Park101, …, Park105 are the parkings’ ids, as we can see there are many instances that cannot be correct, that information is condensed in the quality metrics. Whereas Figure 1 shows the histograms of all basics metrics that could be computed for the parking dataset. In Figure 2 the histogram of outlier-based metrics is shown.
Parking data geospatial-based metric’s histograms are shown in Figure 3. As it was said above, the calculation of these metrics replace the calculation of the concordance metric, because they provide information about the correlation of the different sensors, in this case, the lower the value of the metric, the better.
4.2 Luminosity data
In this section, the monitored luminosity from 4 sensors located in the Pleiades building of the University of Murcia was studied.
First, the data is aggregated using the timestamp as in the previous section, choosing a 10 minutes aggregation time. Table 2 shows the aggregated values and also some of the computed metrics.
Figure 4 shows the histograms of all metrics that could be computed for the luminosity dataset together with basic statistics. The timeliness metric could not be calculated, since there are no signs of any lag in the data’s storage. Also, the artificiality value always takes the value of 1 because the timestamps of the data are far apart. The rest of metrics are included in Figure 5.
By last, in Figure 6 the geospatial luminosity’s metrics can be seen. As in the case of parking, these metrics replace the concordance metric.
4.3 Pollution data
Given that the only way to calculate concordance on previous datasets has been through spatial interpolation due to poor dataset quality, a dataset of high quality has been used to compare the values that this metric takes in this situation and when they are added to it some imperfections.
As can be seen in Table 3, this dataset has five variables that inform on the pollution of the atmosphere every five minutes, the data values are scaled.
|Ozone||Particulate matter||Carbon monoxide||Sulfur dioxide||Nitrogen dioxide|
Now the data are given, one way to calculate the concordance metric is to calculate the correlation between a value and the previous one, in such a way that if when the data is taken properly this value will be very close to 1, while if the data suffers any problem, this value will move away from 1. This is shown in Figure 7, in which we have the original dataset on the left side and the same dataset on the right side to which anomalous values have been added randomized, as it can be seen, the agreement values change significantly.
It should be noted that if the rest of the metrics are calculated in the case of the unaltered dataset, they will take perfect values, that is, they will always indicate a high quality of the dataset.
For this dataset, the rest of the metrics have been calculated, however, the results have not been added, since the dataset presents a high quality and therefore the results are not of great interest since the histograms of the metrics take the ideal behavior.
4.4 Data without context
For demonstration purposes, we propose to compute the quality metrics in a dataset whose context, origin and meaning are unknown. It is a dataset in which we have no knowledge about what the columns represent, how the data was collected and the timestamp of the observations. In such scenario, the only basic metric that can be computed is completeness. However, outlier-based metrics are very useful, since they consider the variables as plain time series without taking into account their physical meaning. Table 4 shows a subset of the dataset, that presents 5 unknown variables, with 1200 instances.
Similarly, the probabilistic and reconstruction metrics can be calculated here, since they do not assume any kind of knowledge of the data. In Figure 9 the histogram of both metric is shown.
The proliferation of datasets thanks to the new paradigm of the Internet of Things, is populating repositories and open data platforms with data that could be of great use for the scientific community and the technologists to catalyze the growth of scientific knowledge and to make proliferate the creation of new technological solutions. Although all data has value, a point has been reached in which it is necessary to rapidly recognize the quality of a dataset, or a data stream, ideally on an only manner.
In this chapter, several concepts have been combined in order to measure the quality of data from IoT-based real-time streams (tested on real-world) sensor systems.
Three sets of quality assurance methods, descriptive, analytic and geometrical have been developed that can be used as levels of a given evaluation, or independently depending on the nature of the datasets to be evaluated.
It has been shown that the metrics can be an standard on the calculation of data quality and the majority can be applied independently on the problem context. At the same time, basic concepts that must be present in any system in which the quality of the data is to be guaranteed have been reviewed. Furthermore, it has been shown how it is possible to obtain quality metrics when knowledge about the data is limited.
The applications of this technology are linked to the proliferation of open data portals. There exist many initiatives and organizations that are working towards publishing data as open. The main funding body for engineering and physical sciences research in the UK, the Engineering and Physical Sciences Research Council (EPSRC) is supporting the management and provision of access to research data. They claim that publicly funded research data should generally be made as widely and freely available as possible in a timely and responsible manner2. Other initiatives are the EU Open Data Portal3 at European level or the national-level ones such as Open Data Aarhus 4. In that sense, the selection of data sources becomes more complicated given the great amount of data that researchers and practitioners have access to. Our system provides an easy, understandable and quick way to make an informed decision for choosing between several data sources based on data quality.
As future work, we are considering several technologies in order to make our metrics available to researchers and businesses. We consider that they have the potential to become a standard for measuring data quality.
This work has been sponsored by MINECO through the PERSEIDES project (ref. TIN2017-86885-R), by ERDF funds of project UMU-CAMPUS LIVING LAB EQC2019-006176-P by the European Comission through the H2020 IoTCrawler (contract 779852), and DEMETER (grant agreement 857202) EU Projects. It was also co-financed by the European Social Fund (ESF) and the Youth European Initiative (YEI) under the Spanish Seneca Foundation (CARM).
- Their locations are stored in the following web address http://mapamurcia.inf.um.es/