Open access peer-reviewed chapter - ONLINE FIRST

High-Fidelity Synthetic Data Applications for Data Augmentation

Written By

Zhenchen Wang, Barbara Draghi, Ylenia Rotalinti, Darren Lunn and Puja Myles

Submitted: 14 July 2023 Reviewed: 02 November 2023 Published: 12 January 2024

DOI: 10.5772/intechopen.113884

Deep Learning - Recent Findings and Research IntechOpen
Deep Learning - Recent Findings and Research Edited by Manuel Domínguez-Morales

From the Edited Volume

Deep Learning - Recent Findings and Research [Working Title]

Ph.D. Manuel Jesus Domínguez-Morales, Dr. Javier Civit-Masot, Mr. Luis Muñoz-Saavedra and Dr. Robertas Damaševičius

Chapter metrics overview

53 Chapter Downloads

View Full Metrics

Abstract

The use of high-fidelity synthetic data for data augmentation is an area of growing interest in data science. In this chapter, the concept of synthetic data is introduced, and different types of synthetic data are discussed in terms of their utility or fidelity. Approaches to synthetic data generation are presented and compared with computer modelling and simulation approaches, highlighting the unique benefits of high-fidelity synthetic data. One of the main applications of high-fidelity synthetic data is supporting the training and validation of machine learning algorithms, where it can provide a virtually unlimited amount of diverse and high-quality data to improve the accuracy and robustness of models. Furthermore, high-fidelity synthetic data can address missing data and biases due to under-sampling using techniques such as BayesBoost, as well as boost sample sizes in scenarios where the real data is based on a small sample. Another important application is generating virtual patient cohorts, such as digital twins, to estimate counterfactuals in silico trials, allowing for better prediction of treatment outcomes and personalised medicine. The chapter concludes by identifying areas for further research in the field, including developing more efficient and accurate synthetic data generation methods and exploring the ethical implications of using synthetic data.

Keywords

  • synthetic data
  • external validation
  • sample size boosting
  • in silico trials
  • clinical trials
  • virtual populations
  • BayesBoost
  • bias correction

1. Introduction

In the rapidly evolving field of data science, one of the most critical challenges faced by researchers is the availability of training data. Data forms the bedrock upon which accurate and robust machine learning models are built. However, acquiring a sufficient amount of high-quality data can often be a challenging and resource-intensive task. This is where the concept of data augmentation comes into play, offering a transformative solution to address this persistent hurdle.

Data augmentation, in its simplest form, is a technique used to enhance the existing dataset by generating variations of the available samples using external data sources or synthetic data. By doing so, it aims to increase the diversity, volume, and quality of the training data. This augmentation process is important in improving the model’s ability to generalise and make accurate predictions on unseen instances. This expanded dataset helps the model learn to handle different scenarios and improve its performance [1].

In the realm of data science, data augmentation allows researchers to overcome the limitations imposed by limited training data. By introducing variability through augmentation, it helps mitigate overfitting, a common challenge where the model becomes too specialized to the training data and performs poorly on new data. Additionally, data augmentation assists in addressing class imbalance issues, especially when certain classes are underrepresented in the dataset. It achieves this by generating additional samples for minority classes, thereby ensuring a more balanced representation of different classes in the augmented dataset [2, 3, 4].

Recent advancements in machine learning have paved the way for innovative data augmentation methods in high-fidelity synthetic data generation, ushering in augmentation strategies that provide richer and more insightful enhancements. For example, tabular data augmentation techniques, as described in [5], showcase the potential for improving disease predictive capabilities. Furthermore, in conjunction with methods like generative adversarial networks (GANs) [6], variational autoencoders (VAEs) [7], rule-based models, and physics-based simulations, which are used to craft highly realistic synthetic data, the incorporation of graphical models such as Bayesian networks and Markov random fields enables the capture of dependencies and relationships among variables within the dataset. This capability facilitates the generation of synthetic data that closely mirrors the characteristics of the original dataset. By combining these techniques, the creation of ‘high-fidelity’ synthetic data becomes achievable, showcasing complex patterns, domain knowledge integration, and simulation of real-world interactions.

The use of high-fidelity synthetic data for data augmentation presents a myriad of advantages. Firstly, it addresses the limitations of traditional augmentation techniques by providing more diverse and representative samples, especially in cases where the original dataset is small or lacks variability. Secondly, synthetic data enables researchers to explore hypothetical scenarios, enabling them to understand the behaviour of their models under different conditions. Additionally, it can be an invaluable asset in situations where the collection of real-world data is prohibitively expensive, time-consuming, or ethically challenging. By simulating data with similar statistical and relational properties, synthetic data augments the training set, expanding the model’s ability to handle a wide range of scenarios.

In this chapter, we will explore the definitions, generation approaches, and considerations surrounding high-fidelity synthetic data. Furthermore, we will examine the fidelity and utility of various types of synthetic data, drawing comparisons between different generation approaches and techniques. Additionally, we will explore how high-fidelity synthetic data can be applied to support the training and validation of machine learning algorithms. Specifically, we will investigate how it can address challenges such as missing data in situations where real data is either randomly or non-randomly missing. Moreover, we will explore how high-fidelity synthetic data can help address biases resulting from under-sampling, employing innovative techniques like BayesBoost [8]. As we conclude this chapter, we will also identify areas that hold promise for further research in the field. This includes the development of more efficient and precise methods for generating and evaluating synthetic data, and ethical implications associated with the use of synthetic data.

Advertisement

2. Synthetic data: definitions, approaches, and considerations

Synthetic data are artificial data that can mimic the statistical properties, patterns, and relationships observed in real-world data. It is generated or simulated rather than directly collected from authentic sources. The quality of synthetic data is primarily determined by the chosen approach employed for its generation. The quality of synthetic data can be described in terms of its fidelity i.e., how effectively the synthetic data captures the relevant features and characteristics of the original data. The fidelity in turn determines its utility i.e., its practical usefulness for various applications.

2.1 High-fidelity synthetic data

High-fidelity synthetic data refers to synthetic data capable of capturing the intricate interrelationships that exist between various data fields, replicating the complex patterns observed in real data.

In the field of financial transactions, a high-fidelity synthetic dataset would accurately emulate the intricate relationships and patterns observed in real financial data. It would possess the same statistical characteristics, transactional structures, and market trends, making it virtually indistinguishable from genuine financial data [9]. In the realm of transportation planning, a high-fidelity synthetic dataset would replicate the complex interactions and dynamics within transportation systems, including traffic flows, travel patterns, and infrastructure utilisation [10]. This synthetic dataset would mirror the statistical properties and intricate relationships found in real-world transportation data, making it practically indistinguishable from genuine data. In the context of patient health care data, a high-fidelity synthetic dataset would be able to capture complex clinical relationships and be clinically indistinguishable from real patient data [11, 12, 13]. Within the realm of social network analysis, a high-fidelity synthetic dataset would accurately capture the intricate connections, community structures, and communication patterns present in real social networks [14]. It would possess the same statistical properties, network topologies, and user behaviours, rendering it virtually indistinguishable from genuine social network data.

Generating a high-fidelity synthetic dataset can be demanding in terms of resources. Generating synthetic datasets with lower or moderate utility, that are less demanding of resources, might be deemed sufficient depending on the application. It is also important to note that there is a trade-off between utility and privacy in synthetic data generation. Higher fidelity synthetic data, which closely resembles real data, may come with increased privacy risks [15].

2.2 Approaches to generate synthetic data

We categorize the generation of synthetic data (see Figure 1) into two distinct methods that tackle the various challenges associated with data generation: the model-based approach and the simulation-based approach.

Figure 1.

Synthetic data generation approach categories.

2.2.1 Model-based approach

The model-based approach to generating synthetic data includes statistical-based, noise-based, and machine learning-based generation methods.

Statistical-based generation (see Figure 2) involves creating synthetic data by leveraging the statistical properties found in real-world data. This approach aims to capture the statistical characteristics, distributions, and dependencies observed in the original data.

Figure 2.

Statistical-based synthetic data generation. E.g., Gaussian Mixture Model components are designed to capture the statistical properties of the real data, the arrows illustrate how the Gaussian components influence the generation of synthetic data.

Various techniques and models are used in this category to generate synthetic data that closely resembles the statistical properties of the real data [16]. For example, Gaussian mixture models are commonly employed in statistical-based generation to capture the underlying distributions and generate synthetic data points [17]. These models allow researchers to replicate the patterns and variations observed in the original data. Similarly, Markov chains [18] are used to model the dependencies and transitions between different states or variables, enabling the generation of synthetic sequences that mimic the temporal behaviour of the real data.

However, the statistical-based approach has limitations in capturing complex relationships and dynamics within the data [19]. While it reproduces statistical properties well, it struggles with non-linear relationships that are not explicitly reflected in the statistical properties. In the context of body mass index (BMI) and health outcomes, a linear relationship might suggest that as BMI increases, the risk of certain health conditions, such as diabetes, also increases linearly. However, the relationship between BMI and health outcomes may not be linear. It could be that initially, as BMI increases, the risk of health conditions rises sharply, but beyond a certain BMI threshold, the effect on health outcomes levels off or even declines. This non-linear relationship is challenging to capture solely through statistical properties.

Furthermore, the statistical-based approach may struggle to capture complex interactions between variables. For example, the relationship between BMI and health conditions may vary depending on other factors such as age, gender, or genetic predispositions. These complex interactions, where the effect of BMI on health outcomes is influenced by other variables, are not easily captured through statistical properties alone.

Thus, the fidelity of the synthetic data generated using statistical-based approaches will largely depend on the assumptions and limitations of the data generation model, as well as the specific characteristics of the dataset. Researchers need to consider the specific nature of the data and the level of detail required for their analysis, as the statistical-based approach may not be able to fully replicate the more nuanced aspects of the real data.

The noise-based approach (see Figure 3) involves adding noise to a small sample of data. This approach, while useful when regenerating a portion of the real-world data, is also part of a broader category known as rule-based synthetic data [20]. This involves defining rules or mathematical functions that generate synthetic data adhering to specific patterns or distributions. One example of a noise-based approach is the use of jitter, where small random perturbations are added to the data points. Jitter introduces slight variations to the data values, simulating the inherent randomness or measurement errors observed in real-world data [21]. By applying the jitter approach within the noise-based category, researchers can generate synthetic datasets that capture the desired characteristics and patterns while incorporating random fluctuations. For instance, in a study investigating the relationship between body mass index (BMI) and blood pressure, researchers can use jitter to introduce random variations to the BMI values within a certain range. This ensures that the synthetic data closely mirrors the statistical properties and patterns observed in real data, accounting for the inherent variability in BMI measurements. Additionally, techniques like data imputation, commonly used to address missing values, also fall under this category.

Figure 3.

Noise-based synthetic data generation, noise source, e.g., a jitter function, represents the introduction of random noise or perturbations to the real data.

Often, an important consideration in synthetic data generation is the protection of privacy. This becomes particularly relevant when dealing with sensitive or personally identifiable information. The noise-based approach, driven by privacy-preserving techniques such as differential privacy [22], is used to safeguard individual privacy while still allowing for meaningful analysis and data utilisation. By incorporating such techniques into the synthetic data generation process, researchers can ensure that the generated datasets maintain a high level of privacy protection. This can be achieved by adding carefully calibrated noise to the data, minimising the risk of re-identification while preserving the statistical properties and utility of the synthetic data. By prioritising privacy alongside utility and fidelity, synthetic data can serve as a valuable resource for various applications while respecting individual privacy rights.

However, similar to the statistical-based approach, the noise-based approach has limitations in capturing intricate relationships and dependencies present in the real data, such as in healthcare datasets [23]. While it can reproduce specific patterns and distributions, it may struggle to capture the complexity of non-linear relationships or intricate dependencies that may exist within the original dataset. This can affect the fidelity of the synthetic data, as it may not fully capture the nuances and interconnections present in the real data.

Machine learning-based generation methods (see Figure 4) involve the use of machine learning techniques for prediction and inference, facilitating the generation of generative synthetic data. This approach encompasses various models, including GANs, VAEs and Bayesian Network-Based Generative Model and other state-of-the-art generative models. These models can learn the underlying structure of real data and generating synthetic data that closely resembles it.

Figure 4.

Machine learning-based synthetic data generation, e.g., Bayesian networks model relationships between variables via latent variables.

GANs consist of two functions, a generator function (G) and discriminator function(D), trained together to produce realistic synthetic data through “adversarial learning.” The VAE also consists of two functions, an encoder and a decoder functions. It uses an encoder and decoder pair to map high-dimensional data to a lower-dimensional latent space. The decoder function takes samples from this latent space and maps them back to the data space. The VAE’s advantage lies in generating synthetic data that matches the distribution of the original data using latent variables. Likewise, Bayesian networks model relationships between variables. These networks represent dependencies and generate consistent synthetic data using techniques like Markov Chain Monte Carlo (MCMC) sampling. For instance, in medical data, a Bayesian network can model dependencies between conditions, symptoms, and treatments to generate synthetic records matching observed data.

However, the machine learning-based approach for generating synthetic data has certain limitations that need to be considered [24]. One challenge is the requirement of substantial computational resources and a large amount of training data. Training generative models like VAEs or GANs can be computationally intensive, particularly when working with large and intricate datasets. Another limitation is the possibility of model collapse, where the generative model fails to capture the full complexity of the data and produces limited variations or repetitive samples. This can result in synthetic data that lacks diversity and fails to capture the full range of patterns present in the real data. Additionally, the quality and fidelity of the generated synthetic data heavily rely on the quantity and representativeness of the training data. Inadequate or biased training data can lead to synthetic data that deviates from the true underlying distribution of the real data, reducing its fidelity and usefulness.

2.2.2 Simulation-based approach

Simulation-based approaches (see Figure 5) on the other hand, include agent-based, compartmental model-based, and discrete event-based methods for synthetic data generation.

Figure 5.

Simulation-based synthetic data generation, e.g., with domain knowledge, agent-based model can simulate interactions between agents to generate data.

In transportation planning, agent-based modelling techniques can capture traffic flows and travel patterns. By simulating realistic scenarios considering road networks, traffic signals, and driver behaviours, these approaches generate synthetic data for analysing traffic congestion, designing efficient transportation systems, and evaluating infrastructure projects [25].

In epidemiology, epidemic models such as compartmental models (e.g., Susceptible-Infectious-Recovered (SIR) or Susceptible-Exposed-Infectious-Recovered (SEIR) models), spatial models, and network models simulate disease transmission dynamics to generate synthetic data [26]. These models incorporate population demographics, contact networks, and intervention strategies to study control measures, predict outbreak impacts, and plan healthcare resource allocation.

Simulation-based approaches also find applications in other domains. Ecological research employs ecological models to simulate species interactions, population dynamics, and environmental factors [27]. In social sciences, agent-based models simulate human behaviour, social networks, and economic systems [28]. In manufacturing and supply chain management, discrete event simulation models replicate production processes, logistics networks, and inventory systems [29]. These techniques generate synthetic data to analyse, make informed decisions, and evaluate policies in respective fields.

However, the obvious limitation of this approach is that it may require domain-specific knowledge and expertise. Designing accurate simulations that replicate the intricacies of the system often necessitates a deep understanding of the underlying mechanisms and processes [30]. Developing reliable simulation models requires careful calibration, validation, and refinement to ensure that the generated synthetic data accurately reflects the real-world phenomenon [31].

The methods discussed earlier have their own strengths and limitations (see Table 1). The choice of method usually depends on the specific requirements of the application, the available resources, and the desired level of similarity between the synthetic and real data.

MethodsStrengthsLimitations
A. Model-based
 Statistical-basedGood for reproduction of statistics and characteristics [11, 12, 13]Limited capture of intricate relationships and dynamics [19]
 Noise-basedGood flexibility and control over generation process [20]Partial capture of complex relationships and dependencies [23]
 Machine learning-basedAble to capture of complex relationships and patterns [6, 7]Substantial computational resources, model collapse and training data dependent [24]
B. Simulation-based
 Agent-basedAccurate replication of system interactions, High fidelity in capturing interactions [25]Domain knowledge required [30]
 Compartmental model-basedGood for epidemiological research [26]Limited capture of fine-grained aspects of dynamics [27]
 Discrete event-basedGood for modelling discrete events, high fidelity in capturing event-based behaviours [29]Reliability depends on calibration, validation, and refinement [31]

Table 1.

Comparison of synthetic data generation approaches.

Advertisement

3. Applications of high-fidelity synthetic data

The use of high-fidelity synthetic data has found applications in a wide range of fields, notably computer vision and natural language processing. However, the healthcare domain stands out as a particularly compelling area of focus. This emphasis on healthcare is justified by the significant impact that high-fidelity synthetic data can have in this field. By generating realistic and privacy-preserving synthetic healthcare data, researchers can overcome challenges related to data availability and data bias concerns. This representative use of high-fidelity synthetic data in healthcare enables the development and validation of machine learning models, facilitating advancements in disease prediction and personalised healthcare. Moreover, high-fidelity synthetic data plays a crucial role in generating virtual patient cohorts for analysis when sample size is small, like digital twins [32], to estimate counterfactuals in silico trials, allowing for better prediction of treatment outcomes and personalised medicine. In this section, we delve into the applications of high-fidelity synthetic data, emphasising its role in supporting the training and validation processes of machine learning algorithms, while highlighting specific examples from healthcare that showcase the transformative potential of this approach in healthcare.

3.1 Addressing missing data

One of the most noteworthy applications of synthetic data lies in the opportunity to effectively tackle missing data due to the absence or unavailability of certain observations or values in a dataset. Failing to appropriately address missing data can jeopardize the reliability of the results by introducing bias into statistical analysis and modelling, leading to results that are not generalisable to real world populations or scenarios [33, 34]. In addition, missing data decreases the sample size available for analysis, inevitably reducing statistical power as it becomes more challenging to detect true relationships and make reliable predictions.

To mitigate these risks and select the most suitable solution, we usually carry out an accurate a-priori analysis considering factors such as the specific characteristics of the dataset (i.e., data types), the proportion of missing data and the underlying mechanisms causing the missingness [35]. Indeed, missing data can be classified into three different categories depending on the missing data mechanism.

Missing Completely at Random (MCAR) data embodies the scenario where the missingness is a random process that occurs independently of any measured or unmeasured feature. MCAR is an ideal scenario as it implies that the missing information does not introduce bias into the analysis. Missing at Random (MAR) occurs when the probability of missingness depends entirely on the observed data. MAR assumes that the missingness can be explained by observable characteristics. Both MCAR and MAR can be handled through a set of approaches as surveyed in [36]. Traditional statistical and machine learning imputation techniques including mean, regression, K nearest neighbour, ensemble based etc, have been proposed in the literature to handle these scenarios [37].

More sophisticated techniques are required to handle more complex data patterns such as Missing Not at Random (MNAR) data. This category describes a situation where the missingness is related to the values that are missing. MNAR is considered the most challenging type of missing data because the missing data is systematically related to unobserved variables. Handling MNAR requires techniques [38] considering the missing data mechanism and the relationship between the missing values and the unobserved variables to make imputations and draw valid inferences.

Recent research on modelling MNAR demonstrates the use of Bayesian networks to generate high-fidelity synthetic data from large-scale UK primary care datasets. These datasets usually contain noise, structurally missing data, and numerous non-linear relationships [39]. In this context, three approaches can be employed to model MNAR. The first approach focuses on the discrete nodes, one can include a “missing state” in all possible node states. Alternatively, for continuous nodes, a new binary parent known as the “missing node” can be added to each node, indicating whether the data point is missing or not. Additionally, the Fast Causal Inference (FCI) algorithm [40] offers a third approach that can be applied to both discrete and continuous nodes. FCI aids in inferring the position and inclusion of latent variables in the network, effectively capturing Missing Not at Random (MNAR) and other unmeasured effects. By incorporating robust latent variables, this approach aims to enhance the accuracy of the underlying distributions and account for any MNAR effects.

Another example [41] is using GANs to model missing data in retailing and healthcare datasets and to generate synthetic data. It starts by creating unique identifiers for missing patterns in the original dataset. Then, it uses these identifiers to fill in the missing values and learns from the data to generate synthetic samples with similar missing patterns. The final synthetic dataset resembles the original data but includes missing values in the same way.

Even though synthetic data represents a solution to handle missing data, we must recognize the complexity of this task and the potential risks it entails [42]. Meticulous attention and thoughtful planning are required. For instance, incorrectly handling missing data can disrupt correlations and relationships among features in the dataset. This can result in misleading associations and invalid inferences, affecting the accuracy of predicting models and the ability to draw meaningful conclusions. Also, an improper solution can increase the variability of the results, making them less precise and less generalizable to the population. In certain situations, such as with temporal correlations of missing data distribution, additional modelling approaches, or sensitivity analyses may be required to handle missing data more appropriately.

3.2 Addressing data bias

Real-world data often suffer from underrepresentation or inadequate representation of certain groups. Specific groups may be under-represented due to cultural sensitivities amongst some communities, institutionalised data collection procedures, or research involving small patient cohorts for rare diseases and outcomes, leading to bias in analyses and decision-making processes [21]. Advanced model-based synthetic data generation techniques such as BayesBoost and Importance Sampling can be employed to mitigate these biases.

BayesBoost is an algorithmic technique that leverages high-fidelity synthetic data to correct biases due to under-sampling by boosting underrepresented groups by generating synthetic data points that augment the representation of underrepresented populations. This technique has been recently used to correct bias in COVID-19 and cardiovascular disease datasets [8]. In this study, the approach’s effectiveness was confirmed by validating it with a biased subset of data from a dataset and comparing the bias-corrected synthetic data with the original “full” dataset. The findings demonstrated that synthetic data can effectively correct biases and enhance the generalizability of machine learning algorithms to population subgroups that are underrepresented in the real data.

Importance Sampling [43] is a statistical technique utilised in the process of creating high-fidelity synthetic data. Its primary purpose is to alleviate biases caused by under-sampling in the generated datasets. By assigning appropriate weights to each sample drawn from the importance distribution, Importance Sampling can adjust the synthetic data points, giving more significance to those that align well with the target population’s distribution. In recent work [44], Importance Sampling helped handle uncertainty in small-sample scenarios during classification tasks by selecting and weighting data points from an alternative distribution. This approach enabled more accurate estimation of classification errors, providing a robust and reliable assessment of classifier performance, even in situations with limited training data.

Despite the importance of addressing bias, both BayesBoost and Importance Sampling encounter common limitations and challenges. Both approaches can be computationally intensive, especially when dealing with large datasets or complex models. The iterative nature of BayesBoost and the need for multiple samples in Importance Sampling contribute to significant computational overhead. Moreover, overfitting is a potential challenge for both approaches. BayesBoost may overfit if the model becomes too complex or if the training data contains noise. Similarly, improper choice of the importance distribution in Importance Sampling can increase variance and result in overfitting. Furthermore, parameter tuning is essential for both methods. Adjusting parameters such as the number of boosting iterations and learning rates in BayesBoost, or selecting the appropriate importance distribution in Importance Sampling, requires careful consideration to achieve optimal results.

3.3 Supporting in silico clinical trials

Clinical trials are systematic investigations conducted on human subjects to evaluate the safety, efficacy, and potential benefits of medical interventions, treatments, therapies, and diagnostic procedures. The traditional approach to conducting clinical trials can be time-consuming, resource-intensive, and costly, often hindering the progress of medical innovation. To address these challenges, an emerging and promising strategy involves leveraging synthetic data generation methodologies to create virtual patient cohorts, fostering the concept of digital twins [32] in the context of healthcare.

The core idea behind generating virtual patient cohorts is to simulate the effects and responses of interventions or treatments on these digital twins. In essence, this process transforms a traditional clinical trial into an in silico trial [45, 46], which is conducted through computer simulation and modelling techniques.

In [47], the researchers demonstrated using electronic health records to create synthetic patient populations and personalised, predictive models of response to therapy, and incorporating in silico clinical trials to accelerate the development of new drugs. Another study [48] successfully generated synthetic radiological images for novel medical device evaluations for in silico trials, though some anatomical distinctions persist between synthetic and real images.

Despite early success in using synthetic data for in silico trials, several potential limitations and challenges need to be addressed. Ensuring the accuracy and representativeness of synthetic data is of utmost importance. The validity of results heavily relies on the quality and appropriateness of the data used to construct virtual patient cohorts. Additionally, thorough validation of the simulation models against real-world clinical data, considering their complexity and incorporation of various physiological, pharmacological, and disease-specific parameters, is essential for ensuring their validity. In [49], the authors explored existing challenges and research opportunities to enhance both synthetic data generation methods and in silico trial techniques.

Advertisement

4. Areas for further research

There are several emerging areas of research associated with the generation of synthetic data. These include potential advancements in synthetic data generation and metrics that can be utilised for evaluating both the usefulness and privacy of the data that has been generated. Towards the end of this section, we will also discuss the general concerns related to synthetic data that we believe will persist in further research.

4.1 Synthetic data generation

Techniques such as GANs and VAEs enable the creation of synthetic data that closely resembles real-world data, thereby enhancing the effectiveness of downstream applications and research. However, there are new approaches that are emerging related to data augmentation and privacy preservation that may further enhance the generation of synthetic data.

4.1.1 Enhanced data augmentation techniques

Using model-based synthetic data generation approaches can facilitate the generation of augmented synthetic data that introduces occlusions, transformations, missing parts, or combinations of different samples. For example, mixup [50] is a technique that blends pairs of samples, incorporating their features and labels. This process fosters smooth transitions between instances, generating synthetic samples with interpolated characteristics., CutMix [51], augments synthetic data by patching fragments of one sample onto another, creating combined instances that exhibit properties from both samples. This technique introduces spatial relationships and fine-grained details into the synthetic data, making it more representative of real-world scenarios.

Self-supervised learning techniques [52] have shown promise in augmenting synthetic data by leveraging pretext tasks or auxiliary tasks. By defining surrogate tasks that encourage the model to learn meaningful representations from unlabelled data, self-supervised learning enhances the richness and diversity of the synthetic data. These techniques allow the synthetic data to capture intricate patterns and structures present in real-world data, enabling more effective model training and enhancing the quality of downstream applications.

While techniques like mixup and CutMix enhance diversity, they may introduce artifacts and unrealistic combinations, impacting the model’s performance in practical applications. Moreover, self-supervised learning techniques heavily depend on the quality of pretext tasks, posing a challenge in obtaining meaningful representations for effective data augmentation. Hence, addressing these new challenges is essential to create reliable and representative synthetic data for robust machine learning models in real-world scenarios.

4.1.2 Advancements in privacy-preserving synthetic data generation

With increasing public concerns surrounding privacy, privacy-preserving synthetic data generation has become increasingly important. Techniques such as personalized privacy, and privacy accounting play a role in ensuring robust privacy protection.

Personalised privacy [53] focuses on tailoring privacy protection measures to individuals or specific data subjects, considering their unique privacy requirements. This approach allows for a more fine-grained privacy control, ensuring that every individual’s privacy needs are adequately addressed during the generation of synthetic data.

Advanced privacy accounting [54] involves systematically measuring and evaluating the privacy guarantees provided by synthetic data generation methods. This technique assesses the level of privacy protection offered by synthetic datasets and quantifies the potential risks of re-identification or privacy breaches.

Privacy amplification techniques [55] aim to strengthen privacy guarantees by incorporating additional privacy-enhancing mechanisms. These mechanisms introduce extra noise or perturbations to the synthetic data, making it even more challenging for attackers to re-identify individuals or extract sensitive information.

That said, finding the right balance between privacy and utility is crucial in privacy-preserving synthetic data generation. Tailoring privacy measures and using advanced privacy accounting can enhance protection but may complicate data generation and affect utility. Privacy amplification techniques strengthen security but can introduce data distortion. Striking an optimal balance remains a challenging yet essential goal for researchers and practitioners in this field.

4.2 Synthetic data validation

Metrics to validate the utility and privacy of synthetic datasets have evolved over time. Historically, utility assessment focused on statistical measures such as mean squared error and correlation coefficients to measure similarity between the synthetic and original datasets. As privacy concerns heightened, metrics were developed to evaluate the risk of identity disclosure in synthetic datasets, including measures like re-identification risk and information disclosure. More recently, with the advent of deep learning and generative models, evaluation metrics have incorporated domain-specific measures such as fidelity, diversity, and semantic consistency to assess the validity and usefulness of synthetic data. Furthermore, privacy evaluation has expanded to include differential privacy mechanisms and privacy-preserving techniques that quantitatively measure the risk of identity disclosure and ensure data protection.

Table 2 summarises these evaluation metrics and provides a comprehensive overview of the measures used to evaluate the similarity, and privacy protection of synthetic datasets.

Evaluation metricsDescriptions
Statistical measuresMeasures such as Mean Squared Error (MSE), correlation coefficients, and Kolmogorov-Smirnov distance.
FidelityEvaluation of the overall similarity and quality of synthetic data compared to the original dataset [56]
DiversityMeasurement of the variability and representation of different data patterns in the synthetic dataset [57]
Semantic consistencyAssessment of the preservation of relationships, patterns, and semantic meaning in the synthetic data [58]
Re-identification riskAssessment of the likelihood of re-identifying individuals in the synthetic dataset [59]
Information disclosureQuantification of the amount of sensitive information leaked in the synthetic dataset [60]
Differential privacyIncorporation of privacy-preserving mechanisms to provide formal privacy guarantees, such as ε-differential privacy [22]
Context-specific utilityDevelopment of domain-specific metrics that capture the unique requirements and characteristics of different applications [61]
Fairness considerationsIncorporation of fairness metrics to assess the impact of synthetic data generation on bias and discrimination [62]
Robust privacy guaranteesAdvancement of privacy-preserving techniques, such as secure multiparty computation (SMC), homomorphic encryption, and federated learning [63]
Model-based evaluationLeveraging generative models to assess the quality and utility of synthetic data based on the performance of downstream tasks or predictive models [6]

Table 2.

Overview of evaluation metrics.

It is worth noting that the traditional data augmentation methods primarily focus on the enhancing the existing dataset by generating variations of the available samples. They do not involve a separate validation process to ensure the similarity, and privacy protection as in synthetic dataset [31].

In addition, when choosing the validation metrics of synthetic data, it is essential to consider its planned applications. For example, if the intended use of high-fidelity synthetic data is for sample size boosting in clinical trials, validity could be assessed by comparing a boosted data extract with the full data. On the other hand, when synthetic data is utilised as a privacy-enhancing technology and a proxy for real (ground truth) data, validity can be evaluated by comparing statistical distributions of variables and ML model results in the synthetic and ground truth data.

4.3 General concerns

When generating synthetic data, bias-related challenges should be considered. Overfitting to biased patterns can occur when the synthetic data generation process closely mirrors biased real data, leading to skewed results. Moreover, synthetic data methods may not fully capture the diverse characteristics of the real data, particularly for underrepresented groups, thus introducing bias. Selection bias may also arise if synthetic data is generated from a biased subset of real data. An incorrect selection of bias correction techniques during synthetic data generation can result in residual bias in the synthetic data, failing to address complex biases in the real data adequately. If the data generation process does not account for unaccounted confounding variables, i.e., variables that are not considered or controlled for during data analysis, influencing bias in the real data, bias may be also introduced in synthetic data. Finally, making unrealistic assumptions about data distribution in synthetic data generation models can introduce bias and impact the validity of the generated data.

Apart from the bias, another consideration is the potential for unintended re-identification of individuals. Despite efforts made to anonymise the data during the synthetic data generation process, there is still a possibility of re-identification, especially when combined with other external data sources. Synthetic data containing unique or rare attributes may increase the risk of re-identification, compromising individuals’ privacy and confidentiality. In such cases, ethical guidelines and robust privacy-preserving techniques must be implemented to minimize the re-identification risk and protect individuals’ privacy.

Addressing the transparency and accountability of synthetic data generation methods is crucial. Synthetic data generation often involves complex algorithms and models, making it challenging to understand and interpret the underlying processes. This lack of transparency can raise concerns about the fairness, interpretability, and accountability of the generated synthetic data. Researchers must be transparent about the methods used, document the assumptions and limitations, and provide clear explanations of how the synthetic data aligns with the original dataset.

The potential implications of using synthetic data in high-stakes decision-making contexts should not be overlooked. If synthetic data is used to train or test algorithms that have a direct impact on individuals’ lives, such as in healthcare or finance, the ethical implications are amplified. Rigorous assessment of the performance and generalisability of models trained on synthetic data is crucial to avoid biases, unfair outcomes, or adverse effects on marginalised populations. Regular monitoring, validation, and auditing of the synthetic data generation process can help identify and mitigate potential biases and ethical concerns.

These concerns highlight the need for careful data curation, ensuring sufficient and diverse training data, and addressing the risk of model collapse to achieve high-fidelity synthetic data generation.

Advertisement

5. Conclusion

This chapter has explored the concept of synthetic data and its significance in data augmentation. Synthetic data, which effectively replicates real-world data while safeguarding privacy, presents valuable opportunities for research and practical applications. The selection of an appropriate synthetic data generation method relies on specific requirements, available resources, and the desired resemblance to real data.

Notably, high-fidelity synthetic data has emerged as a potent tool with transformative potential, particularly in the realm of healthcare. It effectively addresses challenges related to missing data and biases stemming from under-sampling, thereby propelling advancements in disease prediction, drug discovery, and personalised healthcare. Through the imputation of missing values and the generation of additional synthetic samples, high-fidelity synthetic data empowers researchers to surmount data scarcity and enhance inference accuracy. Furthermore, it plays a vital role in the creation of virtual patient cohorts for in silico trials, enabling superior predictions of treatment effectiveness, personalised medicine, and the estimation of counterfactual scenarios. Moreover, high-fidelity synthetic data finds practical utility in rare event analysis, facilitating the study of uncommon diseases or adverse drug reactions.

The progress made in synthetic data generation techniques, such as GANs and VAEs, has considerably bolstered the capacity to create synthetic data that closely approximates real-world data. Furthermore, augmentation techniques have the potential to expand datasets and furnish a more diverse set of samples for training machine learning models.

The evaluation metrics for synthetic datasets have evolved to encompass measures of utility, privacy, and domain-specific characteristics. These metrics now include fidelity, diversity, semantic consistency, and privacy assessment techniques. Anticipated future trends involve the development of more sophisticated metrics that account for context-specific utility, robust privacy guarantees, and considerations of fairness.

Synthetic data raises general concerns related to biases, representativeness, unintended re-identification, transparency, and accountability. Overcoming these concerns requires careful evaluation of the fidelity and quality of synthetic data, implementation of privacy-preserving techniques, transparent documentation of the generation methods, and rigorous assessment of model performance and generalisability.

References

  1. 1. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. Journal of Big Data. 2019;6:60. DOI: 10.1186/s40537-019-0197-0
  2. 2. Antoniou A et al. Data augmentation for time series classification using convolutional neural networks. Data Mining and Knowledge Discovery. 2018;32:914-945. DOI: 10.1007/s10618-018-0595-8
  3. 3. Miotto R et al. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports. 2017;6:26094. DOI: 10.1038/srep26094
  4. 4. Yan L et al. Data augmentation in ECG-based deep cardiac arrhythmia classification. Computers in Biology and Medicine. 2018;102:411-420. DOI: 10.1016/j.compbiomed.2018.10.006
  5. 5. Abayomi-Alli R, Damaševičius RM, Abayomi-Alli A. BiLSTM with data augmentation using interpolation methods to improve early detection of Parkinson disease. In: 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria. IEEE. 2020. pp. 371-380. DOI: 10.15439/2020F188
  6. 6. Goodfellow IJ et al. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 8-13 December 2014; Montreal Canada. Cambridge, MA, USA: MIR Press. pp. 2672-2680
  7. 7. Kingma DP, Welling M. Auto-encoding variation. In: Proceedings of the International Conference on Learning Representations (ICLR). 2014
  8. 8. Draghi B, Wang Z, Myles P, Tucker A. BayesBoost: Identifying and handling bias using synthetic data generators. In: Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research. Vol. 154. Bilbao, Spain: ECML-PKDD 2021, 2021. pp. 49-62. Available from https://proceedings.mlr.press/v154/draghi21a.html
  9. 9. Assefa SA et al. Generating synthetic data in finance: Opportunities, challenges and pitfalls. In: Proceedings of the First ACM International Conference on AI in Finance (ICAIF ‘20). New York, NY, USA: Association for Computing Machinery; 2021. pp. 1-8 Article 44. DOI: 10.1145/3383455.3422554
  10. 10. Li G, Chen Y, Wang Y, et al. City-scale synthetic individual-level vehicle trip data. Scientific Data. 2023;10:96. DOI: 10.1038/s41597-023-01997-4
  11. 11. Wang Z, Myles P, Tucker A. Generating and evaluating cross-sectional synthetic electronic healthcare data: Preserving data utility and patient privacy. Computational Intelligence. 2021;37:1-33. DOI: 10.1111/coin.12427
  12. 12. Wang Z et al. Evaluating a longitudinal synthetic data generator using real world data. In: Proceedings of the IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS), 7-9 June 2021. Aveiro, Portugal; pp. 259-264
  13. 13. El Emam K, Mosquera L, Jonker E, Sood H. Evaluating the utility of synthetic COVID-19 case data. JAMIA Open. 2021;4(1):ooab012. DOI: 10.1093/jamiaopen/ooab012
  14. 14. Shirzadian P, Antony B, Gattani AG, et al. A time evolving online social network generation algorithm. Scientific Reports. 2023;13:2395. DOI: 10.1038/s41598-023-29443-w
  15. 15. Appenzeller A et al. Privacy and utility of private synthetic data for medical data analyses. Applied Sciences. 2022;12:12320. DOI: 10.3390/app122312320
  16. 16. Buczak AL, Babin S, Moniz L. Data-driven approach for creating synthetic electronic medical records. BMC Medical Informatics and Decision Making. 2010;10:59. DOI: 10.1186/1472-6947-10-59
  17. 17. Figueira A, Vaz B. Survey on synthetic data generation, evaluation methods and GANs. Mathematics. 2022;10(15):2733. DOI: 10.3390/math10152733
  18. 18. Sonnenberg FA, Beck JR. Markov models in medical decision making: A practical guide. Medical Decision Making. 1993;13(4):322-338. DOI: 10.1177/0272989X9301300409
  19. 19. Levy JJ, O’Malley AJ. Don’t dismiss logistic regression: the case for sensible extraction of interactions in the era of machine learning. BMC Medical Research Methodology. 2020;20:171. DOI: 10.1186/s12874-020-01046-3
  20. 20. Momeny M et al. Learning-to-augment strategy using noisy and denoised data: Improving generalizability of deep CNN for the detection of COVID-19 in X-ray images. Computers in Biology and Medicine. 2021;136:104704. DOI: 10.1016/j.compbiomed.2021.104704
  21. 21. Chambers JM. Graphical Methods for Data Analysis. Boca Raton, FL: Chapman and Hall/CRC; 1983. DOI: 10.1201/9781351072304
  22. 22. Dwork C. Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I, editors. Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science. Vol. 4052. Berlin, Heidelberg: Springer; 2006. DOI: 10.1007/11787006_1
  23. 23. Shuryak I. Advantages of synthetic noise and machine learning for analyzing radioecological data sets. PLoS One. 2017;12(1):e0170007. DOI: 10.1371/journal.pone.0170007. PMID: 28068401; PMCID: PMC5222373
  24. 24. Sarker IH. Deep learning: A comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science. 2021;2:420. DOI: 10.1007/s42979-021-00815-1
  25. 25. Huang J et al. An overview of agent-based models for transport simulation and analysis. Journal of Advanced Transportation. 2022;2022:1252534. DOI: 10.1155/2022/1252534
  26. 26. Ferguson NM et al. Strategies for mitigating an influenza pandemic. Nature. 2006;442(7101):448-452. DOI: 10.1038/nature04795
  27. 27. Ovaskainen O, Roy DB, Fox R. Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models. Methods in Ecology and Evolution. 2016;7(4):428-436
  28. 28. Steinbacher M, Raddant M, Karimi F, et al. Advances in the agent-based modeling of economic and social behavior. SN Business Economy. 2021;1:99. DOI: 10.1007/s43546-021-00103-3
  29. 29. Chan KC, Rabaev M, Pratama H. Generation of synthetic manufacturing datasets for machine learning using discrete-event simulation. Production & Manufacturing Research. 2022;10(1):337-353. DOI: 10.1080/21693277.2022.2086642
  30. 30. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Statistics in Medicine. 2019;38(11):2074-2102. DOI: 10.1002/sim.8086
  31. 31. Mumuni A, Mumuni F. Data augmentation: A comprehensive survey of modern approaches. Array. 2022;16:100258. DOI: 10.1016/j.array.2022.100258
  32. 32. Jones DE et al. Characterising the digital twin: A systematic literature review. CIRP Journal of Manufacturing Science and Technology. 2020;29:36-52
  33. 33. McKnight PE et al. Missing Data: A Gentle Introduction. New York: Guilford Press; 2007
  34. 34. Nakagawa S, Freckleton RP. Missing inaction: The dangers of ignoring missing data. Trends in Ecology & Evolution. 2008;23(11):592-596
  35. 35. Kleinberg G, Diaz MJ, Batchu S, Lucke-Wold B. Racial underrepresentation in dermatological datasets leads to biased machine learning models and inequitable healthcare. Journal of Biomedical Research. 2022;3(1):42-47
  36. 36. Emmanuel T, Maupong T, Mpoeleng D, et al. A survey on missing data in machine learning. Journal of Big Data. 2021;8:140. DOI: 10.1186/s40537-021-00516-9
  37. 37. Baraldi AN, Enders CK. An introduction to modern missing data analyses. Journal of School Psychology. 2010;48(1):5-37
  38. 38. Iddrisu AK, Gumedze F. An application of a pattern-mixture model with multiple imputation for the analysis of longitudinal trials with protocol deviations. BMC Medical Research Methodology. 2019;19:10. DOI: 10.1186/s12874-018-0639-y
  39. 39. Tucker A et al. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digital Medicine. 2020;3(1):1-13
  40. 40. Colombo D et al. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics. 2012;40:294-321
  41. 41. Wang X, Asif H, Vaidya J. Preserving missing data distribution in synthetic data. In: Proceedings of the ACM Web Conference 2023 (WWW ‘23), April 30–May 04, 2023; Austin, TX, USA. New York, NY, USA: ACM; 2023. p. 12
  42. 42. Stavseth MR, Clausen T, Røislien J. How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data. SAGE Open Medicine. 2019;7:2050312118822912. DOI: 10.1177/2050312118822912
  43. 43. Tokdar ST, Kass RE. Importance sampling: A review. WIREs Computational Statistics. 2010;2:54-60. DOI: 10.1002/wics.56
  44. 44. Maddouri O, Qian X, Alexander FJ, Dougherty ER, Yoon BJ. Robust importance sampling for error estimation in the context of optimal Bayesian transfer learning. Patterns (N Y). 2022;3(3):100428. DOI: 10.1016/j.patter.2021.100428
  45. 45. Wang Z, Gao C, Glass L, Sun J. Artificial intelligence for in silico clinical trials: A review. ArXiv, abs/2209.09023. 2022
  46. 46. Badano A. In silico imaging clinical trials: cheaper, faster, better, safer, and more scalable. Trials. 2021;22:64. DOI: 10.1186/s13063-020-05002-w
  47. 47. Zand R, Abedi V, Hontecillas R, Lu P, Noorbakhsh-Sabet N, Verma M, et al. Development of synthetic patient populations and in silico clinical trials. In: Bassaganya-Riera, editor. Accelerated Path to Cures. Cham: Springer; 2018. pp. 57-77
  48. 48. Galbusera F et al. Exploring the potential of generative adversarial networks for synthesizing radiological images of the spine to be used in in silico trials. Frontiers in Bioengineering and Biotechnology. 2018;6:53. DOI: 10.3389/fbioe.2018.00053
  49. 49. Myles P et al. Synthetic data and the innovation, assessment, and regulation of AI medical devices. Progress in Biomedical Engineering. 2023;5:013001
  50. 50. Zhang H, Cisse M, Dauphin YN, et al. Mixup: Beyond empirical risk minimization. In: Proceedings of International Conference on Learning Representations, April 2018. BC, Canada: Vancouver; 2018. pp. 1-13
  51. 51. Yun S, Han D, Chun S, Oh SJ, Yoo Y, Choe J. CutMix: Regularization strategy to train strong classifiers with localizable features. In: IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, South Korea; 2019. pp. 6022-6031
  52. 52. Chen T et al. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML’20). Vol. 119. Virtual Conference; 2020. pp. 1597-1607 JMLR.org, Article 149
  53. 53. Canhoto AI, Keegan BJ, Ryzhikh M. Snakes and ladders: Unpacking the personalisation-privacy paradox in the context of AI-enabled personalisation in the physical retail environment. Information Systems Frontiers. 2023;25. DOI: 10.1007/s10796-023-10369-7
  54. 54. Doroshenko V, Ghazi B, Kamath P, Kumar R, Manurangsi P. Connect the dots: Tighter discrete approximations of privacy loss distributions. Proceedings on Privacy Enhancing Technologies. 2022;2022:552-570
  55. 55. Bennett CH, Brassard G, Crepeau C, Maurer UM. Generalized privacy amplification. IEEE Transactions on Information Theory. 1995;41(6):1915-1923. DOI: 10.1109/18.476316
  56. 56. Raghunathan TE et al. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics. 2003;19:1
  57. 57. Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breach research participants’ privacy. Journal of the American Medical Informatics Association. 2010;17(3):322-327. DOI: 10.1136/jamia.2009.002725
  58. 58. Vaidya J, Clifton C. Privacy-preserving k-means clustering over vertically partitioned data. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘03). New York, NY, USA: ACM; 2003. pp. 206-215
  59. 59. Machanavajjhala A et al. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data. 2007;1:3–es. DOI: 10.1145/1217299.1217302
  60. 60. Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. Data Mining and Knowledge Discovery. 2005;11:195-212. DOI: 10.1007/s10618-005-0007-5
  61. 61. El Emam K et al. A globally optimal k-anonymity method for the de-identification of health data. Journal of the American Medical Informatics Association. 2009;16(5):670-682. DOI: 10.1197/jamia.M3144
  62. 62. Zemel R et al. Learning fair representations. In: Proceedings of the 30th International Conference on International Conference on Machine Learning – Volume 28 (ICML’13). GA, USA: Atlanta; 2013 JMLR.org, III–325–III–333
  63. 63. Shokri R et al. Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP). CA, USA: San Jose; 2017. pp. 3-18

Written By

Zhenchen Wang, Barbara Draghi, Ylenia Rotalinti, Darren Lunn and Puja Myles

Submitted: 14 July 2023 Reviewed: 02 November 2023 Published: 12 January 2024