The Synthetic Data Landscape

The discussion around synthetic data has become one of the most important conversations in market research and insights. It's particularly crucial because there's no universal agreement on key terminology, and the topic itself remains contentious.

I've previously published views on synthetic data and guidance materials, contributed to the Esomar questions framework for synthetic data, and participated in numerous community discussions, face-to-face meetings, training sessions, and LinkedIn exchanges.

This document sets out my perspective on the landscape, defines key terms, and suggests next steps for developing more structured, safer, and more reliable use of synthetic data. This is very much work in progress, shared to stimulate further discussion.

What is Synthetic Data?

I'll start with the two definitions from the ICC/Esomar Code:

Synthetic data
"Information that has been generated to replicate the characteristics of real-world data"
Synthetic persona
"A digital representation of a person generated to mimic the behaviours, preferences, and characteristics of real people or groups"

These definitions reveal that synthetic data encompasses two distinct entities: personas (representations of people or groups) and synthetic data itself (a data set or artefact).

Defining Personas

Personas are entities created to simulate, estimate, or substitute for people. Some vendors use the term to represent groups or archetypes, whilst others use it to represent individuals (either actual individuals or free-standing synthetic cases).

Personas are neither inherently qualitative nor quantitative. They can be queried and generate answers. One persona might engage in qualitative discussion with a user, whilst another might complete a questionnaire.

Two key uses of personas:

Interacting with users to explore responses
Generating synthetic data for use in place of human-generated data (though there are other ways of generating synthetic data)

Alternative names for personas include agents, simulacra, virtual respondents, and synthetic people, along with many variants.

One alternative with precise meaning is Digital Twin: a persona created to model a specific, genuine person.

Defining Synthetic Data

Synthetic data refers to data that has been created. It's relatively static information, an artefact. We can examine synthetic data and handle it in conventional ways. It's mostly quantitative, though it needn't be.

Synthetic data can comprise extra responses to a survey (augmented data), or it might be an entirely created data set.

Synthetic data broadly falls into these categories:

Boosting/Augmenting and Imputation
Additional data for an existing study, creating a new data set combining real and synthetic data
Wholly or Fully Synthetic
All data is synthetic
Anonymised Synthetic Data
A real dataset adjusted to ensure anonymity whilst retaining the same statistical properties
Randomised Synthetic Data
Generated data for testing theories, software, surveys, etc.

Categories 1 and 2 are of particular interest to the market research and insights industry at present.

A Broad or Narrow Definition of Synthetic Data?

One issue to resolve is whether to adopt a naming approach like Eurostat and ONS, which define synthetic data as something based on real data, matching its properties. For example: "Synthetic data is artificial data generated from original data and a model trained to reproduce the characteristics and structure of the original data. This means synthetic data and original data should deliver very similar results when undergoing the same statistical analysis."

The problem with a narrow definition is that it could exclude substantial amounts of data being sold and bought as synthetic data from guidelines and advice. Therefore, I propose the definition of synthetic data should be broad. However, there should be a clear framework allowing observers and buyers to understand precisely what sort of synthetic data they're dealing with.

The suggested working definition for synthetic data: "Data that has been created."

The Blurring of Definitions and Meanings

We need a collective view on synthetic data to create agreed definitions. We must also consider when it should be used, what questions we should ask about it, when to avoid using it, and how to evaluate it.

There's considerable blurring of definitions at present. Some people and companies use different words for the same thing; others use the same word for different things. This makes it difficult for buyers, regulators, and others to gain a clear picture of the issues.

Because synthetic data in its current form is relatively new, it bleeds into other categories and definitions. This newness can lead us to retrofit the term "synthetic data" to practices used for many years.

A good, relatively uncontentious example of retrofitting is data imputation. Techniques for estimating what people would have said had they answered questions where we have missing data have existed for some time. There's little dispute that this now falls under the synthetic data umbrella.

A more contested topic is whether weighting counts as synthetic data. Weighting is sometimes seen as a form of synthetic data because up-weighting a response creates an identical "synthetic mirror image" or clone of an existing participant, mechanically altering the raw data composition. However, many emphatically argue this is not synthetic data.

How Are Personas Created?

There's less agreement about how personas are created than about how synthetic data is created, and it's probably a more contentious discussion.

Simple prompt crafting
At the simplest (and probably least satisfactory) level, we can create personas by crafting prompts that represent a person. This can be done simply by describing what one thinks the persona should look like to an LLM.
Data-informed prompts
The prompt can be crafted from substantial data so it's more likely built upon project-specific knowledge rather than drawn solely from the large language model.
Agent building
Another approach builds agents within generative AI that represent background data, either from a specific individual (for digital twins) or from large databases containing knowledge of expected ranges of views.
Dynamic techniques
Various techniques leverage stored data and generative AI in dynamic ways that allow personas to operate.

How Is Synthetic Data Made?

This is likely to change substantially over the next two to three years, but it's still useful to outline key approaches, as they affect how we assess synthetic data and its concerns and benefits.

Five ways to generate synthetic data:

Statistical methods
Used for many years, typically for augmented data where we're adding additional cases, filling holes in data, or adding columns such as cluster membership.
Machine learning
Taking existing knowledge and data so the AI learns what additional cases should look like.
Deep learning techniques Using generative AI and similar approaches.
Creating synthetic data from personas
Using an LLM directly
Generally considered a poor option, but done by some.

Evaluating Synthetic Data and Personas

How to evaluate synthetic data and personas is one of the biggest current issues and likely to remain so. At the most trivial level, particularly with augmented data, we see people comparing the means of withheld data (the holdback sample) with synthetic data means, this is often quite reassuring.

However, relying solely on means is statistically inadequate as it overlooks multivariate properties such as correlations and co-dependencies necessary for advanced analysis like segmentation, predictive modelling, or driver analysis.

We should examine correlations within data, data structure, and statistical properties of synthetic data to see when it matches real data, when it doesn't, and what the implications are.

For example, if all generated means are comparable, we could use it for simple concept evaluations. However, if underlying statistical properties differ, we might not be able to use it for more advanced purposes. Indeed, some have found synthetic data problematic for creating segments and clusters within information.

As an industry, we should create an evaluation framework that examines the type of synthetic data, starting with questions like:

Is it augmented data?
Is it fully synthetic data?
Is it a persona?

The framework could then apply an agreed set of evaluation principles examining necessary elements:

Is the synthetic data broadly in the right area?
Does it give the right sort of messages?
Does it give similar answers each time the same questions are asked?

Then the framework could address key points such as:

I. Methodology & Fidelity (How was it made and how accurate is it?)

These questions focus on the creation process and statistical resemblance to original data.

Source Data
Was the synthetic data or persona derived solely from real-world data? If not, what other sources were used?
Model Transparency
How does your system operate? Provide a complete, accessible description of the methodology and AI platforms used (avoiding unnecessary jargon whilst maintaining technical clarity).
Statistical Retention (Fidelity)

To what extent does the synthetic data reproduce the statistical properties of the donor data?

Specifically, how do means, standard deviations, clusters, and correlations (and other measures of interdependences) compare to real data?

Which statistical relationships are not retained (and why)? (e.g., links between gender, age, and risk-taking)

Temporal Handling
How does the model account for and reproduce temporal issues such as trends, seasonality, and changes in underlying data over time?
Preventing Synthetic Input Contamination
How do you ensure synthetic data is prevented from being used as input in creating new synthetic data or personas?

II. Utility & Performance (Does it work for its intended purpose?)

These questions address the usefulness of synthetic data for downstream tasks.

Predictive Equivalence
Does the synthetic data, when subjected to the same analytical processes as real data, produce the same outputs (e.g., in a predictive model)?
Performance Benchmarks
How does your model's performance compare against known external industry benchmarks (e.g., labour force surveys, common predictive model metrics)?
Output Types
What sort of outputs can your system produce (e.g., synthetic data only, personas)? Is the data quantitative, qualitative, or both?
Validation Evidence
Provide examples of situations where your system works well, backed by external validation or proven use cases.
Failure Analysis
Provide examples of where your system has failed or not worked, explaining how this was discovered and what measures prevent similar occurrences.

III. Privacy & Ethics (Is it safe, fair, and responsible?)

These questions focus on data security, re-identification risk, bias mitigation, and ethical review.

Confidentiality & Re-identification
- Does the synthetic data/persona prevent disclosure of confidential or personal information from underlying data?
- Is re-identification of original individuals made provably impossible or sufficiently minimised?
- How have you addressed potential issues related to anonymity, privacy, and cross-border data transfer compliance?
Bias Mitigation and Representation
- How are biases in underlying data identified?
- How are these biases compensated for or mitigated in synthetic data/personas?
- How are minorities and vulnerable groups represented in the final data output?
Security & Compliance
Give a complete description of data security, data privacy, and regulatory compliance (e.g., GDPR) applying to both the input data you use and the output data you create.
Ethical Review
What ethical review has been conducted into your processes, algorithms, and final outputs?
Standard AI Challenges
How do you address common AI problems such as bias and hallucinations?

IV. Vendor Transparency & Viability (Can the vendor be trusted and will they be around?)

These questions are crucial for evaluating vendor business practices and long-term reliability.

Input Data and Permissions
What data does your system require as inputs, and what legal permissions (licences, consents, etc.) are necessary to use that information?
Accessibility of Evaluation
How can a buyer or user who is not a statistician or AI expert evaluate and understand your system and its output?

Scalability and Confidence:

How can I be confident you'll offer this service long-term (in a few months and next year)?

How can I be sure you can scale up to much larger throughput?

How do you guarantee the required compute power for this?

How will you guarantee model updates with new information as needed?
Statistical Assessment
How and to what extent can the statistical accuracy and reliability of the data you produce be assessed by the buyer? Do you agree that conventional sampling statistics are not possible with synthetic data?

Case Studies

There are now numerous case studies, which is fantastic, and the number keeps increasing. It would be valuable if ESOMAR offered to host a list of case studies and perhaps create an open-review system for them.

However, some case studies demonstrate synthetic data works, some show it doesn't work, and some prove it works for certain purposes but not others. I'm using "prove" ironically here. These tests show particular times, places, data sets, and findings. None are generalisable. This will be one of the problems in understanding synthetic data's strengths and weaknesses.

I've noticed that case studies showing synthetic data doesn't work tend to be fairly badly structured and created, or set out to reach unrealistic results. They're not the sort of work one sees from leading agencies selling these techniques.

However, case studies showing synthetic data works very well often examine fairly easy cases, looking at average means of rating scales on straightforward, predictable questions.

We need to be careful about how we assess case studies. We should collect and continue conducting them, but they won't solve the problem on their own.

Known Issues With Synthetic Data

From case studies to date, we can probably draw several inferences about problems synthetic data tends to face:

Dependency on good real information
It's hard, perhaps impossible, to create synthetic data or personas simply from a large language model.
Risk of Synthetic Data Collapse
If we shift too much towards synthetic data, we won't have enough real data to build the next generation of synthetic data. This creates the risk of Synthetic Data Collapse, where future models are trained only on data generated by previous, potentially limited models, compounding inaccuracies and accelerating divergence from true human behaviour.
Continued need for real data
There's a good case for continuing to collect real data, even if synthetic data works well.
Data type performance
Synthetic data seems to work better with scale data, less well with contested or categorical data, and less well with open-ended comments.
Trimming of extremes
Most types of synthetic data seem to trim extremes, for example, standard deviations tend to be smaller. They narrow the range of views and values collected compared to true real data.
Reduced emotional depth
There's evidence they're less empathetic than real people and less emotionally driven, more logical in their answers to questions.

Suggested Definitions

These suggested definitions build on the ICC/ESOMAR Code definitions and are not intended to replace them.

Synthetic data: Data that has been created in whole or in part and which is a static entity once created. By static entity, we mean that if different people are given the same synthetic data set, they will see the same numbers, the same words, and it will behave like a traditional data file.

Augmented data: Real data that has had synthetic data added to it. This additional, synthetic data could be:

Additional columns (new questions)
Additional rows (new participants)
Filling in missing values within the dataset

Personas: Systems created to represent a person or a type/group/archetype of people. Personas differ from synthetic data in that they are not the same each time you use them. A persona can generate answers and, if used again, can generate different answers. At present, some (such as Toluna) use the term "personas" to represent numerous digital individuals that have been created. Other companies (such as Signoi) use it to represent entities representing groups of people rather than theoretical individuals.

An alternative term for personas is agents, used by some academics researching the creation of digital representations of people or groups. These dynamic entities (often referred to as AI Agents) are now seeing rapid investment acceleration in the enterprise sector.

Digital Twins: A specific type of persona. With digital twins, each digital agent is created to be a replica of a real human being. For example, you might take members of an online community and, for each member, use their past survey data and qualitative information to create an agent representing that individual.

Next Steps

I look forward to discussing the ideas I've put forward in this document with others. I don't imagine every point I'm making will be agreed upon.

What I'm trying to do is set out a framework people can examine and then add to, delete from, change, debate, discuss, and contest.

Also, let's all try to be respectful of different points of view. Synthetic data has real enthusiasts and absolute rejectionists, with most people somewhere in between. But I'm sure everyone approaches it with the best intentions, and we should bear that in mind when someone holds a very different view to our own.

Discussion

What are your questions, suggestions, and additional points?