The Synthetic Data Landscape

Ray Poynter
Ray Poynter, Founder
13 November 202514 min read
Image of people and bots talking together.

The discussion around synthetic data has become one of the most important conversations in market research and insights. It's particularly crucial because there's no universal agreement on key terminology, and the topic itself remains contentious.

I've previously published views on synthetic data and guidance materials, contributed to the Esomar questions framework for synthetic data, and participated in numerous community discussions, face-to-face meetings, training sessions, and LinkedIn exchanges.

This document sets out my perspective on the landscape, defines key terms, and suggests next steps for developing more structured, safer, and more reliable use of synthetic data. This is very much work in progress, shared to stimulate further discussion.

What is Synthetic Data?

I'll start with the two definitions from the ICC/Esomar Code:

  • Synthetic data

    "Information that has been generated to replicate the characteristics of real-world data"

  • Synthetic persona

    "A digital representation of a person generated to mimic the behaviours, preferences, and characteristics of real people or groups"

These definitions reveal that synthetic data encompasses two distinct entities: personas (representations of people or groups) and synthetic data itself (a data set or artefact).

Defining Personas

Personas are entities created to simulate, estimate, or substitute for people. Some vendors use the term to represent groups or archetypes, whilst others use it to represent individuals (either actual individuals or free-standing synthetic cases).

Personas are neither inherently qualitative nor quantitative. They can be queried and generate answers. One persona might engage in qualitative discussion with a user, whilst another might complete a questionnaire.

Two key uses of personas:

  1. Interacting with users to explore responses

  2. Generating synthetic data for use in place of human-generated data (though there are other ways of generating synthetic data)

Alternative names for personas include agents, simulacra, virtual respondents, and synthetic people, along with many variants.

One alternative with precise meaning is Digital Twin: a persona created to model a specific, genuine person.

Defining Synthetic Data

Synthetic data refers to data that has been created. It's relatively static information, an artefact. We can examine synthetic data and handle it in conventional ways. It's mostly quantitative, though it needn't be.

Synthetic data can comprise extra responses to a survey (augmented data), or it might be an entirely created data set.

Synthetic data broadly falls into these categories:

  1. Boosting/Augmenting and Imputation

    Additional data for an existing study, creating a new data set combining real and synthetic data

  2. Wholly or Fully Synthetic

    All data is synthetic

  3. Anonymised Synthetic Data

    A real dataset adjusted to ensure anonymity whilst retaining the same statistical properties

  4. Randomised Synthetic Data

    Generated data for testing theories, software, surveys, etc.

Categories 1 and 2 are of particular interest to the market research and insights industry at present.

A Broad or Narrow Definition of Synthetic Data?

One issue to resolve is whether to adopt a naming approach like Eurostat and ONS, which define synthetic data as something based on real data, matching its properties. For example: "Synthetic data is artificial data generated from original data and a model trained to reproduce the characteristics and structure of the original data. This means synthetic data and original data should deliver very similar results when undergoing the same statistical analysis."

The problem with a narrow definition is that it could exclude substantial amounts of data being sold and bought as synthetic data from guidelines and advice. Therefore, I propose the definition of synthetic data should be broad. However, there should be a clear framework allowing observers and buyers to understand precisely what sort of synthetic data they're dealing with.

The suggested working definition for synthetic data: "Data that has been created."

The Blurring of Definitions and Meanings

We need a collective view on synthetic data to create agreed definitions. We must also consider when it should be used, what questions we should ask about it, when to avoid using it, and how to evaluate it.

There's considerable blurring of definitions at present. Some people and companies use different words for the same thing; others use the same word for different things. This makes it difficult for buyers, regulators, and others to gain a clear picture of the issues.

Because synthetic data in its current form is relatively new, it bleeds into other categories and definitions. This newness can lead us to retrofit the term "synthetic data" to practices used for many years.

A good, relatively uncontentious example of retrofitting is data imputation. Techniques for estimating what people would have said had they answered questions where we have missing data have existed for some time. There's little dispute that this now falls under the synthetic data umbrella.

A more contested topic is whether weighting counts as synthetic data. Weighting is sometimes seen as a form of synthetic data because up-weighting a response creates an identical "synthetic mirror image" or clone of an existing participant, mechanically altering the raw data composition. However, many emphatically argue this is not synthetic data.

How Are Personas Created?

There's less agreement about how personas are created than about how synthetic data is created, and it's probably a more contentious discussion.

  1. Simple prompt crafting

    At the simplest (and probably least satisfactory) level, we can create personas by crafting prompts that represent a person. This can be done simply by describing what one thinks the persona should look like to an LLM.

  2. Data-informed prompts

    The prompt can be crafted from substantial data so it's more likely built upon project-specific knowledge rather than drawn solely from the large language model.

  3. Agent building

    Another approach builds agents within generative AI that represent background data, either from a specific individual (for digital twins) or from large databases containing knowledge of expected ranges of views.

  4. Dynamic techniques

    Various techniques leverage stored data and generative AI in dynamic ways that allow personas to operate.

How Is Synthetic Data Made?

This is likely to change substantially over the next two to three years, but it's still useful to outline key approaches, as they affect how we assess synthetic data and its concerns and benefits.

Five ways to generate synthetic data:

  1. Statistical methods

    Used for many years, typically for augmented data where we're adding additional cases, filling holes in data, or adding columns such as cluster membership.

  2. Machine learning

    Taking existing knowledge and data so the AI learns what additional cases should look like.

  3. Deep learning techniques Using generative AI and similar approaches.

  4. Creating synthetic data from personas

  5. Using an LLM directly

    Generally considered a poor option, but done by some.

Evaluating Synthetic Data and Personas

How to evaluate synthetic data and personas is one of the biggest current issues and likely to remain so. At the most trivial level, particularly with augmented data, we see people comparing the means of withheld data (the holdback sample) with synthetic data means, this is often quite reassuring.

However, relying solely on means is statistically inadequate as it overlooks multivariate properties such as correlations and co-dependencies necessary for advanced analysis like segmentation, predictive modelling, or driver analysis.

We should examine correlations within data, data structure, and statistical properties of synthetic data to see when it matches real data, when it doesn't, and what the implications are.

For example, if all generated means are comparable, we could use it for simple concept evaluations. However, if underlying statistical properties differ, we might not be able to use it for more advanced purposes. Indeed, some have found synthetic data problematic for creating segments and clusters within information.

As an industry, we should create an evaluation framework that examines the type of synthetic data, starting with questions like:

  • Is it augmented data?

  • Is it fully synthetic data?

  • Is it a persona?

The framework could then apply an agreed set of evaluation principles examining necessary elements:

  • Is the synthetic data broadly in the right area?

  • Does it give the right sort of messages?

  • Does it give similar answers each time the same questions are asked?

Then the framework could address key points such as:

I. Methodology & Fidelity (How was it made and how accurate is it?)

These questions focus on the creation process and statistical resemblance to original data.

  • Source Data

    Was the synthetic data or persona derived solely from real-world data? If not, what other sources were used?

  • Model Transparency

    How does your system operate? Provide a complete, accessible description of the methodology and AI platforms used (avoiding unnecessary jargon whilst maintaining technical clarity).

  • Statistical Retention (Fidelity)

To what extent does the synthetic data reproduce the statistical properties of the donor data?

Specifically, how do means, standard deviations, clusters, and correlations (and other measures of interdependences) compare to real data?

Which statistical relationships are not retained (and why)? (e.g., links between gender, age, and risk-taking)

  • Temporal Handling

    How does the model account for and reproduce temporal issues such as trends, seasonality, and changes in underlying data over time?

  • Preventing Synthetic Input Contamination

    How do you ensure synthetic data is prevented from being used as input in creating new synthetic data or personas?

II. Utility & Performance (Does it work for its intended purpose?)

These questions address the usefulness of synthetic data for downstream tasks.

  • Predictive Equivalence

    Does the synthetic data, when subjected to the same analytical processes as real data, produce the same outputs (e.g., in a predictive model)?

  • Performance Benchmarks

    How does your model's performance compare against known external industry benchmarks (e.g., labour force surveys, common predictive model metrics)?

  • Output Types

    What sort of outputs can your system produce (e.g., synthetic data only, personas)? Is the data quantitative, qualitative, or both?

  • Validation Evidence

    Provide examples of situations where your system works well, backed by external validation or proven use cases.

  • Failure Analysis

    Provide examples of where your system has failed or not worked, explaining how this was discovered and what measures prevent similar occurrences.

III. Privacy & Ethics (Is it safe, fair, and responsible?)

These questions focus on data security, re-identification risk, bias mitigation, and ethical review.

  • Confidentiality & Re-identification

    • Does the synthetic data/persona prevent disclosure of confidential or personal information from underlying data?

    • Is re-identification of original individuals made provably impossible or sufficiently minimised?

    • How have you addressed potential issues related to anonymity, privacy, and cross-border data transfer compliance?

  • Bias Mitigation and Representation

    • How are biases in underlying data identified?

    • How are these biases compensated for or mitigated in synthetic data/personas?

    • How are minorities and vulnerable groups represented in the final data output?

  • Security & Compliance

    Give a complete description of data security, data privacy, and regulatory compliance (e.g., GDPR) applying to both the input data you use and the output data you create.

  • Ethical Review

    What ethical review has been conducted into your processes, algorithms, and final outputs?

  • Standard AI Challenges

    How do you address common AI problems such as bias and hallucinations?

IV. Vendor Transparency & Viability (Can the vendor be trusted and will they be around?)

These questions are crucial for evaluating vendor business practices and long-term reliability.

  • Input Data and Permissions

    What data does your system require as inputs, and what legal permissions (licences, consents, etc.) are necessary to use that information?

  • Accessibility of Evaluation

    How can a buyer or user who is not a statistician or AI expert evaluate and understand your system and its output?

Scalability and Confidence:

How can I be confident you'll offer this service long-term (in a few months and next year)?

How can I be sure you can scale up to much larger throughput?

How do you guarantee the required compute power for this?

  • How will you guarantee model updates with new information as needed?

  • Statistical Assessment

    How and to what extent can the statistical accuracy and reliability of the data you produce be assessed by the buyer? Do you agree that conventional sampling statistics are not possible with synthetic data?

Case Studies

There are now numerous case studies, which is fantastic, and the number keeps increasing. It would be valuable if ESOMAR offered to host a list of case studies and perhaps create an open-review system for them.

However, some case studies demonstrate synthetic data works, some show it doesn't work, and some prove it works for certain purposes but not others. I'm using "prove" ironically here. These tests show particular times, places, data sets, and findings. None are generalisable. This will be one of the problems in understanding synthetic data's strengths and weaknesses.

I've noticed that case studies showing synthetic data doesn't work tend to be fairly badly structured and created, or set out to reach unrealistic results. They're not the sort of work one sees from leading agencies selling these techniques.

However, case studies showing synthetic data works very well often examine fairly easy cases, looking at average means of rating scales on straightforward, predictable questions.

We need to be careful about how we assess case studies. We should collect and continue conducting them, but they won't solve the problem on their own.

Known Issues With Synthetic Data

From case studies to date, we can probably draw several inferences about problems synthetic data tends to face:

  • Dependency on good real information

    It's hard, perhaps impossible, to create synthetic data or personas simply from a large language model.

  • Risk of Synthetic Data Collapse

    If we shift too much towards synthetic data, we won't have enough real data to build the next generation of synthetic data. This creates the risk of Synthetic Data Collapse, where future models are trained only on data generated by previous, potentially limited models, compounding inaccuracies and accelerating divergence from true human behaviour.

  • Continued need for real data

    There's a good case for continuing to collect real data, even if synthetic data works well.

  • Data type performance

    Synthetic data seems to work better with scale data, less well with contested or categorical data, and less well with open-ended comments.

  • Trimming of extremes

    Most types of synthetic data seem to trim extremes, for example, standard deviations tend to be smaller. They narrow the range of views and values collected compared to true real data.

  • Reduced emotional depth

    There's evidence they're less empathetic than real people and less emotionally driven, more logical in their answers to questions.

Suggested Definitions

These suggested definitions build on the ICC/ESOMAR Code definitions and are not intended to replace them.

Synthetic data: Data that has been created in whole or in part and which is a static entity once created. By static entity, we mean that if different people are given the same synthetic data set, they will see the same numbers, the same words, and it will behave like a traditional data file.

Augmented data: Real data that has had synthetic data added to it. This additional, synthetic data could be:

  • Additional columns (new questions)

  • Additional rows (new participants)

  • Filling in missing values within the dataset

Personas: Systems created to represent a person or a type/group/archetype of people. Personas differ from synthetic data in that they are not the same each time you use them. A persona can generate answers and, if used again, can generate different answers. At present, some (such as Toluna) use the term "personas" to represent numerous digital individuals that have been created. Other companies (such as Signoi) use it to represent entities representing groups of people rather than theoretical individuals.

An alternative term for personas is agents, used by some academics researching the creation of digital representations of people or groups. These dynamic entities (often referred to as AI Agents) are now seeing rapid investment acceleration in the enterprise sector.

Digital Twins: A specific type of persona. With digital twins, each digital agent is created to be a replica of a real human being. For example, you might take members of an online community and, for each member, use their past survey data and qualitative information to create an agent representing that individual.

Next Steps

I look forward to discussing the ideas I've put forward in this document with others. I don't imagine every point I'm making will be agreed upon.

What I'm trying to do is set out a framework people can examine and then add to, delete from, change, debate, discuss, and contest.

Also, let's all try to be respectful of different points of view. Synthetic data has real enthusiasts and absolute rejectionists, with most people somewhere in between. But I'm sure everyone approaches it with the best intentions, and we should bear that in mind when someone holds a very different view to our own.

Discussion

What are your questions, suggestions, and additional points?

Published 13 November 2025
The Synthetic Data Landscape | ResearchWiseAI