MemSafeEval Part 3: Building a Synthetic Dataset for a Vulnerability Classifier

Generating thousands of verified vulnerabilities with sanitizers as the source of truth

CVE
Software Weakness
Memory Vulnerabilities
Embeddings
Synthetic Data
Author

Brian Williams

Published

January 4, 2026

Vartia Memory Safety Eval: Building a Synthetic Dataset for a Vulnerability Classifier

Generating thousands of verified vulnerabilities with sanitizers as the source of truth

Illustration of glowing cubes on an assembly line

Part 3 of the MemSafeEval series.

← Part 2: When Are Two Bugs the Same Bug

In Part 2, we ended with a set of 67 broad vulnerability signature types. Each signature type included a structured description, at least one vulnerable function, a narrative explanation of the vulnerability, and a patch showing how the vulnerability was fixed.

At that point, we had single code samples for all 67 signature types, but only a small number of training pairs.

Fewer than 10 signatures had enough code samples to support meaningful comparison, and the majority had fewer than five training pairs. In order to train a model to correctly identify all these signature types, we need on the order of 50–100 code sample pairs per category.

This kind of imbalance is expected. Real-world CVEs cluster heavily around a small number of bug patterns, leaving many categories underrepresented. Fortunately, this is a problem we can solve by using a generative model to create synthetic code sample pairs for training.

What Makes a Good Code Sample

Not all code samples are equally useful for training embeddings. A good synthetic code sample needs to satisfy several constraints simultaneously:

  • It embodies the same vulnerability mechanism as the original code sample.
  • It looks like it came from a real production codebase:
    • non-trivial control flow
    • custom data structures
    • realistic naming and structure
    • It has no obvious tells that it is synthetic.
  • It is easy to compile, run, and instrument.
  • Most importantly, it can be validated algorithmically:
    • no human judgment
    • no “looks right” reasoning
    • the bug either triggers or it doesn’t

If a code sample cannot be verified automatically, it is not suitable as training data.

Why Pairs — and Why Negatives Matter

The goal of this dataset is to train embeddings, not a simple classifier. Embeddings learn by comparing code samples to one another. As mentioned in article 2, the final pairs are determined by using an LLM to judge if a pair is the same.

That means we need:

  • Positive pairs: two code samples of the same vulnerability
  • Negative pairs: code samples of different vulnerability

But there is a third, critical category: hard negative pairs.

A hard negative pair consists of a vulnerable sample and its patched version. The hard negative sample is identical to the positive sample in every meaningful way except for the bug itself. In this project, the most important hard negative samples are the patched versions of the vulnerable functions.

These hard negative pairs force the model to learn what actually matters. They prevent superficially similar but fundamentally different bugs from collapsing into the same cluster. Without them, the model learns the wrong signal.

In short:

  • positives pull embeddings together
  • negatives push them apart
  • hard negatives sharpen the boundary

How Do We Know the Code Samples are Correct

Bad labels poison embedding training. Empirically, even modest label corruption measurably degrades learned representations and downstream accuracy (Li et al. 2022).

Because of this, every code sample used for training, must be generated and validated through a repeatable, unambiguous process.

That process starts with precise definitions.

Defining the Vulnerability Precisely

As in earlier stages of the project, manually writing detailed definitions for all 67 signature types was not practical. Instead, we used a structured process to derive canonical vulnerability definitions from the existing code samples.

For each signature type, an LLM was used to:

  • summarize the defining characteristics of the vulnerability
  • identify what must be present for the bug to exist
  • identify what must not be present
  • describe how the vulnerability is typically fixed

These definitions were captured in what we call a vulnerability card.

Each vulnerability card includes:

  • a natural-language definition of the bug
  • constraints on how it should appear in code
  • instructions for creating both a vulnerable version and a fixed version
  • guidance on how the vulnerability should be triggered and detected

Based on this vulnerability card, a local coding agent generates:

  • a new vulnerable code sample
  • the corresponding fix
  • a script to compile, instrument, and verify both

The script is the source of truth. If it passes, the code sample is accepted. Additional human spot checking is also needed to verify the accuracy of the script and code samples.

Single Code Sample Generation Loop

The initial setup for each signature type follows a tight feedback loop:

  1. The agent reads the vulnerability card.
  2. The agent produces a code sample containing the vulnerability.
  3. The agent adds code to reliably trigger the vulnerability.
  4. The agent writes an evaluation harness to compile, instrument and validate the code sample.
  5. The evaluation harness is executed to confirm the vulnerability triggers.
  6. If validation fails too frequently the vulnerability card or code sample is updated.

This loop continues until one fully validated code sample exists for the signature type, along with a reliable validation script.

At this point, the signature type is ready for scaling.

Then all the evaluation harnesses across the signature types are compared and a set of canonical evaluation harnesses are created. We ended up with 8 classes of evaluation harness for the 67 signature types.

Batch Code Sample Generation

Once the environment is stable, batch generation begins.

A secondary, cheaper model is invoked with a simplified prompt:

  • generate a new vulnerable code sample matching the vulnerability card
  • generate the corresponding fix

Each generated pair is placed in its own directory and validated using the same script. If validation passes, the code sample is accepted. If it fails, the code sample is rejected.

When failure rates are high, the vulnerability card is refined and the process repeats. This continues until most generated code samples are accepted. Once a signature type reaches its target number of code samples, it is considered complete.

Expanding Coverage by Enumerating Subtypes Within a Signature Type

Even with a stable signature definition, each signature type usually contains multiple subtypes (distinct ways the same underlying vulnerability shows up in real code). If we don’t cover those subtypes, the embedding model risks learning only limited instances of the signature type.

The generation strategy is then:

  1. For each signature type, generate a subtype checklist.
  2. Require the generator to produce code samples for each subtype, not just “another random sample.”
  3. Validate each candidate with the same evaluation harness (sanitizers + trigger), and accept only code samples that pass.
  4. Track subtype coverage in the dashboard, so “30 code samples per signature type” is now “30 code samples per subtype”.

This turns synthetic data generation into a deliberate coverage plan: not just more data, but representative data across the realistic ways a vulnerability occurs.

Increasing Diversity Without Excess LLM Calls

After a stable pool of validated code samples exists, diversity can be expanded further by:

  • renaming variables and data structures using AST-based transformations
  • reorganizing code layout
  • slicing functions from larger real-world codebases and injecting the vulnerability
  • template seeding from real code: pull small, non-vulnerable functions from other codebases as “style templates,” then inject the same vulnerability pattern into those templates and slice the resulting function into the dataset.

These transformations preserve the vulnerability semantics while increasing stylistic variation, reducing the need for additional LLM calls.

Human in the Loop

Although the system is largely automated, human oversight is still essential.

A Streamlit-based inspection tool was built to:

  • track success and failure rates per signature type
  • inspect validated and rejected code samples
  • spot-check generated code for subtle errors

Spot-checking a small number of code samples per signature type provides confidence that the system is behaving as intended.

Wrapping Up

By the end of this process, each vulnerability signature type has:

  • a precise definition
  • a reliable evaluation harness
  • a growing pool of validated vulnerable and fixed code samples
  • positive, negative, and hard-negative pairs suitable for embedding training

This synthetic dataset forms the foundation for the next step: training and evaluating embeddings that capture the true structure of software vulnerability.

I’m turning this into a repeatable playbook for custom embeddings + evaluation loops. If you want to sanity-check your own pipeline or talk through an approach, book a call below.

References

Li, Shikun, Xiaobo Xia, Shiming Ge, and Tongliang Liu. 2022. “Selective-Supervised Contrastive Learning with Noisy Labels.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 316–25. https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Selective-Supervised_Contrastive_Learning_With_Noisy_Labels_CVPR_2022_paper.pdf.

We help teams with AI engineering. Book a consult to find out more.

Book a Consult