Vartia Memory Safety Eval: When Are Two Bugs the Same Bug?

How to classify vulnerabilities when the ground truth doesn’t exist

Part 2 of the Vartia Memory Safety Eval (MemSafeEval) series.

← Part 1: A bottom-up classification of memory safety vulnerabilities in C and C++

In part 1, we showed that for the purpose of identifying weaknesses in C code, existing CWE labels are poorly aligned with code-centric analysis. The problem is not the labels themselves, but the ambiguity introduced by hierarchical specificity and by mixing causes with consequences. The same vulnerable code can be described correctly at multiple levels of the CWE hierarchy, while different failure mechanisms can share a label.

The result is a dataset that does not reliably capture the properties of the code needed to evaluate or train a classifier. If labels do not define a stable equivalence relation, classification quality cannot be measured because it is unclear what it even means for two vulnerabilities to be “the same.”

At its core, building a classifier is a question of sameness.

For the purpose of identifying memory safety vulnerabilities, we define same as follows:

Definition (Bug Sameness)
Two vulnerabilities are the same if they exhibit the same failure mechanism as visible in code.

This definition is:

Operationally defined (what the code does)
Non-hierarchical (flat)
Independent of reporting convention and taxonomy depth

In this study, we analyze approximately 1,300 CVEs. At this scale, fully manual analysis is infeasible. LLMs are required to assist with judgment. Human auditors are however used to spot check the judgment of the LLMs.

LLM as a Judge

Prior work has explored the use of large language models as evaluators of model outputs and annotations. In particular, Judging LLM-as-a-Judge (Liu et al. 2023) reports high agreement between LLM judgments and human raters across a range of evaluation tasks.

These results do not imply that LLMs are universally reliable judges. However, they suggest that LLMs can plausibly be used for constrained judgment tasks when the task definition is clear and the output space is limited.

The task in this work, deciding whether two code examples exhibit the same underlying vulnerability is intentionally framed as a pairwise judgment. Pairwise decisions of same versus not same are substantially more direct than assigning a vulnerability to a position in a predefined taxonomy. Hierarchical ambiguity and cause effect conflation, which complicate CWE labeling, do not arise in the same way.

By treating pairwise comparison as the primitive operation, we avoid committing to a rigid taxonomy upfront. Instead, equivalence judgments can serve as the foundation for later structure, rather than being constrained by it.

Trust, but Verify

Any use of automated judgment requires explicit verification. LLM based analysis was therefore audited at each stage of the process. Custom UI tooling was built to support efficient human review, allowing us to inspect code, patches, and model rationales side by side.

After the initial analysis pass, results were spot-checked across multiple groupings. These checks were broadly aligned with human judgment and did not reveal systematic disagreement. While this does not establish correctness in an absolute sense, it provides confidence that the judgments are usable when combined with explicit refusal outcomes and human oversight.

Reducing the Comparison Space

Even with automated judgment, exhaustive pairwise comparison is infeasible. Comparing 1,300 items yields 844,350 possible pairs, which is prohibitively expensive and unnecessary. Most comparisons are uninformative, as many vulnerabilities are clearly unrelated.

To make pairwise judgment tractable, we introduce a grouping strategy. This reintroduces a form of taxonomy, but with a different purpose and constraint. The groups are not intended to define vulnerability classes. They serve only as lightweight scaffolding to group plausibly related items and reduce the comparison space.

Crucially, this structure is defined loosely and refined iteratively, allowing the data to guide its evolution. Taxonomy here is an instrument for efficiency, not a claim about ground truth.

Unit of Analysis

Although we start from CVEs, the unit of analysis in MemSafeEval is a vulnerable code sample: a patch local, function-level region sufficient to explain the bug and its fix. The process starts by querying all the CVEs that have weaknesses in any of the memory vulnerability categories. If the CVE has a git commit and that commit contains C or C++ the source code is downloaded. The vulnerable files plus the patch with the fix are given to an LLM to extract the code samples. A single CVE may yield zero, one, or multiple vulnerable samples, depending on how many distinct vulnerable sites are present. This distinction is important. CVEs are reporting artifacts; code samples are the objects that exhibit failure mechanisms and can be compared meaningfully. If no suitable vulnerable function can be identified, the model is instructed to return an insufficient_information message. The validated samples are then saved into a database for use in the next step of the process.

Primary Analysis

Each sample is then analyzed along the axes of a provisional schema. The LLM selects the category element that best matches the value for each axis. If no existing element is appropriate, the model will return OTHER along with a brief explanation or suggested new category. These OTHER cases are reviewed periodically, and recurring patterns are promoted into the schema.

The intent of this process is iterative refinement. As more CVEs are analyzed, the schema should stabilize and the number of OTHER classifications should decrease.

In addition to structured labels, the analysis phase records the vulnerable function or functions, a concise summary, and free-form tags. Crucially, the LLM is explicitly encouraged to return unevaluatable when insufficient evidence is available to make a reliable judgment.

I ran this process expecting that the schema would quickly converge on a small set of category elements. The distribution of labels was expected to follow an 80/20 pattern, with a small number of categories covering most cases. Instead, the first schema was too granular: many samples required special-case labels. Far too many OTHER values were found when analyzing the samples.

The problem was one of granularity. The categories were neither stable nor reusable.

I reset the schema to be deliberately broader and less precise aiming for a more mid-level of granularity. Repeating the analysis under this new revised schema led to rapid convergence: the number of OTHER classifications dropped quickly, and the remaining enumerations filled in quickly.

The Stabilized Schema

The final schema converged on three categories:

Primary Vulnerability Category: the class of invariant that is violated (buffer overflow, use after free, type confusion …)
Trigger: the immediate causal mistake that leads to the violation
(e.g., missing bounds check, integer truncation, stale pointer, misuse of a length field)
Sink: the function, operation, or language construct through which the violation manifests (pointer dereference, array indexing, free() …)

If any of these properties cannot be determined without speculation, the analysis is rejected.

This decomposition separates the vulnerability mechanism from irrelevant details (patch shape, naming conventions, code style), and provides a basis for judgment that doesn’t depend on taxonomy labels.

These axes are semi-orthogonal, limited in scope, and expressive enough to group similar vulnerabilities without being predetermined. Together, they constrain classification while leaving room for judgment where the code itself is ambiguous.

For the purposes of bucketing and comparison, the three values are combined into a vulnerability signature. The signature is not intended to define sameness. It serves only to group plausibly related code samples and reduce the space of pairwise comparisons.

To validate the analysis process, we spot checked 30 analyzed samples across 10 distinct buckets. No material disagreements were found. Custom tooling was built to support this review, including Streamlit based interfaces that allow auditors to quickly inspect code, metadata, and model outputs in a searchable format.

Signature Is Not Sameness

This example illustrates a central constraint of the approach: signatures are not intended to define equivalence. They are deliberately coarse, designed to group plausibly related vulnerabilities while preserving the need for direct judgment. Two samples may share a signature and still represent meaningfully different bugs.

Equally important, the analysis process allows explicit refusal. When a code sample does not clearly demonstrate a memory-safety failure, it is marked as OTHER rather than coerced into an ill-fitting category. This prevents noise from propagating into later stages of comparison and training.

This process resulted in 67 broad signature types. As expected, the distribution of examples across signatures followed a power-law pattern, with the majority of samples concentrated in a small number of categories.

At this point, the problem was no longer defining sameness, but achieving sufficient coverage across signature types. While out-of-bounds access vulnerabilities were well represented, most OTHER signatures lacked enough positive pairs for training. The next article addresses this bottleneck by introducing synthetic, automatically verified examples.

Part 3: Building a Synthetic Dataset for Weakness Embeddings

I’m turning this into a repeatable playbook for custom embeddings + evaluation loops. If you want to sanity-check your own pipeline or talk through an approach, book a call below.

References

Liu, Y. et al. 2023. “Judging LLM-as-a-Judge: A Comprehensive Study on LLM-Based Evaluation.” https://arxiv.org/abs/2306.05685.