Vartia Memory Safety Eval: A bottom-up classification of memory safety vulnerabilities in C and C++

Why frontier LLMs fail at vulnerability classification—and what it reveals about CWE labels

Part 1 of the Vartia Memory Safety Eval (MemSafeEval) series.

Memory-related bugs are some of the most common and costly security failures in software.
The industry has been burned by these bugs since before the Morris Worm incident made headlines in 1988. Here we are, 37 years later, and the same class of security bugs are still happening today.

Memory safety vulnerabilities consistently make up a large fraction of high-severity CVEs year after year, across operating systems, browsers, databases, and embedded systems. Despite decades of tooling and mitigation efforts, out-of-bounds accesses, use-after-free errors, and related defects remain among the most reliable ways to turn ordinary bugs into security exploits.

To prevent these vulnerabilities, we first need to understand them: what the failure mode is, what conditions trigger it, and how to prevent the next one from shipping. This requires a way to detect vulnerable code and evaluate the accuracy of our detectors.

I am building Vartia Memory Safety Eval (MemSafeEval), a dataset + evaluation harness for training and testing code embeddings on root-cause equivalence across memory-safety vulnerabilities.

Classifying Vulnerabilities: The Easy Baseline That Wasn’t

I began by running a simple baseline experiment: test whether state-of-the-art LLMs can correctly classify real security vulnerabilities.

The source data was the Common Vulnerabilities and Exposures (CVE) database. CVEs include natural-language descriptions, one or more Common Weakness Enumeration (CWE) labels, and often the code patch that fixed the issue.

I downloaded roughly 20 years of CVE data from NIST and filtered for memory-safety-related CWEs. CVEs with available GitHub commits were converted into a dataset of vulnerable functions paired with their assigned weaknesses.

The result:

~1300 CVEs
79% had a single CWE label

The task given to the LLMs was intentionally simple:

Given the vulnerable code and its patch, return one or more CWEs that describe the weakness.

If the model returned any CWE listed in the CVE record, it was considered correct. Hierarchical matching was not allowed. This was meant to be a loose baseline, not a strict evaluation.

I assumed modern LLMs would perform extremely well.

They did not.

Frontier models returned a correct CWE less than 50% of the time.

Are We Asking the Wrong Question?

This failure raises a deeper question:

Are top-down CWE taxonomies compatible with code-centric evaluation at all?

After digging into the dataset, the answer became clear: the problem is not the models—it is the labels.

CWEs are assigned manually through a community-driven process spanning decades. That process is valuable, but it struggles to produce consistent, fine-grained, code-level classifications.

This is not an argument that CWEs are incorrect or useless.
It is an argument that they are misaligned with the task of classifying vulnerabilities directly from code.

Look at Your Data

To understand what was happening, I looked at the label distribution.

The dataset is extremely imbalanced. A small number of CWEs dominate the majority of examples.

Next, I examined where the model was wrong.

This confusion matrix is a comparison of the dataset label and the LLM predicted label. Correct predictions are found on the diagonal. All of the non-diagonal cells show a count of that type of misprediction.

The errors are not random. They cluster.

CWE-119 is both:

the most over-predicted label (predicted as 119 when not)
the most under-predicted label (predicted as not 119 when it was)

This makes sense. CWE-119 (“Improper Restriction of Operations within the Bounds of a Memory Buffer”) is a parent of CWE-120, CWE-125, CWE-787, and a grandparent of others like CWE-121 and CWE-122.

It acts as a semantic catch-all.

When Two Labels Are Both Correct

Consider CWE-122 (Heap-based Buffer Overflow) and CWE-787 (Out-of-Bounds Write).

For CVE-2025-54949, the official label is CWE-122. The LLM predicted CWE-787.

The code uses memcpy to copy data into a heap buffer without verifying that the buffer has sufficient space.

Both labels are correct.

The mechanism is an out-of-bounds write.
The consequence is a heap-based buffer overflow.

The CWE hierarchy forces a choice. The code does not.

Why Bottom-Up Classification Matters

The goal was never to reproduce CWE assignments.
The goal was to classify vulnerabilities as they manifest in code.

Top-down labeling forces models to guess which level of abstraction the human annotator chose. That information is often absent from the code itself.

To build a code-centric evaluation system, weakness categories must be derived bottom-up, from concrete failure modes observed in real code—not imposed from a predefined hierarchy.

What’s Next

In the next article, I’ll describe how I built a bottom-up taxonomy for memory safety vulnerabilities, derived directly from observed code patterns rather than inherited labels.

That taxonomy is the foundation of MemSafeEval.

Part 2: When are 2 bugs the same bug