Discovering Concept-Editing Algorithms with LLM Agents

Discovering Concept-Editing Algorithms With LLM Agents

1 Introduction

To create safe models, we must be able to control what they know and use. Concept erasure is one tool for modifying a model’s representations. Rather than retraining or fine-tuning, concept erasure reaches into a model’s activations and removes a target concept, rendering it unusable.

The ability to erase concepts supports a range of AI-safety efforts. Concept erasure can help models unlearn knowledge needed to build bioweapons, remove sensitive attributes that enable discrimination, and ablate concepts that drive misalignment. But concepts are rarely isolated, detachable features, making perfect erasure difficult to achieve.

One prevalent method, LEAst-squares Concept Erasure (LEACE) (Belrose et al. 2023), removes the linear component of concepts in closed form. Given activations labeled by a binary concept (e.g., “sarcastic” vs. “not sarcastic”) LEACE removes the target concept’s linear component with the smallest possible edit. Afterwards, a linear classifier should not be able to recover the target concept above chance, a property known as Linear Guardedness. LEACE’s successor, Quadratic LEAst-squares Concept Erasure (QLEACE) (Quirke and Belrose 2025), additionally equalizes the two classes’ covariances, so that no quadratic classifier can recover the target concept.

However, as linear methods modifying nonlinear representations, LEACE and QLEACE cannot fully erase target concepts in the model. LEACE matches the two classes’ means, but not their covariances. QLEACE matches the means and covariances but preserves the higher-order structure. An RBF-kernel SVM trained on LEACE-erased activations still recovered target concepts with 70%–95% accuracy. Nonlinear erasure struggles to generalize: Kernelized Concept Erasure (Ravfogel et al. 2022) showed that protecting against one nonlinear classifier can leave the concept fully readable by another.

We tasked agents trained on our data with inventing algorithms that erase concepts from neural network activations that outperform LEACE and QLEACE. Given only a description of the task and an L2 edit-distance budget, our agents discovered six distinct families of erasure algorithms across 50 concepts. The best algorithm reduced the average accuracy of a nonlinear probe from 99% to 70%, whereas LEACE only reduced accuracy to 88%. Additionally, on a held out random forest classifier, the best algorithm reduced accuracy from 99% to 72%, whereas LEACE only reduced accuracy to 82%.

Concepts tested

Algorithm families discovered

50/50

Best algorithm beats LEACE (SVM)

70%

Avg best SVM accuracy

2 The Experiment

Each agent was given a target concept, a small labeled sample, a Gemma-3 270M model, and was asked the following:1

Create a procedure that edits the model’s activations to remove a concept such that nonlinear classifiers cannot recover it. Your modification must stay within the L2 budget of LEACE.

The agents did not receive an algorithm or a published solution. They were unable to view the grader, which evaluated their method on held-out activations, with a fresh nonlinear probe. Agents had to:

Analyze the activation geometry to understand why LEACE leaves a nonlinear signal.

Devise an algorithm to remove it.

Implement, debug, and optimize hyperparameters.

Keep the edit within the specified L2 budget (same amount of modification LEACE uses).

We ran 560 independent rollouts across 50 concepts, ranging from “sarcasm” and “humor” to “metacognition” and “hypotheticality.” We used two algorithms to evaluate performance compared to the LEACE baseline: SVM Accuracy and Random Forest Accuracy.

3 Results

3.1 SVM Accuracy

SVM Accuracy was our primary evaluation metric. An RBF-kernel SVM is trained on data the agent never saw and evaluated on a separate test set. An SVM Accuracy score of 50% is equal to random chance (i.e., perfect erasure). A score higher than 50% meant the target concept was recoverable.

Average SVM Accuracy — Baseline vs. LEACE vs. Agents

For every concept, the best agent solution beat LEACE on SVM accuracy. On average, LEACE reduces the SVM Accuracy to 88%, while our agents reduced it to 70.1%. The Appendix shows the per-concept SVM results for all 50 concepts.

3.2 Random Forest Accuracy

As a test of generalization, we also evaluated with a Random Forest Classifier (100 trees, max depth 10). This is a different nonlinear classifier family that the agents were not optimizing against. If a Random Forest also failed to recover the target concept, that indicated the agents found erasure that transfers beyond the specific SVM they were graded on. This result would suggest genuine distribution-matching rather than overfitting to one kernel.

Average Random Forest Accuracy — Baseline vs. LEACE vs. Agents

Random Forest recovered some concepts the SVM could not, but our best agent solutions generally beat LEACE. On average, LEACE reduces Random Forest accuracy to...

Discovering Concept-Editing Algorithms with LLM Agents

Related Articles

(no title)

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

The labor share of income in the US is at its lowest post-war level