Discovering Concept-Editing Algorithms with LLM Agents

mattmarcus1 pts0 comments

Discovering Concept-Editing Algorithms With LLM Agents

1 Introduction

To create safe models, we must be able to control what they know and<br>use. Concept erasure is one tool for modifying a model’s<br>representations. Rather than retraining or fine-tuning, concept erasure<br>reaches into a model’s activations and removes a target concept,<br>rendering it unusable.

The ability to erase concepts supports a range of AI-safety efforts.<br>Concept erasure can help models unlearn knowledge needed to build<br>bioweapons, remove sensitive attributes that enable discrimination, and<br>ablate concepts that drive misalignment. But concepts are rarely<br>isolated, detachable features, making perfect erasure difficult to<br>achieve.

One prevalent method, LEAst-squares Concept<br>Erasure (LEACE) (Belrose et al. 2023), removes the linear<br>component of concepts in closed form. Given activations labeled by a<br>binary concept (e.g., “sarcastic” vs. “not sarcastic”) LEACE removes the<br>target concept’s linear component with the smallest possible edit.<br>Afterwards, a linear classifier should not be able to recover the target<br>concept above chance, a property known as Linear Guardedness.<br>LEACE’s successor, Quadratic LEAst-squares Concept<br>Erasure (QLEACE) (Quirke and Belrose 2025), additionally<br>equalizes the two classes’ covariances, so that no quadratic classifier<br>can recover the target concept.

However, as linear methods modifying nonlinear representations, LEACE<br>and QLEACE cannot fully erase target concepts in the model. LEACE<br>matches the two classes’ means, but not their covariances. QLEACE<br>matches the means and covariances but preserves the higher-order<br>structure. An RBF-kernel SVM trained on LEACE-erased activations still<br>recovered target concepts with 70%–95% accuracy. Nonlinear erasure<br>struggles to generalize: Kernelized Concept<br>Erasure (Ravfogel et al. 2022) showed that<br>protecting against one nonlinear classifier can leave the concept fully<br>readable by another.

We tasked agents trained on our data with inventing algorithms that<br>erase concepts from neural network activations that outperform LEACE and<br>QLEACE. Given only a description of the task and an L2 edit-distance<br>budget, our agents discovered six distinct families of erasure<br>algorithms across 50 concepts. The best algorithm reduced the average<br>accuracy of a nonlinear probe from 99% to 70%, whereas LEACE only<br>reduced accuracy to 88%. Additionally, on a held out random forest<br>classifier, the best algorithm reduced accuracy from 99% to 72%, whereas<br>LEACE only reduced accuracy to 82%.

50

Concepts tested

Algorithm families discovered

50/50

Best algorithm beats LEACE (SVM)

70%

Avg best SVM accuracy

2 The Experiment

Each agent was given a target concept, a small labeled sample, a<br>Gemma-3 270M model, and was asked the following:1

Create a procedure that edits the model’s activations to remove a<br>concept such that nonlinear classifiers cannot recover it. Your<br>modification must stay within the L2 budget of LEACE.

The agents did not receive an algorithm or a published solution. They<br>were unable to view the grader, which evaluated their method on held-out<br>activations, with a fresh nonlinear probe. Agents had to:

Analyze the activation geometry to understand why LEACE leaves a<br>nonlinear signal.

Devise an algorithm to remove it.

Implement, debug, and optimize hyperparameters.

Keep the edit within the specified L2 budget (same amount of<br>modification LEACE uses).

We ran 560 independent rollouts across 50 concepts, ranging from<br>“sarcasm” and “humor” to “metacognition” and “hypotheticality.” We used<br>two algorithms to evaluate performance compared to the LEACE baseline:<br>SVM Accuracy and Random Forest Accuracy.

3 Results

3.1 SVM Accuracy

SVM Accuracy was our primary evaluation metric. An<br>RBF-kernel SVM is trained on data the agent never saw and evaluated on a<br>separate test set. An SVM Accuracy score of 50% is equal to random<br>chance (i.e., perfect erasure). A score higher than 50% meant the target<br>concept was recoverable.

Average SVM Accuracy — Baseline vs. LEACE vs. Agents

For every concept, the best agent solution beat LEACE on SVM<br>accuracy. On average, LEACE reduces the SVM Accuracy to 88%, while our<br>agents reduced it to 70.1%. The Appendix shows<br>the per-concept SVM results for all 50 concepts.

3.2 Random Forest Accuracy

As a test of generalization, we also evaluated with a Random<br>Forest Classifier (100 trees, max depth 10). This is a different<br>nonlinear classifier family that the agents were not optimizing against.<br>If a Random Forest also failed to recover the target concept, that<br>indicated the agents found erasure that transfers beyond the specific<br>SVM they were graded on. This result would suggest genuine<br>distribution-matching rather than overfitting to one kernel.

Average Random Forest Accuracy — Baseline vs. LEACE vs. Agents

Random Forest recovered some concepts the SVM could not, but our best<br>agent solutions generally beat LEACE. On average, LEACE reduces Random<br>Forest accuracy to...

concept leace accuracy concepts agents erasure

Related Articles