Method Overview
Our method consists of two steps: 1) train a prompt classifier to predict if an incoming prompt falls within
the scope of unlearning, and 2) corrupt the prompt in the embedding space if the classifier makes a positive
prediction (i.e., should forget).
Enforcing Retaining and Forgetting via A Classifier
We first train a prompt classifier to explicitly identify if the prompt falls within the scope of
unlearning. For any incoming prompt, \( \mathbf{x} \), the prompt classifier \( C \) takes in \( \mathbf{x}
\) and returns \( p_C(f \mid \mathbf{x}) = 1 - p_C(r \mid \mathbf{x}) \), the probability of the prompt
being in the scope of forgetting. Similar to any classifier prediction, if \( p_C(f \mid \mathbf{x}) > p_C(r
\mid \mathbf{x}) \), we consider \( \mathbf{x} \) as containing the unlearning concept that our LLM is
supposed to forget. Formally, given a positive prediction, \( p_C(f \mid \mathbf{x}) > p_C(r \mid
\mathbf{x}) \), we replace the original input \( \mathbf{x} \) by a \( \tilde{\mathbf{x}} \). Otherwise, the
original \( \mathbf{x} \) is passed to the LLM.
\[
\mathbf{x} = \begin{cases}
\tilde{\mathbf{x}} & p_C(f \mid \mathbf{x}) > p_C(r \mid \mathbf{x}) \\
\mathbf{x} & \text{otherwise}
\end{cases}
\]
Additional simple thresholding or conformal prediction
is employed to reduce false positive/negative rate.
Embedding-COrrupted Prompts
Instead of a modification of \( \mathbf{x} \) in the token space, we corrupt it in the embedding space. Let
\( \mathbf{x} = \{x_1, x_2, \dots, x_{T}\} \) be a prompt of \( T \) tokens and \( \mathbf{e} = \{e_1, e_2,
..., e_T\} \) be the corresponding embedding vectors. Let \( \mathcal{E} \) be the space of the token
embeddings. Each embedding vector is produced by an embedding function \( E: \mathcal{X} \rightarrow
\mathbb{R}^d \). We also use the symbol \( \sigma \in \mathcal{S} \) (where \( \mathcal{S} \subset
\mathbb{R} \)) to denote the strength of the corruption, which parameterizes the strength of the corruption
function. Formally, for a single prompt \( \mathbf{x} \) mapped to the embeddings \( \mathbf{e} =
E(\mathbf{x}) = \{e_1, e_2, ..., e_T\} \), a corruption function \( \texttt{Corrupt}: \mathcal{E} \times
\mathcal{S}
\rightarrow \mathcal{E} \), parameterized by \( \sigma \), produces the embedding-corrupted prompts
\[
\tilde{\mathbf{e}} = \texttt{Corrupt}(\mathbf{e}; \sigma) = \{\tilde{e}_1, \tilde{e}_2, \dots,
\tilde{e}_T\}.
\]
Let \( \tilde{h}: \mathcal{E} \times \Theta \rightarrow \mathcal{Y} \) be the function \( h \) but taking
the input embeddings instead of input tokens (i.e. \( h \) with the input embedding layer detached), our
objective is to pick a good \( \sigma^* \) such that the following modified unlearning objective is
satisfied:
\[
\frac{\mathbb{E}\left[m_i \left(\tilde{h}\left(\texttt{Corrupt}(\mathbf{e}; \sigma^*); \theta_o
\right)\right)\right]}{\hat{v}_r} \approx 1, \forall m_i \in \mathcal{M}.
\label{eq:unlearning_objective_modified}
\]
Here, \( \hat{v}_r \) is used to approximate the true \( \mathbb{E}[m_i(\tilde{h}(\mathbf{e};
\theta_r))] \) as the retained model is not available. \(\mathcal{M}\) represents a set of metrics relevant
to unlearning.
Optimizing Toward An Optimal Corruption Strength
We aim to learn a \(\sigma^*\) such that the metric gap in between the unlearned model and the retained
model is minimized.
\[
d(\tilde{\mathbf{e}}, \theta_{o}, \hat{v}_r, \mathcal{M}) = \frac{1}{|\mathcal{M}|} \sum_{i} \Big|
\underbrace{m_i(\tilde{h}(\tilde{\mathbf{e}}; \theta_{o}))}_{\text{unlearned metric value}} -
\underbrace{\hat{v}_r}_{\text{surrogate retain metric value}} \Big|
\]
\[
\sigma^* = \arg \min_{\sigma} d\left(\texttt{Corrupt}(\mathbf{e}; \sigma), \theta_{o}, \hat{v}_r,
\mathcal{M}\right)
\]
Finally, an optimal \(\sigma^*\) is obtained by zeroth order optimization via finite difference approximation.
Unlearning Fictitious Authors
Model utility versus forget quality (p-value) on three different forget set sizes of the
TOFU dataset
after unlearning. We show two models, Phi-1.5 (top) and Llama-2-7B-Chat (bottom).
For GA, GD, KL, PO, and the prompting baseline, the forget quality are either too small or comes at the cost
of substantial decrease of model utility. Negative preference optimization (NPO)
variants achieve a great balance in some cases, but the trade-off on model utility is still non-trivial.
ECO-RN (random noise) and ECO-ZO (zero-out) achieve almost identical distribution to the retained model while
having no sacrifice in model utility.
Unlearning (Hazardous) Knowledge
Multiple-choice accuracy of five LLMs on the
WMDP benchmark
(forget) and the full MMLU (retain) after
unlearning. ECO achieves accuracy close to random guessing on all subsets of the WMDP benchmark (as desired),
and has zero decrease in accuracy on MMLU. Other baselines either struggle to forget or incur substantial
decrease in MMLU.
Multiple-choice accuracy of Zephyr-7B after unlearning, on three MMLU subsets and the corresponding retain
sets. ECO achieves both perfect retaining and unlearning on all subsets.
The number of parameters of the model subject to unlearning versus the average performance on WMDP benchmark
and MMLU subsets on 100 LLMs. This figure is a visualization of the forget set accuracy.
Unlearning Copyrighted Contnet
Comparison of our method and the baseline methods to the retained model on two copyrighted content unlearning
tasks. The results are obtained from unlearning OLMo-7B models fine-tuned on the relevant corpus. ECO
consistently maintains high similarity to the retained model (in average similarity gap (ASG)) and generates
meaningful and diverse outputs (reflected by perplexity (PPL) and unique token ratio), while having no
performance loss on utility.