Using embedding-corrupted prompts to maintain an unlearned state on the LLM subject to unlearning. We first
employ a classifier to identify whether the incoming prompt falls within the scope of the unlearning target.
We construct embedding-corrupted prompts by selectively corrupting dimensions within the tokens' embeddings.
The corruption parameter is learned offline via zeroth order optimization. An unlearned state is imposed
during inference and does not require any updates to the original model's weights.

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet
controlling what a large language model should not know is important for ensuring alignment and thus safe
use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the
potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large
computational requirements for optimization across state-of-the-art models with hundreds of billions of
parameters. In this work, we present **Embedding-COrrupted (ECO) Prompts**, a lightweight unlearning
framework
for large language models to address both the challenges of knowledge entanglement and unlearning
efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during
inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn
corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective
offline and corrupt prompts flagged by the classifier during inference. We find that these
embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but
also closely approximate the output from a model that has never been trained on the data intended for
forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in
achieving promising unlearning at *nearly zero side effects* in general domains and domains closely
related
to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from
0.5B to 236B parameters, incurring no additional cost as the number of parameters increases.

- Embedding-COrrupted (ECO) Prompts, a novel and lightweight LLM unlearning method that enforces an unlearned state over an intact LLM.
- Instead of relying on unlearning objective optimization, carefully corrupted prompts lead to behavior that resembles that of a model which has never seen the data intended to be forgotten, across multiple tasks and metrics.
- The superior performance of our method in both retaining and forgetting, incurring virtually zero side effects and no additional cost when scaling to larger models.
- To the best of our knowledge, we are the first to demonstrate
**universally effective and efficient unlearning for 100 LLMs and up to 236B parameters**.

Our method consists of two steps: 1) train a prompt classifier to predict if an incoming prompt falls within
the scope of unlearning, and 2) corrupt the prompt in the embedding space if the classifier makes a positive
prediction (i.e., should forget).

We first train a prompt classifier to explicitly identify if the prompt falls within the scope of unlearning. For any incoming prompt, \( \mathbf{x} \), the prompt classifier \( C \) takes in \( \mathbf{x} \) and returns \( p_C(f \mid \mathbf{x}) = 1 - p_C(r \mid \mathbf{x}) \), the probability of the prompt being in the scope of forgetting. Similar to any classifier prediction, if \( p_C(f \mid \mathbf{x}) > p_C(r \mid \mathbf{x}) \), we consider \( \mathbf{x} \) as containing the unlearning concept that our LLM is supposed to forget. Formally, given a positive prediction, \( p_C(f \mid \mathbf{x}) > p_C(r \mid \mathbf{x}) \), we replace the original input \( \mathbf{x} \) by a \( \tilde{\mathbf{x}} \). Otherwise, the original \( \mathbf{x} \) is passed to the LLM.

\[
\mathbf{x} = \begin{cases}
\tilde{\mathbf{x}} & p_C(f \mid \mathbf{x}) > p_C(r \mid \mathbf{x}) \\
\mathbf{x} & \text{otherwise}
\end{cases}
\]

Additional simple thresholding or conformal prediction
is employed to reduce false positive/negative rate.

Instead of a modification of \( \mathbf{x} \) in the token space, we corrupt it in the embedding space. Let \( \mathbf{x} = \{x_1, x_2, \dots, x_{T}\} \) be a prompt of \( T \) tokens and \( \mathbf{e} = \{e_1, e_2, ..., e_T\} \) be the corresponding embedding vectors. Let \( \mathcal{E} \) be the space of the token embeddings. Each embedding vector is produced by an embedding function \( E: \mathcal{X} \rightarrow \mathbb{R}^d \). We also use the symbol \( \sigma \in \mathcal{S} \) (where \( \mathcal{S} \subset \mathbb{R} \)) to denote the strength of the corruption, which parameterizes the strength of the corruption function. Formally, for a single prompt \( \mathbf{x} \) mapped to the embeddings \( \mathbf{e} = E(\mathbf{x}) = \{e_1, e_2, ..., e_T\} \), a corruption function \( \texttt{Corrupt}: \mathcal{E} \times \mathcal{S} \rightarrow \mathcal{E} \), parameterized by \( \sigma \), produces the embedding-corrupted prompts

\[
\tilde{\mathbf{e}} = \texttt{Corrupt}(\mathbf{e}; \sigma) = \{\tilde{e}_1, \tilde{e}_2, \dots,
\tilde{e}_T\}.
\]

Let \( \tilde{h}: \mathcal{E} \times \Theta \rightarrow \mathcal{Y} \) be the function \( h \) but taking the input embeddings instead of input tokens (i.e. \( h \) with the input embedding layer detached), our objective is to pick a good \( \sigma^* \) such that the following modified unlearning objective is satisfied:

\[
\frac{\mathbb{E}\left[m_i \left(\tilde{h}\left(\texttt{Corrupt}(\mathbf{e}; \sigma^*); \theta_o
\right)\right)\right]}{\hat{v}_r} \approx 1, \forall m_i \in \mathcal{M}.
\label{eq:unlearning_objective_modified}
\]

Here, \( \hat{v}_r \) is used to approximate the true \( \mathbb{E}[m_i(\tilde{h}(\mathbf{e}; \theta_r))] \) as the retained model is not available. \(\mathcal{M}\) represents a set of metrics relevant to unlearning.

We aim to learn a \(\sigma^*\) such that the metric gap in between the unlearned model and the retained model is minimized.

\[
d(\tilde{\mathbf{e}}, \theta_{o}, \hat{v}_r, \mathcal{M}) = \frac{1}{|\mathcal{M}|} \sum_{i} \Big|
\underbrace{m_i(\tilde{h}(\tilde{\mathbf{e}}; \theta_{o}))}_{\text{unlearned metric value}} -
\underbrace{\hat{v}_r}_{\text{surrogate retain metric value}} \Big|
\]

\[
\sigma^* = \arg \min_{\sigma} d\left(\texttt{Corrupt}(\mathbf{e}; \sigma), \theta_{o}, \hat{v}_r,
\mathcal{M}\right)
\]

Finally, an optimal \(\sigma^*\) is obtained by zeroth order optimization via finite difference approximation.
Model utility versus forget quality (p-value) on three different forget set sizes of the TOFU dataset
after unlearning. We show two models, Phi-1.5 (top) and Llama-2-7B-Chat (bottom).
For GA, GD, KL, PO, and the prompting baseline, the forget quality are either too small or comes at the cost
of substantial decrease of model utility. Negative preference optimization (NPO)
variants achieve a great balance in some cases, but the trade-off on model utility is still non-trivial.
ECO-RN (random noise) and ECO-ZO (zero-out) achieve almost identical distribution to the retained model while
having no sacrifice in model utility.

Multiple-choice accuracy of five LLMs on the WMDP benchmark
(forget) and the full MMLU (retain) after
unlearning. ECO achieves accuracy close to random guessing on all subsets of the WMDP benchmark (as desired),
and has zero decrease in accuracy on MMLU. Other baselines either struggle to forget or incur substantial
decrease in MMLU.

Multiple-choice accuracy of Zephyr-7B after unlearning, on three MMLU subsets and the corresponding retain
sets. ECO achieves both perfect retaining and unlearning on all subsets.

The number of parameters of the model subject to unlearning versus the average performance on WMDP benchmark
and MMLU subsets on 100 LLMs. This figure is a visualization of the forget set accuracy.

Comparison of our method and the baseline methods to the retained model on two copyrighted content unlearning
tasks. The results are obtained from unlearning OLMo-7B models fine-tuned on the relevant corpus. ECO
consistently maintains high similarity to the retained model (in average similarity gap (ASG)) and generates
meaningful and diverse outputs (reflected by perplexity (PPL) and unique token ratio), while having no
performance loss on utility.

```
@article{liu2024eco,
title={Large Language Model Unlearning via Embedding-Corrupted Prompts},
author={Liu, Chris Yuhao and Wang, Yaxuan and Flanigan, Jeffrey and Liu, Yang},
journal={arXiv preprint arXiv:2406.07933},
year={2024}
}
```