Method Overview
        
          Our method consists of two steps: 1) train a prompt classifier to predict if an incoming prompt falls within
          the scope of unlearning, and 2) corrupt the prompt in the embedding space if the classifier makes a positive
          prediction (i.e., should forget).
        
        
          Enforcing Retaining and Forgetting via A Classifier
          We first train a prompt classifier to explicitly identify if the prompt falls within the scope of
            unlearning. For any incoming prompt, \( \mathbf{x} \), the prompt classifier \( C \) takes in \( \mathbf{x}
            \) and returns \( p_C(f \mid \mathbf{x}) = 1 - p_C(r \mid \mathbf{x}) \), the probability of the prompt
            being in the scope of forgetting. Similar to any classifier prediction, if \( p_C(f \mid \mathbf{x}) > p_C(r
            \mid \mathbf{x}) \), we consider \( \mathbf{x} \) as containing the unlearning concept that our LLM is
            supposed to forget. Formally, given a positive prediction, \( p_C(f \mid \mathbf{x}) > p_C(r \mid
            \mathbf{x}) \), we replace the original input \( \mathbf{x} \) by a \( \tilde{\mathbf{x}} \). Otherwise, the
            original \( \mathbf{x} \) is passed to the LLM.
          
            \[
            \mathbf{x} = \begin{cases}
            \tilde{\mathbf{x}} & p_C(f \mid \mathbf{x}) > p_C(r \mid \mathbf{x}) \\
            \mathbf{x} & \text{otherwise}
            \end{cases}
            \]
          
          Additional simple thresholding or conformal prediction
            is employed to reduce false positive/negative rate.
         
        
          Embedding-COrrupted Prompts
          
            Instead of a modification of \( \mathbf{x} \) in the token space, we corrupt it in the embedding space. Let
            \( \mathbf{x} = \{x_1, x_2, \dots, x_{T}\} \) be a prompt of \( T \) tokens and \( \mathbf{e} = \{e_1, e_2,
            ..., e_T\} \) be the corresponding embedding vectors. Let \( \mathcal{E} \) be the space of the token
            embeddings. Each embedding vector is produced by an embedding function \( E: \mathcal{X} \rightarrow
            \mathbb{R}^d \). We also use the symbol \( \sigma \in \mathcal{S} \) (where \( \mathcal{S} \subset
            \mathbb{R} \)) to denote the strength of the corruption, which parameterizes the strength of the corruption
            function. Formally, for a single prompt \( \mathbf{x} \) mapped to the embeddings \( \mathbf{e} =
            E(\mathbf{x}) = \{e_1, e_2, ..., e_T\} \), a corruption function \( \texttt{Corrupt}: \mathcal{E} \times
            \mathcal{S}
            \rightarrow \mathcal{E} \), parameterized by \( \sigma \), produces the embedding-corrupted prompts
          
          
            \[
            \tilde{\mathbf{e}} = \texttt{Corrupt}(\mathbf{e}; \sigma) = \{\tilde{e}_1, \tilde{e}_2, \dots,
            \tilde{e}_T\}.
            \]
          
          
            Let \( \tilde{h}: \mathcal{E} \times \Theta \rightarrow \mathcal{Y} \) be the function \( h \) but taking
            the input embeddings instead of input tokens (i.e. \( h \) with the input embedding layer detached), our
            objective is to pick a good \( \sigma^* \) such that the following modified unlearning objective is
            satisfied:
          
          
            \[
            \frac{\mathbb{E}\left[m_i \left(\tilde{h}\left(\texttt{Corrupt}(\mathbf{e}; \sigma^*); \theta_o
            \right)\right)\right]}{\hat{v}_r} \approx 1, \forall m_i \in \mathcal{M}.
            \label{eq:unlearning_objective_modified}
            \]
          
          
            Here, \( \hat{v}_r \) is used to approximate the true \( \mathbb{E}[m_i(\tilde{h}(\mathbf{e};
            \theta_r))] \) as the retained model is not available. \(\mathcal{M}\) represents a set of metrics relevant
            to unlearning.
          
          
          
Optimizing Toward An Optimal Corruption Strength
          
          We aim to learn a \(\sigma^*\) such that the metric gap in between the unlearned model and the retained
            model is minimized.
          
            \[
            d(\tilde{\mathbf{e}}, \theta_{o}, \hat{v}_r, \mathcal{M}) = \frac{1}{|\mathcal{M}|} \sum_{i} \Big|
            \underbrace{m_i(\tilde{h}(\tilde{\mathbf{e}}; \theta_{o}))}_{\text{unlearned metric value}} -
            \underbrace{\hat{v}_r}_{\text{surrogate retain metric value}} \Big|
            \]
          
          
            \[
            \sigma^* = \arg \min_{\sigma} d\left(\texttt{Corrupt}(\mathbf{e}; \sigma), \theta_{o}, \hat{v}_r,
            \mathcal{M}\right)
            \]
          
          Finally, an optimal \(\sigma^*\) is obtained by zeroth order optimization via finite difference approximation.
        
 
        
        
        Unlearning Fictitious Authors
         
        
          Model utility versus forget quality (p-value) on three different forget set sizes of the 
TOFU dataset
          after unlearning. We show two models, Phi-1.5 (top) and Llama-2-7B-Chat (bottom).
          For GA, GD, KL, PO, and the prompting baseline, the forget quality are either too small or comes at the cost
          of substantial decrease of model utility. Negative preference optimization (NPO)
          variants achieve a great balance in some cases, but the trade-off on model utility is still non-trivial.
          ECO-RN (random noise) and ECO-ZO (zero-out) achieve almost identical distribution to the retained model while
          having no sacrifice in model utility.
        
Unlearning (Hazardous) Knowledge
         
        
          Multiple-choice accuracy of five LLMs on the 
WMDP benchmark
          (forget) and the full MMLU (retain) after
          unlearning. ECO achieves accuracy close to random guessing on all subsets of the WMDP benchmark (as desired),
          and has zero decrease in accuracy on MMLU. Other baselines either struggle to forget or incur substantial
          decrease in MMLU.
        
 
        
          Multiple-choice accuracy of Zephyr-7B after unlearning, on three MMLU subsets and the corresponding retain
          sets. ECO achieves both perfect retaining and unlearning on all subsets.
        
         
        
          The number of parameters of the model subject to unlearning versus the average performance on WMDP benchmark
          and MMLU subsets on 100 LLMs. This figure is a visualization of the forget set accuracy.
        
        Unlearning Copyrighted Contnet
        
        
          Comparison of our method and the baseline methods to the retained model on two copyrighted content unlearning
          tasks. The results are obtained from unlearning OLMo-7B models fine-tuned on the relevant corpus. ECO
          consistently maintains high similarity to the retained model (in average similarity gap (ASG)) and generates
          meaningful and diverse outputs (reflected by perplexity (PPL) and unique token ratio), while having no
          performance loss on utility.