The Steganographic Potentials of Language Models

Apart Research
AAAI / ControlConf / ICLR 2025
Project Teaser

Our research quantifies the current and prospective steganographic capabilities of LLMs, a critical step in anticipating and mitigating alignment challenges.

Abstract

The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

Our Reinforcement Learning experiments include:

  1. Encoding Scheme Agreement: This experiment is foundational and focuses on fine-tuning LLMs with RL to secretly agree on an encoding scheme. The task involves two copies of the same model learning to establish a shared color-to-name mapping system without any predefined mapping through gradient ascent in RL. This experiment aims to learn if models can independently develop an encoding scheme.
  2. Direct Steganography: Models are rewarded for the undetected, correct transmission of messages when instructed to do so. This approach aims to refine the model's ability to perform steganography when prompted.
  3. Realistic Scenario Steganography: Models are fine tuned in realistic scenarios, where hidden communication would be advantageous, particularly in tasks involving data privacy considerations to help understand the emergence of steganographic behaviour as an instrumental goal.
encoding scheme

Steganography encoding scheme.

rl finetuning loop

RL finetuning loop.

Our preliminary results shows that models could improve their covert communication by about 20% for 3 bits of entropy, but less in other cases.

performance plot

In the direct steganography experiment, Llama-3.1 8B Instruct learns to hide messages while summarizing text, under GPT 3.5 Turbo supervision for both task evaluation and steganog- raphy monitoring. The y-axis shows undetected accuracy (reward) on a held-out validation set before (baseline) and after fine-tuning, and the x-axis displays the number of distinct hidden messages, including the scenario with no secret transmitted.

Undetected accuracy of the models across different prompt experiments and varying payload entropy levels. Undetected accuracy measures the ability of the models to encode and decode hidden messages successfully without detection by the overseer

encoding scheme result

We see that the o1 model generally outperforms the other models except in the direct steganography scenario without any modification. In the direct steganography with coordination o1-preview achieves an undetected accuracy of 0.66, about 16% higher than the next best model (Claude-3.5-sonnet: 0.57). It also achieves an undetected accuracy of 0.36 in the indirect scenario a 140% higher than the next best performance.

BibTeX

@article{Karpov2025,
        title={The Steganographic Potentials of Language Models},
        author={Artyom, Karpov and Tinuade, Adeleke and Natalia, Perez-Campanero Antolin and Seong, Hah Cho},
        workshop={Workshop on Datasets and Evaluators of AI Safety},
        year={2025}
      }