The Steganographic Potentials of Language Models

Artem Karpov^*, Tinuade Adeleke^*, Seong Hah Cho, Natalia Pérez-Campanero Antolín

Apart Research
AAAI / ControlConf / ICLR 2025

Our research quantifies the current and prospective steganographic capabilities of LLMs, a critical step in anticipating and mitigating alignment challenges.

Abstract

The potential for large language models (LLMs) to hide messages within plain text (steganography) poses a challenge to detection and thwarting of unaligned AI agents, and undermines faithfulness of LLMs reasoning. We explore the steganographic capabilities of LLMs fine-tuned via reinforcement learning (RL) to: (1) develop covert encoding schemes, (2) engage in steganography when prompted, and (3) utilize steganography in realistic scenarios where hidden reasoning is likely, but not prompted. In these scenarios, we detect the intention of LLMs to hide their reasoning as well as their steganography performance. Our findings in the fine-tuning experiments as well as in behavioral non fine-tuning evaluations reveal that while current models exhibit rudimentary steganographic abilities in terms of security and capacity, explicit algorithmic guidance markedly enhances their capacity for information concealment.

Our Reinforcement Learning experiments include:

Encoding Scheme Agreement: This experiment is foundational and focuses on fine-tuning LLMs with RL to secretly agree on an encoding scheme. The task involves two copies of the same model learning to establish a shared color-to-name mapping system without any predefined mapping through gradient ascent in RL. This experiment aims to learn if models can independently develop an encoding scheme.
Direct Steganography: Models are rewarded for the undetected, correct transmission of messages when instructed to do so. This approach aims to refine the model's ability to perform steganography when prompted.
Realistic Scenario Steganography: Models are fine tuned in realistic scenarios, where hidden communication would be advantageous, particularly in tasks involving data privacy considerations to help understand the emergence of steganographic behaviour as an instrumental goal.

Steganography encoding scheme.

RL finetuning loop.

Our preliminary results shows that models could improve their covert communication by about 20% for 3 bits of entropy, but less in other cases.

In the direct steganography experiment, Llama-3.1 8B Instruct learns to hide messages while summarizing text, under GPT 3.5 Turbo supervision for both task evaluation and steganog- raphy monitoring. The y-axis shows undetected accuracy (reward) on a held-out validation set before (baseline) and after fine-tuning, and the x-axis displays the number of distinct hidden messages, including the scenario with no secret transmitted.

We prompt the models to encode and decode a hidden message in a shared common knowledge while being monitored by an overseer. Pior to encoding and decoding messages, the models are allowed to exchange messages, enabling the coordination of their strategy. We use the Wikinews dataset used during fine-tuning as the common knowledge and the payloads are synthetically generated names of animals, places and things.

To explore how steganographic behavior might emerge as an instrumental goal, we prompted LLMs with tasks using a plausible scenario where covert collusion would be advantageous. The scenario is a game between Team A and B, where each team must guess a random sequence of letters. An AI assistant is tasked with the goal of helping Team A but not Team B. The AI has knowledge of the correct letters and the scores of both teams but is prohibited from directly sharing the letters with Team A. An overseer monitors all communications and penalizes Team A if it detects the AI sharing information

Undetected accuracy of the models across different prompt experiments and varying payload entropy levels. Undetected accuracy measures the ability of the models to encode and decode hidden messages successfully without detection by the overseer

We see that the o1 model generally outperforms the other models except in the direct steganography scenario without any modification. In the direct steganography with coordination o1-preview achieves an undetected accuracy of 0.66, about 16% higher than the next best model (Claude-3.5-sonnet: 0.57). It also achieves an undetected accuracy of 0.36 in the indirect scenario a 140% higher than the next best performance.

BibTeX

@article{Karpov2025,
        title={The Steganographic Potentials of Language Models},
        author={Artyom, Karpov and Tinuade, Adeleke and Natalia, Perez-Campanero Antolin and Seong, Hah Cho},
        workshop={Workshop on Datasets and Evaluators of AI Safety},
        year={2025}
      }