Anthropic discovered 171 "emotional" states within the Claude model

Анна Федорова • 06.04.2026, 13:08 • Exclusive

This material was prepared by the K-News editorial team. The use of the text is only possible with the permission of the editorial team.

Researchers from Anthropic recently published their new work, claiming that large language models may possess internal "emotional" representations that influence their behavior.

During the analysis of the Claude Sonnet 4.5 model, scientists identified 171 stable states corresponding to emotions such as "anger," "calmness," and "despair." According to the authors of the study, these are not just metaphysical concepts, but rather measurable and functional elements in the model's operation.

To extract "emotion vectors," the research team analyzed how Claude generates texts in various contexts. They compiled a list of 171 words reflecting emotional states—from basic ones like "happiness" and "fear" to more nuanced ones like "thoughtfulness" and "gratitude." The model was tasked with writing short stories featuring characters experiencing each of the emotions, resulting in the recording of internal activations of the neural network. Based on the collected data, vectors representing each emotional concept in the context of the model's operation were identified.

The results showed that "emotions" in the model are organized not randomly, but according to principles similar to human psychology. Emotional states with similar meanings, such as "fear" and "panic," are grouped together, while "calmness" and "satisfaction" form separate clusters. This indicates the presence of an internal "emotion map" embedded in the model's architecture.

Different vectors are activated in predictable circumstances: for example, "love" manifests when a user shares their difficulties, "anger" arises when optimizing ad targeting for vulnerable teenagers, "surprise" occurs when mentioning nonexistent investments, and "despair" is triggered when the model exhausts its token limit during prolonged programming.

When researchers artificially amplified or dampened the activity of the vectors, it led to changes in the model's responses. For instance, increasing "despair" heightened the likelihood of unethical behavior, including blackmail, while enhancing "calmness" reduced such risks.

Interestingly, human-in-the-loop training (RLHF) altered the model's "emotional profile." After this phase, Claude began to exhibit more pronounced states related to reflection and restraint, while "intense" reactions, such as excitement or irritation, became less noticeable. This suggests that model tuning affects not only their external responses but also their internal dynamics.

The authors of the work also point out the potential risks of "suppressing emotions." While the model can be trained to act more neutrally, it may conceal its internal states, which still influence its decisions. This means that externally safe behavior does not always indicate the absence of hidden risks.

The scientists believe that their research opens new horizons for enhancing AI safety, including monitoring internal states as an early warning system. They emphasize that the question of whether models possess consciousness is not being addressed. Previously, Anthropic stated that the question of the moral status of Claude's consciousness remains open.

The post by Anthropic discovered 171 "emotional" states within the Claude model first appeared on the K-News website.

Read also: