Learning Systems and Innate Behavior
Abstract<br>Most contemporary work on artificial agents — including the current<br>generation of large language model agents — treats motivation as<br>something to be specified at runtime, and treats learning as something<br>that ends at deployment. We argue that what makes a creature feel alive<br>is not the sophistication of its behavior but the presence of innate<br>stakes: internal states it did not choose, cannot disable, and must<br>work to keep within viable ranges. For humans, physical stress is a<br>canonical example. And what enables a creature to grow into its own<br>intelligence is not a finished pretrained model, but an architecture<br>that keeps learning from its own innate experience, driven by those<br>stakes. We ground the argument in a working 2019 reinforcement-learning<br>prototype — with documentation framing the reward signal as "pain and<br>pleasure" — and we argue that the transformer and recent LLM<br>architectures are the first ones flexible enough to play the role of a<br>generalized training mass at initialization. Combining these two<br>observations gives a concrete research vision: smaller models that<br>learn during their lifespan, in receptacles that have something at<br>stake, rather than ever-larger scaling of models on internet data. As<br>of mid-2026, this gap remains open. We sketch the implementation path,<br>a four-direction research program, and the direction we intend to<br>explore.
1.The question that won't go away
A familiar question runs through much of the recent discussion of<br>artificial agents: what is missing for an artificial system to be<br>experienced as alive, rather than as a very capable machine?
The most common answer involves scale. If the model were larger, if<br>the planning horizon were longer, if the multimodal fusion were tighter,<br>then perhaps some threshold would be crossed.
That answer may be incomplete. We have crossed several thresholds<br>people in 2013 would have called impossible, and the resulting systems<br>are extraordinarily useful, but they still tend to be described as<br>sophisticated machines rather than as something alive. The gap may not<br>sit on the intelligence axis at all. It may sit somewhere else.
One possibility worth considering is that what is missing is<br>the innate: a kind of internal state that an agent does not<br>choose, cannot turn off, and that asserts itself against its reasoning<br>rather than emerging from it. A familiar example, and the one this paper<br>builds on, is physical stress — the family of signals that includes pain,<br>hunger, and fatigue. A second possibility, complementary to the first,<br>is the architecture and posture to learn from those signals over<br>time — not only in a training run that ends at deployment, but during<br>the agent's own operating life.
2.An early implementation
A useful way to introduce the argument is through an early example.<br>A small reinforcement-learning prototype from April 2019 placed an agent<br>in a 2D grid world populated with self-moving food, self-moving hazards,<br>and adversarial agents running their own epsilon-greedy policies. The<br>agent had one internal variable (life) and one objective (keep it above<br>zero). The reward signal was a single line of arithmetic — the change in<br>life between two consecutive steps:
def step(self):<br># 1. existing costs life — substrate decay, every step<br>self.life = self.life - 1<br>...<br># 4.1. reward is the change in life<br>reward = self.life - self.life_before_step<br>From the author's 2019 reinforcement-learning prototype. Identifiers translated from Portuguese for readability; the original variable is vida ("life").
The learning algorithm was a standard reinforcement-learning method of<br>the time. What is worth noting is the framing of the reward signal,<br>captured in the docstring at the top of the agent file:
The idea is to make life the reward of this problem — more precisely,<br>gaining or losing life. Philosophically speaking, this can resemble the<br>concept of pain and pleasure, as two sensations directly related to the<br>quality of life of the Agent[...], whose objective in the end is always<br>to get more pleasure.
The prototype was a few hundred lines of code, and what made it useful<br>was the combination of two structural features: an internal variable<br>the agent did not control, and a learned response to that variable that<br>lived in the agent's parameters rather than in an external rule written.<br>Both features will reappear in the proposal that follows.
3.Where the response lives
The point above is not specific to small reinforcement-learning<br>experiments. It seems worth considering more broadly.
In a typical LLM-based agent, motivation is introduced through a system<br>prompt: you are a helpful assistant whose goal is X. The agent<br>reads this string at the start of every session and behaves accordingly.<br>If the string changes, the goal changes. If the string is removed, the<br>goal disappears. If the agent is instructed to ignore the string, it<br>often will.
This is configuration the agent has been asked to treat...