Why Does Reinforcement Learning Fail?
And how do we make it succeed?
Despite being one of the core paradigms of machine learning, reinforcement learning (RL) is often treated as its own individual discipline rather than a tool to be applied when the situation arises. There are several reasons for this:
Unlike supervised and unsupervised learning, which optimize over an exogenous, fixed dataset, RL concerns sequential decision-making problems in which data are generated by an agent–environment interaction and depend on the agent’s policy.
Challenges in RL require distinct mathematical tools, many drawn from control theory, game theory, and stochastic processes.
The engineering complexity is typically higher, requiring complex simulation environments and relatively longer training times. This also means the computational bottlenecks are different.
This perception of RL as a self-contained discipline cemented during the late 2010’s, where a slew of breakthroughs in Deep RL brought the field into mainstream view. The success of Deep Q-Networks (DQN) on Atari games in 2015, AlphaGo defeating world champion Go players in 2016, OpenAI Five in Dota 2 and AlphaStar in StarCraft II demonstrated that RL could achieve superhuman performance in highly complex, dynamic environments. This led to (renewed) interest in RL, both in academia and industry.

Very quickly, however, the gap in performance between RL in virtual environments and practical settings became apparent. One problem in particular that throttled performance was non-stationarity [1], where dynamic and evolving environments shift the optimal action distribution over time and not necessarily in predictable ways. This meant that any hopes for immediate returns were dashed, and people in the field now had to contend with both the limitations of deep learning and RL itself. Such limitations have confined it to relatively narrow domains - in many industrial or scientific applications, practitioners still prefer supervised or heuristic approaches even when the underlying problem is sequential, largely because RL remains perceived as complex, unstable, and computationally expensive. Indeed, high profile figures in the AI community such as Andrej Karpathy and Yann LeCun have been vocal about their skepticism about the usefulness of RL.
Despite this, work on the aforementioned challenges has proceeded steadily and RL has been increasingly applied to a variety of control problems. However, the core question remains: how, if at all, can RL be made ready for large scale, real world deployment?
What is Reinforcement Learning?
To answer the above question, we need to delve a bit further into the details of what reinforcement learning actually it is. Fundamentally, it is a class of optimization techniques that focus on sequential decision making tasks. Importantly, it assumes that these tasks can be represented as a Markov game.
A Markov Game is a tuple < S, A, T, R, γ, N > where:
S is the set of all joint states of the game. This could be a collection of specific cells in a grid, for example.
A is the joint action space, the set of all possible combinations of actions for all players in the game.
T is the transition dynamics. This is a probability distribution that determines the successor state of a player taking an action in a particular state.
R is the reward function. This maps a transition (taking an action from a state and ending up in another) to a scalar value. For example, reaching a goal state gives +1 reward.
γ is a discount factor between 0 and 1 to make sure summations involving multiple rewards are finite.
N is the number of players in the game.
For those games where N=1, this is called a Markov Decision Process (MDP).
A Markov game assumes the Markov property, which states that the evolution of a system depends only on the current state and the history is irrelevant. In other words, regardless of how an agent reaches a specific state, its action probability distribution remains the same under a particular policy. The policy is a conditional probability distribution π(⋅|s) which maps each state to an action probability distribution. For example, State s₁ → π(·|s₁) might be 0.8 for action “left”, 0.2 for action “right” and State s₂ → π(·|s₂) might be 0.3 for action “left”, 0.7 for action “right”.
The Markov property ensures that this mapping is computationally tractable, since otherwise the policy would be a mapping from histories to actions which rapidly becomes intractable to work with as S increases. It also provides theoretical convergence guarantees to many RL algorithms.
Note: As one might expect, in some practical settings the history may be very relevant to a decision which needs to be made. This means that either the environment or the framework needs to be adapted to make sure the Markov property is upheld.
The policy is updated by interaction with the environment. At each timestep, the agent takes an action in the environment (sampled from the policy) and the environment communicates a reward signal to the agent that weights said action positively or negatively. The agent can be trained using data generated by the current policy (online learning) or accumulated data from past interactions (offline learning) - training meaning an update to (the probability distribution that represents) the policy. After many training steps, in theory, the agent’s behaviour should be optimal.
Note: In practice, online learning is computationally expensive as training updates are performed every timestep and can bias the learned policy.
As you can imagine, there are a bunch of algorithms that translate this high level process into something that can be applied to different problem spaces. The taxonomy of these algorithms is explained well in this article.

For the purposes of our discussion, we’ll just make the high level distinction between model-free and model-based algorithms.
We can use model-based algorithms If we know the transition dynamics T of the Markov game (i.e if we know exactly what state the agent will end up in by taking an action from any state) or we can estimate it, then we can use this for planning by simulating and evaluating different trajectories.
But very often, we don’t know T and trying to estimate it is very costly/impractical, so model-free algorithms are more widely used. By balancing exploration of unseen states and actions with exploitation of those seen, our agent bootstraps value estimates of state-action pairs (taking an action from a state) which they use for planning.
The distinction between model-based and model-free RL highlights the central question of what information the agent relies on when improving its behavior. That same question clarifies how RL relates to supervised and unsupervised learning: in those paradigms, the learner passively receives data, whereas in RL the learner generates its own data through actions.
So, what is the point of RL? To answer this, we need to more concretely distinguish RL from the other machine learning paradigms. This distinction is that the type of problem which RL solves is one where we don’t know what the ‘expert’ solution is.
Firstly, unlike supervised learning, we are not dealing with labeled data. Even if such data were available, it would only represent expert behaviour under a limited set of states. Producing accurate labels that reflect long-term consequences is exceptionally challenging, and even small errors can compound catastrophically. This difficulty forces the agent to learn by interacting with the environment, as discussed earlier, which is a fundamentally distinct challenge of RL.
Second, unlike unsupervised learning, the goal is not to uncover structure in a fixed dataset. The agent is actively trying to optimize a policy through interaction, and its behaviour determines which states it encounters. The data is neither static nor passive; it is generated as a consequence of the agent’s decisions. RL therefore focuses on exploiting this interaction to maximize cumulative reward, not on describing the underlying distribution of observations.
RL seeks to build from first principles an ‘expert’ solution by interaction with environment, particularly in settings where long-term consequences are not readily observable and preset control sequences or heuristic algorithms are not effective. Indeed, if we knew what the optimal control sequence of actions is to achieve optimal performance, there isn’t much point in using machine learning in the first place - which is why it very often isn’t.
The application of deep learning to RL, or deep reinforcement learning (DRL) is simply RL in which deep neural networks are used as function approximators for the policy, the value function, the model or combinations of these. This presents its own challenges, such as exploding gradients and violation of theoretical convergence guarantees, but ultimately the biggest bottlenecks to performance are still fundamentally RL ones.
Why Reinforcement Learning Fails
RL can in principle be applied to a wide range of control problems, but several factors severely limit its practical performance.
Sample inefficiency: Most RL algorithms require a huge amount of experience to learn anything reliable. The core issue is not simply that rewards are unknown until an action is taken, but that estimates of long-term returns have high variance and depend on trajectories rather than isolated samples. Sparse rewards, unstable bootstrapping, and continual distributional shift further amplify this problem. As a result, RL typically needs far more data than other learning paradigms.
Credit assignment: Determining how much each decision contributed to an eventual outcome is intrinsically difficult. Temporal credit assignment deals with rewards that may arrive long after the causally relevant action, while in multi-agent settings the problem is compounded by the need to apportion credit across interacting policies. In both cases, the learning signal becomes noisy and impedes effective optimization.
Non-stationarity: The transition dynamics in many real environments can change over time, and even when the underlying system is stable, the agent’s own learning induces non-stationarity by interaction with the environment. As the policy changes, so does the distribution of states the agent visits. In multi-agent settings this becomes more severe: each agent’s policy updates alter the effective environment faced by the others, breaking assumptions required for convergence.
Hyper-parameter sensitivity: While hyper parameter tuning is common in machine learning generally, RL is unusually sensitive to these choices. The interaction between value estimation (of states) and policy updates is highly nonlinear, and exploration–exploitation trade-offs cannot be tuned via cross-validation because the data distribution evolves with the policy. Minor adjustments in learning rates, discount factors, or exploration parameters can produce radically different outcomes.
Safety and robustness: Policies that perform well in simulation often fail under modest distribution shifts in the real system. RL agents tend to over-fit to the trajectories they experienced during training and extrapolate poorly to unseen states. This brittleness can lead to catastrophic actions when deployed in safety-critical settings such as autonomous driving or robotics. Ensuring reliability under uncertainty is substantially harder than with traditional control methods that provide structural guarantees.
All of these challenges must be addressed for RL to be viable for a given control problem, and in many domains they remain unresolved. Consequently, in practice, most industrial control problems continue to rely on tried-and-tested techniques such as Proportional-Integral-Derivative (PID) controllers or Model Predictive Control (MPC). These methods are widely deployed because they are robust, interpretable, and computationally efficient. RL, in contrast, is typically reserved for carefully selected tasks where classical approaches struggle — for example, fine-tuning locomotion on Boston Dynamics’ Spot, autonomous navigation in unstructured environments [2], or learning manipulation strategies that are difficult to hand-engineer.
But, even in these niche applications, RL is rarely applied in isolation. It is combined with simulation, model-based planning, or safety constraints to mitigate sample inefficiency, fragility, and safety risks. This highlights that RL’s strength is not presently as a general-purpose control method, but as a complementary tool capable in domains where state-of-the-art controllers are impractical or sub-optimal. To ground this discussion, let’s consider the application of RL to robotic control.
Training an RL agent from scratch on physical hardware is often prohibitively expensive and risky: even a single fall can damage components. Instead, developers typically train policies in high-fidelity simulations and then transfer them to the robot, using techniques such as domain randomization or sim-to-real adaptation to bridge the gap between virtual and real environments. This pipeline is resource-intensive and requires careful engineering, which is why RL is not the default choice.
In contrast, PID controllers or MPC can produce reliable, predictable behavior with minimal tuning and without extensive data collection. They are interpretable, provably stable under certain conditions, and easier to certify for safety-critical systems. RL excels only when the control task involves complex, high-dimensional dynamics or long-horizon objectives that are difficult to encode manually — for example, optimizing gait patterns for uneven terrain, adaptive manipulation of objects with unknown physical properties, or multi-agent coordination tasks where emergent strategies outperform hand-engineered heuristics. But all of these applications are individual components of a larger robot system, and underscores the niche role of RL in robotics.
So, given all these issues that are inherent to RL, how do we overcome them?
How to Make Reinforcement Learning Succeed
Obviously, we should work on the problems outlined above. But the more fundamental issue is how we approach RL—both conceptually and practically. Several deep philosophical shifts may be necessary:
Conceptual
A major barrier to mainstream adoption is that many of the domains where RL is applied are already well understood. For most industrial processes we know exactly how we want controllers to behave, and classical tools like PID or MPC already work very well.
The real opportunity lies in domains where control objectives are not fully known and where traditional engineering provides no closed-form solution. As new technologies like autonomous vehicles and blockchain develop we will increasingly encounter decision-making problems that cannot be solved feasibly using analytical or classic control methods. In these cases, RL becomes a method for discovering effective strategies from first principles instead of replacing controllers we already know how to design. Specifically, RL becomes valuable when entering domains where either:
the optimal strategy is unknown, or
hand-engineering a controller is prohibitively complex, or
the environment is too high-dimensional, volatile, or poorly modeled
Since many domains will most likely host a combination of these, modern RL needs to focus on the complexity gap between its most commonly used test-beds and real applications. Games like Go have every property that makes learning easy: deterministic, discrete, static, fully observable, single-agent, episodic. Real-world problems have none of these. Real problems have a bunch of other properties that make them a nightmare to deal with—partial observability, multi-agent dynamics, non-stationarity, continuous action spaces—and remain largely unsolved. Success requires explicitly designing for real-world complexity.
Practical
Even in promising domains, RL will only succeed if its deployment pipeline becomes as disciplined and scalable as classical control engineering. Real systems cannot tolerate millions of unsafe or random interactions, so sim-to-real transfer needs to be treated as an engineering sub-field in its own right.
In addition, complete end-to-end RL policies are inherently brittle and uninterpretable. For real systems, modular architectures — hierarchical RL, options frameworks, RL combined with classical control loops — are more stable and far easier to certify.
For example, a framework for robotic control could look like:
MPC handles stability and constraint satisfaction;
RL optimizes high-level strategy or adapts parameters;
safety monitors detect out-of-distribution states and fall back to known-safe controllers.
This hybrid approach is already common and appears to be something integral to the increasing application of RL to control problems. Another thing that is likewise integral is the anchoring of performance metrics in real-world cost structures. Most RL papers optimize for rewards defined by researchers, not actual operational metrics. Real deployments have several additional considerations: safety costs dominate, downtime matters more than asymptotic optimality, human interaction and safety constraints to name a few. For RL to be more readily applicable, it has to be evaluated the way controllers are evaluated: worst-case performance, energy usage, maintenance cost, failure probability, recovery behavior.
Conclusions
Fundamentally, RL is a single tool within a toolbox that contains many others. If a particular sequential decision making problem can be more efficiently solved with a simple heuristic or a basic algorithm then of course they should be tried first.
Unlike other machine paradigms, RL has developed its own dedicated community due to the unique problems that arise with its practice. This has perhaps led to a contradiction - if you have a hammer, you are primed to see everything as a nail. This attitude isn’t unique to RL, but in my opinion, it is uniquely unproductive because of the specific type of problems that it solves.
Sequential decision making problems are a niche, but they also happen to be one of the most important ones. It underpins almost any system that must act autonomously in complex, dynamic, or poorly understood environments. In other words, while it may be ‘niche’ in terms of current industrial applications, it’s foundational for future AI systems that must operate in the real world, including AGI. Intelligent behaviour doesn’t come about from next token prediction but from real interaction with the world.
References
Hamadanian, P., Schwarzkopf, M. and Sen, S., 2022, April. How reinforcement learning systems fail and what to do about it. In The 2nd Workshop on Machine Learning and Systems (EuroMLSys).
Raj, R. and Kos, A., 2024. “Intelligent mobile robot navigation in unknown and complex environment using reinforcement learning technique”. Scientific Reports, 14(1), p.22852.


