Nikola Milosevic
Logo Ph.D. Candidate @ MPI CBS

I'm a researcher working on making Reinforcement Learning (RL) more reliable and ready for real-world use.

While RL has shown impressive results in games and simulations, applying it safely and effectively outside the lab is still a major challenge. My work focuses on building algorithms that help RL systems learn in ways that are not just powerful, but also safe, stable, and aligned with real-world goals and constraints. This includes areas like safety-critical decision-making, learning from imperfect or limited data (offline RL), and developing new ways to guide exploration and generalization. Ultimately, I’m interested in closing the gap between what RL can do in theory and what it needs to do in practice.


Education
  • Max Planck Institute for Human Cognitive and Brain Sciences
    Max Planck Institute for Human Cognitive and Brain Sciences
    Neural Data Science Lab
    Ph.D. Candidate
    Sep. 2022 - present
  • University of Applied Sciences Leipzig
    University of Applied Sciences Leipzig
    M.S. in Electrical Engineering
    Sep. 2019 - Jul. 2021
Honors & Awards
  • Master's Thesis Award @ University of Applied Sciences Leipzig
    2021
News
2025
Embedding Safety into RL: My first Ph.D. work got accepted at ICML as a conference paper for ICML 2025 in Vencouver, Canada!
May 30
Selected Publications (view all )
Embedding Safety into RL: A New Take on Trust Region Methods

Nikola Milosevic, Johannes Müller, Nico Scherf

International Conference on Machine Learning (ICML) 2025 Spotlight

Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.

Embedding Safety into RL: A New Take on Trust Region Methods

Nikola Milosevic, Johannes Müller, Nico Scherf

International Conference on Machine Learning (ICML) 2025 Spotlight

Reinforcement Learning (RL) agents can solve diverse tasks but often exhibit unsafe behavior. Constrained Markov Decision Processes (CMDPs) address this by enforcing safety constraints, yet existing methods either sacrifice reward maximization or allow unsafe training. We introduce Constrained Trust Region Policy Optimization (C-TRPO), which reshapes the policy space geometry to ensure trust regions contain only safe policies, guaranteeing constraint satisfaction throughout training. We analyze its theoretical properties and connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Experiments show that C-TRPO reduces constraint violations while maintaining competitive returns.

All publications