Reinforcement Learning

Reward Biased Maximum Likelihood Estimation:

The exploration-exploitation trade-off remains a challenging issue in reinforcement learning. A novel class of model-based learning algorithms, RBMLE can be applied to various learning tasks such as Markov decision processes, stochastic bandits, linear quadratic systems, and contextual bandits. Theoretical analysis shows that RBMLE has comparable regret bounds to state-of-the-art methods. Empirical results show that RBMLE outperforms existing techniques, including Upper Confidence Bound (UCB) and Thompson Sampling.

Publications:

Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems,
Akshay Mete, R. Singh & P. R. Kumar , NeurIPS 2022
Link: https://arxiv.org/abs/2201.10542

Reward Biased Maximum Likelihood Estimation ,
Akshay Mete, R. Singh & P. R. Kumar, CSS 2022 (Invited Paper)
Link: https://ieeexplore.ieee.org/abstract/document/9751189

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning,
Akshay Mete, R. Singh & P. R. Kumar, L4DC 2021
Link: http://proceedings.mlr.press/v144/mete21a.html

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits
Yu-Heng Hung, Ping-Chun Hsieh, Xi Liu, P. R. Kumar, AAAI 2021
Link:https://ojs.aaai.org/index.php/AAAI/article/view/16961/16768

Exploration Through Reward Biasing: Reward-Biased Maximum Likelihood Estimation for Stochastic Multi-Armed Bandits
Xi Liu, Ping-Chun Hsieh, Yu Heng Hung, A. Bhattacharya, P. R. Kumar, ICML 2020.
Link: https://proceedings.mlr.press/v119/liu20g.html

Provable Learning with Real-World Constraints:

(Reinforcement) learning methods have been playing important roles in many fields, e.g., manipulating robotics, playing computer games, chip placement design, and recommendation systems. However, it is challenging to extend such success to real-world applications by directly applying existing approaches. Real-world systems are always subject to several constraints, including safety, multiple objectives, simulation-to-reality gap, multi-agent cooperation/competition, small sample size, etc. To deal with these challenges, we have done a series of works about safe RL, multi-objective RL, robust RL, multi-agent RL, and learning from few samples.

Safe and Multi-Objective Reinforcement Learning:

The goal of safe RL is to learn a policy to maximize the objective while satisfying the safety constraints. We solve several open problems under this topic.

1. Previous OFU (optimism in the face of uncertainty) style algorithms can learn the safe policy, but naïve OFU algorithms violate the constraints during learning. We propose two new algorithms that can achieve an O(√K) regret with respect to the performance objective, while guaranteeing zero or bounded safety constraint violation with arbitrarily high probability.

2. Previous policy gradient-based algorithms for constraint MDPs suffer from an O(1/ √T) convergence rate. We propose a new policy mirror descent-primal dual approach that can provably achieve a faster O(log(T)/T) convergence rate for both the optimality gap and the constraint violation and enjoy superior empirical performance compared to previous methods.

The goal of multi-objective RL is to optimize a policy for multiple objectives simultaneously. Based on the system requirements, the agent can employ different criteria on value vectors, e.g., proportional fairness, hard constraints, max-min trade-off. Previously, there was no systematic way to design policy gradient-based algorithms for multi-objective RL and no systematic theory to analyze convergence rate under all scenarios. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which achieves O(log(T)/T) global convergence with exact gradients and enjoys superior empirical performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based deep RL scenarios.

Publications:

Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, and Chao Tian. “Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs.” Advances in Neural Information Processing Systems 34 (2021): 17183-17193. [link]

Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, and Chao Tian. “Policy Optimization for Constrained MDPs with Provable Fast Global Convergence.” arXiv preprint arXiv:2111.00552 (2021). [link] [code]

Ruida Zhou, Tao Liu, Dileep Kalathil, P. R. Kumar, and Chao Tian. “Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning.” Advances in Neural Information Processing Systems 35 (2022): 13584-13596. [link] [code]

Robust Reinforcement Learning:

The goal of robust RL is to determine a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.

Publication:

Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, P. R. Kumar, and Chao Tian. “Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation.” Advances in Neural Information Processing Systems 36 (2023). [link] [code]

Multi-Agent Reinforcement Learning:

The goal of multi-agent RL is to learn a policy that cooperates or competes among several agents. We study an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the suboptimality gap, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an \epsilon-Nash Equilibrium (NE) within O(1/\epsilon) iterations. This improves upon the previous best result of O(1/\epsilon^2) iterations and is of the same order, O(1/\epsilon), that is achievable for the single-agent case. Empirical results for a synthetic potential game and a congestion game are presented to verify the theoretical bounds.

Publication:

Youbang Sun, Tao Liu, Ruida Zhou, P. R. Kumar, Shahin Shahrampour, “Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games.” Advances in Neural Information Processing Systems 36 (2023). [link] [code]

Learning from Few Samples:

Motivated by the problem of learning with small sample sizes, we show how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the maximum similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds with high probability for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.

Publication:

Tao Liu, P. R. Kumar, Ruida Zhou, and Xi Liu. “Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales.” Advances in Neural Information Processing Systems 35 (2022): 9151-9163. [link] [code]