Symbolic Visual Reinforcement Learning:
A Scalable Framework with Object-Level Abstraction
and Differentiable Expression Search

The University of Texas At Austin
*Equal Contribution

Overview


This paper presents DiffSES, a framework that learns and generates symbolic policies for a visual environment. An example of symbolic policy generated for the Pong-Atari2600 environment is shown below:

Demo symbolic tree

The generated symbolic policy takes the form of a forest of Naction symbolic trees (above is one such tree, the other trees take similar structures), each tree corresponds to one action, whose execution value represents the probability for this action (pre-normalization), same as the neural networks based agents.

The DiffSES is developed from object level abstraction of the scene, rather than raw pixel level, or higher level of text planning as previous works did. It uses an unsupervised object detection module to discover objects in the current frame, and learn/search symbolic expressions to discover meaningful objects and features.

In this way, the inputs to the learned symbolic expression are object features. The leaf nodes X0, X1, ... are the relabeled objects' positions and velocities. Such symbolic expressions offer potential explanability of the controlling policy: some subtrees might happen to constitute geometrically interpretable meanings. For example, the geometric features (leaf nodes) X0, X1, ... could be xpong, xracket, ypong, vy,pong, the x/y location of the pong and the racket, and vertical velocity of the pong. Then one subtree might appear as: ypong + vy, pong * (xracket - xpong)*c. This could mean the y axis of the aiming point of the pong on the racket, where c is some constant to convert the horizontal distance into time gap. The aiming point is where the racket should ultimately go to. If such sub-expressions are found, it could hint that similar logic is learned.

The inference procedure of the learned symbolic policy is shown in the figure below.

Demo symbolic tree

Symbolic Agent Behavior Illustrations



We mainly compared in these environments.

Demo symbolic tree

Some observations are shown the videos below.

This video illustrates the behaviors of the neural network agent, which is based on CNN and trained via PPO, and the learned symbolic policy, which is initialized by the PPO agent. The video displays three example environments, and the action distribution is diplayed at the bottom.

Here is the comparison of the models in a transfer learning setting. In this setting, the teacher DRL model is trained in AdventureIsland3, and the symbolic agent is learned based on it. Then both agents are applied to AdventureIsland2 without fine-tuning. The performance of the symbolic policy drops less than DRL model.

Abstract

Learning efficient and interpretable policies has been a challenging task in reinforcement learning (RL), particularly in the visual RL setting with complex scenes. While deep neural networks have achieved competitive performance, the resulting policies are often over-parameterized black boxes that are difficult to interpret and deploy efficiently. More recent symbolic RL frameworks have shown that high-level domain-specific programming logic can be designed to handle both policy learning and symbolic planning. However, these approaches often rely on human-coded primitives with little feature learning, and when applied to high-dimensional continuous conversations such as visual scenes, they can suffer from scalability issues and perform poorly when images have complicated compositions and object interactions.

To address these challenges, we propose Differentiable Symbolic Expression Search (DiffSES), a novel symbolic learning approach that discovers discrete symbolic policies using partially differentiable optimization. By using object-level abstractions instead of raw pixel-level inputs, DiffSES is able to leverage the simplicity and scalability advantages of symbolic expressions, while also incorporating the strengths of neural networks for feature learning and optimization. Our experiments demonstrate that DiffSES is able to generate symbolic policies that are more interpretable and scalable than state-of-the-art symbolic RL methods, even with a reduced amount of symbolic prior knowledge.


Method



Motivations

Symbolic expressions demonstrate simplicity and effectiveness compared with over-parameterized DRL model. Previous works on symbolic RL either fail to scale to complex visual scenes, or use pixel level inputs to directly evolve symbolic expressions, which are often over-complex and require carefully curated operator set.

We assume that object level abstraction could generate simpler policies than pixel level methods. To discover object information in new scenes with minimum human interference, we adopt an unsupervised object detection module.

Furthermode, to make the symbolic expression search procedure more flexible, we introduce differentiable search in the vanilla genetic programming with neural network guidance.

Learning steps

The symbolic policy consists of the unsupervised object detection module and the We employ symbolic knowledge distillation followed by fine-tuning. The pipline consists of three steps.

Step I: A traditional CNN based DRL agent in trained using PPO.

Step II: The DRL model is distilled into symbolic expressions using genetic programming based symbolic regression.

Step III: The symbolic expressions are fine-tuned with the proposed neural guided search which integrates gradient descent into genetic programming mutation procedure for symbolic expressions.


Here is a subset of environments we found the proposed DiffSES is compatible with. To effectively control in these environments, it require handling geometric relationships beween objects and the protagonist. These relationships can usually be expressed as straight lines intersections, which is what symbolic expressions are good at.

BibTeX


   
  Comming soon.