NS-Gym Benchmark Algorithms¶

class ns_gym.benchmark_algorithms.MCTS(env, state, d, m, c, gamma)[source]¶

Bases: Agent

Vanilla MCTS with Chance Nodes. Compatible with OpenAI Gym environments.: Selection and expansion are combined into the “treepolicy method” The rollout/simulation is the “default” policy.

Parameters:

env (gym.Env) – The environment to run the MCTS on.
state (Union[int, np.ndarray]) – The state to start the MCTS from.
d (int) – The depth of the MCTS.
m (int) – The number of simulations to run.
c (float) – The exploration constant.
gamma (float) – The discount factor.

v0¶

The root node of the tree.

Type:: DecisionNode

possible_actions¶

List of possible actions in the environment.

Type:: list

Qsa¶

Dictionary to store Q values for state-action pairs.

Type:: dict

Nsa¶

Dictionary to store visit counts for state-action pairs.

Type:: dict

Ns¶

Dictionary to store visit counts for states.

Type:: dict

act(observation, env)[source]¶

Decide on an action using the MCTS search, reinitializing the tree structure.

Parameters:: observation (Union[int, np.ndarray]) – The current state or observation of the environment.
Returns:: The selected action.
Return type:: int

best_action(v)[source]¶: Select the best action based on the Q values of the state-action pairs. :returns: best action to) :rtype: best_action(int)

best_child(v)[source]¶

Find the best child nodes based on the UCT value.

This method is only called for decision nodes.

Parameters:: exploration_constant (_type_, optional) – _description_. Defaults to math.sqrt(2).
Returns:: The best child node based on the UCT value. action: The action that leads to the best child node.
Return type:: Node

search()[source]¶

Do the MCTS by doing m simulations from the current state s. After doing m simulations we simply choose the action that maximizes the estimate of Q(s,a)

Returns:: best action to take action_values(list): list of Q values for each action.
Return type:: best_action(int)

type_checker(observation, reward)[source]¶

Converts the observation and reward from dict and base.Reward type to the correct type if they are not already.

Parameters:

observation (Union[dict, np.ndarray]) – Observation to convert.
reward (Union[float, base.Reward]) – Reward to convert.

Returns:

Converted observation. (float): Converted reward.

Return type:

(int,np.ndarray)

update_metrics_chance_node(state, action, reward)[source]¶

Update the Q values and visit counts for state-action pairs and states.

Parameters:

state (Union[int,np.ndarray]) – The state.
action (Union[int,float,np.ndarray]) – action taken at the state.
reward (float) – The reward received after taking the action at the state.

update_metrics_decision_node(state)[source]¶: Update the visit counts for states.

class ns_gym.benchmark_algorithms.DQN(state_size, action_size, num_layers, num_hidden_units, seed)[source]¶

Bases: Module

Deep Q network, simple feedforward neural network.

Simple Deep Q Network (DQN) algorithm for benchmarking. Follows this tutorial: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

Parameters:

state_size (int) – Dimension of each state
action_size (int) – Dimension of each action
num_layers (int) – Number of hidden layers
num_hidden_units (int) – Number of units in each hidden layer
seed (int) – Random seed

Warning

This implementation works though the StableBaselines3 implementation is likely better optimized.

forward(state)[source]¶

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

type_checker(x)[source]¶

class ns_gym.benchmark_algorithms.DQNAgent(state_size, action_size, seed, model=None, model_path=None, buffer_size=100000, batch_size=64, gamma=0.99, lr=0.001, update_every=4, do_update=False)[source]¶

Bases: Agent

Simple Deep Q Network (DQN) algorithm for benchmarking

This implementation is based on the PyTorch tutorial found at https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html

Parameters:

state_size (int) – dimension of each state
action_size (int) – dimension of each action
seed (int) – random seed
model (DQN, optional) – Predefined model architecture. Defaults to None.
model_path (str, optional) – Path to pretrained model weights. Defaults to None.
buffer_size (int, optional) – replay buffer size. Defaults to int(1e5).
batch_size (int, optional) – minibatch size. Defaults to 64.
gamma (float, optional) – discount factor. Defaults to 0.99.
lr (float, optional) – learning rate. Defaults to 0.001.
update_every (int, optional) – how often to update the network. Defaults to 4.
do_update (bool, optional) – Whether to perform gradient updates during environment interaction. Defaults to False.

act(state, eps=0.0)[source]¶

Returns actions for given state as per current policy.

Parameters:

state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0

learn(experiences, gamma)[source]¶

Update value parameters using given batch of experience tuples.

Parameters:

experiences (Tuple[torch.Tensor]) – tuple of (s, a, r, s’, done) tuples
gamma (float) – discount factor

search(state, eps=0.0)[source]¶

Returns actions for given state as per current policy.

Parameters:

state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0

soft_update(local_model, target_model, tau)[source]¶

Soft update model parameters.

Parameters:

local_model (nn.Module) – weights will be copied from
target_model (nn.Module) – weights will be copied to
tau (float) – interpolation parameter

step(state, action, reward, next_state, done)[source]¶

Add experience to memory and potentially learn.

Parameters:

state (Union[int, np.ndarray, dict]) – current state
action (int) – action taken
reward (float) – reward received
next_state (Union[int, np.ndarray, dict]) – next state
done (bool) – whether the episode has ended

ns_gym.benchmark_algorithms.train_ddqn(env, model, n_episodes=1000, max_t=200, eps_start=1.0, eps_end=0.01, eps_decay=0.999, seed=0)[source]¶

DDQN Training Loop

Parameters:

env (gym.Env) – environment to interact with
model (DQN) – model architecture to use
n_episodes (int, optional) – maximum number of training episodes. Defaults to 1000.
max_t (int, optional) – maximum number of timesteps per episode. Defaults to 200.
eps_start (float, optional) – starting value of epsilon, for epsilon-greedy action selection. Defaults to 1.0.
eps_end (float, optional) – minimum value of epsilon. Defaults to 0.01.
eps_decay (float, optional) – multiplicative factor (per episode) for decreasing epsilon. Defaults to 0.999.
seed (int, optional) – random seed. Defaults to 0.

class ns_gym.benchmark_algorithms.PAMCTS(alpha, mcts_iter, mcts_search_depth, mcts_discount_factor, mcts_exploration_constant, state_space_size, action_space_size, DDQN_model=None, DDQN_model_path=None, seed=0)[source]¶

Bases: Agent

Policy augmented MCTS algorithm Uses a convex combination of DDQN policy values and MCTS values to select actions.

Parameters:

alpha (float) – PAMCTS convex combination parameter
env (gym.Env) – Gymnasium style environment object
mcts_iter (int) – Total number of MCTS iterations
mcts_search_depth (int) – MCTS search depth
mcts_discount_factor (float) – MCTS discount factor
mcts_exploration_constant (float) – UCT exploration constant c
state_space_size (int) – Size of environment state space. For Q-value networks.
action_space_size (int) – Size of environment action space. For Q-value networks.
DDQN_model (torch.NN, optional) – DDQN torch neural network object . Defaults to None.
DDQN_model_path (str, optional) – Path to DDQN model weights. Defaults to None.

act(state, env, normalize=True)[source]¶

Agent decision making function. Subclasses must implement this method.

Parameters:: obs – Observation from the environment
Returns:: Action to be taken by the agent
Return type:: Any

search(state, env, normalize=True)[source]¶

class ns_gym.benchmark_algorithms.AlphaZeroAgent(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, gamma, c, num_mcts_simulations, max_mcts_search_depth, model_checkpoint_path=None, model=<class 'ns_gym.benchmark_algorithms.AlphaZero.alphazero.AlphaZeroNetwork'>, alpha=1.0, epsilon=0.0)[source]¶

Bases: object

act(obs, env, temp=1)[source]¶

Use the trained model to select an action

Parameters:

obs (Union[np.array,int,dict]) – observation from the environment
env (gym.Env) – The current environment.

Returns:

best_action (int)

train(env, n_episodes, max_episode_len, lr, batch_size, n_epochs, experiment_name, eval_window_size=100, weight_decay=0.0001, temp_start=2, temp_end=0.8, temp_decay=0.95)[source]¶

Train the AlphaZero agent

Parameters:

env (gym.Env) – The environment to train on
n_episodes (int) – Number of training episodes
max_episode_len (int) – Maximum number of steps per episode
lr (float) – Learning rate for the neural network
batch_size (int) – Batch size for training
n_epochs (int) – Number of epochs per training iteration
experiment_name (str) – Name for saving models and logs
eval_window_size (int, optional) – Size of the evaluation window. Defaults to 100.
weight_decay (float, optional) – Weight decay for optimizer. Defaults to 1e-4.
temp_start (float, optional) – Starting temperature for exploration. Defaults to 2.
temp_end (float, optional) – Ending temperature for exploration. Defaults to 0.8.
temp_decay (float, optional) – Decay rate for temperature. Defaults to 0.95.

Returns:

List of episode returns

Return type:

List[float]

class ns_gym.benchmark_algorithms.AlphaZeroNetwork(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, activation='relu')[source]¶

Bases: Module

Overview:

This is a simple MLP that predicts the policy and value of a particalar state

Args:: action_space_dim (int): Size of the action space observation_space_dim (int): Size of the observation space lr (float): learning rate n_hidden_layers (int): Number of hidden layers n_hidden_units (int): Number of units in each hidden layer activation (str, optional): Activation . Defaults to ‘relu’.

forward(obs)[source]¶

Overview:: A single forward pass of the observations from the environment

Returns:: Tuple containing the policy and value of the state
Return type:: Tuple[torch.Tensor,torch.Tensor]

input_check(obs)[source]¶

class ns_gym.benchmark_algorithms.PPO(actor, critic, lr_policy=0.0003, lr_critic=0.0004, max_grad_norm=0.5, ent_weight=0.0, clip_val=0.2, sample_n_epoch=10, sample_mb_size=32, device='cpu')[source]¶

Bases: Agent

PPO class

Warning

You can use this if you want but honestly just use the StableBaselines3 implementation.

Parameters:

actor – Actor network.
critic – Critic network.
lr_policy – Learning rate for the policy network.
lr_critic – Learning rate for the critic network.
max_grad_norm – Maximum gradient norm for clipping.
ent_weight – Entropy weight for exploration.
clip_val – Clipping value for PPO.
sample_n_epoch – Number of epochs to sample minibatches.
sample_mb_size – Size of each minibatch.
device – Device to run the computations on.

act(obs, *args, **kwargs)[source]¶

Agent decision making function. Subclasses must implement this method.

Parameters:: obs – Observation from the environment
Returns:: Action to be taken by the agent
Return type:: Any

train(states, actions, prev_val, advantages, returns, prev_lobprobs)[source]¶

Train the PPO model using provided experience.

Parameters:

states – State samples.
actions – Action samples.
prev_val – Previous state value estimates.
advantages – Advantage estimates.
returns – Discounted return estimates.
prev_lobprobs – Previous log probabilities of actions.

Returns:

Policy loss. v_loss: Value loss. entropy: Average entropy.

Return type:

pg_loss

train_ppo(env, config)[source]¶

Main training loop PPO algorithm.

Saves best model based on running average reward over 100 episodes.

Parameters:

env – Gym environment.
config – Configuration dictionary.

Returns:

Best running average reward over 100 episodes.

Return type:

best_reward

class ns_gym.benchmark_algorithms.PPOActor(s_dim, a_dim, hidden_size=64)[source]¶

Bases: Module

Actor network for policy approximation.

Outputs mean and standard deviation of the action distribution. A simple MLP.

Parameters:

s_dim – State dimension.
a_dim – Action dimension.
hidden_size – Number of hidden units in each layer.

evaluate(state, action)[source]¶

forward(state, deterministic=False)[source]¶

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

class ns_gym.benchmark_algorithms.PPOCritic(s_dim, hidden_size=64)[source]¶

Bases: Module

Critic network to estimate the state value function. A simple MLP.

Parameters:

s_dim – State dimension.
hidden_size – Number of hidden units in each layer.

forward(state)[source]¶

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

class ns_gym.benchmark_algorithms.DDPG(state_dim=8, action_dim=2, hidden_size=256, lr_actor=0.001, lr_critic=0.001)[source]¶

Bases: Agent

Deep Deterministic Policy Gradient (DDPG) algorithm.

Parameters:

state_dim (int) – Dimension of the state space.
action_dim (int) – Dimension of the action space.
hidden_size (int) – Number of hidden units in each layer of the networks.
lr_actor (float) – Learning rate for the actor network.
lr_critic (float) – Learning rate for the critic network.

Warning

This implementation works though the StableBaselines3 implementation is likely better optimized.

act(observation, clip=False)[source]¶

Agent decision making function. Subclasses must implement this method.

Parameters:: obs – Observation from the environment
Returns:: Action to be taken by the agent
Return type:: Any

run_eval_episode(env, T=100, visualize=False)[source]¶

train(env, num_episodes=10000, batch_size=64, gamma=0.99, tau=0.005, warmup_episodes=300, save_path='models/')[source]¶

update(states, actions, rewards, next_states, dones, gamma=0.99, tau=0.001)[source]¶

Update the actor and critic networks for one training step in DDPG.

Parameters:

states – Batch of current states.
actions – Batch of actions taken.
rewards – Batch of rewards received.
next_states – Batch of next states.
dones – Batch of done flags (indicating episode termination).
gamma – Discount factor.
tau – Target network soft update parameter.

warmup(env, warmup_episodes=300)[source]¶