NS-Gym Benchmark Algorithms¶
- class ns_gym.benchmark_algorithms.MCTS(env, state, d, m, c, gamma)[source]¶
Bases:
Agent- Vanilla MCTS with Chance Nodes. Compatible with OpenAI Gym environments.
Selection and expansion are combined into the “treepolicy method” The rollout/simulation is the “default” policy.
- Parameters:
env (gym.Env) – The environment to run the MCTS on.
state (Union[int, np.ndarray]) – The state to start the MCTS from.
d (int) – The depth of the MCTS.
m (int) – The number of simulations to run.
c (float) – The exploration constant.
gamma (float) – The discount factor.
- v0¶
The root node of the tree.
- Type:
DecisionNode
- possible_actions¶
List of possible actions in the environment.
- Type:
list
- Qsa¶
Dictionary to store Q values for state-action pairs.
- Type:
dict
- Nsa¶
Dictionary to store visit counts for state-action pairs.
- Type:
dict
- Ns¶
Dictionary to store visit counts for states.
- Type:
dict
- act(observation, env)[source]¶
Decide on an action using the MCTS search, reinitializing the tree structure.
- Parameters:
observation (Union[int, np.ndarray]) – The current state or observation of the environment.
- Returns:
The selected action.
- Return type:
int
- best_action(v)[source]¶
Select the best action based on the Q values of the state-action pairs. :returns: best action to) :rtype: best_action(int)
- best_child(v)[source]¶
Find the best child nodes based on the UCT value.
This method is only called for decision nodes.
- Parameters:
exploration_constant (_type_, optional) – _description_. Defaults to math.sqrt(2).
- Returns:
The best child node based on the UCT value. action: The action that leads to the best child node.
- Return type:
Node
- search()[source]¶
Do the MCTS by doing m simulations from the current state s. After doing m simulations we simply choose the action that maximizes the estimate of Q(s,a)
- Returns:
best action to take action_values(list): list of Q values for each action.
- Return type:
best_action(int)
- type_checker(observation, reward)[source]¶
Converts the observation and reward from dict and base.Reward type to the correct type if they are not already.
- Parameters:
observation (Union[dict, np.ndarray]) – Observation to convert.
reward (Union[float, base.Reward]) – Reward to convert.
- Returns:
Converted observation. (float): Converted reward.
- Return type:
(int,np.ndarray)
- update_metrics_chance_node(state, action, reward)[source]¶
Update the Q values and visit counts for state-action pairs and states.
- Parameters:
state (Union[int,np.ndarray]) – The state.
action (Union[int,float,np.ndarray]) – action taken at the state.
reward (float) – The reward received after taking the action at the state.
- class ns_gym.benchmark_algorithms.DQN(state_size, action_size, num_layers, num_hidden_units, seed)[source]¶
Bases:
ModuleDeep Q network, simple feedforward neural network.
Simple Deep Q Network (DQN) algorithm for benchmarking. Follows this tutorial: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- Parameters:
state_size (int) – Dimension of each state
action_size (int) – Dimension of each action
num_layers (int) – Number of hidden layers
num_hidden_units (int) – Number of units in each hidden layer
seed (int) – Random seed
Warning
This implementation works though the StableBaselines3 implementation is likely better optimized.
- forward(state)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.DQNAgent(state_size, action_size, seed, model=None, model_path=None, buffer_size=100000, batch_size=64, gamma=0.99, lr=0.001, update_every=4, do_update=False)[source]¶
Bases:
AgentSimple Deep Q Network (DQN) algorithm for benchmarking
This implementation is based on the PyTorch tutorial found at https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- Parameters:
state_size (int) – dimension of each state
action_size (int) – dimension of each action
seed (int) – random seed
model (DQN, optional) – Predefined model architecture. Defaults to None.
model_path (str, optional) – Path to pretrained model weights. Defaults to None.
buffer_size (int, optional) – replay buffer size. Defaults to int(1e5).
batch_size (int, optional) – minibatch size. Defaults to 64.
gamma (float, optional) – discount factor. Defaults to 0.99.
lr (float, optional) – learning rate. Defaults to 0.001.
update_every (int, optional) – how often to update the network. Defaults to 4.
do_update (bool, optional) – Whether to perform gradient updates during environment interaction. Defaults to False.
- act(state, eps=0.0)[source]¶
Returns actions for given state as per current policy.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0
- learn(experiences, gamma)[source]¶
Update value parameters using given batch of experience tuples.
- Parameters:
experiences (Tuple[torch.Tensor]) – tuple of (s, a, r, s’, done) tuples
gamma (float) – discount factor
- search(state, eps=0.0)[source]¶
Returns actions for given state as per current policy.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0
- soft_update(local_model, target_model, tau)[source]¶
Soft update model parameters.
- Parameters:
local_model (nn.Module) – weights will be copied from
target_model (nn.Module) – weights will be copied to
tau (float) – interpolation parameter
- step(state, action, reward, next_state, done)[source]¶
Add experience to memory and potentially learn.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
action (int) – action taken
reward (float) – reward received
next_state (Union[int, np.ndarray, dict]) – next state
done (bool) – whether the episode has ended
- ns_gym.benchmark_algorithms.train_ddqn(env, model, n_episodes=1000, max_t=200, eps_start=1.0, eps_end=0.01, eps_decay=0.999, seed=0)[source]¶
DDQN Training Loop
- Parameters:
env (gym.Env) – environment to interact with
model (DQN) – model architecture to use
n_episodes (int, optional) – maximum number of training episodes. Defaults to 1000.
max_t (int, optional) – maximum number of timesteps per episode. Defaults to 200.
eps_start (float, optional) – starting value of epsilon, for epsilon-greedy action selection. Defaults to 1.0.
eps_end (float, optional) – minimum value of epsilon. Defaults to 0.01.
eps_decay (float, optional) – multiplicative factor (per episode) for decreasing epsilon. Defaults to 0.999.
seed (int, optional) – random seed. Defaults to 0.
- class ns_gym.benchmark_algorithms.PAMCTS(alpha, mcts_iter, mcts_search_depth, mcts_discount_factor, mcts_exploration_constant, state_space_size, action_space_size, DDQN_model=None, DDQN_model_path=None, seed=0)[source]¶
Bases:
AgentPolicy augmented MCTS algorithm Uses a convex combination of DDQN policy values and MCTS values to select actions.
- Parameters:
alpha (float) – PAMCTS convex combination parameter
env (gym.Env) – Gymnasium style environment object
mcts_iter (int) – Total number of MCTS iterations
mcts_search_depth (int) – MCTS search depth
mcts_discount_factor (float) – MCTS discount factor
mcts_exploration_constant (float) – UCT exploration constant c
state_space_size (int) – Size of environment state space. For Q-value networks.
action_space_size (int) – Size of environment action space. For Q-value networks.
DDQN_model (torch.NN, optional) – DDQN torch neural network object . Defaults to None.
DDQN_model_path (str, optional) – Path to DDQN model weights. Defaults to None.
- class ns_gym.benchmark_algorithms.AlphaZeroAgent(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, gamma, c, num_mcts_simulations, max_mcts_search_depth, model_checkpoint_path=None, model=<class 'ns_gym.benchmark_algorithms.AlphaZero.alphazero.AlphaZeroNetwork'>, alpha=1.0, epsilon=0.0)[source]¶
Bases:
object- act(obs, env, temp=1)[source]¶
Use the trained model to select an action
- Parameters:
obs (Union[np.array,int,dict]) – observation from the environment
env (gym.Env) – The current environment.
- Returns:
best_action (int)
- train(env, n_episodes, max_episode_len, lr, batch_size, n_epochs, experiment_name, eval_window_size=100, weight_decay=0.0001, temp_start=2, temp_end=0.8, temp_decay=0.95)[source]¶
Train the AlphaZero agent
- Parameters:
env (gym.Env) – The environment to train on
n_episodes (int) – Number of training episodes
max_episode_len (int) – Maximum number of steps per episode
lr (float) – Learning rate for the neural network
batch_size (int) – Batch size for training
n_epochs (int) – Number of epochs per training iteration
experiment_name (str) – Name for saving models and logs
eval_window_size (int, optional) – Size of the evaluation window. Defaults to 100.
weight_decay (float, optional) – Weight decay for optimizer. Defaults to 1e-4.
temp_start (float, optional) – Starting temperature for exploration. Defaults to 2.
temp_end (float, optional) – Ending temperature for exploration. Defaults to 0.8.
temp_decay (float, optional) – Decay rate for temperature. Defaults to 0.95.
- Returns:
List of episode returns
- Return type:
List[float]
- class ns_gym.benchmark_algorithms.AlphaZeroNetwork(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, activation='relu')[source]¶
Bases:
Module- Overview:
This is a simple MLP that predicts the policy and value of a particalar state
- Args:
action_space_dim (int): Size of the action space observation_space_dim (int): Size of the observation space lr (float): learning rate n_hidden_layers (int): Number of hidden layers n_hidden_units (int): Number of units in each hidden layer activation (str, optional): Activation . Defaults to ‘relu’.
- class ns_gym.benchmark_algorithms.PPO(actor, critic, lr_policy=0.0003, lr_critic=0.0004, max_grad_norm=0.5, ent_weight=0.0, clip_val=0.2, sample_n_epoch=10, sample_mb_size=32, device='cpu')[source]¶
Bases:
AgentPPO class
Warning
You can use this if you want but honestly just use the StableBaselines3 implementation.
- Parameters:
actor – Actor network.
critic – Critic network.
lr_policy – Learning rate for the policy network.
lr_critic – Learning rate for the critic network.
max_grad_norm – Maximum gradient norm for clipping.
ent_weight – Entropy weight for exploration.
clip_val – Clipping value for PPO.
sample_n_epoch – Number of epochs to sample minibatches.
sample_mb_size – Size of each minibatch.
device – Device to run the computations on.
- act(obs, *args, **kwargs)[source]¶
Agent decision making function. Subclasses must implement this method.
- Parameters:
obs – Observation from the environment
- Returns:
Action to be taken by the agent
- Return type:
Any
- train(states, actions, prev_val, advantages, returns, prev_lobprobs)[source]¶
Train the PPO model using provided experience.
- Parameters:
states – State samples.
actions – Action samples.
prev_val – Previous state value estimates.
advantages – Advantage estimates.
returns – Discounted return estimates.
prev_lobprobs – Previous log probabilities of actions.
- Returns:
Policy loss. v_loss: Value loss. entropy: Average entropy.
- Return type:
pg_loss
- class ns_gym.benchmark_algorithms.PPOActor(s_dim, a_dim, hidden_size=64)[source]¶
Bases:
ModuleActor network for policy approximation.
Outputs mean and standard deviation of the action distribution. A simple MLP.
- Parameters:
s_dim – State dimension.
a_dim – Action dimension.
hidden_size – Number of hidden units in each layer.
- forward(state, deterministic=False)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.PPOCritic(s_dim, hidden_size=64)[source]¶
Bases:
ModuleCritic network to estimate the state value function. A simple MLP.
- Parameters:
s_dim – State dimension.
hidden_size – Number of hidden units in each layer.
- forward(state)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.DDPG(state_dim=8, action_dim=2, hidden_size=256, lr_actor=0.001, lr_critic=0.001)[source]¶
Bases:
AgentDeep Deterministic Policy Gradient (DDPG) algorithm.
- Parameters:
state_dim (int) – Dimension of the state space.
action_dim (int) – Dimension of the action space.
hidden_size (int) – Number of hidden units in each layer of the networks.
lr_actor (float) – Learning rate for the actor network.
lr_critic (float) – Learning rate for the critic network.
Warning
This implementation works though the StableBaselines3 implementation is likely better optimized.
- act(observation, clip=False)[source]¶
Agent decision making function. Subclasses must implement this method.
- Parameters:
obs – Observation from the environment
- Returns:
Action to be taken by the agent
- Return type:
Any
- train(env, num_episodes=10000, batch_size=64, gamma=0.99, tau=0.005, warmup_episodes=300, save_path='models/')[source]¶
- update(states, actions, rewards, next_states, dones, gamma=0.99, tau=0.001)[source]¶
Update the actor and critic networks for one training step in DDPG.
- Parameters:
states – Batch of current states.
actions – Batch of actions taken.
rewards – Batch of rewards received.
next_states – Batch of next states.
dones – Batch of done flags (indicating episode termination).
gamma – Discount factor.
tau – Target network soft update parameter.