NS-Gym Benchmark Algorithms¶
- class ns_gym.benchmark_algorithms.MCTS(env, state, d, m, c, gamma)[source]¶
Bases:
Agent- Vanilla MCTS with Chance Nodes. Compatible with OpenAI Gym environments.
Selection and expansion are combined into the “treepolicy method” The rollout/simulation is the “default” policy.
- Parameters:
env (
Env) – The environment to run the MCTS on.state (Union[int, np.ndarray]) – The state to start the MCTS from.
d (int) – The depth of the MCTS.
m (int) – The number of simulations to run.
c (float) – The exploration constant.
gamma (float) – The discount factor.
- v0¶
The root node of the tree.
- Type:
DecisionNode
- possible_actions¶
List of possible actions in the environment.
- Type:
list
- Qsa¶
Dictionary to store Q values for state-action pairs.
- Type:
dict
- Nsa¶
Dictionary to store visit counts for state-action pairs.
- Type:
dict
- Ns¶
Dictionary to store visit counts for states.
- Type:
dict
- search()[source]¶
Do the MCTS by doing m simulations from the current state s. After doing m simulations we simply choose the action that maximizes the estimate of Q(s,a)
- Returns:
best action to take action_values(list): list of Q values for each action.
- Return type:
best_action(int)
- update_metrics_chance_node(state, action, reward)[source]¶
Update the Q values and visit counts for state-action pairs and states.
- Parameters:
state (Union[int,np.ndarray]) – The state.
action (Union[int,float,np.ndarray]) – action taken at the state.
reward (float) – The reward received after taking the action at the state.
- type_checker(observation, reward)[source]¶
Converts the observation and reward from dict and base.Reward type to the correct type if they are not already.
- Parameters:
observation (Union[dict, np.ndarray]) – Observation to convert.
reward (Union[float, base.Reward]) – Reward to convert.
- Returns:
Converted observation. (float): Converted reward.
- Return type:
(int,np.ndarray)
- best_child(v)[source]¶
Find the best child nodes based on the UCT value.
This method is only called for decision nodes.
- Parameters:
exploration_constant (_type_, optional) – _description_. Defaults to math.sqrt(2).
- Returns:
The best child node based on the UCT value. action: The action that leads to the best child node.
- Return type:
Node
- class ns_gym.benchmark_algorithms.DQN(state_size, action_size, num_layers, num_hidden_units, seed)[source]¶
Bases:
ModuleDeep Q network, simple feedforward neural network.
Simple Deep Q Network (DQN) algorithm for benchmarking. Follows this tutorial: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- Parameters:
state_size (int) – Dimension of each state
action_size (int) – Dimension of each action
num_layers (int) – Number of hidden layers
num_hidden_units (int) – Number of units in each hidden layer
seed (int) – Random seed
Warning
This implementation works though the StableBaselines3 implementation is likely better optimized.
- forward(state)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.DQNAgent(state_size, action_size, seed, model=None, model_path=None, buffer_size=100000, batch_size=64, gamma=0.99, lr=0.001, update_every=4, do_update=False)[source]¶
Bases:
AgentSimple Deep Q Network (DQN) algorithm for benchmarking
This implementation is based on the PyTorch tutorial found at https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- Parameters:
state_size (int) – dimension of each state
action_size (int) – dimension of each action
seed (int) – random seed
model (DQN, optional) – Predefined model architecture. Defaults to None.
model_path (str, optional) – Path to pretrained model weights. Defaults to None.
buffer_size (int, optional) – replay buffer size. Defaults to int(1e5).
batch_size (int, optional) – minibatch size. Defaults to 64.
gamma (float, optional) – discount factor. Defaults to 0.99.
lr (float, optional) – learning rate. Defaults to 0.001.
update_every (int, optional) – how often to update the network. Defaults to 4.
do_update (bool, optional) – Whether to perform gradient updates during environment interaction. Defaults to False.
- step(state, action, reward, next_state, done)[source]¶
Add experience to memory and potentially learn.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
action (int) – action taken
reward (float) – reward received
next_state (Union[int, np.ndarray, dict]) – next state
done (bool) – whether the episode has ended
- search(state, eps=0.0)[source]¶
Returns actions for given state as per current policy.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0
- act(state, eps=0.0)[source]¶
Returns actions for given state as per current policy.
- Parameters:
state (Union[int, np.ndarray, dict]) – current state
eps (float, optional) – epsilon, for epsilon-greedy action selection. Defaults to 0.0
- ns_gym.benchmark_algorithms.train_ddqn(env, model, n_episodes=1000, max_t=200, eps_start=1.0, eps_end=0.01, eps_decay=0.999, seed=0)[source]¶
DDQN Training Loop
- Parameters:
env (gym.Env) – environment to interact with
model (DQN) – model architecture to use
n_episodes (int, optional) – maximum number of training episodes. Defaults to 1000.
max_t (int, optional) – maximum number of timesteps per episode. Defaults to 200.
eps_start (float, optional) – starting value of epsilon, for epsilon-greedy action selection. Defaults to 1.0.
eps_end (float, optional) – minimum value of epsilon. Defaults to 0.01.
eps_decay (float, optional) – multiplicative factor (per episode) for decreasing epsilon. Defaults to 0.999.
seed (int, optional) – random seed. Defaults to 0.
- class ns_gym.benchmark_algorithms.PAMCTS(alpha, mcts_iter, mcts_search_depth, mcts_discount_factor, mcts_exploration_constant, state_space_size, action_space_size, DDQN_model=None, DDQN_model_path=None, q_value_fn=None, seed=0)[source]¶
Bases:
AgentPolicy-Augmented MCTS.
- Parameters:
alpha (float) – Convex combination weight on the learned policy’s Q-values.
alpha = 0-> pure MCTS,alpha = 1-> pure learned head.mcts_iter (int) – Total MCTS rollouts (m).
mcts_search_depth (int) – MCTS rollout depth (d).
mcts_discount_factor (float) – MCTS discount factor (gamma).
mcts_exploration_constant (float) – UCB1 exploration constant (c).
state_space_size (int) – Discrete state-space size, only used by the legacy DDQN wiring path.
action_space_size (int) – Number of discrete actions, only used by the legacy DDQN wiring path.
DDQN_model (torch.nn.Module, optional) – Legacy. Architecture instance whose weights will be loaded from
DDQN_model_path.DDQN_model_path (str, optional) – Legacy. Path to a state-dict
.pthfile matchingDDQN_model.q_value_fn (Callable[[state, env], np.ndarray], optional) – Recommended. A function that returns per-action Q-values for a given
(state, env)pair. Bypasses the legacy DDQN wiring entirely.seed (int) – Random seed forwarded to the legacy
DQNAgent. Unused whenq_value_fnis supplied.
Examples
>>> # Legacy: ns_gym DQN architecture + .pth state dict >>> from ns_gym.benchmark_algorithms.DDQN.DDQN import DQN as DDQNNet >>> arch = DDQNNet(state_size=16, action_size=4, ... num_layers=3, num_hidden_units=64, seed=0) >>> agent = PAMCTS(alpha=0.75, mcts_iter=30, mcts_search_depth=20, ... mcts_discount_factor=0.95, ... mcts_exploration_constant=1.4, ... state_space_size=16, action_space_size=4, ... DDQN_model=arch, ... DDQN_model_path="weights.pth")
>>> # Recommended: Stable-Baselines3 DQN via StableBaselineWrapper >>> from stable_baselines3 import DQN >>> from ns_gym.base import StableBaselineWrapper >>> sb3 = DQN.load("contextual_ddqn.zip") >>> def obs_fn(state, env): ... return np.concatenate([ ... np.eye(16, dtype=np.float32)[int(state)], ... np.asarray(env.transition_prob, dtype=np.float32), ... ]) >>> wrap = StableBaselineWrapper(sb3, obs_fn=obs_fn) >>> agent = PAMCTS(alpha=0.75, mcts_iter=30, mcts_search_depth=20, ... mcts_discount_factor=0.95, ... mcts_exploration_constant=1.4, ... state_space_size=16, action_space_size=4, ... q_value_fn=wrap.q_values)
- class ns_gym.benchmark_algorithms.AlphaZeroAgent(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, gamma, c, num_mcts_simulations, max_mcts_search_depth, model_checkpoint_path=None, model=<class 'ns_gym.benchmark_algorithms.AlphaZero.alphazero.AlphaZeroNetwork'>, alpha=1.0, epsilon=0.0)[source]¶
Bases:
object- train(env, n_episodes, max_episode_len, lr, batch_size, n_epochs, experiment_name, eval_window_size=100, weight_decay=0.0001, temp_start=2, temp_end=0.8, temp_decay=0.95)[source]¶
Train the AlphaZero agent
- Parameters:
env (gym.Env) – The environment to train on
n_episodes (int) – Number of training episodes
max_episode_len (int) – Maximum number of steps per episode
lr (float) – Learning rate for the neural network
batch_size (int) – Batch size for training
n_epochs (int) – Number of epochs per training iteration
experiment_name (str) – Name for saving models and logs
eval_window_size (int, optional) – Size of the evaluation window. Defaults to 100.
weight_decay (float, optional) – Weight decay for optimizer. Defaults to 1e-4.
temp_start (float, optional) – Starting temperature for exploration. Defaults to 2.
temp_end (float, optional) – Ending temperature for exploration. Defaults to 0.8.
temp_decay (float, optional) – Decay rate for temperature. Defaults to 0.95.
- Returns:
List of episode returns
- Return type:
List[float]
- class ns_gym.benchmark_algorithms.AlphaZeroNetwork(action_space_dim, observation_space_dim, n_hidden_layers, n_hidden_units, activation='relu')[source]¶
Bases:
Module- Overview:
This is a simple MLP that predicts the policy and value of a particalar state
- Args:
action_space_dim (int): Size of the action space observation_space_dim (int): Size of the observation space lr (float): learning rate n_hidden_layers (int): Number of hidden layers n_hidden_units (int): Number of units in each hidden layer activation (str, optional): Activation . Defaults to ‘relu’.
- class ns_gym.benchmark_algorithms.PPO(actor, critic, lr_policy=0.0003, lr_critic=0.0004, max_grad_norm=0.5, ent_weight=0.0, clip_val=0.2, sample_n_epoch=10, sample_mb_size=32, device='cpu')[source]¶
Bases:
AgentPPO class
Warning
You can use this if you want but honestly just use the StableBaselines3 implementation.
- Parameters:
actor – Actor network.
critic – Critic network.
lr_policy – Learning rate for the policy network.
lr_critic – Learning rate for the critic network.
max_grad_norm – Maximum gradient norm for clipping.
ent_weight – Entropy weight for exploration.
clip_val – Clipping value for PPO.
sample_n_epoch – Number of epochs to sample minibatches.
sample_mb_size – Size of each minibatch.
device – Device to run the computations on.
- train(states, actions, prev_val, advantages, returns, prev_lobprobs)[source]¶
Train the PPO model using provided experience.
- Parameters:
states – State samples.
actions – Action samples.
prev_val – Previous state value estimates.
advantages – Advantage estimates.
returns – Discounted return estimates.
prev_lobprobs – Previous log probabilities of actions.
- Returns:
Policy loss. v_loss: Value loss. entropy: Average entropy.
- Return type:
pg_loss
- class ns_gym.benchmark_algorithms.PPOActor(s_dim, a_dim, hidden_size=64, is_discrete=False)[source]¶
Bases:
ModuleActor network for policy approximation.
Supports both continuous and discrete action spaces.
- Parameters:
s_dim – State dimension.
a_dim – Action dimension.
hidden_size – Number of hidden units in each layer.
is_discrete – Whether the action space is discrete.
- forward(state, deterministic=False)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.PPOCritic(s_dim, hidden_size=64)[source]¶
Bases:
ModuleCritic network to estimate the state value function. A simple MLP.
- Parameters:
s_dim – State dimension.
hidden_size – Number of hidden units in each layer.
- forward(state)[source]¶
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class ns_gym.benchmark_algorithms.DDPG(state_dim=8, action_dim=2, hidden_size=256, lr_actor=0.001, lr_critic=0.001)[source]¶
Bases:
AgentDeep Deterministic Policy Gradient (DDPG) algorithm.
- Parameters:
state_dim (int) – Dimension of the state space.
action_dim (int) – Dimension of the action space.
hidden_size (int) – Number of hidden units in each layer of the networks.
lr_actor (float) – Learning rate for the actor network.
lr_critic (float) – Learning rate for the critic network.
Warning
This implementation works though the StableBaselines3 implementation is likely better optimized.
- train(env, num_episodes=10000, batch_size=64, gamma=0.99, tau=0.005, warmup_episodes=300, save_path='models/')[source]¶
- update(states, actions, rewards, next_states, dones, gamma=0.99, tau=0.001)[source]¶
Update the actor and critic networks for one training step in DDPG.
- Parameters:
states – Batch of current states.
actions – Batch of actions taken.
rewards – Batch of rewards received.
next_states – Batch of next states.
dones – Batch of done flags (indicating episode termination).
gamma – Discount factor.
tau – Target network soft update parameter.
- class ns_gym.benchmark_algorithms.RATS(action_space, gamma=0.9, max_depth=4, L_p=1.0, L_r=0.0, tau=1.0)[source]¶
Bases:
AgentRisk-Averse Tree Search agent.
- Parameters:
action_space (gymnasium.spaces.Discrete) – action space of the env.
gamma (float) – discount factor.
max_depth (int) – planning depth of the minimax tree.
L_p (float) – Lipschitz constant of the transition kernel in time.
L_r (float) – Lipschitz constant of the reward in time.
tau (float) – non-stationarity time scale.
- act(observation=None, env=None, done=False)[source]¶
Run the RATS planning procedure and return the chosen action.
- Parameters:
observation – ignored (state is read from env.unwrapped.s); kept for signature parity with other ns_gym agents.
env – NS-Gym env exposing env.unwrapped.s and env.unwrapped.P.
done (bool) – True if the current state is terminal.