Evaluating Your Agent¶
Welcome! This is the fourth tutorial in a series for the full submission workflow:
Environment Setup: create your repository, configure Python, and build Docker images.
Build Your Agent: implement a model-based or model-free agent and register it.
Create a Custom Environment: define non-stationarity with schedulers and update functions.
This tutorial: understand how submissions are scored and ranked.
Submit Your Agent: run final checks and send your repository for evaluation.
“Jade must be polished to become a gem.” - Three Character Classic (San Zi Jing)
For each submission, we will evaluate and rank according to four criteria. When you run the evaluator through Docker, you should see your scoring metrics on your machine. We will officiate your result after submission.
We mainly evaluate based on:
Adaptability: a measure of how fast an agent could adapt to change. At an unknown timestep, we will make a change to the non-stationary environment parameter. Your algorithm needs to learn to recover from initial failures, and adapt to find a good solution.
To evaluate, we consider the following:
Regret. The difference between our oracle solution and your submission.
Recovery Time. The time your algorithm takes to adapt to the change.
Unnotify category only.
Performance: this is the average undiscounted episodic reward achieved under non-stationary conditions. The environment we use to evaluate this will be more non-stationary.
Unnotify and partial-notify categories.
Resilience: a good algorithm needs a robust policy that still thrives with slight perturbations. We want to measure the agent’s performance immediately after the change, before the agent has time to adapt. To measure this, we froze your policy (or your current estimation of the MDP, or both) and perturbs the non-stationarity settings. We consider both the retention ratio and its relative performance with respect to our oracle solution. To get ranked on this leaderboard, agents have to pass a specific performance threshold.
Fully-notify only.
Efficiency: this is measured in two aspects: fewer timesteps and less wall-clock (real-life) time the agent consumes in finding a solution. We rank submissions based on the ratio between the two. To get ranked on this leaderboard, agents have to pass a specific performance threshold.
Unnotify and partial-notify categories.
Next step¶
Think you can beat one or more categories? Go to Submit Your Agent for the final checklist and submission steps!