.. currentmodule:: grid2op.Reward .. _reward-module: Reward =================================== This page is organized as follow: .. contents:: Table of Contents :depth: 3 Objectives ----------- This module implements some utilities to get rewards given an :class:`grid2op.Action` an :class:`grid2op.Environment` and some associated context (like has there been an error etc.) It is possible to modify the reward to use to better suit a training scheme, or to better take into account some phenomenon by simulating the effect of some :class:`grid2op.Action` using :func:`grid2op.Observation.BaseObservation.simulate`. Doing so only requires to derive the :class:`BaseReward`, and most notably the three abstract methods :func:`BaseReward.__init__`, :func:`BaseReward.initialize` and :func:`BaseReward.__call__` Customization of the reward ----------------------------- In grid2op you can customize the reward function / reward kernel used by your agent. By default, when you create an environment a reward has been specified for you by the creator of the environment and you have nothing to do: .. code-block:: python import grid2op env_name = "l2rpn_case14_sandbox" env = grid2op.make(env_name) obs = env.reset() an_action = env.action_space() obs, reward_value, done, info = env.step(an_action) The value of the reward function above is computed by a default function that depends on the environment you are using. For the example above, the "l2rpn_case14_sandbox" environment is using the :class:`RedispReward`. Using a reward function available in grid2op ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you want to customize your environment by adapting the reward and use a reward available in grid2op it is rather simple, you need to specify it in the `make` command: .. code-block:: python import grid2op from grid2op.Reward import EpisodeDurationReward env_name = "l2rpn_case14_sandbox" env = grid2op.make(env_name, reward_class=EpisodeDurationReward) obs = env.reset() an_action = env.action_space() obs, reward_value, done, info = env.step(an_action) In this example the `reward_value` is computed using the formula defined in the :class:`EpisodeDurationReward`. .. note:: There is no error in the syntax. You need to provide the class and not an object of the class (see next paragraph for more information about that). At time of writing the available reward functions is : - :class:`AlarmReward` - :class:`AlertReward` - :class:`BridgeReward` - :class:`CloseToOverflowReward` - :class:`ConstantReward` - :class:`DistanceReward` - :class:`EconomicReward` - :class:`EpisodeDurationReward` - :class:`FlatReward` - :class:`GameplayReward` - :class:`IncreasingFlatReward` - :class:`L2RPNReward` - :class:`LinesCapacityReward` - :class:`LinesReconnectedReward` - :class:`N1Reward` - :class:`RedispReward` In the provided reward you have also some convenience functions to combine different reward. These are: - :class:`CombinedReward` - :class:`CombinedScaledReward` Basically these two classes allows you to combine (sum) different reward in a single one. Passing an instance instead of a class ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On some occasion, it might be easier to work with instance of classes (object) rather than to work with classes (especially if you want to customize the implementation used). You can do this without any issue: .. code-block:: python import grid2op from grid2op.Reward import N1Reward env_name = "l2rpn_case14_sandbox" n1_l1_reward = N1Reward(l_id=1) # this is an object and not a class. env = grid2op.make(env_name, reward_class=n1_l1_reward) obs = env.reset() an_action = env.action_space() obs, reward_value, done, info = env.step(an_action) In this example `reward_value` is computed as being the maximum flow on all the powerlines after the disconnection of powerline `1` (because we specified `l_id=1` at creation). If we want to know the maximum flows after disconnection of powerline `5` you can call: .. code-block:: python import grid2op from grid2op.Reward import N1Reward env_name = "l2rpn_case14_sandbox" n1_l5_reward = N1Reward(l_id=5) # this is an object and not a class. env = grid2op.make(env_name, reward_class=n1_l5_reward) Customizing the reward for the "simulate" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In grid2op, you have the possibility to `simulate` the impact of an action on some future steps with the use of `obs.simulate(...)` (see :func:`grid2op.Observation.BaseObservation.simulate`) or `obs.get_forecast_env()` (see :func:`grid2op.Observation.BaseObservation.get_forecast_env`). In these methods you have some computations of rewards. Grid2op lets you allow to customize how these rewards are computed. You can change it in multiple fashion: .. code-block:: python import grid2op from grid2op.Reward import EpisodeDurationReward env_name = "l2rpn_case14_sandbox" env = grid2op.make(env_name, reward_class=EpisodeDurationReward) obs = env.reset() an_action = env.action_space() sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action) By default `sim_reward` is comupted with the same function as the environment, in this example :class:`EpisodeDurationReward`. If for some reason you want to customize the formula used to compute `sim_reward` and cannot (or does not want to) modify the reward of the environment you can: .. code-block:: python import grid2op from grid2op.Reward import EpisodeDurationReward env_name = "l2rpn_case14_sandbox" env = grid2op.make(env_name) obs = env.reset() env.observation_space.change_reward(EpisodeDurationReward) an_action = env.action_space() sim_obs, sim_reward, sim_d, sim_i = obs.simulate(an_action) next_obs, reward_value, done, info = env.step(an_action) In this example, `sim_reward` is computed using the `EpisodeDurationReward` (on forecast data) and `reward_value` is computed using the default reward of "l2rpn_case14_sandbox" on the "real" time serie data. Creating a new reward ~~~~~~~~~~~~~~~~~~~~~~ If you don't find any suitable reward function in grid2op (or in other package) you might want to implement one yourself. To that end, you need to implement a class that derives from :class:`BaseReward`, like this: .. code-block:: python import grid2op from grid2op.Reward import BaseReward from grid2op.Action import BaseAction from grid2op.Environment import BaseEnv class MyCustomReward(BaseReward): def __init__(self, whatever, you, want, logger=None): self.whatever = blablabla # some code needed ... super().__init__(logger) def __call__(self, action: BaseAction, env: BaseEnv, has_error: bool, is_done: bool, is_illegal: bool, is_ambiguous: bool) -> float: # only method really required. # called at each step to compute the reward. # this is where you need to code the "formula" of your reward ... def initialize(self, env: BaseEnv): # optional # called once, the first time the reward is used pass def reset(self, env: BaseEnv): # optional # called by the environment each time it is "reset" pass def close(self): # optional called once when the environment is deleted pass And then you can use your (custom) reward like any other: .. code-block:: python import grid2op from the_above_script import MyCustomReward env_name = "l2rpn_case14_sandbox" custom_reward = MyCustomReward(whatever=1, you=2, want=42) env = grid2op.make(env_name, reward_class=custom_reward) obs = env.reset() an_action = env.action_space() obs, reward_value, done, info = env.step(an_action) And now `reward_value` is computed using the formula you defined in `__call__` Training with multiple rewards ------------------------------- In the standard reinforcement learning framework the reward is unique. In grid2op, we didn't want to modify that. However powergrid are complex environment with some specific and unsual dynamics. For these reasons it can be difficult to compress all these signal into one single scalar. To speed up the learning process, to force the Agent to adopt more resilient strategies etc. it can be usefull to look at different aspect, thus using different reward. Grid2op allows to do so. At each time step (and also when using the `simulate` function) it is possible to compute different rewards. This rewards must inherit and be provided at the initialization of the Environment. This can be done as followed: .. code-block:: python import grid2op from grid2op.Reward import GameplayReward, L2RPNReward env = grid2op.make("case14_realistic", reward_class=L2RPNReward, other_rewards={"gameplay": GameplayReward}) obs = env.reset() act = env.action_space() # the do nothing action obs, reward, done, info = env.step(act) # immplement the do nothing action on the environment On this example, "reward" comes from the :class:`L2RPNReward` and the results of the "reward" computed with the :class:`GameplayReward` is accessible with the info["rewards"]["gameplay"]. We choose for this example to name the other rewards, "gameplay" which is related to the name of the reward "GampeplayReward" for convenience. The name can be absolutely any string you want. **NB** In the case of L2RPN competitions, the reward can be modified by the competitors, and so is the "other_reward" key word arguments. The only restriction is that the key "__score" will be use by the organizers to compute the score the agent. Any attempt to modify it will be erased by the score function used by the organizers without any warning. .. _reward-module-reset-focus: What happens in the "reset" ------------------------------ TODO Detailed Documentation by class -------------------------------- .. automodule:: grid2op.Reward :members: :special-members: :autosummary: .. include:: final.rst