.. _model_based_rl: Model Based / Planning methods ==================================== This page is organized as follow: .. contents:: Table of Contents :depth: 3 Objectives ---------------- .. warning:: This page is in progress. We welcome any contribution :-) There are 3 standard methods currently in grid2op to apply "model based" / "planning" methods: 1) use "obs.simulate" (see :func:`grid2op.Observation.BaseObservation.simulate`) 2) use the "Simulator" (see :ref:`simulator_page` and :mod:`grid2op.simulator`) 3) use the "forecast env" (see :func:`grid2op.Observation.BaseObservation.get_forecast_env`) 4) use the "forecast env" (see :func:`grid2op.Observation.BaseObservation.get_env_from_external_forecasts`) .. note:: The main difference between :func:`grid2op.Observation.BaseObservation.get_forecast_env` and :func:`grid2op.Observation.BaseObservation.get_env_from_external_forecasts` is that the first one rely on provided forecast in the environment and in :func:`grid2op.Observation.BaseObservation.get_env_from_external_forecasts` you are responsible for providing these forecasts. This has some implications: - you cannot use `obs.get_forecast_env()` if the forecasts are deactivated, or if there are no provided forecast in the environment - the number of steps possible in `obs.get_forecast_env()` is fixed and determined by the environment. - `"garbarge in" = "garbage out"` is especially true for `obs.get_env_from_external_forecasts` By this I mean that if you provided forecasts with poor quality (*eg* that does not contain any usefull information about the future, or such that the total generation is lower that the total demand etc.) then you will most likely not get any usefull information from their usage. And you can use them for different strategies among: - *Decide when to act or not*: A successful techniques is "do nothing" or to "get back to a reference configuration" when the grid is safe. And it's only when the grid is declared "not safe" that an action is taken. You can declare a grid is safe is you can "do nothing" withtout overload for a certain number of steps, or test if there are still no overload even if the grid is "under stress" (disconnected line by the opponent, more loads / renewables etc.) - *Chose the best actions among a short list*: in this usecase you have a short list of actions (hard coded, given by a heuristic, by domain knowledge or by a neural network, etc.) .. _mb_simulate: obs.simulate ------------- The idea here is to "simulate" the impact of an action on "future" grid state(s) before taking this action "for real". You can use it , for example to select the "best action among *k*" (the *k* actions you selected can come from the output of a neural net and you take the *k* actions with the highest q-value for example). In this first example you "simulate" the grid state after having taken your actions for the next 3 steps, and take the action with the best "score". .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent1(BaseAgent): def act(self, observation, reward, done=False): k_actions = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -99999999 for act in k_actions: _, sim_reward1, done, info = obs.simulate(act, time_step=1) _, sim_reward2, done, info = obs.simulate(act, time_step=2) # if supported by the environment _, sim_reward3, done, info = obs.simulate(act, time_step=3) # if supported by the environment this_score = function_to_combine_rewards(sim_reward1, sim_reward2, sim_reward3) # select the action with the best score if this_score > highest_score: res = act highest_score = this_score return res You can also use it to select the action that keep the grid in a "correct" state for the longest .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent2(BaseAgent): def act(self, observation, reward, done=False): k_actions = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -1 for act in k_actions: done = False ts_survived = 0 sim_obs, sim_r, sim_done, sim_info = obs.simulate(act) if not sim_done: # you can then start to see how long your survive while not done: ts_survived += 1 sim_obs, sim_reward, done, info = sim_obs.simulate(self.action_space()) # select the action with the best score if ts_survived > highest_score: res = act highest_score = ts_survived return res .. note:: In both cases above, you can evaluate the impact of an entire "strategy" (*here* encoded as "a list of actions" -- the most simple one being "do an action then do nothing as long as you can") if you chain the calls to simulate. This would give, for the example 1: .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent1Bis(BaseAgent): def act(self, observation, reward, done=False): k_strategies = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -99999999 for strat in k_strategies: act1, act2, act3 = strat s_o1, sim_reward1, done, info = obs.simulate(act1) sim_reward2 = None sim_reward3 = None if not done: s_o2, sim_reward2, done, info = s_o1.simulate(act2) if not done: s_o3, sim_reward3, done, info = s_o2.simulate(act3) this_score = function_to_combine_rewards(sim_reward1, sim_reward2, sim_reward3) # select the action with the best score if this_score > highest_score: res = strat[0] # action will be the first one of the strategy of course highest_score = this_score return res And for the ExampleAgent2: .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent2Bis(BaseAgent): def act(self, observation, reward, done=False): k_strategies = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -1 for strat in k_strategies: done = False ts_survived = 0 sim_obs, sim_r, sim_done, sim_info = obs.simulate(strat[ts_survived]) if not sim_done: # you can then start to see how long your survive while not done: ts_survived += 1 sim_obs, sim_reward, done, info = sim_obs.simulate(strat[ts_survived]) # select the action with the best score if ts_survived > highest_score: res = strat[0] # action is the first one of the best strategy highest_score = ts_survived return res .. note:: We are sure there are lots of other ways to use "obs.simulate". If you have some idea let us know, for example by starting a conversation here https://github.com/Grid2Op/grid2op/discussions or in our discord. Simulator -------------- The idea of the :class:`grid2op.simulator.Simulator` is to allow you to have more control on the "grid state" you want to simulate. Instead of relying on pre computed "time series" of the environment (so called "*forecast*") you can define your own "load" and "generation". This can be usefull if you want to see if an action is still persistent if the grid is "more stressed". In a simular setting that above, you could select the "best action" among a list of *k* based on the "more robust action if the loads increase" (there are lots of ways to "stress" a powergrid... You can increase the amount of renewables, the total demand, you can increase the demand in a particular area, disconnect some powerlines etc. etc. We keep it simple here and will just increase the demand - and the generation, because remember that `sum generation = sum load + losses` by a certain factor). This can give a code looking like: .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent3(BaseAgent): def act(self, observation, reward, done=False): k_actions = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_stress = -1 simulator = obs.get_simulator() init_load_p = 1. * obs.load_p init_load_q = 1. * obs.load_q init_gen_p = 1. * obs.gen_p for act in k_actions: done = False max_stress = 0 sim_obs, sim_r, sim_done, sim_info = obs.simulate(act) # you can stress the grid the way you want, disconnecting some powerline # increase demand / generation in certain area etc. etc. # we just do a simple "heuristic" here for this_stress in [1, 1.01, 1.02, 1.03, 1.05, 1.06, 1.07, 1.08, 1.09, 1.1]: this_load_p = init_load_p * this_stress this_load_q = init_load_q * this_stress this_gen_p = init_gen_p * this_stress res = simulator.predict(act, new_gen_p=new_gen_p, new_load_p=new_load_p, new_load_q=new_load_q, ) if not res.converged: # simulation could not be made, this would corresponds to a "game over" break obs = res.current_obs if np.any(obs.rho > 1.): # grid is not safe, action is not "robust enough": # at least one powerline is overloaded break prev_stress = this_stress # select the action with the best score if prev_stress > highest_stress: res = act highest_stress = prev_stress return res This way of looking at the problem is related to the "forecast error". If you "stress" the grid in the direction where you expect the forecast to be inaccurate and you want to know if your "strategy" is robust to these uncertainties. If you rather want to disconnect some powerline as way to stress the grid, you can end up with something like: .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent3Bis(BaseAgent): def act(self, observation, reward, done=False): k_strategies = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_stress = -1 simulator = obs.get_simulator() for act in k_strategies: done = False this_stress_pass = 0 sim_obs, sim_r, sim_done, sim_info = obs.simulate(act) # you can stress the grid the way you want, disconnecting some powerline # increase demand / generation in certain area etc. etc. # here we simulate the impact of your action after disconnection of line 1,2, 7, 12 and 42 for this_stress_id in [1, 2, 7, 12, 42]: this_act = act.copy() this_act += self.action_space({"set_line_status": [(this_stress_id, -1)]}) # some code that ignores the "topology" ways (if any) to reconnect the line # in the original action this_act.remove_line_status_from_topo(check_cooldown=False) res = simulator.predict(this_act, new_gen_p=new_gen_p, new_load_p=new_load_p, new_load_q=new_load_q, ) if not res.converged: # simulation could not be made, this would corresponds to a "game over" continue obs = res.current_obs if np.any(obs.rho > 1.): # grid is not safe, action is not "robust enough": # at least one powerline is overloaded continue this_stress_pass += 1 # select the action with the best score # in this case the highest number of "safe disconnection" if this_stress_pass > highest_stress: res = act highest_stress = this_stress_pass return res .. note:: We are sure there are lots of other ways to use "obs.simulate". If you have some idea let us know, for example by starting a conversation here https://github.com/Grid2Op/grid2op/discussions or in our discord. Forecast env --------------- Finally you can use the :func:`grid2op.Observation.BaseObservation.get_forecast_env` to retrieve an actual environment already loaded with the "forecast" data available. Alternatively, if you want to use this feature but the environment does not provide such forecasts you can have a look at the :func:`grid2op.Observation.BaseObservation.get_env_from_external_forecasts` (if you can generate your own forecasts) or the :ref:`tshandler-module` section of the documentation (to still be able to "generate" forecasts) Lots of example can be use in this setting, for example using MCTS or any other "planning strategy", but if we take again the example of the section :ref:`mb_simulate` above this also allows to evaluate the impact of more than 1 action already planned, or of an action followed by "do nothing" etc. This could give, for the `ExampleAgent1` .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent4(BaseAgent): def act(self, observation, reward, done=False): k_strategies = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -99999999 for strat in k_strategies: act1, act2, act3 = strat f_env = obs.get_forecast_env() f_obs = f_env.reset() done = False ts_survived = 0 strat_rewards = [] while not done: f_obs, f_r, done, f_info = f_env.step(strat[ts_survived]) strat_rewards.append(f_r) ts_survived += 1 this_score = function_to_combine_rewards(strat_rewards) # select the strategy with the best score if this_score > highest_score: res = strat[0] # action will be the first one of the strategy of course highest_score = this_score return res And for the `ExampleAgent2`: .. code-block:: python from grid2op.Agent import BaseAgent class ExampleAgent5(BaseAgent): def act(self, observation, reward, done=False): k_strategies = ... # whatever you want, hard coded, heuristics, output of a NN etc. res = None highest_score = -1 for strat in k_strategies: f_done = False f_env = obs.get_forecast_env() f_obs = f_env.reset() ts_survived = 0 f_obs, f_r, f_done, f_info = f_env.step(strat[ts_survived]) if not f_done: # you can then start to see how long your survive while not f_done: ts_survived += 1 f_obs, f_reward, f_done, f_info = f_env.step(strat[ts_survived]) # select the action with the best score if ts_survived > highest_score: res = strat[0] # action is the first one of the best strategy highest_score = ts_survived return res .. include:: final.rst