Dive into grid2op sequential decision process

This page is organized as follow:


The goal of this page of the documentation is to provide you with a relatively extensive description of the mathematical model behind grid2op.

Grid2op is a software whose aim is to make experiments on powergrid, mainly sequential decision making, as easy as possible.

We chose to model this sequential decision making probleme as a “Markov Decision Process” (MDP) and one some cases “Partially Observable Markov Decision Process” (POMDP) or “Constrainted Markov Decision Process” (CMDP) and (work in progress) even “Decentralized (Partially Observable) Markov Decision Process” (Dec-(PO)MDP).

General notations

There are different ways to define an MDP. In this paragraph we introduce the notations that we will use.

In an MDP an “agent” / “automaton” / “algorithm” / “policy” takes some action a_t \in \mathcal{A}. This action is processed by the environment and update its internal state from s_t \in \mathcal{S} to s_{t+1} \in \mathcal{S} and computes a so-called reward r_{t+1} \in [0, 1].


By stating the dynamic of the environment this way, we ensure the “Markovian” property: the state s_{t+1} is determined by the knowledge of the previous state s_{t} and the action a_{t}

This tuple (s_t, r_t) is then given to the “agent” / “automaton” / “algorithm” which in turns produce the action a_{t+1}


More formally even, everything written can be stochastic:

  • a_t \sim \pi_{\theta}(s_t) where \pi_{\theta}(\cdot) is the “policy” parametrized by some parameters \theta that outputs here a probability distribution (depending on the state of the environment s_t) over all the actions mathcal{A}

  • s_{t+1} \sim \mathcal{L}_S(s_t, a_t) where \mathcal{L}_S(s_t, a_t) is a probability distribution over \mathcal{S} representing the likelyhood if the “next state” given the current state and the action of the “policy”

  • r_{t+1} \sim \mathcal{L}_R(s_t, s_{t+1}, a_t) is the reward function indicating “how good” was the transition from s_{t} to s_{t+1} by taking action a_t

This alternation \dots \to a \to (s, r) \to a \to \dots is done for a certain number of “steps” called T.

We will call the list s_{1} \to a_1 \to (s_2, r_2) \to \dots \to a_{T-1} \to (s_{T}, r_T) an “episode”.

Formally the knowledge of:

  • \mathcal{S}, the “state space”

  • \mathcal{A}, the “action space”

  • \mathcal{L}_s(s, a), sometimes called “transition kernel”, is the probability distribution (over \mathcal{S}) that gives the next state after taking action a in state s

  • \mathcal{L}_r(s, s', a), sometimes called “reward kernel”, is the probability distribution (over [0, 1]) that gives the reward r after taking action a in state s which lead to state s'

  • T \in \mathbb{N}^* the maximum number of steps for an episode

Defines a MDP. We will detail all of them in the section Modeling sequential decisions bellow.

In grid2op, there is a special case where a grid state cannot be computed (either due to some physical infeasibilities or because the resulting state would be irrealistic). This can be modeled relatively easily in the MDP formulation above if we add a “terminal state” s_{\emptyset} in the state space \mathcal{S}_{new} := \mathcal{S} \cup \left\{ s_{\emptyset} \right\}: and add the transitions: \mathcal{L}_s(s_{\emptyset}, a) = \text{Dirac}(s_{\emptyset}) \forall a \in \mathcal{A} stating that once the agent lands in this “terminal state” then the game is over, it stays there until the end of the scenario.

We can also define the reward kernel in this state, for example with \mathcal{L}_r(s_{\emptyset}, s', a) = \text{Dirac}(0) \forall s' \in \mathcal{S}, a \in \mathcal{A} and \mathcal{L}_r(s, s_{\emptyset}, a) = \text{Dirac}(0) \forall s \in \mathcal{S}, a \in \mathcal{A} which states that there is nothing to be gained in being in this terminal set.

Unless specified otherwise, we will not enter these details in the following explanation and take it as “pre requisite” as it can be defined in general. We will focus on the definition of \mathcal{S}, \mathcal{A}, \mathcal{L}_s(s, a) and \mathcal{L}_r(s, s', a) by leaving out the “terminal state”.


In grid2op implementation, this “terminal state” is not directly implemented. Instead, the first Observation leading to this state is marked as “done” (flag obs.done is set to True).

No other “observation” will be given by grid2op after an observation with obs.done set to True and the environment needs to be “reset”.

This is consistent with the gymnasium implementation.

The main goal of a finite horizon MDP is then to find a policy \pi \in \Pi that given states s and reward r output an action a such that (NB here \Pi denotes the set of all considered policies for this MDP):

    \min_{\pi \in \Pi}  ~& \sum_{t=1}^T \mathbb{E} \left( r_t \right) \\
    \text{s.t.} ~ \\
                   & \forall t, a_t \sim  \pi (s_{t}) & \text{policy produces the action} \\
                   & \forall t, s_{t+1} \sim \mathcal{L}_S(s_t, a_t) & \text{environment produces next state} \\
                   & \forall t, r_{t+1} \sim \mathcal{L}_r(s_t, a_t, s_{t+1}) & \text{environment produces next reward} \\

Specific notations

To define “the” MDP modeled by grid2op, we also need to define some other concepts that will be used to define the state space \mathcal{S} or transition kernel \mathcal{L}_s(s, a) for example.

A Simulator

We need a so called “simulator”.

Informatically, this is represented by the Backend inside the grid2op environment (more information about the Backend is detailed in the Backend section of the documentation).

This simulator is able to compute some informations that are part of the state space \mathcal{S} (eg flows on powerlines, active production value of generators etc.) and thus are used in the computation of the transition kernel.

We can model this simulator with a function \text{Sim} that takes as input some data from an “input space” \mathcal{S}_{\text{im}}^{(\text{in})} and result in data in \mathcal{S}_{\text{im}}^{(\text{out})}.


In grid2op we don’t force the “shape” of \mathcal{S}_{\text{im}}^{(\text{in})}, including the format used to read the grid file from the hard drive, the solved equations, the way these equations are used. Everything here is “free” and grid2op only needs that the simulator (wrapped in a Backend) understands the “format” sent by grid2op (through a grid2op.Action._backendAction._BackendAction) and is able to expose to grid2op some of its internal variables (accessed with the ***_infos() methods of the backend)

TODO do I emphasize that the simulator also contains the grid iteself ?

To make a parallel with similar concepts “simulator”, represents the physics as in all “mujoco” environments eg Ant or Inverted Pendulum . This is the same concept here excepts that it solves powerflows.

Some Time Series

Another type of data that we need to define “the” grid2op MDP is the “time series”, implemented in the chronics grid2op module documented on the page Time series (formerly called “chronics”) with some complements given in the Input data of an environment page as well.

These time series define what exactly would happen if the grid was a “copper plate” without any constraints. Said differently it provides what would each consumer consume and what would each producer produce if they could all be connected together with infinite “bandwith”, without any constraints on the powerline etc.

In particular, grid2op supposes that these “time series” are balanced, in the sense that the producers produce just the right amount (electrical power cannot really be stocked) for the consumer to consume and that for each steps. It also supposes that all the “constraints” of the producers.

These time series are typically generated outside of grid2op, for example using chronix2grid python package (or anything else).

Formally, we will define these time series as input \mathcal{X}_t all these time series at time t. These exogenous data consist of :

  • generator active production (in MW), for each generator

  • load active power consumption (in MW), for each loads

  • load reactive consumption (in MVAr), for each loads

  • * generator voltage setpoint / target (in kV)


* for this last part, this can be adapted “on demand” by the environment through the voltage controler module. But for the sake of modeling, this can be modeled as being external / exogenous data.

And, to make a parrallel with similar concept in other RL environment, these “time series” can represent the layout of the maze in pacman, the positions of the platforms in “mario-like” 2d games, the different turns and the width of the route in a car game etc. This is the “base” of the levels in most games.

Finally, for most released environment, a lof of different \mathcal{X} are available. By default, each time the environment is “reset” (the user want to move to the next scenario), a new \mathcal{X} is used (this behaviour can be changed, more information on the section Time series Customization of the documentation).

Modeling sequential decisions

As we said in introduction of this page, we will model a given scenario in grid2op. We have at our disposal:

  • a simulator, which is represented as a function \text{Sim} : \mathcal{S}_{\text{im}}^{(\text{in})} \to \mathcal{S}_{\text{im}}^{(\text{out})}

  • some time series \mathcal{X} = \left\{ \mathcal{X}_t \right\}_{1 \leq t \leq T}

In order to define the MDP we need to define:

  • \mathcal{S}, the “state space”

  • \mathcal{A}, the “action space”

  • \mathcal{L}_s(s, a), sometimes called “transition kernel”, is the probability distribution (over \mathcal{S}) that gives the next state after taking action a in state s

  • \mathcal{L}_r(s, s', a), sometimes called “reward kernel”, is the probability distribution (over [0, 1]) that gives the reward r after taking action a in state s which lead to state s'

We will do that for a single episode (all episodes follow the same process)


To make the reading of this MDP easier, for this section of the documentation, we adopted the following convention:

  • text in green will refer to elements that are read directly from the grid by the simulator \text{Sim} at the creation of the environment.

  • text in orange will refer to elements that are related to time series \mathcal{X}

  • text in blue will refer to elements that can be be informatically modified by the user at the creation of the environment.

In the pure definition of the MDP all text in green, orange or blue are exogenous and constant: once the episode starts they cannot be changed by anything (including the agent).

We differenciate between these 3 types of “variables” only to clarify what can be modified by “who”:

  • green variables depend only on the controlled powergrid

  • orange variables depend only time series

  • blue variables depend only on the way the environment is loaded


Not all these variables are independant though. If there are for example 3 loads on the grid, then you need to use time series that somehow can generate 3 values at each step for load active values and 3 values at each step for load reactive values. So the dimension of the orange variables is somehow related to dimension of green variables : you cannot use the time series you want on the grid you want.

Structural informations

To define mathematically the MPD we need first to define some notations about the grid manipulated in this episode.

We suppose that the structure of the grid does not change during the episode, with:

  • n_line being the number of “powerlines” (and transformers) which are elements that allow the power flows to actually move from one place to another

  • n_gen being the number of generators, which are elements that produces the power

  • n_load being the number of consumers, which are elements that consume the power (typically a city or a large industrial plant manufacturing)

  • n_storage being the number of storage units on the grid, which are elements that allow to convert the power into a form of energy that can be stored (eg chemical)

All these elements (side of powerlines, generators, loads and storage units) are connected together at so called “substation”. The grid counts n_sub such substations. We will call dim_topo := 2 times n_line + n_gen + n_load + n_storage the total number of elements in the grid.


This “substation” concept only means that if two elements does not belong to the same substations, they cannot be directly connected at the same “node” of the graph.

They can be connected in the same “connex component” of the graph (meaning that there are edges that can connect them) but they cannot be part of the same “node”

Each substation can be divided into n_busbar_per_sub (was only 2 in grid2op <= 1.9.8 and can be any integer > 0 in grid2op version >= 1.9.9).

This n_busbar_per_sub parameters tell the maximum number of independant nodes their can be in a given substation. So to count the total maximum number of nodes in the grid, you can do \text{n\_busbar\_per\_sub} \times \text{n\_sub}

When the grid is loaded, the backend also informs the environment about the ***_to_subid vectors (eg gen_to_subid) which give, for each element to which substation they are connected. This is how the “constraint” of



With these notations, two elements are connected together if (and only if, that’s a definition after all):

  • they belong to the same substation

  • they are connected to the same busbar

In this case, we can also say that these two elements are connected to the same “bus”.

These “buses” are the “nodes” in “the” graph you thought about when looking at a powergrid.


Definition (“disconnected bus”): A bus is said to be disconnected if there are no elements connected to it.


Definition (“disconnected element”): An element (side of powerlines, generators, loads or storage units) is said to be disconnected if it is not connected to anything.

Extra references:

You can modify n_busbar_per_sub in the grid2op.make function. For example, by default if you call grid2op.make(“l2rpn_case14_sandbox”) you will have n_busbar_per_sub = 2 but if you call grid2op.make(“l2rpn_case14_sandbox”, n_busbar=3) you will have n_busbar_per_sub = 3 see Substations for more information.

n_line, n_gen, n_load, n_storage and n_sub depends on the environment you loaded when calling grid2op.make, for example calling grid2op.make(“l2rpn_case14_sandbox”) will lead to environment with n_line = 20, n_gen = 6, n_load = 11 and n_storage = 0.

Other informations

When loading the environment, there are also some other static data that are loaded which includes:

  • min_storage_p and max_storage_p: the minimum power that can be injected by each storage units (typically min_storage_p < 0). These are vectors (of real numbers) of size n_storage

  • is_gen_renewable: a vector of True / False indicating for each generator whether it comes from new renewable (and intermittent) renewable energy sources (eg solar or wind)

  • is_gen_controlable: a vector of True / False indicating for each generator whether it can be controlled by the agent to produce both more or less power at any given step. This is usually the case for generator which uses as primary energy coal, gaz, nuclear or water (hyrdo powerplant)

  • min_ramp and max_ramp: are two vector giving the maximum amount of power each generator can be adjusted to produce more / less. Typically, min_ramp = max_ramp = 0 for non controlable generators.


These elements are marked green because they are loaded by the backend, but strictly speaking they can be specified in other files than the one representing the powergrid.

Action space

At time of writing, grid2op support different type of actions:

  • change_line_status: that will change the line status (if it is disconnected this action will attempt to connect it). It leaves in \left\{0,1\right\}^{\text{n\_line}}

  • set_line_status: that will set the line status to a particular state regardless of the previous state (+1 to attempt a force reconnection on the powerline and -1 to attempt a force disconnection). There is also a special case where the agent do not want to modify a given line and it can then output “0” It leaves in \left\{-1, 0, 1\right\}^{\text{n\_line}}

  • * change_bus: that will, for each element of the grid change the busbars to which it is connected (eg if it was connected on busbar 1 it will attempt to connect it on busbar 2). This leaves in \left\{0,1\right\}^{\text{dim\_topo}}

  • set_bus: that will, for each element control on which busbars you want to assign it to (1, 2, …, n_busbar_per_sub). To which has been added 2 special cases -1 means “disconnect” this element and 0 means “I don’t want to affect” this element. This part of the action space then leaves in \left\{-1, 0, 1, 2, ..., \text{n\_busbar\_per\_sub} \right\}^{\text{dim\_topo}}

  • storage_p: for each storage, the agent can chose the setpoint / target power for each storage units. It leaves in [\text{min\_storage\_p}, \text{max\_storage\_p}] \subset \mathbb{R}^{\text{n\_storage}}

  • curtail: corresponds to the action where the agent ask a generator (using renewable energy sources) to produce less than what would be possible given the current weather. This type of action can only be performed on renewable generators. It leaves in [0, 1]^{\text{n\_gen}} (to avoid getting the notations even more complex, we won’t define exactly the space of this action. Indeed, writing [0, 1]^{\text{n\_gen}} is not entirely true as a non renewable generator will not be affected by this type of action)

  • redisp: corresponds to the action where the agent is able to modify (to increase or decrease) the generator output values (asking at the some producers to produce more and at some to produce less). It leaves in [\text{min\_ramp}, \text{max\_ramp}] \subset \mathbb{R}^{\text{n\_gen}} (remember that for non controlable generators, by definition we suppose that min_ramp = max_ramp = 0)


The change_bus is only available in environment where n_busbar_per_sub = 2 otherwise this would not make sense. The action space does not include this type of actions if n_busbar_per_sub != 2

You might have noticed that every type of actions is written in blue. This is because the action space can be defined at the creation of the environment, by specifying in the call to grid2op.make the action_class to be used.

Let’s call 1_{\text{change\_line\_status}} either \left\{0,1\right\}^{\text{n\_line}} (corresponding to the definition of the change_line_status briefly described above) if the change_line_status has been selected by the user (for the entire scenario) or the \emptyset otherwise (and we do similarly for all other type of actions of course: for example: 1_{redisp} \in \left\{[\text{min\_ramp}, \text{max\_ramp}], \emptyset\right\})

Formally then, the action space can then be defined as:

\mathcal{A}\text{space\_type} =&\left\{\text{change\_line\_status}, \text{set\_line\_status},  \right. \\
                               &~\left.\text{change\_bus}, \text{set\_bus}, \right.\\
                               &~\left.\text{storage\_p}, \text{curtail}, \text{redisp} \right\} \\
\mathcal{A} =&\Pi_{\text{a\_type} \in  \mathcal{A}\text{space\_type} } 1_{\text{a\_type}}\\


In the grid2op documentation, the words “topological modification” are often used. When that is the case, unless told otherwise it means set_bus or change_bus type of actions.

Extra references:

Informatically, the 1_{\text{change\_line\_status}} can be define at the call to grid2op.make when the environment is created (and cannot be changed afterwards).

For example, if the user build the environment like this :

import grid2op
from grid2op.Action import PlayableAction
env_name = ... # whatever, eg "l2rpn_case14_sandbox"
env = grid2op.make(env_name, action_class=PlayableAction)

Then all type of actions are selected and :

\mathcal{A} =& \left\{0,1\right\}^{\text{n\_line}}~ \times & \text{change\_line\_status} \\
             & \left\{-1, 0, 1\right\}^{\text{n\_line}}~ \times & \text{set\_line\_status} \\
             & \left\{0,1\right\}^{\text{dim\_topo}}~ \times & \text{change\_bus} \\
             & \left\{-1, 0, 1, 2, ..., \text{n\_busbar\_per\_sub} \right\}^{\text{dim\_topo}}~ \times & \text{set\_bus} \\
             & ~[\text{min\_storage\_p}, \text{max\_storage\_p}]~ \times & \text{storage\_p} \\
             & ~[0, 1]^{\text{n\_gen}} \times & \text{curtail} \\
             & ~[\text{min\_ramp}, \text{max\_ramp}] & \text{redisp}

You can also build the same environment like this:

import grid2op
from grid2op.Action import TopologySetAction
same_env_name = ... # whatever, eg "l2rpn_case14_sandbox"
env = grid2op.make(same_env_name, action_class=TopologySetAction)

Which will lead the following action space, because the user ask to use only “topological actions” (including line status) with only the “set” way of modifying them.

\mathcal{A} =& \left\{-1, 0, 1\right\}^{\text{n\_line}}~ \times & \text{set\_line\_status} \\
             & \left\{-1, 0, 1, 2, ..., \text{n\_busbar\_per\_sub} \right\}^{\text{dim\_topo}}~ & \text{set\_bus} \\

The page Action of the documentation provides you with all types of actions you you can use in grid2op.


If you use a compatibility with the popular gymnasium (previously gym) you can also specify the action space with the “attr_to_keep” key-word argument.

State space

By default in grid2op, the state space shown to the agent (the so called “observation”). In this part of the documentation, we will described something slightly different which is the “state space” of the MDP.

The main difference is that this “state space” will include future data about the environment (eg the \mathcal{X} matrix). You can refer to section Partial Observatibility or Or not partial observatibility ? of this page of the documentation.


We found it easier to show the MDP without the introduction of the “observation kernel”, so keep in mind that this paragraph is not representative of the observation in grid2op but is “purely theoretical”.

The state space is defined by different type of attributes and we will not list them all here (you can find a detailed list of everything available to the agent in the Observation page of the documentation.) The “state space” is then made of:

And, to make it “Markovian” we also need to include :

  • the (constant) values of \mathcal{S}_{\text{im}}^{(\text{in})} that are not “part of” \mathcal{X} (more information about that in the paragraph “Step 4: Call the simulator” of this documentation). This might include some physical parameters of some elements of the grid (like transformers or powerlines) or some other parameters of the solver controlling either the equations to be solved or the solver to use etc. *

  • the complete matrix \mathcal{X} which include the exact knowledge of past, present and future loads and generation for the entire scenario (which is not possible in practice). The matrix itself is constant.

  • the index representing at which “step” of the matrix \mathcal{X} the current data are being used by the environment.


* grid2op is build to be “simulator agnostic” so all this part of the “state space” is not easily accessible through the grid2op API. To access (or to modify) them you need to be aware of the implementation of the grid2op.Backend.Backend you are using.


In this modeling, by design, the agent sees everything that will happen in the future, without uncertainties. To make a parrallel with a “maze” environment, the agent would see the full maze and its position at each step.

This is of course not fully representative of the daily powergrid operations, where the operators cannot see exactly the future. To make this modeling closer to the reality, you can refer to the paragphs Partial Observatibility and Or not partial observatibility ? below.

Transition Kernel

In this subsection we will describe the so called transition kernel, this is the function that given a state s and an action a gives a probability distribution over all possible next state s' \in \mathcal{S}.

In this subsection, we chose to model this transition kernel as a deterministic function (which is equivalent to saying that the probability distribution overs \mathcal{S} is a Dirac distribution).


The removal of the \mathcal{X} matrix in the “observation space” see section Partial Observatibility or the rewriting of the MDP to say in the “fully observable setting” (see section Or not partial observatibility ?) or the introduction of the “opponent” described in section Adversarial attacks are all things that “makes” this “transition kernel” probabilistic. We chose the simplicity in presenting it in a fully deterministic fashion.

So let’s write what the next state is given the current state s \in \mathcal{S} and the action of the agent a \in \mathcal{A}. To do that we split the computation in different steps explained bellow.


To be exhaustive, if the actual state is s = s_{\emptyset} then the s' = s_{\emptyset} is returned regardless of the action and the steps described below are skipped.

If the end of the episode is reached then s' = s_{\emptyset} is returned.

Step 2: load next environment values

This is also rather straightforward, the current index is updated (+1 is added) and this new index is used to find the “optimal” (from a market or a central authority perspective) value each producer produce to satisfy the demand mof each consumers (in this case large cities or companies). These informations are stored in the \mathcal{X} matrix.

Step 3: Compute the generators setpoints and handle storage units

The next step of the environment is to handle the “continuous” part of the action (eg “storage_p”, “curtail” or “redisp”) and to make sure a suitable setpoint can be reached for each generators (you can refer to the pages Storage units (optional) and Generators of this documentation for more information).

There are two alternatives:

  1. either the physical constraints cannot be met (there exist no feasible solutions for at least one generator), and in this case the next state is the terminal state s_{\emptyset} (ignore all the steps bellow)

  2. or they can be met. In this case the “target generator values” is computed as well as the “target storage unit values”


There is a parameters called LIMIT_INFEASIBLE_CURTAILMENT_STORAGE_ACTION that will try to avoid, as best as possible to fall into infeasibile solution. It does so by limiting the amount of power that is curtailed or injected in the grid from the storage units: it modifies the actions a.

Step 4: Call the simulator

At this stage then (assuming the physical constraints can be met), the setpoint for the following variables is known:

  • the status of the lines is deduced from the “change_line_status” and “set_line_status” and their status in s (the current state). If there are maintenance (or attacks, see section Adversarial attacks) they can also disconnect powerlines.

  • the busbar to which each elements is connected is also decuced from the “change_bus” and “set_bus” part of the action

  • the consumption active and reactive values have been computed from the \mathcal{X} values at previous step

  • the generator active values have just been computed after taking into account the redispatching, curtailement and storage (at this step)

  • the voltage setpoint for each generators is either read from \mathcal{X} or deduced from the above data by the “voltage controler” (more information on Voltage Controler)

All this should be part of the input solver data \mathcal{S}_{\text{im}}^{(\text{in})}. If not, then the solver cannot be used unfortunately…

With that (and the other data used by the solver and included in the space, see paragraph State space of this documentation), the necessary data is shaped (by the Backend) into a valid s_{\text{im}}^{(\text{in})} \in \mathcal{S}_{\text{im}}^{(\text{in})}.

The solver is then called and there are 2 alternatives (again):

  1. either the solver cannot find a feasible solution (it “diverges”), and in this case the next state is the terminal state s_{\emptyset} (ignore all the steps bellow)

  2. or a physical solution is found and the process carries out in the next steps

Step 5: Emulation of the “protections”

At this stage an object s_{\text{im}}^{(\text{out})} \in \mathcal{S}_{\text{im}}^{(\text{out})} has been computed by the solver.

The first step performed by grid2op is to look at the flows (in Amps) on the powerlines (these data are part of s_{\text{im}}^{(\text{out})}) and to check whether they meet some constraints defined in the parameters (mainly if for some powerline the flow is too high, or if it has been too high for too long, see HARD_OVERFLOW_THRESHOLD, NB_TIMESTEP_OVERFLOW_ALLOWED and NO_OVERFLOW_DISCONNECTION). If some powerlines are disconnected at this step, then the “setpoint” send to the backend at the previous step is modified and it goes back to Step 4: Call the simulator.


The simulator can already handle a real simulation of these “protections”. This “outer loop” is because some simulators does not do it.


For the purist, this “outer loop” necessarily terminates. It is trigger when at least one powerline needs to be disconnected. And there are n_line (finite) powerlines.

Step 6: Reading back the “grid dependant” attributes

At this stage an object s_{\text{im}}^{(\text{out})} \in \mathcal{S}_{\text{im}}^{(\text{out})} has been computed by the solver and all the “rules” / “parameters” regarding powerlines are met.

As discussed in the section about “state space” (see State space for more information), the next state space s' include some part of the outcome of the solver. These data are then read from the s_{\text{im}}^{(\text{out})}, which includes but is not limited to the loads active values load_p, loads reactive values load_q, voltage magnitude at each loads load_v, the same kind of attributes but for generators gen_p, gen_q, gen_v, gen_theta and also for powerlines p_or, q_or, v_or, a_or, theta_or, p_ex, q_ex, v_ex, a_ex, theta_ex, rho etc.

Step 7: update the other attributes of the state space

Finally, the environment takes care of updating all the other “part” of the state space, which are:

As for the attributes related to the rules of the game, they are updated in the following way:

  • timestep_overflow is set to 0 for all powerlines not in overflow and increased by 1 for all the other

  • time_before_cooldown_line is reduced by 1 for all line that has not been impacted by the action a otherwise set to param.NB_TIMESTEP_COOLDOWN_LINE

  • time_before_cooldown_sub is reduced by 1 for all substations that has not been impacted by the action a otherwise set to param.NB_TIMESTEP_COOLDOWN_SUB

The new state s' is then passed to the agent.


We remind that this process might have terminated before reaching the last step described above, for example at Step 3: Compute the generators setpoints and handle storage units or at Step 4: Call the simulator or during the emulation of the protections described at Step 5: Emulation of the “protections”

Reward Kernel

And to finish this (rather long) description of grid2op’s MDP we need to mention the “reward kernel”.

This “kernel” computes the reward associated to taking the action a in step s that lead to step s'. In most cases, the reward in grid2op is a deterministic function and depends only on the grid state.

In grid2op, every environment comes with a pre-defined reward function that can be fully customized by the user when the environment is created or even afterwards (but is still constant during an entire episode of course).

For more information, you might want to have a look at the Reward page of this documentation.


In this last section of this page of the documentation, we dive more onto some aspect of the grid2op MDP.


TODO: This part of the section is still an ongoing work.

Let us know if you want to contribute !

Partial Observatibility

This is the case in most grid2op environments: only some part of the environment state at time t s_t are given to the agent in the observation at time t o_t.

Mathematically this can be modeled with the introduction of an “observation space” and an “observation kernel”. This kernel will only expose part of the “state space” to the agent and (in grid2op) is a deterministic function that depends on the environment state s'.

More specifically, in most grid2op environment (by default at least), none of the physical parameters of the solvers are provided. Also, to represent better the daily operation in power systems, only the t th row of the matrix \mathcal{X}_t is given in the observation o_t. The components \mathcal{X}_{t', i} (for \forall t' > t) are not given. The observation kernel in grid2op will mask out some part of the “environment state” to the agent.

Or not partial observatibility ?

If we consider that the agent is aware of the simulator used and all it’s “constant” (see paragraph State space) part of \mathcal{S}_{\text{im}}^{(\text{in})} (which are part of the simulator that are not affected by the actions of the agent nor by environment) then we can model the grid2op MDP without the need to use an observation kernel: it can be a regular MDP.

To “remove” the need of partial observatibility, without the need to suppose that the agent sees all the future we can adapt slightly the modeling which allows us to remove completely the \mathcal{X} matrix :

  • the observation space / state space (which are equal in this setting) are the same as the one used in Partial Observatibility

  • the transition kernel is now stochastic. Indeed, the “next” value of the loads and generators are, in this modeling not read from a \mathcal{X} matrix but sampled from a given distribution which replaces the step Step 2: load next environment values of subsection Transition Kernel. And once the values of these variables are sampled, the rest of the steps described there are unchanged.


The above holds as long as there exist a way to sample new values for gen_p, load_p, gen_v and load_q that is markovian. We suppose it exists here and will not write it down.


Sampling from these distribution can be quite challenging and will not be covered here.

One of the challenging part is that the sampled generations need to meet the demand (and the losses) as well as all the constraints on the generators (p_min, p_max and ramps)

Adversarial attacks

TODO: explain the model of the environment

Forecast and simulation on future states

TODO : explain the model the forecast and the fact that the “observation” also includes a model of the world that can be different from the grid of the environment

Simulator dynamics can be more complex

TODO, Backend does not need to “exactly map the simulator” there are some examples below:

Hide elements from the grid2op environment

TODO only a part of the grid would be “exposed” in the grid2op environment.

Contain elements not modeled by grid2op

TODO: speak about HVDC or “pq” generators, or 3 winding transformers

Contain embeded controls

TODO for example automatic setpoint for HVDC or limit on Q for generators

Time domain simulation

TODO: we can plug in simulator that solves more accurate description of the grid and only “subsample” (eg at a frequency of every 5 mins) provide grid2op with some information.

Handle the topology differently

Backend can operate switches, only requirement from grid2op is to map the topology to switches.

Some constraints


Operator attention: alarm and alter


If you still can’t find what you’re looking for, try in one of the following pages:

Still trouble finding the information ? Do not hesitate to send a github issue about the documentation at this link: Documentation issue template