.. _environment-module-data-pipeline: Optimize the data pipeline ============================ This page is organized as follow: .. contents:: Table of Contents :depth: 3 Objectives -------------------------- Optimizing the data pipeline can be crucial if you want to learn fast, especially at the beginning of the training. There exists multiple way to perform this task. First, let's start with a summary of the timing. For this test, i ran, on my personal computer, the following code to compare the different method. .. code-block:: python import time import grid2op from grid2op.Chronics import MultifolderWithCache ############################## # this part changes depending on the method env = grid2op.make("l2rpn_neurips_2020_track1_small") env.chronics_handler.set_filter(lambda path: re.match(".*37.*", path) is not None) kept = env.chronics_handler.reset() # if you don't do that it will not have any effect ############################## episode_count = 100 reward = 0 done = False total_reward = 0 # only the time of the following loop is measured %%time for i in range(episode_count): ob = env.reset() if i % 10 == 0: print("10 more") while True: action = env.action_space.sample() ob, reward, done, info = env.step(action) total_reward += reward if done: # in this case the episode is over break Results are reported in the table below: ============================== ================ =================== Method used memory footprint time to perform (s) ============================== ================ =================== Nothing (see Basic Usage ) low 44.6 set_chunk (see `Chunk size`_ ) ultra low 26.8 `MultifolderWithCache`_ high 11.0 ============================== ================ =================== As you can see, the default usage uses relatively little memory but takes a while to compute (almost 45s to perform the 100 episode.) On the contrary, the `Chunk size`_ method uses less memory and is about 40% faster. Storing all data in memory using the `MultifolderWithCache`_ leads to a large memory footprint, but is also significantly faster. On this benchmark, it is 75% faster (it takes only 25% of the initial time) than the original method. Chunk size +++++++++++ The first think you can do, without changing anything to the code, is to ask grid2op to read the input grid data by "chunk". This means that, when you call "env.reset" instead of reading all the data representing a full month, you will read only a subset of it, thus speeding up the IO time by a large amount. In the following example we read data by "chunk" of 100 (if you want hard drive is accessed to read data 100 time steps by 100 time steps (instead of reading the full dataset at once) Note that this "technique" can also be used to reduce the memory footprint (less RAM taken). .. code-block:: python import numpy as np import re import grid2op from grid2op.Agent import RandomAgent env = grid2op.make("l2rpn_case14_sandbox") agent = RandomAgent(env.action_space) env.seed(0) # for reproducible experiments ################################### env.chronics_handler.set_chunk_size(100) ################################### episode_count = 10000 # i want to make lots of episode # i initialize some useful variables reward = 0 done = False total_reward = 0 # and now the loop starts # it will only used the chronics selected for i in range(episode_count): ob = env.reset() # now play the episode as usual while True: action = agent.act(ob, reward, done) ob, reward, done, info = env.step(action) total_reward += reward if done: # in this case the episode is over break (as always added line compared to the base code are highlighted: they are "circle" with `#####`) .. note:: Not all "environment" supports "chunk size". For example if data are generated "on the fly", for now you are forced to generate an entire episode, you cannot generate it "piece by piece". MultifolderWithCache +++++++++++++++++++++ Another way is to use a dedicated class that stores the data in memory. This is particularly useful to avoid long and inefficient I/O that are replaced by reading the the complete dataset once and store it into memory. .. seealso:: The documentation of :class:`grid2op.Chronics.Chronics.MultifolderWithCache` for a more detailed documentation. .. versionchanged:: 1.9.0 Any call to "env.reset()" or "env.step()" without a previous call to `env.chronics_handler.real_data.reset()` will raise an error preventing any use of the environment. (It is no longer assumed people read, at least partially the documentation.) .. danger:: When you create an environment with this chronics class (*eg* by doing `env = make(...,chronics_class=MultifolderWithCache)`), the "cache" is not pre loaded, only the first scenario is loaded in memory (to save loading time). In order to load everything, you NEED to call `env.chronics_handler.reset()`, which, by default, will load every scenario into memory. If you want to filter some data, for example by reading only the scenario of decembre, you can use the `set_filter` method. A typical workflow (at the start of your program) when using this class is then: 1) create the environment: `env = make(...,chronics_class=MultifolderWithCache)` 2) (optional but recommended) select some scenarios: `env.chronics_handler.real_data.set_filter(lambda x: re.match(".*december.*", x) is not None)` 3) load the data in memory: `env.chronics_handler.reset()` (see *eg* :func:`grid2op.Chronics.MultifolderWithCache.reset`) 4) do whatever you want using `env` This can be achieved with: .. code-block:: python import numpy as np import re import grid2op from grid2op.Agent import RandomAgent from grid2op.Chronics import MultifolderWithCache ################################### env = grid2op.make(chronics_class=MultifolderWithCache) # I select only part of the data, it's unlikely the whole dataset can fit into memory... env.chronics_handler.set_filter(lambda path: re.match(".*00[0-9].*", path) is not None) # you need to do that kept = env.chronics_handler.real_data.reset() ################################### agent = RandomAgent(env.action_space) env.seed(0) # for reproducible experiments episode_count = 10000 # i want to make lots of episode # i initialize some useful variables reward = 0 done = False total_reward = 0 # and now the loop starts # it will only used the chronics selected for i in range(episode_count): ob = env.reset() # now play the episode as usual while True: action = agent.act(ob, reward, done) ob, reward, done, info = env.step(action) total_reward += reward if done: # in this case the episode is over break (as always added line compared to the base code are highlighted: they are "circle" with `#####`) Note that by default the `MultifolderWithCache` class will only load the **first** chronics it sees. You need to filter it and call `env.chronics_handler.real_data.reset()` for it to work properly.