Optimize the data pipeline
This page is organized as follow:
Objectives
Optimizing the data pipeline can be crucial if you want to learn fast, especially at the beginning of the training. There exists multiple way to perform this task.
First, let’s start with a summary of the timing. For this test, i ran, on my personal computer, the following code to compare the different method.
import time
import grid2op
from grid2op.Chronics import MultifolderWithCache
##############################
# this part changes depending on the method
env = grid2op.make("l2rpn_neurips_2020_track1_small")
env.chronics_handler.set_filter(lambda path: re.match(".*37.*", path) is not None)
kept = env.chronics_handler.reset() # if you don't do that it will not have any effect
##############################
episode_count = 100
reward = 0
done = False
total_reward = 0
# only the time of the following loop is measured
%%time
for i in range(episode_count):
ob = env.reset()
if i % 10 == 0:
print("10 more")
while True:
action = env.action_space.sample()
ob, reward, done, info = env.step(action)
total_reward += reward
if done:
# in this case the episode is over
break
Results are reported in the table below:
Method used |
memory footprint |
time to perform (s) |
---|---|---|
Nothing (see `Basic Usage`_) |
low |
44.6 |
set_chunk (see Chunk size) |
ultra low |
26.8 |
high |
11.0 |
As you can see, the default usage uses relatively little memory but takes a while to compute (almost 45s to perform the 100 episode.) On the contrary, the Chunk size method uses less memory and is about 40% faster. Storing all data in memory using the MultifolderWithCache leads to a large memory footprint, but is also significantly faster. On this benchmark, it is 75% faster (it takes only 25% of the initial time) than the original method.
Chunk size
The first think you can do, without changing anything to the code, is to ask grid2op to read the input grid data by “chunk”. This means that, when you call “env.reset” instead of reading all the data representing a full month, you will read only a subset of it, thus speeding up the IO time by a large amount. In the following example we read data by “chunk” of 100 (if you want hard drive is accessed to read data 100 time steps by 100 time steps (instead of reading the full dataset at once) Note that this “technique” can also be used to reduce the memory footprint (less RAM taken).
import numpy as np
import re
import grid2op
from grid2op.Agent import RandomAgent
env = grid2op.make("l2rpn_case14_sandbox")
agent = RandomAgent(env.action_space)
env.seed(0) # for reproducible experiments
###################################
env.chronics_handler.set_chunk_size(100)
###################################
episode_count = 10000 # i want to make lots of episode
# i initialize some useful variables
reward = 0
done = False
total_reward = 0
# and now the loop starts
# it will only used the chronics selected
for i in range(episode_count):
ob = env.reset()
# now play the episode as usual
while True:
action = agent.act(ob, reward, done)
ob, reward, done, info = env.step(action)
total_reward += reward
if done:
# in this case the episode is over
break
(as always added line compared to the base code are highlighted: they are “circle” with #####)
Note
Not all “environment” supports “chunk size”. For example if data are generated “on the fly”, for now you are forced to generate an entire episode, you cannot generate it “piece by piece”.
MultifolderWithCache
Another way is to use a dedicated class that stores the data in memory. This is particularly useful to avoid long and inefficient I/O that are replaced by reading the the complete dataset once and store it into memory.
See also
The documentation of grid2op.Chronics.Chronics.MultifolderWithCache
for a more
detailed documentation.
Changed in version 1.9.0: Any call to “env.reset()” or “env.step()” without a previous call to env.chronics_handler.real_data.reset() will raise an error preventing any use of the environment. (It is no longer assumed people read, at least partially the documentation.)
Danger
When you create an environment with this chronics class (eg by doing env = make(…,chronics_class=MultifolderWithCache)), the “cache” is not pre loaded, only the first scenario is loaded in memory (to save loading time).
In order to load everything, you NEED to call env.chronics_handler.reset(), which, by default, will load every scenario into memory. If you want to filter some data, for example by reading only the scenario of decembre, you can use the set_filter method.
A typical workflow (at the start of your program) when using this class is then:
create the environment: env = make(…,chronics_class=MultifolderWithCache)
(optional but recommended) select some scenarios: env.chronics_handler.real_data.set_filter(lambda x: re.match(“.*december.*”, x) is not None)
load the data in memory: env.chronics_handler.reset() (see eg
grid2op.Chronics.MultifolderWithCache.reset()
)do whatever you want using env
This can be achieved with:
import numpy as np
import re
import grid2op
from grid2op.Agent import RandomAgent
from grid2op.Chronics import MultifolderWithCache
###################################
env = grid2op.make(chronics_class=MultifolderWithCache)
# I select only part of the data, it's unlikely the whole dataset can fit into memory...
env.chronics_handler.set_filter(lambda path: re.match(".*00[0-9].*", path) is not None)
# you need to do that
kept = env.chronics_handler.real_data.reset()
###################################
agent = RandomAgent(env.action_space)
env.seed(0) # for reproducible experiments
episode_count = 10000 # i want to make lots of episode
# i initialize some useful variables
reward = 0
done = False
total_reward = 0
# and now the loop starts
# it will only used the chronics selected
for i in range(episode_count):
ob = env.reset()
# now play the episode as usual
while True:
action = agent.act(ob, reward, done)
ob, reward, done, info = env.step(action)
total_reward += reward
if done:
# in this case the episode is over
break
(as always added line compared to the base code are highlighted: they are “circle” with #####)
Note that by default the MultifolderWithCache class will only load the first chronics it sees. You need to filter it and call env.chronics_handler.real_data.reset() for it to work properly.