commonpower.control.runners.MAPPOTrainer

class MAPPOTrainer(sys: ~commonpower.core.System, alg_config: ~commonpower.control.configs.algorithms.MAPPOBaseConfig, global_controller: ~commonpower.control.controllers.OptimalController = <commonpower.control.controllers.OptimalController object>, wrapper: ~gymnasium.core.Wrapper | None = None, logger: ~commonpower.control.logging_utils.loggers.BaseLogger | None = None, horizon: ~datetime.timedelta = datetime.timedelta(days=1), episode_length: int = 24, dt: ~datetime.timedelta = datetime.timedelta(seconds=3600), continuous_control: bool = False, history: ~commonpower.modeling.history.ModelHistory | None = None, solver: ~pyomo.opt.base.solvers.OptSolver = <pyomo.solvers.plugins.solvers.gurobi_direct.GurobiDirect object>, save_path: str = './saved_models/test_model', seed: int | None = None, normalize_actions: bool = True, limited_date_range: ~typing.List[~datetime.datetime] | None = None)[source]

Bases: BaseTrainer

Runner for training multiple heterogeneous agents with MAPPO/IPPO from the on-policy repository (https://github.com/marlbenchmark/on-policy/tree/main/onpolicy). Based on our BaseTrainer and our logging framework as well as the BaseRunner from the on-policy repository :param sys: power system to be controlled :type sys: System :param global_controller: instance of controller taking over control of all nodes

that have not yet been assigned a controller. Mostly used to balance the system using a market node or a generator. Defaults to OptimalController(“global”).

Parameters:
  • alg_config (MAPPOBaseConfig) – configuration for the RL algorithm and policy to be trained

  • wrapper (gym.Wrapper) – wrapper for the environment that handles the RL agents during training (used for example for single-agent RL control).

  • logger (BaseLogger) – object for handling training logs

  • horizon (timedelta) – amount of time that the controller looks into the future

  • episode_length (int) – number of time steps to simulate before the system is reset during RL training if continuous_control=False

  • dt (timedelta) – control time interval

  • continuous_control (bool) – whether to use an infinite control horizon

  • history (ModelHistory) – logging

  • solver (OptSolver) – solver for optimization problem

  • save_path (str) – local path to folder in which the trained policy will be stored (as .zip file) after the training is finished

  • seed (int) – seed for the global random number generator of numpy (we use np.random.seed(seed) instead

  • generator) (of instantiating our own)

  • normalize_actions (bool) – whether or not to normalize the action space

  • limited_date_range (list) – limits the system’s date range such that we only train over a specific interval

Methods

collect

Obtain actions for the current step based on current policies, observations, shared observations, and hidden states.

compute

Compute returns based on next value (will be needed for loss)

eval

Evaluates current policies on separate eval environment (not used atm).

finish_run

Finish run, mostly needed for deleting global controller and terminating Weights&Biases logger

insert

Write information collected during rollout to buffers (one per agent) in the appropriate format (the "SeparatedReplayBuffer" from the on-policy repository logs some information we do not require, like masks for terminated agents).

log_train

prepare_run

In addition to the preparation in BaseRunner, we also instantiate an environment function as an API for the RL training.

render

run

Simulates the scenario for a given number of time steps.

save

Save the actor and critic parameters for each agent

set_start_time

Set start time from external.

system_feasible

Check whether the current system set-up is feasible.

train

Perform updates of actor and critic parameters for each agent

warmup

Pre-training preparations specific to MAPPO/IPPO

_check_alg_config()[source]

Sanity check for the algorithm configuration: If we use any variant of MAPPO, we want a shared observation space which means that use_centralized_V has to be true. If we use a recurrent policy (RMAPPO), the respective arguments have to be true.

Returns:

None

_parse_alg_config()[source]

Write algorithm configuration to class attributes

Returns:

None

_run(n_steps: int = 24)[source]

Runs the multi-agent RL training algorithm (MAPPO or IPPO) for a given number of time steps and saves the trained policies.

Returns:

None

_set_device()[source]

Set computing device according to algorithm configuration

Returns:

None

collect(step: int) Tuple[array, List[array], List[array], array, array, List[array]][source]

Obtain actions for the current step based on current policies, observations, shared observations, and hidden states. The masks are not necessary in our case, because all agents terminate at the same time.

Parameters:

step (int) – The current step within the episode

Returns:

Tuple

tuple containing:
  • values (np.array)

  • actions (List[np.array])

  • action probabilities, logarithmic (List[np.array])

  • hidden states of recurrent NN actor. Only needed for recurrent policies (np.array)

  • hidden states of recurrent NN critic. Only needed for recurrent policies (np.array)

  • environment actions ? not sure, adapted from on-policy BaseRunner (List[np.array])

compute()[source]

Compute returns based on next value (will be needed for loss)

Returns:

None

eval(total_num_steps: int)[source]

Evaluates current policies on separate eval environment (not used atm).

Parameters:

total_num_steps (int) – Current training progress

Returns:

None

finish_run()[source]

Finish run, mostly needed for deleting global controller and terminating Weights&Biases logger

Returns:

None

insert(data: Tuple[array, array, array, array, array, List[array], List[array], array, array]) None[source]

Write information collected during rollout to buffers (one per agent) in the appropriate format (the “SeparatedReplayBuffer” from the on-policy repository logs some information we do not require, like masks for terminated agents).

Parameters:

data (Tuple) – data collected during rollout which should be inserted into buffers

Returns:

None

log_train(train_infos: List[dict], total_num_steps: int, start: time | None = None, end: time | None = None)[source]
Parameters:
  • train_infos (List[dict]) – training metrics for each agent

  • total_num_steps (int) – current training progress

  • start (time.time) – start time of training episode

  • end (time.time) – end time of training episode

Returns:

None

prepare_run()[source]

In addition to the preparation in BaseRunner, we also instantiate an environment function as an API for the RL training.

Returns: None

save()[source]

Save the actor and critic parameters for each agent

Returns:

None

train() List[dict][source]

Perform updates of actor and critic parameters for each agent

Returns:

List[dict] – list of training metrics dictionary (one list entry per agent)

warmup()[source]

Pre-training preparations specific to MAPPO/IPPO

Returns:

None