commonpower.control.wrappers.MultiAgentWrapper

class MultiAgentWrapper(env)[source]

Bases: Wrapper

Wrapper to standardize ControlEnv to the API for MAPPO/IPPO implementation of the on-policy repository (https://github.com/marlbenchmark/on-policy/tree/main/onpolicy). NOTE: We use our own fork of this repository, see the Readme file.

Parameters:: env (ControlEnv) – power system environment with multi-agent API
Returns:: MultiAgentWrapper

Methods

`act_space_dict_to_list`	Transforms an action space in the form of a nested dictionary into a list of Box spaces for each agent.
`class_name`	Returns the class name of the wrapper.
`close`	Closes the wrapper and `env`.
`get_wrapper_attr`	Gets an attribute from the wrapper and lower environments if name doesn't exist in this object.
`obs_space_dict_to_list`	Transforms the observation space in the form of a nested dictionary into a list of Box spaces for each agent
`render`	Uses the `render()` of the `env` that can be overwritten to change the returned data.
`reset`	Reset the environment
`step`	Advance the environment (in our case, the power system) by one step in time by applying control actions to discrete-time dynamics and updating data sources.
`wrapper_spec`	Generates a WrapperSpec for the wrappers.

Attributes

`action_space`	Return the `Env` `action_space` unless overwritten then the wrapper `action_space` is used.
`metadata`	Returns the `Env` `metadata`.
`np_random`	Returns the `Env` `np_random` attribute.
`observation_space`	Return the `Env` `observation_space` unless overwritten then the wrapper `observation_space` is used.
`render_mode`	Returns the `Env` `render_mode`.
`reward_range`	Return the `Env` `reward_range` unless overwritten then the wrapper `reward_range` is used.
`spec`	Returns the `Env` `spec` attribute with the WrapperSpec if the wrapper inherits from EzPickle.
`unwrapped`	Returns the base environment of the wrapper.

_unpack_obs(obs: dict) → ndarray[source]

Convert dictionary of {agent_id: observation_dict} to a dictonary of {agent_id: flattened observation arrays}.

Parameters:: obs (dict) – observation dictionary {agent_id: observation_dict}
Returns:: np.ndarray – flat array of observations

act_space_dict_to_list(action_space: dict) → Tuple[List[Box], dict][source]

Transforms an action space in the form of a nested dictionary into a list of Box spaces for each agent. Returns the original keys to allow re-transformation

Parameters:

action_space (dict) – nested dictionary of {agent_id: {node_id: {element_id: el_action_space}}}

Returns:

Tuple –

tuple containing:

list of flattened agent action spaces (List[gym.spaces.Box])
dictionary with original actions keys from the action space received as an input (dict)

obs_space_dict_to_list(observation_space: dict) → List[Box][source]

Transforms the observation space in the form of a nested dictionary into a list of Box spaces for each agent

Parameters:: observation_space (dict) – nested dictionary of {agent_id: {node_id: {element_id: el_obs_space}}}
Returns:: List[gym.spaces.Box] – list of flattened agent observation spaces

reset(*, seed=None, options=None)[source]

Reset the environment

Parameters:

seed – seed for the random number generator
options – not needed here

Returns:

None

step(action: List[ndarray]) → Tuple[List[ndarray], List[float], bool, bool, dict][source]

Advance the environment (in our case, the power system) by one step in time by applying control actions to discrete-time dynamics and updating data sources. Handled within the System class. The actions of the RL agent are selected within the RL training algorithm and are passed on to the power system using a callback. After the system update, a reward is computed which indicates how good the action selected by the algorithm was in the current state. This reward is passed to the training algorithm to gradually improve the policies of the RL agents.

Parameters:

action (List[np.ndarray]) – actions of RL agents (here as a list of numpy arrays)

Returns:

Tuple –

tuple containing:

observations of all RL agents, here as a list of observations of each agent as numpy arrays (list).
rewards of all RL agents (list).
whether the episode has terminated (bool). We assume that all agents terminate an episode at the same time, as we have a centralized time management. Always false for continuous control
same as above (bool), but the gymnasium API makes a difference between terminated and truncated, which can be useful for other environments but is not needed in our case
additional information (dict)