The decentralized partially observable Markov decision process (Dec-POMDP)[1][2] is a model for coordination and decision-making among multiple agents. It is a probabilistic model that can consider uncertainty in outcomes, sensors and communication (i.e., costly, delayed, noisy or nonexistent communication).
It is a generalization of a Markov decision process (MDP) and a partially observable Markov decision process (POMDP) to consider multiple decentralized agents.[3]
A Dec-POMDP is a 7-tuple ( S , { A i } , T , R , { Ω i } , O , γ ) {\displaystyle (S,\{A_{i}\},T,R,\{\Omega _{i}\},O,\gamma )} , where
At each time step, each agent takes an action a i ∈ A i {\displaystyle a_{i}\in A_{i}} , the state updates based on the transition function T ( s , a , s ′ ) {\displaystyle T(s,a,s')} (using the current state and the joint action), each agent observes an observation based on the observation function O ( s ′ , a , o ) {\displaystyle O(s',a,o)} (using the next state and the joint action) and a reward is generated for the whole team based on the reward function R ( s , a ) {\displaystyle R(s,a)} . The goal is to maximize expected cumulative reward over a finite or infinite number of steps. These time steps repeat until some given horizon (called finite horizon) or forever (called infinite horizon). The discount factor γ {\displaystyle \gamma } maintains a finite sum in the infinite-horizon case ( γ ∈ [ 0 , 1 ) {\displaystyle \gamma \in [0,1)} ).