Chapter 1
Introduction
Communication in a multi-agent system and emergence of language have been part of an
active research field for some time now. Recent advances in machine learning offer new
ways of resolving and studying them. Developing agents that are capable of learning
to communicate have been a constant challenge for the academic community. Recent
results show a great progress in this field[ [5], [1]] using deep reinforcement learning.
The hardware advances, especially when it comes to GPUs 2 , have allowed for machine
learning algorithms to be trained on large datasets over reasonable spans of time.
What is the role of the environment in language and communication emergence has
been a topic of intense debate. We wish that our agent communication to emerge from
necessity. This approach will imitate humans that communicate out of necessity. In
order to create this necessity agents must have a partial observability of the environment. We explore cooperative tasks in our work in order to create good premises for
language emergence. The analysis in evolutionary linguistics (Nowak et al., 2000) finds
that the composition emerges only when number of concepts to be expressed becomes
greater than a factor of agent’s symbol vocabulary capacity. We consider the problem of
multiple agents acting in environments with the goal of maximizing their shared utility.
In these environments, agents must learn to communicate in order to share information
that is needed to solve the tasks.
Previously, there had been a lot of solutions where the agents were sharing parameters
during the training process. While not all real-world problems can be solved in this
way, so we aim to decentralize the training process and the execution. In recent years
there was a new approach called Dial [5] where agents were trained centralized but
the execution was decentralized. In order to decentralize the learning our approach
aims to use discrete messages not gradients as Dial. We are proposing a decentralized
system where agents learn to chose their own actions and to communicate in order to
coordinate.
An intelligent agent in a multi-agent system should have a social abilities. A naturally
way of expressing the social abilities for and agent would be communicate with other
agents. Messages between agents are usually predefined. This approach is restricting
agent communication to human knowledge of the environment. Enabling agents to learn
to communicate though deep reinforcement learning will remove this restriction and will
7
CHAPTER 1. INTRODUCTION
8
.This approach will enable agent to gain a better understanding of the environment than
the human level.
Agents in recent years gain better or equally performance as humans[7]. Emergence of
language in multi-agent systems could help humans and agents communicate. This will
empower humans to gain a better understanding of different problems and solutions for
that problems. Interpretable compositional structure that in general assigns symbols
to separately refer to environment landmarks has been presented in this paper[1]. We
aim to create agents that exchanging interpretable message.
Through this work, we aim to achieve two goals. Firstly, we want to find an environment
where partial observability makes impossible to find a efficient solution without sharing
information between different agents. We are going to test this hypothesis comparing
a centralized learning with full observability with a decentralized learning with partial
observability. The second goal that we set out to achieve in this work is to observe
if communication between agents may be a solution in order to resolve the partial
observability issue. Certainly we aim to map our messages from continuous values to
discrete encodings without influencing the accuracy of our approach.
In the next Chapter 2 we will focus on work that uses the same proposed method to
solve a different problem and on work that proposes a different method to solve the same
problem. In the third Chapter 3 we offer a good understanding of knowledge needed to
understand our solution and our experiments. After that in the next Chapter 4 we will
describe our solution and our testing environments. In the Chapter 5 we will present out
experiments and their result. In the last two chapter we will present the Conclusions Chapter 6 of our work and we a proposing Future work – Chapter 7.
Chapter 2
Related work
A significant number of problems in Artificial Intelligence require the cooperative work
of multiple agents on an activity. Usually those problems are too difficult to be solved
efficiently by an individual agent. Multi-agent population offer a good environment
for approaches that tackle the emergence of communication and the emergence of language(Igor Mordatch, Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5])
Typically, a manually specified protocol, which is not altered during training, is used as
a communication protocol between agents. Recent advances in machine learning offer
ways of establishing communication protocols between agents. Recently, reinforcement
learning approaches are used towards the purposes of learning a communication protocol(Igor Mordatch, Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5]; Sukhbaatar et al.,
2016[3]; Lazaridou et al., 2016[4]). Giles et al., 2001[8] used a evolutionary algorithm
in order to bootstrap multi-agent communication protocol. His result show that during
training agents were able to find a superior communication protocols which are not
understood by a human.
The use of deep learning as a function approximator for Q values is essential for scalability and further progress. Recent works developed successfully deep Q-networks that
learn policies directly using end-to-end reinforcement learning(Volodymyr Mnih et al.,
2017[7]). His results show that agents gain better or equally performance as humans
players on Atari games. Many of the games used in [7] offer full observablity of the environment and lack communication between agents. This approach can not be applied
to many real world problems because it make a strong assuptiom about the envitorment
and about agents observability. Some new approaches involve learning to communicate
and emergence of language among agents that have partial observability(Igor Mordatch,
Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5]).
From a multi-agent system that use reinforcement learning in order to learn a communication protocols Foerster et al., 2016[5] approach is the closest to ours. Like our
approach they consider multiple agent acting in an environment, where agents need to
learn a communication protocol in order to share information about their state, with a
single purpose of maximizing the shared utility of the system. It is important to mention
that their work introduces new techniques like DIAL and channel noise. Dial enables
agent not only to share parameter during learning but also to push gradients throw
9
CHAPTER 2. RELATED WORK
10
the communication channel. This technique simulates human communication where all
the persons who are part of the conversation receive feedback about the listener understanding. Naturally Dial uses continuous protocols in order to address this issue they
use noise on the communication channel. During their work they discovered that adding
noise to the communication channel will force messages to split into discrete categories
and it was essential for their method to successfully train. However, their experiments
use a centralized learning and a decentralized execution but our approach is bringing in
some technological breakthroughs that enabled decentralized learning and execution.
Figure 2.1:
DIAL
communication[5]
–
Differentiable
Figure 2.2: DIAL – Architecture[5]
Currently large amounts of text is used in order to train natural language system. If
we want to train conversational agent this exposure is problematic because the agent
only retain statistical information and doesn’t have the capacity of understanding the
meaning of the words. However, recent work relies on an environment of multi-agent
system(Igor Mordatch, Pieter Abbeel, 2017 [1]; Lazaridou et al., 2016[4]). Agent communication in this environment creates a framework for language emergence. In both
of their work they focus on what environment features that lead symbols that have
meaning and could be understood by a human. Another interesting point of their work
is that both papers use a fully cooperative environment. Our approach needs also cooperation between agent in order to find the best solution but also has a competitive
part where agents have to maintain their workload as low as possible.
A long standing challenge in artificial intelligence is the development of agents that are
able to communicate with humans. There have been steps towards this goal. One of the
hot topics is how and why did languages formed. One popular idea it’s that emergence
of languages must be out of necessity(Igor Mordatch, Pieter Abbeel, 2017 [1]; Lazaridou
et al., 2016[4]). If an agent uses an language that does not mean that he understands
it, in our view an agent has a great understanding of the language if he achieves his
goals in the environment using a communication protocols with other agents.
Igor Mordatch and Pieter Abbeel, 2017 [1] offer a few starting points in how language
emerges in multi-agent systems. Their work confirms that vocabulary size should be less
than number of concepts. In the past communication with discrete messages presented
difficulties for backpropagation. A notable achievement of their paper is the successful
usage of Gumble-Softmax, which is a continuous relaxation of a discrete categorical
CHAPTER 2. RELATED WORK
11
Figure 2.3: The Gumbel-Softmax distribution interpolates between discrete one-hotencoded categorical distributions and continuous categorical densities. (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable
approaches the expected value of a categorical random variable with the same logits.
As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories. (b) Samples from GumbelSoftmax distributions
are identical to samples from a categorical distribution as τ → 0. At higher temperatures, Gumbel-Softmax samples are no longer one-hot, and become uniform as τ →
∞.[10]
distribution, in language emergence. This distribution was independently discovered by
Maddison et al., 2016 [12]. Jang et al., 2016 [10] used the same approach as Maddison
et al., 2016 [12] which similar to ours. However, Igor Mordatch and Pieter Abbeel, 2017
[1] approach uses a changing environment and our doesn’t. We consider that a changing
environment may help the emergence of language but it is not requirement.
Chapter 3
Theoretical background
3.1
Multi-agent system
A multi-agent system[9] is a system of multiple agents are interacting intelligent within
an environment. Wooldridge and Jennings define an agent as an system(software or
hardware) with the following properties: autonomy, reactive, proactive and social abilities. The social abilities helps agent to synchronize, communicate and organize in order
to achieve their goals. Agents could be split in two categories cognitive agents and reactive agents. Reactive agents are agents that only observe the environment and react
when it changes. Cognitive agents are agents that not only observe the environment but
they interact with other agents and use their social abilities. An rational agent it’s an
agent that is doing the right thing. Stuart Russell and Peter Norvig[9] define an rational
agent as “For each possible percept sequence, a rational agent should select an action
that is expected to maximize its performance measure, given the evidence provided
by the percept sequence and whatever built-in knowledge the agent has”. Depending
on the environment agents could compete, eg. chess, or cooperate, eg. a beehive, or
partially cooperate, eg. taxi drivers.
12
CHAPTER 3. THEORETICAL BACKGROUND
3.2
13
Neural networks
A neural network is a computational model that contains a number of nodes which
are similarly connected as the human brain neurons. Neural networks are organized in
layers. Each layers contains a number of interconnected neurons. Every network has
an input layer and output layer and one or more hidden layers.
Figure 3.1: Neural network with two hidden layers.
A neuron is a node in the neural network that receives multiple inputs and output a
single value. Every neuron has an activation function, the role of activation functions
is make neural networks non-linear. The most commonly used activation functions are
Sigmoid , TanH and Relu.
σ(x) =
1
1 + e−x
tanh(x) = 2 ∗ σ(2x) − 1 relu(x) = max(0, x)
A feed forward neural network is a network that connections between the units do not
form a cycle. A recurrent neural network is a network that connection between units
form a cycle. This help the network to creates an internal state which allows it to
exhibit dynamic temporal behavior.
CHAPTER 3. THEORETICAL BACKGROUND
14
Figure 3.3: TanH
Figure 3.2: Sigmoid
Figure 3.4: Relu
3.3
Reinforcement learning
Reinforcement learning[11] (RL) is an area of machine learning where the agent is
learning how to map situation to actions in order to maximize the cumulative reward.
The learner is not told which actions to take but instead must discover by trial and
error which actions receive the most reward. Finding a policy that achieves the a lot of
reward over the long run means solving the reinforcement learning task. The optimal
policy is the one that is always better than or equal to all other policies.
One of the challenges is the trade-off between exploration vs exploitation. To maximize
the reward the agent will pick the action that he explored and returns the most reward
but on the other hand the agent must try a variety of actions in order to pick best
action later. The most commonly used algorithm, that is balancing the exploration and
the exploitation, is -greedy. The value is between 0 and 1. A value close to 0 means
that the agent acts randomly doing a lot exploring and a value close to 1 mean that the
agent acts based of the policy exploiting what discovered already. During episodes the
value could be fixed or decrease over time.
An reinforcement learning can be formalized as a Markov decision process. The agent
is in an environment which in a certain state. Agent performs actions that transforms
the environment to new state. These actions result in a reward received by the agent.
One episode of this process forms a finite sequence of states, actions and rewards:
s0,a0,r1,s1,a1,r2,s2,. . . ,sn−1,an−1,rn,sn.
CHAPTER 3. THEORETICAL BACKGROUND
15
Figure 3.5: The agent–environment interaction.
The agent goal is to pick the action that maximizes the return in the future. Given a
Markov process we can calculate the total future reward from time point t onward can
be expressed as:
Gt = Rt+1 + Rt+2 + Rt+3 + …
Our environment is stochastic, we can never be sure, if we will get the same rewards
the next time we perform the same actions. The more into the future we go, the more
it may diverge. For that reason it is common to use discounted future reward instead:
Gt = Rt+1 + γ ∗ Rt+2 + γ 2 ∗ Rt+2 + …
where γ
3.4
is a parameter,
0 ≤ γ ≤ 1,
called the discount rate.
Deep Reinforcement learning
One of the most important breakthroughs in reinforcement learning was the development of an off-policy Temporal Difference control algorithm known as Q-learning
(Watkins, 1989). Q-learning is defined by:
Q(S t , At ) ← Q(S t , At ) + α ∗ [Rt+1 + γ ∗ maxa Q(S t+1 , a) − Q(S t , At )].
The main idea in Q-learning is that we can iteratively approximate the Q-function using
the Bellman equation. In the simplest case the Q-function is implemented as a table,
with states as rows and actions as columns.The Q-learning algorithm is the following:
initialize Q[numstates,numactions] arbitrarily
observe initial state s
3 repeat
4
Choose a from s using policy derived from Q (e.g., -greedy)
5
Observe reward r and new state s’
1
2
CHAPTER 3. THEORETICAL BACKGROUND
16
Q[s,a] = Q[s,a] + α * (r + γ * maxa’ Q[s’,a’] – Q[s,a])
s = s’
8 until terminated
6
7
Listing 3.1: Q-learning: An off-policy TD control algorithm.
Deep Q-learning[7] uses neural networks parameterised by θ to represent Q(s, a; θ).
DQN is used when the number of states of the environment is very high and it would
take a long time span for the Q-table to converge. DQN uses the neural network to
approximate value function Q(s, a; θ). DQN uses experience replay: during learning,
the agent builds a dataset of episodic experiences and is then trained by sampling
mini-batches of experiences.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights
for episode = 1, M do
Initialise sequence s1 = x1 and preprocessed sequenced
φ1 = φ(s1 )
for t = 1, T do
With probability select a random action at
otherwise select at = maxa Q(φ(st ), a; θ)
Execute action at in emulator and observe reward rt and
state xt+1
Set st+1 = st , at , xt+1 and preprocess φt+1 = φ(st + 1)
Store transition (φt , at , rt , φt+1 ) in D
Sample random minibatch of transitions (φj , aj , rj , φj+1 )
from
( D
rj f or terminal φj+1
yj =
rj + γ ∗ maxaθ Q(φj+1 , aθ ; θ) f or non − terminal φj+1
Perform a gradient descent step on (yj − Q(φj , aj ; θ))2
end for
end for
Listing 3.2: Q-learning: Deep Q-learning with experience replay.
DQN uses an actions selector for picking an action a in the state s with parameter
θ. Typically the action selector is an -greddy implementation that selects the action
that maximises the Q-value with a probability of 1 − and chooses randomly with a
probability of . During the episodes the may have a fixed value or may decreases
over time.
CHAPTER 3. THEORETICAL BACKGROUND
3.5
17
Gumble-softmax
Gumbel-Softmax distribution is a continuous relaxation of a discrete categorical distribution. One of the most important properties is that this distribution is reparameterizable, allowing gradient to flow during the backpropagation. The main contribution of
this work is a reparameterization trick for the categorical distribution. The GumbelMax trick (Gumbel, 1954 [6] provides an efficient way to draw samples z from the
categorical distribution with class probabilities πi :
z = one_hot(argmaxi [gi + log(πi )])
where g1 , …, gk are independent and identically distributed samples drawn from Gumbel(0, 1)1 . Argmax is not differentiable, so we approximate continuously the argmax
using a the softmax function:
yi = exp((log(πi + gi )/τ )/Σj=1 exp((log(πj + gj )/τ ) f or
i = 1, …, k.
τ is a temperature parameter that allows us to control how closely samples from the
Gumbel-Softmax distribution approximate those from the categorical distribution. As
the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform
distribution over the categories. As the softmax temperature τ approaches 0, samples
from the Gumbel-Softmax distribution become one-hot and the Gumbel-Softmax distribution becomes identical to the categorical distribution p(z) (Maddison et al., 2016
[12], Jang et al., 2016 [10]) .