Emergence of Language in Multi-agent Systems: Deep Reinforcement Learning

Chapter 1

Table of Contents

Introduction

Communication in a multi-agent system and emergence of language have been part of an

active research field for some time now. Recent advances in machine learning offer new

ways of resolving and studying them. Developing agents that are capable of learning

to communicate have been a constant challenge for the academic community. Recent

results show a great progress in this field[ [5], [1]] using deep reinforcement learning.

The hardware advances, especially when it comes to GPUs 2 , have allowed for machine

learning algorithms to be trained on large datasets over reasonable spans of time.

What is the role of the environment in language and communication emergence has

been a topic of intense debate. We wish that our agent communication to emerge from

necessity. This approach will imitate humans that communicate out of necessity. In

order to create this necessity agents must have a partial observability of the environment. We explore cooperative tasks in our work in order to create good premises for

language emergence. The analysis in evolutionary linguistics (Nowak et al., 2000) finds

that the composition emerges only when number of concepts to be expressed becomes

greater than a factor of agent’s symbol vocabulary capacity. We consider the problem of

multiple agents acting in environments with the goal of maximizing their shared utility.

In these environments, agents must learn to communicate in order to share information

that is needed to solve the tasks.

Previously, there had been a lot of solutions where the agents were sharing parameters

during the training process. While not all real-world problems can be solved in this

way, so we aim to decentralize the training process and the execution. In recent years

there was a new approach called Dial [5] where agents were trained centralized but

the execution was decentralized. In order to decentralize the learning our approach

aims to use discrete messages not gradients as Dial. We are proposing a decentralized

system where agents learn to chose their own actions and to communicate in order to

coordinate.

An intelligent agent in a multi-agent system should have a social abilities. A naturally

way of expressing the social abilities for and agent would be communicate with other

agents. Messages between agents are usually predefined. This approach is restricting

agent communication to human knowledge of the environment. Enabling agents to learn

to communicate though deep reinforcement learning will remove this restriction and will

CHAPTER 1. INTRODUCTION

.This approach will enable agent to gain a better understanding of the environment than

the human level.

Agents in recent years gain better or equally performance as humans[7]. Emergence of

language in multi-agent systems could help humans and agents communicate. This will

empower humans to gain a better understanding of different problems and solutions for

that problems. Interpretable compositional structure that in general assigns symbols

to separately refer to environment landmarks has been presented in this paper[1]. We

aim to create agents that exchanging interpretable message.

Through this work, we aim to achieve two goals. Firstly, we want to find an environment

where partial observability makes impossible to find a efficient solution without sharing

information between different agents. We are going to test this hypothesis comparing

a centralized learning with full observability with a decentralized learning with partial

observability. The second goal that we set out to achieve in this work is to observe

if communication between agents may be a solution in order to resolve the partial

observability issue. Certainly we aim to map our messages from continuous values to

discrete encodings without influencing the accuracy of our approach.

In the next Chapter 2 we will focus on work that uses the same proposed method to

solve a different problem and on work that proposes a different method to solve the same

problem. In the third Chapter 3 we offer a good understanding of knowledge needed to

understand our solution and our experiments. After that in the next Chapter 4 we will

describe our solution and our testing environments. In the Chapter 5 we will present out

experiments and their result. In the last two chapter we will present the Conclusions Chapter 6 of our work and we a proposing Future work – Chapter 7.

Chapter 2

Related work

A significant number of problems in Artificial Intelligence require the cooperative work

of multiple agents on an activity. Usually those problems are too difficult to be solved

efficiently by an individual agent. Multi-agent population offer a good environment

for approaches that tackle the emergence of communication and the emergence of language(Igor Mordatch, Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5])

Typically, a manually specified protocol, which is not altered during training, is used as

a communication protocol between agents. Recent advances in machine learning offer

ways of establishing communication protocols between agents. Recently, reinforcement

learning approaches are used towards the purposes of learning a communication protocol(Igor Mordatch, Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5]; Sukhbaatar et al.,

2016[3]; Lazaridou et al., 2016[4]). Giles et al., 2001[8] used a evolutionary algorithm

in order to bootstrap multi-agent communication protocol. His result show that during

training agents were able to find a superior communication protocols which are not

understood by a human.

The use of deep learning as a function approximator for Q values is essential for scalability and further progress. Recent works developed successfully deep Q-networks that

learn policies directly using end-to-end reinforcement learning(Volodymyr Mnih et al.,

2017[7]). His results show that agents gain better or equally performance as humans

players on Atari games. Many of the games used in [7] offer full observablity of the environment and lack communication between agents. This approach can not be applied

to many real world problems because it make a strong assuptiom about the envitorment

and about agents observability. Some new approaches involve learning to communicate

and emergence of language among agents that have partial observability(Igor Mordatch,

Pieter Abbeel, 2017 [1]; Foerster et al., 2016[5]).

From a multi-agent system that use reinforcement learning in order to learn a communication protocols Foerster et al., 2016[5] approach is the closest to ours. Like our

approach they consider multiple agent acting in an environment, where agents need to

learn a communication protocol in order to share information about their state, with a

single purpose of maximizing the shared utility of the system. It is important to mention

that their work introduces new techniques like DIAL and channel noise. Dial enables

agent not only to share parameter during learning but also to push gradients throw

CHAPTER 2. RELATED WORK

the communication channel. This technique simulates human communication where all

the persons who are part of the conversation receive feedback about the listener understanding. Naturally Dial uses continuous protocols in order to address this issue they

use noise on the communication channel. During their work they discovered that adding

noise to the communication channel will force messages to split into discrete categories

and it was essential for their method to successfully train. However, their experiments

use a centralized learning and a decentralized execution but our approach is bringing in

some technological breakthroughs that enabled decentralized learning and execution.

Figure 2.1:

DIAL

communication[5]

–

Differentiable

Figure 2.2: DIAL – Architecture[5]

Currently large amounts of text is used in order to train natural language system. If

we want to train conversational agent this exposure is problematic because the agent

only retain statistical information and doesn’t have the capacity of understanding the

meaning of the words. However, recent work relies on an environment of multi-agent

system(Igor Mordatch, Pieter Abbeel, 2017 [1]; Lazaridou et al., 2016[4]). Agent communication in this environment creates a framework for language emergence. In both

of their work they focus on what environment features that lead symbols that have

meaning and could be understood by a human. Another interesting point of their work

is that both papers use a fully cooperative environment. Our approach needs also cooperation between agent in order to find the best solution but also has a competitive

part where agents have to maintain their workload as low as possible.

A long standing challenge in artificial intelligence is the development of agents that are

able to communicate with humans. There have been steps towards this goal. One of the

hot topics is how and why did languages formed. One popular idea it’s that emergence

of languages must be out of necessity(Igor Mordatch, Pieter Abbeel, 2017 [1]; Lazaridou

et al., 2016[4]). If an agent uses an language that does not mean that he understands

it, in our view an agent has a great understanding of the language if he achieves his

goals in the environment using a communication protocols with other agents.

Igor Mordatch and Pieter Abbeel, 2017 [1] offer a few starting points in how language

emerges in multi-agent systems. Their work confirms that vocabulary size should be less

than number of concepts. In the past communication with discrete messages presented

difficulties for backpropagation. A notable achievement of their paper is the successful

usage of Gumble-Softmax, which is a continuous relaxation of a discrete categorical

CHAPTER 2. RELATED WORK

Figure 2.3: The Gumbel-Softmax distribution interpolates between discrete one-hotencoded categorical distributions and continuous categorical densities. (a) For low temperatures (τ = 0.1, τ = 0.5), the expected value of a Gumbel-Softmax random variable

approaches the expected value of a categorical random variable with the same logits.

As the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform distribution over the categories. (b) Samples from GumbelSoftmax distributions

are identical to samples from a categorical distribution as τ → 0. At higher temperatures, Gumbel-Softmax samples are no longer one-hot, and become uniform as τ →

∞.[10]

distribution, in language emergence. This distribution was independently discovered by

Maddison et al., 2016 [12]. Jang et al., 2016 [10] used the same approach as Maddison

et al., 2016 [12] which similar to ours. However, Igor Mordatch and Pieter Abbeel, 2017

[1] approach uses a changing environment and our doesn’t. We consider that a changing

environment may help the emergence of language but it is not requirement.

Chapter 3

Theoretical background

3.1

Multi-agent system

A multi-agent system[9] is a system of multiple agents are interacting intelligent within

an environment. Wooldridge and Jennings define an agent as an system(software or

hardware) with the following properties: autonomy, reactive, proactive and social abilities. The social abilities helps agent to synchronize, communicate and organize in order

to achieve their goals. Agents could be split in two categories cognitive agents and reactive agents. Reactive agents are agents that only observe the environment and react

when it changes. Cognitive agents are agents that not only observe the environment but

they interact with other agents and use their social abilities. An rational agent it’s an

agent that is doing the right thing. Stuart Russell and Peter Norvig[9] define an rational

agent as “For each possible percept sequence, a rational agent should select an action

that is expected to maximize its performance measure, given the evidence provided

by the percept sequence and whatever built-in knowledge the agent has”. Depending

on the environment agents could compete, eg. chess, or cooperate, eg. a beehive, or

partially cooperate, eg. taxi drivers.

CHAPTER 3. THEORETICAL BACKGROUND

3.2

Neural networks

A neural network is a computational model that contains a number of nodes which

are similarly connected as the human brain neurons. Neural networks are organized in

layers. Each layers contains a number of interconnected neurons. Every network has

an input layer and output layer and one or more hidden layers.

Figure 3.1: Neural network with two hidden layers.

A neuron is a node in the neural network that receives multiple inputs and output a

single value. Every neuron has an activation function, the role of activation functions

is make neural networks non-linear. The most commonly used activation functions are

Sigmoid , TanH and Relu.

σ(x) =

1 + e−x

tanh(x) = 2 ∗ σ(2x) − 1 relu(x) = max(0, x)

A feed forward neural network is a network that connections between the units do not

form a cycle. A recurrent neural network is a network that connection between units

form a cycle. This help the network to creates an internal state which allows it to

exhibit dynamic temporal behavior.

CHAPTER 3. THEORETICAL BACKGROUND

Figure 3.3: TanH

Figure 3.2: Sigmoid

Figure 3.4: Relu

3.3

Reinforcement learning

Reinforcement learning[11] (RL) is an area of machine learning where the agent is

learning how to map situation to actions in order to maximize the cumulative reward.

The learner is not told which actions to take but instead must discover by trial and

error which actions receive the most reward. Finding a policy that achieves the a lot of

reward over the long run means solving the reinforcement learning task. The optimal

policy is the one that is always better than or equal to all other policies.

One of the challenges is the trade-off between exploration vs exploitation. To maximize

the reward the agent will pick the action that he explored and returns the most reward

but on the other hand the agent must try a variety of actions in order to pick best

action later. The most commonly used algorithm, that is balancing the exploration and

the exploitation, is -greedy. The value is between 0 and 1. A value close to 0 means

that the agent acts randomly doing a lot exploring and a value close to 1 mean that the

agent acts based of the policy exploiting what discovered already. During episodes the

value could be fixed or decrease over time.

An reinforcement learning can be formalized as a Markov decision process. The agent

is in an environment which in a certain state. Agent performs actions that transforms

the environment to new state. These actions result in a reward received by the agent.

One episode of this process forms a finite sequence of states, actions and rewards:

s0,a0,r1,s1,a1,r2,s2,. . . ,sn−1,an−1,rn,sn.

CHAPTER 3. THEORETICAL BACKGROUND

Figure 3.5: The agent–environment interaction.

The agent goal is to pick the action that maximizes the return in the future. Given a

Markov process we can calculate the total future reward from time point t onward can

be expressed as:

Gt = Rt+1 + Rt+2 + Rt+3 + …

Our environment is stochastic, we can never be sure, if we will get the same rewards

the next time we perform the same actions. The more into the future we go, the more

it may diverge. For that reason it is common to use discounted future reward instead:

Gt = Rt+1 + γ ∗ Rt+2 + γ 2 ∗ Rt+2 + …

where γ

3.4

is a parameter,

0 ≤ γ ≤ 1,

called the discount rate.

Deep Reinforcement learning

One of the most important breakthroughs in reinforcement learning was the development of an off-policy Temporal Difference control algorithm known as Q-learning

(Watkins, 1989). Q-learning is defined by:

Q(S t , At ) ← Q(S t , At ) + α ∗ [Rt+1 + γ ∗ maxa Q(S t+1 , a) − Q(S t , At )].

The main idea in Q-learning is that we can iteratively approximate the Q-function using

the Bellman equation. In the simplest case the Q-function is implemented as a table,

with states as rows and actions as columns.The Q-learning algorithm is the following:

initialize Q[numstates,numactions] arbitrarily

observe initial state s

3 repeat

Choose a from s using policy derived from Q (e.g., -greedy)

Observe reward r and new state s’

CHAPTER 3. THEORETICAL BACKGROUND

Q[s,a] = Q[s,a] + α * (r + γ * maxa’ Q[s’,a’] – Q[s,a])

s = s’

8 until terminated

Listing 3.1: Q-learning: An off-policy TD control algorithm.

Deep Q-learning[7] uses neural networks parameterised by θ to represent Q(s, a; θ).

DQN is used when the number of states of the environment is very high and it would

take a long time span for the Q-table to converge. DQN uses the neural network to

approximate value function Q(s, a; θ). DQN uses experience replay: during learning,

the agent builds a dataset of episodic experiences and is then trained by sampling

mini-batches of experiences.

Initialize replay memory D to capacity N

Initialize action-value function Q with random weights

for episode = 1, M do

Initialise sequence s1 = x1 and preprocessed sequenced

φ1 = φ(s1 )

for t = 1, T do

With probability select a random action at

otherwise select at = maxa Q(φ(st ), a; θ)

Execute action at in emulator and observe reward rt and

state xt+1

Set st+1 = st , at , xt+1 and preprocess φt+1 = φ(st + 1)

Store transition (φt , at , rt , φt+1 ) in D

Sample random minibatch of transitions (φj , aj , rj , φj+1 )

from

( D

rj f or terminal φj+1

yj =

rj + γ ∗ maxaθ Q(φj+1 , aθ ; θ) f or non − terminal φj+1

Perform a gradient descent step on (yj − Q(φj , aj ; θ))2

end for

Listing 3.2: Q-learning: Deep Q-learning with experience replay.

DQN uses an actions selector for picking an action a in the state s with parameter

θ. Typically the action selector is an -greddy implementation that selects the action

that maximises the Q-value with a probability of 1 − and chooses randomly with a

probability of . During the episodes the may have a fixed value or may decreases

over time.

CHAPTER 3. THEORETICAL BACKGROUND

3.5

Gumble-softmax

Gumbel-Softmax distribution is a continuous relaxation of a discrete categorical distribution. One of the most important properties is that this distribution is reparameterizable, allowing gradient to flow during the backpropagation. The main contribution of

this work is a reparameterization trick for the categorical distribution. The GumbelMax trick (Gumbel, 1954 [6] provides an efficient way to draw samples z from the

categorical distribution with class probabilities πi :

z = one_hot(argmaxi [gi + log(πi )])

where g1 , …, gk are independent and identically distributed samples drawn from Gumbel(0, 1)1 . Argmax is not differentiable, so we approximate continuously the argmax

using a the softmax function:

yi = exp((log(πi + gi )/τ )/Σj=1 exp((log(πj + gj )/τ ) f or

i = 1, …, k.

τ is a temperature parameter that allows us to control how closely samples from the

Gumbel-Softmax distribution approximate those from the categorical distribution. As

the temperature increases (τ = 1.0, τ = 10.0), the expected value converges to a uniform

distribution over the categories. As the softmax temperature τ approaches 0, samples

from the Gumbel-Softmax distribution become one-hot and the Gumbel-Softmax distribution becomes identical to the categorical distribution p(z) (Maddison et al., 2016

[12], Jang et al., 2016 [10]) .

Essay: Emergence of Language in Multi-agent Systems: Deep Reinforcement Learning

Essay details and download:

Text preview of this essay:

Introduction

Theoretical background

About this essay:

Essay details and download:

Text preview of this essay:

Introduction

Theoretical background

About this essay:

Essay Categories: