I

A SEMINAR REPORT ON

IMPLEMENTATION OF MONTE-CARLO SEARCH TREE METHODS IN DECISION MAKING AND RTS GAMES

SUBMITTED TO SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE

IN THE PARTIAL FULFILLMENT OF THE REQUIREMENTS

OF

THIRD YEAR OF COMPUTER ENGINEERING

BY

GHARTI GAURAV TULARAM Exam No: T120574234

DEPARTMENT OF COMPUTER ENGINEERING

STES’S SINHGAD INSTITUE OF TECHNOLOGY AND SCIENCE

NARHE,

PUNE- 411041

April 2017

CERTIFICATE

This is to certify that the seminar report entitled

“SECURITY AND PRIVACY IN WEB-BASED ACCESS CONTROL IN INTERNET OF THINGS”

Submitted by

Gharti Gaurav Tularam Exam No: T120574234

is a bonafide work carried out by them under the supervision of Prof. Nancy Peter and is approved for the partial fulfillment of the requirement of SavitriBai Phule Pune University, Pune for the award of the third year of Computer Engineering.

(Prof. Nancy Peter) (Prof. Mrs. G. S. Navale)

Guide, Head,

Department of Computer Engineering Department of Computer Engineering

(Dr. S. N. Mali)

Principal,

SINHGAD INSTITUE OF TECHNOLOGY and SCIENCE, NARHE – 41

Place: Pune

Date:

Abbreviations

1. MCTS Monte Carlo Tree Search

2. RTS Real Time Strategy

3. MDP Markov Decision process

4. UCB Upper Confidence bounds

5. UCT Upper Confidence bounds for trees

6. TDL Temporal Difference Learning

7. TDMC Temporal Difference with Monte-Carlo

8. BAAL Bandit-Based Active Learner

9. ISUCT Information Set Upper Confidence bounds for trees

10. NRPA Nested Rollout Policy Adaption

11. FSSS Forward Search Sparse Sampling

12. FSM Finite State Machine

13. NPC Non-player Character

14. AI Artificial Intelligence

List of Figures

Figure No. Title Page No.

1.1 Monte Carlo Tree Search 1

2.1 Monte Carlo Tree Search Stages with probabilities of nodes. 4

Abstract

We have deliberated about how Computational Intelligence and artificial intelligence are implemented in games by applying Monte-Carlo Search Tree(MCST) methods. The area of Computational Intelligence and artificial intelligence in games had a great success in the past 10 years. Several developments are introduced in order to adapt MCTS to the real-time domain.

Computational intelligence is the study of the strategy of intelligent agents. Artificial intelligence can be well-defined as the theory and development of computer systems able to perform tasks normally needing human intelligence, such as visual perception, speech recognition, decision-making, and transformation between languages.

Monte Carlo tree search (MCTS) is an practical search algorithm for some kinds of decision processes, most particularly those active in game play. A leading example of Monte Carlo tree search is current computer Go programs, but it also has been used in other board games, along with real-time video games and non-deterministic games.

Purpose of this paper is to instrument Monte Carlo Search tree algorithm to analyze the real-time decision making of the character in the game in diverse situations with different outcomes, their cost, benefits of choosing one decision and its defeats. Also, we have discussed about game theory, path finding, genetic programming, neural networks and RTS gaming.

Keywords: Monte Carlo tree search, Artificial Intelligence, Computational intelligence, RTS Gaming, Game Theory,

Acknowledgement

I take this opportunity to thank my internal guide Prof. Nancy Peter, for giving me guidance and support throughout the Seminar. Her/his valuable guidelines and suggestions were very helpful.

I wish to express my thanks to Prof. G. S. Navale, Head of Computer Engineering Department, Sinhgad Institute of Technology and Science, Narhe for giving me all the help and important suggestions all over the Seminar Work. I thank all the teaching and non-teaching staff members, for their indispensable support and priceless suggestions.

I also thank my friends and family for their help in collecting data without which this Seminar report not have been Completed. At the end my special thanks to Dr. S. N. Mali, Principal Sinhgad Institute of Technology and Science, Narhe for providing ambience in the college, which motivate us to work.

Gharti Gaurav Tularam

Table of Contents

Certificate

Certificate of Plagiarism

List of Abbreviations

List of Figures

Abstract

CHAPTER TITLE PAGE NO

1. INTRODUCTION 1 - 3

1.1 BACKGROUND 1

1.2 OBJECTIVE 2

1.3 RELEVANCE 2

1.4 ORGANIZATION OF SEMINAR REPORT. 3

2. LITERATURE SURVEY 4 - 9

2.1 INTRODUCTION 4-5

2.2 EXISTING METHODOLOGIES 5-9

2.2.1 Bandit-Based Methods 6

2.2.2 Upper Conﬁdence Bounds for Trees (UCT) 7

2.2.3 Hybrid Navigation System 8

2.2.4 Backpropagation and Move Selection 9

3. PROBLEM DEFINITION 10

3.1 PROBLEM STATEMENT 10

3.2 VISION DOCUMENT 10

4. MATHEMATICAL MODEL 11

5. PROPOSED SOLUTION 12 - 17

5.1 Simulation Enhancements 12

5.2 Backpropagation 14

5.3. Parallelization 15

5.4. Considerations for Using Enhancements 17

6. LIMITATION AND FUTURE SCOPE 18 - 19

6.1 Limitations 18

6.2 Future Scope

7. CONCLUSION 20

8. REFERENCES 21

1. INTRODUCTION

1.1 BACKGROUND:

This Section contains the background related with Computational Intelligence, Artificial Intelligence, Monte-Carlo Tree Search and its different methods. These all methods are used for the decision making and related to the game theory.

The background of the Monte-Carlo Tree Search, game theory, decision making and bandit based methods. But main focus is on the Monte-Carlo Tree Search algorithm and its different methods such as Bandit and the UCT algorithms.

Monte-Carlo Tree Search algorithm basically is used for optimal decision making by the game agent. It’s an iterative type of algorithm including four stages and those as first the Select a stage,

second the Expansion stage, third the Simulation stage and last the Backpropagation stage and goes into iteration again.

Decision Theory

The Decision Theory is defined as “The mathematical study of policies for optimal decision-making between options involving diverse risks or opportunities of gain or loss depending on the outcome”.

Game Theory

The Game Theory is nothing but the branch of mathematics concerned with the study of strategies for dealing with competitive circumstances where the outcome of a participant\'s choice of action depends critically on the actions of other game agents. Game theory has been applied to contexts in war, business, and biology.

1.2 OBJECTIVE:

The main objectives of this survey are —

• To study how the Monte-Carlo Tree Search algorithm works in making optimal decisions by the Ghosts in Pac-man and other RTS-game agents

• To analyze the current state of Artificial Intelligence and Computational Intelligence in RTS games and various other fields.

1.3 RELEVANCE:

This report basically focuses on Monte-Carlo Tree Search algorithm which is used for making optimal decision making, what are its different methodologies, how the next decision is made, which parameters are considered, when does it stops, what are its upper bounds and what are its lower bounds, benefits and limitations etc. In Monte-Carlo Tree Search algorithm iteratively four stages are implemented, first the select stage wherein a node is selected, second is the expansion stage here the next node is searched for expansion, the decision of selecting the next node is done in this stage by simulating various probabilities and selecting the optimal one, and the last stage is Backtracking where in the path is stored and is used for future purpose.

Figure 1.1: Monte-Carlo Tree Search

1.4 ORGANIZATION OF SEMINAR REPORT:

The further chapters of the report are structured as — Chapter 2 :Literature survey, which gives an overview of existing methodologies used in the concerned domain; Chapter no 3 : Problem definition, which depicts the inferences from the Literature review that have been molded to form a statement that can be used to address various problems related to the topic; Chapter 4 : Mathematical model, which denotes a mathematical expression that could be derived for the mentioned problem statement; Chapter 5 : Proposed solution, which represents the most feasible solution in relevance to the topic as far as our perspective is concerned; Chapter 6 : Limitations & Future scope; Chapter 7 : Conclusion and finally we have References.

2. LITERATURE SURVEY

2.1 INTRODUCTION:

In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in game play. A leading example of Monte Carlo tree search is recent computer Go programs, but it also has been used in other board games, as well as real-time video games and non-deterministic games such as poker. The focus of Monte Carlo tree search is on the study of the most promising moves, intensifying the search tree based on arbitrary sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is frolicked out to the very end by selecting moves at random. The concluding game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be selected in future playouts.

The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choosing the move which led to the most victories. The efficiency of this method—called Pure Monte Carlo Game Search—often increases with time as more playouts are assigned to the moves that have frequently resulted in the player\'s victory (in previous playouts). Full Monte Carlo tree search employs this principle recursively on many depths of the game tree. Each round of Monte Carlo tree search consists of four steps:

• Selection: start from root R and select successive child nodes down to a leaf node L. The section below says more about a way of choosing child nodes that lets the game tree expand towards most promising moves, which is the essence of Monte Carlo tree search.

• Expansion: unless L ends the game with a win/loss for either player, either create one or more child nodes or choose from them node C.

• Simulation: play a random playout from node C. This step is sometimes also called playout or rollout.

• Backpropagation: use the result of the playout to update information in the nodes on the path from C to R.

Sample steps from one round are shown in the figure below. Each tree node stores the number of won/played playouts.

Figure 2.1: Monte-Carlo Tree Search Stages with Probabilities of nodes

2.2 EXISTING METHODOLOGIES:

In this section of the chapter, a detailed description of the methods or approaches that are currently surrounding Monte-Carlo Tree Search methods that has been mentioned. This enables us to get the grip of the topic in technical aspects. Following are some of the methodologies that drew are attention—

2.2.1 Bandit-Based Methods

Bandit problems are a well-known class of sequential decision problems, wherein one needs to choose among actions (e.g. The arms of a multiarmed bandit slot machine) in order to exploit the cumulative reward by consistently taking the optimal action. The choice of deed is difﬁcult as the underlying reward circulations are unknown, and potential rewards must be estimated based on past observations. This leads to the exploitation–exploration problem: one needs to balance the exploitation of the action currently believed to be optimal with the exploration of other movements that currently appear suboptimal but may turn out to be bigger in the long run. An armed bandit is deﬁned by arbitrary variables for and, where indicates the arm of the bandit. Successive plays of bandit produce, which are independently and identically distributed according to an unknown law with unknown anticipation. The -armed bandit problem may be approached using a policy that regulates which bandit to play, based on past rewards. 1) Regret: The policy should aim to minimize the player’s fault, which is deﬁned after plays as where is the best possible expected reward and denotes the probable number of plays for arm in the ﬁrst trials. In other words, the regret is the expected loss unpaid to not playing the best bandit. It is important to highlight the necessity of conferring nonzero probabilities to all arms at all times, in order to ensure that the optimal arm is not missed owing to temporarily promising rewards from a suboptimal arm. It is hence important to place an upper conﬁdence bound on the rewards observed so far that ensures this.

2.2.2 Upper Conﬁdence Bounds for Trees (UCT)

This section designates the most popular algorithm in the MCTS family, the upper conﬁdence bound for trees (UCT) algorithm. We offer a detailed description of the algorithm, and brieﬂy outline the proof of union. 1) The UCT Algorithm: The goal of MCTS is to near the (true) game-theoretic value of the actions that may be taken from the current state. This is accomplished by iteratively building a partial search tree. How the tree is built be contingent on how nodes in the tree are selected. The success of MCTS, especially in Go, is mostly due to this tree policy. In particular, Kocsis and Szepesvári planned the use of UCB1 as tree policy. In treating the choice of child node as a multiarmed bandit problem, the value of a child node is the possible reward approximated by the Monte Carlo simulations, and hence these rewards link to random variables with unknown distributions. UCB1 has some promising properties: it is actual simple and efﬁcient and guaranteed to be with in a constant factor of the best possible bound on the evolution of regret. It is thus a promising applicant to address the exploration–exploitation dilemma in MCTS: every time a node (action) is to be designated within the existing tree, the choice can be modeled as an independent multiarmed bandit problem. A child node is nominated to exploit

where n is the number of times the current (parent) node has been visited , is the number of times child j has been visited, and is a constant. If more than one child node has the same maximal value, the tie is usually broken randomly . The values of and thus of are understood to be within [0,1](this holds true for both the UCB1and the UCT proofs). It is generally understood that = 0 yields a UCT value of infinity, so that previously unvisited children are assigned the largest possible value, to ensure that all children of a node are considered at least once before any child is expanded further. This results in a powerful form of iterated local search. There is an essential balance between the ﬁrst (exploitation) and second (exploration) terms of the UCB equation. As each node is visited, the denominator of the exploration term increases, which decreases its contribution. On the other hand, if another child of the parent node is visited, the numerator increases and hence the exploration values of unvisited siblings increase. The exploration term ensures that each child has a nonzero probability of selection, which is essential given the random nature of the playouts. This also imparts an inherent restart property to the algorithm, as even low-reward children are guaranteed to be chosen eventually (given sufﬁcient time), and hence different lines of play explored. The constant in the exploration term can be adjusted to lower or increase the amount of exploration performed. The value was shown by Kocsis and Szepesvári to satisfy the Hoeffding inequality with rewards in the range. With rewards outside this range, a different value of maybe needed and also certain enhancements work better with a different value for . The rest of the algorithm proceeds as described in , if the node selected by UCB descent has children that are not yet part of the tree, one of those is chosen randomly and added to the tree. The default policy is then used until a terminal state has been reached. In the simplest case, this default policy is uniformly random. The value of the terminal state is then backpropagated to all nodes visited during this iteration, from the newly added node to the root. Each node holds two values, the number N(v) of times it has been visited and a value Q(v) that corresponds to the total reward of all playouts that passed through this state (so that Q(v)/N(v) is an approximation of the node’s game-theoretic value). Every time a node is part of a playout from the root, its values are updated. Once some computational budget has been reached, the algorithm terminates and returns the best move found, corresponding to the child of the root with the highest visit count. Algorithm 2 shows the UCT algorithm in pseudocode. This code is a summary of UCT descriptions from several sources, notably, but adapted to remove the two-player, zero-sum, and turn order constraints typically found in the existing literature.

2.2.3 Hybrid Navigation System:

The hybrid navigation system has two parts. When no enemy element or building is within sight range, agents navigate using A*. Local inﬂuence procedures such as ﬂocking and potential ﬁelds have a tendency to get stuck in complex terrain subsequently they do not “backtrack” if they get stuck in a dead end. We dodge this problem by using A* to calculate the shortest track to the goal position.

A*is on the other hand not very suitable for placing units. Agents move towards the goal without considering how to effectively involve the enemy in a combat situation. To solve this the navigation system changes to using ﬂocking with the boids algorithm as soon as an enemy unit or building is within eye sight range. The boids algorithm is modiﬁed so agents try to keep a distance to the enemy nearby to the maximum shooting distance of its own weapons, while at the same time possessing the squad (the group of units an agent belongs to) gathered up. Agents should also avoid colliding with other own agents and complications.

2.2.4 Backpropagation and Move Selection:

Backpropagation and Move Selection Results are backpropagated from moreover the expanded leaf node, or the internal node reached in the intermittent selection step, to the root based on maximum backpropagation. Scores stored at each internal node denote the maximum scores of its children based on, bestowing to the current tactic, as deﬁned in (3). Whereas traditionally, MCTS implementations for games use typical backpropagation, maximization is applied to MCTS for Ms Pac-Man. During the exploration when a state is reached, upon return to the parent, the maximum score over the children is returned to the parent. The maximum is used since every move Pac-Man can make at a intersection can have altogether different results [15]. For example, at a given junction, Pac-Man has two options to move. A decision to go left primes to a loss of life for Pac-Man in all simulations, whereas a choice to go right is determined to be harmless in all simulations. When via average values, the resulting score is 0.5, whereas maximum backpropagation results in the correct estimation of 1. After each simulation, results are incremented starting at the node from which the playout step was began. Next, the values of the parent node are set to the maximum values concluded all its children.

3. PROBLEM DEFINITION

3.1 PROBLEM STATEMENT:

Finding an optimal path in order to make correct decisions in an efficient way. Also, to avoid the exploration-exploitation problem. Making enhancements in the Monte-Carlo Tree Search and producing better results.

3.2 VISION DOCUMENT:

From the Literature survey, we noticed some key issues and challenges that have not been addressed yet and require utmost attention for better implementation. These can be listed as follows —

• Failure in discovering and composing balancing factor

• Inefficient interaction between a parent node and a child node.

• Functional aspects of a child node not utilized to its fullest.

• Ensuring efficient as well as less space consuming simultaneously within single environment at any instance of time — main challenge.

Instead of concentrating on any one aspect, the requirement at the moment is to aim at providing a feasible solution encompassing almost all the aspects. The advantage of using a holistic approach is that overall improvement would be achieved in a shorter time span.

4. MATHEMATICAL MODEL

This section represents the Monte-Carlo Tree Search system in mathematical aspects. The ontology is developed around the core concept of decision making, which is composed of one or more of Connected parent nodes concept, which in turns is composed of multiple connections of parent to child node concept. The ontology provides shared vocabulary for describing concepts including nodes, depth, weight, state, policy and action. It also defines a set of attributes (i.e. time, path, no. of times visited) and relationship (i.e. between parent node and child node), which holds between different concepts.

Following is the formal deﬁnition of a solution perspective ‘S’ used in developing the ontology :

S = {S 1, S 2, S 3}

Where,

S 1: MCTS: 𝑠𝑢𝑏𝐶𝑙𝑎𝑠𝑠𝑂𝑓: 𝑆𝐶(𝐶1) ⊆ 𝑆𝐶(𝐶2),

Semantic scope of C2 is narrow than that of C1.

Decision-making class has 3 different subclasses.

Decision-making ⊆ {state, policy, action, balance-factor}

S 2: (𝑖1, 𝑖2) states that 𝑖1 related with 𝑖2 through property 𝑃.

𝑏𝑒𝑙𝑜𝑛𝑔𝑇𝑂(node)

𝑒𝑚𝑏𝑒𝑑𝑂 (parent, root)

𝑝𝑢𝑏𝑙𝑖𝑠ℎ𝑒𝑟𝑂𝑓 (child, terminal)

S 3: 𝑖1, 2.... 𝑖𝑛: 𝑆𝐶(𝐶1), 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑖1, 𝑖2. 𝑖𝑛 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜𝑐𝑙𝑎𝑠𝑠𝐶1

Following are example of different instances:

{root, parent, child, terminal}: node

{state, action, policy}: MCTS

5. SOLUTION

This section describes enhancements to features of the core MCTS algorithm other than its tree policy. This contains modiﬁcations to the default policy (which are typically domain dependent and involve heuristic knowledge of the problem being modeled) and further more general modiﬁcations related to the backpropagation phase and parallelization.

5.1. Simulation Enhancements: The default simulation policy for MCTS is to select casually among the available actions. This has the benefit that it is simple, requires no domain knowledge, and repeated trials will most likely cover diverse areas of the search space, but the games played are not likely to be realistic linked to games played by rational players. A popular class of enhancements makes the simulations more realistic by including domain knowledge interested in the playouts.

5.2 Backpropagation: Enhancements Modiﬁcations to the backpropagation step typically include special node updates required by other enhancement methods for advancing planning, but some establish enhancements in their own right.

5.3. Parallelization: The self-determining nature of each simulation in MCTS means that the algorithm is a good target for parallelization. Parallelization has the benefit that more simulations can be performed in a given amount of time and the varied availability of multicore processors can be exploited. However, parallelization raises matters such as the combination of results from different sources in a single search tree, and the management of threads of different speeds over a organization.

5.4. Considerations for Using Enhancements: MCTS works well in everywhere domains but not in others. The many enhancements described in this section and the earlier one also have different levels of applicability to different fields. This section describes efforts to understand situations in which MCTS and its developments may or may not work, and what conditions might cause complications.

6. LIMITATION AND FUTURE SCOPE

6.1 Limitations:

Combining the accuracy of tree search with the generality of random sampling in MCTS has provided stronger decision making in a wide variety of games. However, there are clear challenges for domains wherever the branching factor and depth of the graph to be searched makes naive application of MCTS, or indeed any additional search algorithm, infeasible. This is particularly the case for video game and real-time control tenders, where a systematic way to integrate knowledge is required in order to restrict the subtree to be searched. Another issue arises when imitations are very CPU intensive and MCTS must study from relatively few samples. Work on Bridge and Scrabble shows the potential of very shallow explorations in this case, but it remains an open question as to whether MCTS is the best way to direct simulations when comparatively few can be carried out. Although basic implementations of MCTS deliver effective play for some domains, results can be weak if the basic algorithm is not enhanced. This review presents the wide range of enhancements considered in the short time to date. There is at present no better way than a manual, empirical study of the effect of enhancements to obtain suitable performance in a particular domain. A primary feebleness of MCTS, shared by most search heuristics, is that the dynamics of search are not yet fully understood, and the impact of conclusions concerning parameter settings and enhancements to basic algorithms are tough to predict. Work to date shows promise, with basic MCTS algorithms demonstrating.

6.2 FUTURE DIRECTIONS:

Future research in MCTS motivation likely be focused toward:

•cultivating of the game MCTS performance in all-purpose;

•refining MCTS performance in speciﬁc fields;

• considerate the behavior of MCTS.

It seems likely that there will continue to be considerable effort on game-speciﬁc developments to MCTS for Go and other games.

7. CONCLUSION

Ms Pac-Man is a motivating subject for AI research based on many of its characteristics. Agents need to study both long-term and short-term goals. Moreover, its real-time nature makes it hard for algorithms such as MCTS that have to consider a assembly of simulations. Based on observations, our agent made between 200 and 400 simulations for each exploration. Investigating domain-independent methods and enhancements for working with such games could lead to a better kind of other real-time domains.

There are two main points we want to make in this conclusion: good abstractions are vital to RTSAI, and dealing with incomplete information is obligatory. Both of which are directly dealt with Bayesian models.

MCTS has become the preeminent slant for many challenging games, and its application to a broader range of domains has also been verified. In this paper, we present by far the most comprehensive survey of MCTS methods to date, describing the fundamentals of the algorithm, major variations and improvements, and a representative set of problems to which it has been useful.

8. REFERENCES

[1] D. Bertsimas, Dynamic Programming and Optimal Control, Vol II: Approximate Dynamic Programming, 4th ed. Belmont, MA, USA: Athena Scientiﬁc, 2012.

[2] M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone, “A neuroevolution approach to general Atari game playing,” IEEE Trans. Comput. Intell. AI in Games, 2014, (in press).

[3] A. Liapis, G. N. Yannakakis, and J. Togelius, “Sentient world: based procedural cartography,” in Evolutionary and Biologically Inspired Music, Sound, Art and Design. Berlin, Germany: Springer-Verlag, 2013, pp. 180–191.

[4] H. Campos, J. Campos, J. Cabral, C. Martinho, J. H. Nielsen, and A. Paiva, “My dream theatre,” in Proc. 2013 Int. Conf. Autonom. Agents Multi-Agent Syst., 2013, pp. 1357–1358, International Foundation for Autonomous Agents and Multiagent Systems. Jun Wu1, Mianxiong Dong, Kaoru Ota, Jianhua Li1 and Bei Pei, “A Fine-Grained Cross-Domain Access Control Mechanism for Social Internet of Things”, IEEE 11th International Conference on Ubiquitous Intelligence and Computing, vol. 24, no. 6, pp. 289-293, 2014.

[5] C.W. Reynolds, “Flocks, herds, and schools: A distributed behavioral model,” in Proc. Comput. Graph. (SIGGRAPH\'87),1987, vol.21, no. 4, pp. 25–34.

[6] “QueryPerformanceCounter function,” 2014 [Online]. Available: msdn.microsoft.com/en-us/library/windows/desktop/ms644904(v=vs. 85).aspx

**...(download the rest of the essay above)**