1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
|
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "73c6d255-0c32-4895-9a22-e95eadb25103",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pygame 2.5.1 (SDL 2.28.2, Python 3.11.5)\n",
"Hello from the pygame community. https://www.pygame.org/contribute.html\n"
]
}
],
"source": [
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from collections import namedtuple\n",
"from IPython.core.debugger import Pdb\n",
"from IPython.display import display, clear_output\n",
"\n",
"from QNetwork import neuralnetwork_regression as nn\n",
"from GameEngine import multiplayer\n",
"from QTable import qtsnake\n",
"\n",
"Point = namedtuple('Point', 'x, y')"
]
},
{
"cell_type": "markdown",
"id": "b3aab739-e016-4700-89c9-41f3c2f536cf",
"metadata": {},
"source": [
"### Representing Q-function using Neural Networks\n",
"\n",
"In the last notebook, I represented my Q-function in a simple lookup table. This notebook offers a difference approach by using a neural network. The function we want the neural network to learn is, of course, the snake's Q-function, which maps a state-action pair onto an expected reward.\n",
"\n",
"#### Benefits of a Q-network\n",
"\n",
"The distinction between a Q-table and Q-network is that a Q-network contains and updates a set of parameters (weights) which summarize previously seen data. A Q-table cannot do this, and thus is completely clueless in situations where it recieves an input it has either not seen, or has not been trained on. In theory, this allows a neural network to not only represent environments with many more states, but also the ability to make guesses about in 'gaps' in its learning.\n",
"\n",
"This notebook will go over training of a simple q-network, which maps a total of 32 different combinations of states and actions onto rewards, much like the previous q-table implementation from ***one_revised_snake_q_table.ipynb***.\n",
"\n",
"First, I will set up the game environment:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "682a7036-4f0d-4f3d-b147-6355c0a2f93e",
"metadata": {},
"outputs": [],
"source": [
"# defines game window size and block size, in pixels\n",
"WINDOW_WIDTH = 480\n",
"WINDOW_HEIGHT = 320\n",
"GAME_UNITS = 80"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "41cfbec9-e14e-4c58-95dd-2e3fb1788e72",
"metadata": {},
"outputs": [],
"source": [
"game_engine = multiplayer.Playfield(window_width=WINDOW_WIDTH,\n",
" window_height=WINDOW_HEIGHT,\n",
" units=GAME_UNITS,\n",
" g_speed=35,\n",
" s_size=1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "804a13dc-7dd4-43f0-bc47-e781bc022075",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Game starting with 1 players.\n"
]
},
{
"data": {
"text/plain": [
"0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p1 = game_engine.add_player()\n",
"game_engine.start_game()\n",
"p1"
]
},
{
"cell_type": "markdown",
"id": "34efdb66-7a8e-4b48-a015-d1eb8a029915",
"metadata": {},
"source": [
"Training thousands of steps is a little bit slow with the graphics on. It makes only a small difference here, but it provides little information anyways. So, I introduced a function to turn the drawing off."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "acabac69-a92d-4415-b4ef-251fd1e965f7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Draw is now False.\n"
]
}
],
"source": [
"game_engine.toggle_draw()"
]
},
{
"cell_type": "markdown",
"id": "43cefedf-e005-4910-9b4c-953697aa3f26",
"metadata": {},
"source": [
"### State-sensing methods, defining reinforcement and greedy-action selector\n",
"\n",
"I have also imported the aforementioned q_table implementation as qtsnake. It will come back in the end of the notebook when I pair the q_table and q_network against each other, but to make the game fair, I'll use the exact same state-sensing method:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "71c97804-74d3-4248-bdb7-5519aa02b556",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<function QTable.qtsnake.sense_goal(head, goal)>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qtsnake.sense_goal"
]
},
{
"cell_type": "markdown",
"id": "e065f223-9e19-4f21-ba75-8d44fc62d353",
"metadata": {},
"source": [
"Even though I plan to only call it when selecting a greedy_action, I'll wrap it in a neat 'query_state' function:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "26b8f8bf-ad08-40f8-847f-88351e262c1d",
"metadata": {},
"outputs": [],
"source": [
"def query_state(id):\n",
" '''\n",
" given a player's id,\n",
" returns their state\n",
" '''\n",
" heads, _, goal = game_engine.get_heads_tails_and_goal()\n",
" return np.array(qtsnake.sense_goal(heads[id], goal))"
]
},
{
"cell_type": "markdown",
"id": "7d61e508-0661-4893-a720-f0a511c52809",
"metadata": {},
"source": [
"And now the reinforcement function. Because I took the requirement to sense danger away, we only need two outputs from the reinforcement function. In almost every case, the snake is not allowed to choose an action that would collide with its own tail.\n",
"\n",
"The output of this function was chosen due to being the best-performing. In reality, the reinforcement for non-goals will never be used. I prefer the simplicity of using the discount factor to force agents to the goal quickly. This is because, with larger discount, the snake prioritizes actions that result in more immediate rewards. An alternative approach which a tried is to punish the agent for each unneccessary step."
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "0af0a115-83b9-498a-8228-dc79580131f1",
"metadata": {},
"outputs": [],
"source": [
"def reinforcement(outcome):\n",
" '''\n",
" given an outcome of an action,\n",
" returns associated reward\n",
" '''\n",
" if outcome == multiplayer.CollisionType.GOAL:\n",
" return -3\n",
" return 0"
]
},
{
"cell_type": "markdown",
"id": "45e6040c-9aae-4f9e-8ef6-cf23b4043622",
"metadata": {},
"source": [
"For this version of the epsilon greedy function, I wanted an interface similar to the ***one_revised_snake_q_table.ipynb*** notebook. The function operates in the same way, by accumulating the expected reward for each action taken in a state into a list, and then returning the argmin of those actions. I return the expected reward for this action in addition, because it is needed later for learning with discounted rewards."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "a76fd63a-478a-43ad-91ce-df1dff03e565",
"metadata": {},
"outputs": [],
"source": [
"def pick_greedy_action(q_net, id, epsilon):\n",
" '''\n",
" given a q network, the id of the player\n",
" taking action, and a randomization factor,\n",
" returns the most rewarding non-lethal action\n",
" or a non-lethal random action and expected reward\n",
" '''\n",
" viable_actions = game_engine.get_viable_actions(id)\n",
" state = query_state(id)\n",
"\n",
" if viable_actions.size < 1:\n",
" best_action = 0\n",
" elif np.random.uniform() < epsilon:\n",
" best_action = np.random.choice(viable_actions)\n",
" else:\n",
" qs = [q_net.use(np.hstack(\n",
" (state, action)).reshape((1, -1))) for action in viable_actions]\n",
" best_action = viable_actions[np.argmin(qs)]\n",
"\n",
" X = np.hstack((state, best_action))\n",
" q = q_net.use(X.reshape((1, -1)))\n",
"\n",
" return X, q"
]
},
{
"cell_type": "markdown",
"id": "0c87558b-e6ce-4db2-a0cb-7bdb5dd70c75",
"metadata": {},
"source": [
"### Q-Learning with Temporal Difference, One Sample at a Time\n",
"\n",
"Unlike the marble implementation, I have created a similar training loop to what was observed in the q-table, without the use of a make samples function. This means I adjust each weight for a single sample at a time (batch size 1), assigning the output of each intermediate step to the discounted rewards of future steps, 'bootstrapping' the learning process similar to the temporal difference equation. Remember, the nature of this method is somewhat recursive, as it updates Q to agree with max(Q'), which in turn is updated to agree with max(Q'')..."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "06cd085e-77f4-4a22-9b1f-ec364b7737c5",
"metadata": {},
"outputs": [],
"source": [
"def update_q(q, old_X, new_X, new_q, outcome, n_epochs, discount=0.9, lr=0.2):\n",
" '''\n",
" given a q network, the previous state/action pair,\n",
" the new state/action pair, the expected next reward,\n",
" the outcome of the last action, the number of epochs,\n",
" a discount factor (gamma), and the learning rate\n",
" updates q with discounted rewards.\n",
" '''\n",
" reward = reinforcement(outcome)\n",
" if outcome == multiplayer.CollisionType.GOAL:\n",
" q.train(np.array([new_X]),\n",
" np.array([reward]) + np.array([[reward]]),\n",
" n_epochs, lr, method='sgd', verbose=False)\n",
" else:\n",
" q.train(np.array([old_X]),\n",
" discount * np.array([new_q]), n_epochs,\n",
" lr, method='sgd', verbose=False)"
]
},
{
"cell_type": "markdown",
"id": "93e8aa26-d334-49d3-8640-ede35ba6f1ae",
"metadata": {},
"source": [
"#### Training\n",
"\n",
"In this case, I already know our game world is limited to 32 inputs. In this minimal case, I don't neccessarily care if the network is generalizable, so there is no real need for a test set, and no real downside of overfitting. My learning process will simply run the experiment for a set amount of steps. This is not to say that my approach may not be generalizable, but that there is no real way to know.\n",
"\n",
"Through use of my exploration strategy, as well as a randomly initialized set of weights, the data passed into the neural network should thouroughly account for all possible inputs.\n",
"\n",
"To start, I initialize a few hyperparameters, discovered largely through trial-and-error, and create a new Q-network object:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f51c3238-c918-40a5-bf38-1456f4ed4ff5",
"metadata": {},
"outputs": [],
"source": [
"gamma = 0.9\n",
"n_epochs = 10\n",
"learning_rate = 0.015\n",
"\n",
"hidden_layers = [10]\n",
"q = nn.NeuralNetwork(2, hidden_layers, 1)\n",
"q.setup_standardization([5, 3.5], [4, np.sqrt(5.25)], [-.1], [0.2])"
]
},
{
"cell_type": "markdown",
"id": "ff9cf658-ec13-4810-9443-757b71663bbb",
"metadata": {},
"source": [
"Reminder that gamma is the discount factor, and learning rate controls how quickly the weights are adjusted, much like I used it for the temporal difference equation.\n",
"\n",
"In general, the number of epochs corresponds to the amount of weight updates occur per batch of samples. Often, large numbers result in poor generalizability, which, as mentioned, is not a priority due to the size of the Q-input pool.\n",
"\n",
"Similarly to before, I'll set up epsilon to decay exponentially over a 10,000 step training loop..."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "072ef9b7-86ec-4cbf-a315-dd6b4019fce6",
"metadata": {},
"outputs": [],
"source": [
"n_steps = 10000\n",
"epsilon = 1\n",
"final_epsilon = 0.05\n",
"epsilon_decay = np.exp(np.log(final_epsilon) / (n_steps))\n",
"epsilon_trace = np.zeros(n_steps)"
]
},
{
"cell_type": "markdown",
"id": "e2f54fd3-4899-4d83-bd6b-2b50a66a6b26",
"metadata": {},
"source": [
"And create a few classes and methods to plot the results:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "720a04aa-b53f-42d7-adf8-7c1a0958ff04",
"metadata": {},
"outputs": [],
"source": [
"class Scoreboard():\n",
" ''' tracks game statistics '''\n",
" def __init__(self):\n",
" self.all_goals = 0\n",
" self._deaths = 0\n",
" self._goals = 0\n",
" self._max_goals = 0\n",
"\n",
" self.goals = []\n",
" self.deaths = []\n",
" self.max_goals = []\n",
"\n",
" def track_outcome(self, outcome):\n",
" if outcome == multiplayer.CollisionType.GOAL:\n",
" self._goals += 1\n",
" self.all_goals += 1\n",
" if self._goals > self._max_goals:\n",
" self._max_goals = self._goals\n",
" elif outcome == multiplayer.CollisionType.DEATH:\n",
" self._deaths += 1\n",
" self._goals = 0\n",
"\n",
" def flush(self):\n",
" self.goals.append(self._goals)\n",
" self.deaths.append(self._deaths)\n",
" self.max_goals.append(self._max_goals)\n",
"\n",
" self._reset()\n",
"\n",
" def _reset(self):\n",
" self._deaths = 0\n",
" self._goals = 0\n",
" self._max_goals = 0"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "c86cea77-c3b9-44fa-becd-2d04d49b92cc",
"metadata": {},
"outputs": [],
"source": [
"def plot_status(q, step, epsilon_trace, r_trace=None):\n",
" \n",
" plt.subplot(4, 3, 1)\n",
" plt.plot(epsilon_trace[:step + 1])\n",
" plt.ylabel('Random Action Probability ($\\epsilon$)')\n",
" plt.ylim(0, 1)\n",
"\n",
" plt.subplot(4, 3, 2)\n",
" plt.plot(scoreboard.deaths)\n",
" plt.ylabel('Deaths')\n",
"\n",
" plt.subplot(4, 3, 3)\n",
" plt.plot(scoreboard.goals)\n",
" plt.ylabel('Goals')\n",
"\n",
" plt.subplot(4, 3, 4)\n",
" plt.plot(scoreboard.max_goals)\n",
" plt.ylabel('Max Score')\n",
"\n",
" '''\n",
" plt.subplot(4, 3, 5)\n",
" plt.plot(r_trace[:step + 1], alpha=0.5)\n",
" binSize = 20\n",
" if step+1 > binSize:\n",
" # Calculate mean of every bin of binSize reinforcement values\n",
" smoothed = np.mean(r_trace[:int(step / binSize) * binSize].reshape((int(step / binSize), binSize)), axis=1)\n",
" plt.plot(np.arange(1, 1 + int(step / binSize)) * binSize, smoothed)\n",
" plt.ylabel('Mean reinforcement')\n",
" '''\n",
"\n",
" plt.subplot(4, 3, 6)\n",
" q.draw(['$o$', '$a$'], ['q'])\n",
"\n",
" plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"id": "79ad6521-1907-4fb3-8a33-a0e470e0a361",
"metadata": {},
"source": [
"The logic behind this the training loop is the same as the q-table implementation with added calls to the scoreboard and plotting functions, because I took the time to make each function interface the same. If exported this code to a file, I may utilize higher-order functions to allow easy selection of either Q-function:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "00ca3585-8a11-4fd5-93d7-8e73bfc31e81",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x1000 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1000x1000 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"old_X, old_q = pick_greedy_action(q, p1, epsilon)\n",
"game_engine.player_advance([old_X[1]])\n",
"\n",
"fig = plt.figure(figsize=(10, 10))\n",
"scoreboard = Scoreboard()\n",
"plot_spacing = 1000\n",
"plotted_steps = 0\n",
"\n",
"# R = np.zeros((plot_spacing, 1))\n",
"# r_trace = np.zeros(n_steps // plot_spacing)\n",
"\n",
"for step in range(n_steps):\n",
" new_X, new_q = pick_greedy_action(q, p1, epsilon)\n",
" outcomes = game_engine.player_advance([new_X[1]])\n",
" scoreboard.track_outcome(outcomes[p1])\n",
"\n",
" update_q(q, old_X, new_X, new_q, outcomes[p1], n_epochs, lr=learning_rate)\n",
"\n",
" epsilon *= epsilon_decay\n",
" epsilon_trace[step] = epsilon\n",
" # R[step % plot_spacing, 0] = reinforcement(outcomes[p1])\n",
" old_X = new_X\n",
" old_q = new_q\n",
"\n",
" if step >= plotted_steps:\n",
" # r_trace[plotted_steps // plot_spacing] = np.mean(R)\n",
" plotted_steps += plot_spacing\n",
" scoreboard.flush()\n",
" fig.clf()\n",
" plot_status(q, step, epsilon_trace)\n",
" scoreboard.all_goals = 0\n",
" clear_output(wait=True)\n",
" display(fig)"
]
},
{
"cell_type": "markdown",
"id": "4caa7801-017f-429c-ba14-7331fab1a68b",
"metadata": {},
"source": [
"Like the q-table, the training process occasionally produces a bad agent. This is always due to the agent's exploration finding relatively few goals. To solve this, I specifically select a small envirnoment to train on, as random actions happen to score goals more likely. A slower decay of epsilon would likely improve this further, though this is generally not an issue."
]
},
{
"cell_type": "markdown",
"id": "3bc3c79e-2162-4f3a-9e73-a1d789cc2bb0",
"metadata": {},
"source": [
"#### Viewing the Results: Multiplayer Snake\n",
"\n",
"Now, I'll test the performance of the Q-network first by itself, in a simple get action advance loop. I'll turn on the draw feature:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "269ac824-1568-49aa-a020-9a57ee59ae49",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Draw is now True.\n"
]
}
],
"source": [
"game_engine.toggle_draw()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "36a2d897-15a8-47a4-953b-a159af0ad881",
"metadata": {},
"outputs": [],
"source": [
"epsilon = 0\n",
"for step in range(500):\n",
" new_X, _ = pick_greedy_action(q, p1, epsilon)\n",
" game_engine.player_advance([new_X[1]])"
]
},
{
"cell_type": "markdown",
"id": "b208136a-90c0-40e9-ac73-e2536e903ae3",
"metadata": {},
"source": [
"If the agent gets stuck, you may want to retrain the agent by running the last few cells.\n",
"\n",
"Now for the fun part: Facing the agents off against each other..."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "b77b2db1-e928-4cd8-ae98-7f8ac9b1326f",
"metadata": {},
"outputs": [],
"source": [
"inferior_table = qtsnake.load_q('inferior_qt.npy')\n",
"superior_table = qtsnake.load_q('superior_qt.npy')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "1022bbdf-c68d-4e02-89e0-9d71470d9b8e",
"metadata": {},
"outputs": [],
"source": [
"epsilon = 0\n",
"n_steps = 1500"
]
},
{
"cell_type": "markdown",
"id": "4897a058-a527-4d3c-bd08-7f8eac8a6e58",
"metadata": {},
"source": [
"I also make the game really large, to allow the snakes their own space:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "d67ba96c-9b42-47d2-a88f-a94335bd6967",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Game starting with 3 players.\n"
]
}
],
"source": [
"game_engine = multiplayer.Playfield(window_width=WINDOW_WIDTH,\n",
" window_height=WINDOW_HEIGHT,\n",
" units=10,\n",
" g_speed=100,\n",
" s_size=1)\n",
"t1 = game_engine.add_player()\n",
"t2 = game_engine.add_player()\n",
"n1 = game_engine.add_player()\n",
"game_engine.start_game()"
]
},
{
"cell_type": "markdown",
"id": "482f45f9-6964-49e5-90a5-3f189239fe8b",
"metadata": {},
"source": [
"And initialize a new q-table object with the table. This object is quite nice, because it is not tied to any one q-table; it simply reads and writes a given q-table:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "c5be5beb-e92c-42ad-9076-c28394560122",
"metadata": {},
"outputs": [],
"source": [
"q_table = qtsnake.QSnake(game_engine)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "314d0836-5c99-4de3-91c8-e563fed61e6c",
"metadata": {},
"outputs": [],
"source": [
"for step in range(n_steps):\n",
" # table 1 (YELLOW)\n",
" _, t1_action = q_table.pick_greedy_action(inferior_table, t1, epsilon)\n",
"\n",
" # table 2 (RED)\n",
" _, t2_action = q_table.pick_greedy_action(superior_table, t2, epsilon)\n",
"\n",
" # network 1 (PURPLE)\n",
" n1_state_action, _ = pick_greedy_action(q, n1, epsilon)\n",
" game_engine.player_advance([t1_action,\n",
" t2_action,\n",
" n1_state_action[1]])"
]
},
{
"cell_type": "markdown",
"id": "bfe78a0a-a65f-4c9c-9e08-2f99e0fb54b2",
"metadata": {},
"source": [
"YELLOW: Learned Q-Table\n",
"\n",
"RED: Set Q-Table\n",
"\n",
"PURPLE: Learned Q-Network\n",
"\n",
"Often in my games, the q-network finishes second behind the superior, manually set q-table. Sometimes, it is the other way around, dependent on how successful each agent trained, and the luck of goal spawns during the trial. I have also seen it finish first on many occasions. All in all, it seems both the Q-network and Q-table are relatively equally matched in this game representation."
]
},
{
"cell_type": "markdown",
"id": "5d369045-bc0a-4567-99bb-bb04e75f294c",
"metadata": {},
"source": [
""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|