5 files changed, 93 insertions, 70 deletions
diff --git a/QNetwork/neuralnetwork_regression.py b/QNetwork/neuralnetwork_regression.py
index b26be5e..3f141c5 100644
--- a/QNetwork/neuralnetwork_regression.py
+++ b/QNetwork/neuralnetwork_regression.py
@@ -6,9 +6,11 @@ import matplotlib.patches as pltpatch  # for Arc
 import matplotlib.collections as pltcoll
 import math
 
-######################################################################
-## class NeuralNetwork()
-######################################################################
+###############################################
+# Thanks to Dr. Chuck Anderson for his neural #
+# network with optimizers implementation      #
+# https://www.cs.colostate.edu/~anderson/wp/  #
+###############################################
 
 class NeuralNetwork():
 
diff --git a/QNetwork/optimizers.py b/QNetwork/optimizers.py
index 7d28f92..c928b64 100644
--- a/QNetwork/optimizers.py
+++ b/QNetwork/optimizers.py
@@ -1,8 +1,10 @@
 import numpy as np
 
-######################################################################
-## class Optimizers()
-######################################################################
+###############################################
+# Thanks to Dr. Chuck Anderson for his neural #
+# network with optimizers implementation      #
+# https://www.cs.colostate.edu/~anderson/wp/  #
+###############################################
 
 class Optimizers():
 
diff --git a/README.org b/README.org
new file mode 100644
index 0000000..64ebf54
--- /dev/null
+++ b/README.org
@@ -0,0 +1,3 @@
+* README
+
+(work in progress)
diff --git a/revised_snake_q_table.ipynb b/one_revised_snake_q_table.ipynb
index c9934bc..9a6ca5f 100644
--- a/revised_snake_q_table.ipynb
+++ b/one_revised_snake_q_table.ipynb
@@ -21,7 +21,8 @@
     "from IPython.core.debugger import Pdb\n",
     "\n",
     "from GameEngine import multiplayer\n",
-    "Point = namedtuple('Point', 'x, y')"
+    "Point = namedtuple('Point', 'x, y')\n",
+    "SA = namedtuple('SA', 'state, action')"
    ]
   },
   {
@@ -33,7 +34,7 @@
     "\n",
     "I have an improved game implementation which allows for multiplayer snake games, as well as simplified training. This notebook will go over both of these, including an implementation of q-table learning, as well as a match between a manually filled out q-table and a learned one.\n",
     "\n",
-    "Let's start by initializing the engine object:"
+    "I'll start by initializing the engine object:"
    ]
   },
   {
@@ -90,7 +91,7 @@
     "\n",
     "**toggle_draw**: Turns off/on the game UI for faster training.\n",
     "\n",
-    "Let's do a test of these functions:"
+    "I can do a test of these functions:"
    ]
   },
   {
@@ -142,7 +143,7 @@
     {
      "data": {
       "text/plain": [
-       "[<CollisionType.GOAL: 1>]"
+       "[<CollisionType.NONE: 2>]"
       ]
      },
      "execution_count": 6,
@@ -172,7 +173,7 @@
     "\n",
     "How many states do I need? Seeing how the new **get_viable_actions** method already prevents the snake from choosing life-ending moves, the snake is no longer tasked with learning or memorizing it.\n",
     "\n",
-    "The snake doesneed to be able to interpret progress towards the goal, so I will reinclude one state-sensing to sense the goal direction. I need only 8 states (with entries for each four actions) to represent our game now."
+    "The snake does need to be able to interpret progress towards the goal, so I will reinclude one state; the compass direction from the snake's head to the goal. This means I need only 8 states (with entries for each four actions) to represent my game now."
    ]
   },
   {
@@ -262,9 +263,7 @@
     {
      "data": {
       "text/plain": [
-       "([Point(x=480, y=0)],\n",
-       " [Point(x=480, y=0), Point(x=560, y=0)],\n",
-       " Point(x=0, y=400))"
+       "([Point(x=320, y=160)], [Point(x=320, y=160)], Point(x=0, y=400))"
       ]
      },
      "execution_count": 9,
@@ -281,7 +280,7 @@
    "id": "d3fd47ce-55fe-4d2f-9147-8848193f7ca1",
    "metadata": {},
    "source": [
-    "Now to use sense_goal as an index into our q table:"
+    "Now to make a function to index our expected reward-to-go given a state using sense_goal:"
    ]
   },
   {
@@ -338,7 +337,14 @@
    "id": "6a2ef7f7-f6f7-4610-8e98-1d389327f3e8",
    "metadata": {},
    "source": [
-    "In our learning agent, these actions will obviously be associated with different expected rewards. But it is not enough to take the best reward, because the positions of the hazards have not been accounted for. I chose to implement a replacement argmin/max function to select actions from this table, which generates new actions in order from highest expected reward to lowest expected reward."
+    "In our learning agent, these actions will obviously be associated with different expected rewards. Essentially, we have a function that, given a state, tells us the expected utility of each action. Should we just choose the best one?\n",
+    "\n",
+    "There are two problems with a greedy approach...\n",
+    "\n",
+    "1. If the agent only sticks to the one policy it knows, then it will never truly learn the best solution. This is not the largest problem, due to how few states we have. Additionally, the snake is forced into a new state everytime it collects a goal. Still, a pure greedy approach results in slow, sub-optimal training.\n",
+    "2. My snake is ignorant of which actions will result in a collision. The snake must be able to understand that its Q function will commonly lead it astray in navigating its environment.\n",
+    "\n",
+    "In cases where the snake has explored enough to make its Q function useful and death is not a possibility, I still do want the snake to be greedy. I chose to implement a replacement argmin/max function to select actions from this table, which generates new actions in order from highest expected reward to lowest expected reward."
    ]
   },
   {
@@ -385,11 +391,13 @@
    "id": "b9bed101-661c-4e61-b8ad-94bbc1900e03",
    "metadata": {},
    "source": [
-    "How will we use this? If the action generated is not a viable action, we will take the next best action.\n",
+    "How will I use this? Remember that my game engine includes all the logic to calculate which actions are immediately dangerous. If the action generated is not a viable action, we will take the next best action, or the next, next best action, etc.\n",
     "\n",
     "What if no actions are viable? Then the agent has boxed itself in, and it doesn't matter what action we choose.\n",
     "\n",
-    "Previously, I reset the snake if it got stuck in a learning-loop. I will instead use epsilon, as it gives me a bit more control over training. Here is my greedy-action selector function, combining the work of all the previous code:"
+    "Previously, I reset the snake if it got stuck in a learning-loop. I will instead use epsilon, as it is the tried-and-true method of enforcing exploration.\n",
+    "\n",
+    "Here is my greedy-epsilon function, combining the work of all the previous code:"
    ]
   },
   {
@@ -404,11 +412,12 @@
     "    state, rewards = index_actions(q, id)\n",
     "\n",
     "    if np.random.uniform() < epsilon:\n",
-    "        return (state, np.random.choice(viable_actions)) if viable_actions.size > 0 else (state, 0)\n",
+    "        # SA -- a STATE-ACTION pair (X as in function input)\n",
+    "        return SA(state, np.random.choice(viable_actions)) if viable_actions.size > 0 else SA(state, 0)\n",
     "    for action in argmin_gen(rewards):\n",
     "        if action in viable_actions:\n",
-    "            return (state, action)\n",
-    "    return (state, 0) # death"
+    "            return SA(state, action)\n",
+    "    return SA(state, 0) # death"
    ]
   },
   {
@@ -416,7 +425,7 @@
    "id": "2169ac2b-df83-4f53-8a83-b35a2d1a521a",
    "metadata": {},
    "source": [
-    "We'll set up epsilon to decay over our 500-step test..."
+    "I'll set up epsilon to decay over our 500-step test..."
    ]
   },
   {
@@ -471,13 +480,13 @@
    "outputs": [],
    "source": [
     "set_q = np.array([[-10., -2., 0., -2.],\n",
-    "                 [-5., -5., 0., 0.],\n",
-    "                 [-2., -10., 2., 0.],\n",
-    "                 [0., -5., -5., 0.],\n",
-    "                 [0., -2., -10., -2.],\n",
-    "                 [0., 0., -5., -5.],\n",
-    "                 [-2., 0., -2., -10.],\n",
-    "                 [-5., 0., 0., -5.]])"
+    "                  [-5., -5., 0., 0.],\n",
+    "                  [-2., -10., 2., 0.],\n",
+    "                  [0., -5., -5., 0.],\n",
+    "                  [0., -2., -10., -2.],\n",
+    "                  [0., 0., -5., -5.],\n",
+    "                  [-2., 0., -2., -10.],\n",
+    "                  [-5., 0., 0., -5.]])"
    ]
   },
   {
@@ -489,8 +498,8 @@
    "source": [
     "epsilon = 0\n",
     "for step in range(n_steps):\n",
-    "    _, p1_action = pick_greedy_action(set_q, p1, epsilon)\n",
-    "    game_engine.player_advance([p1_action])\n",
+    "    X = pick_greedy_action(set_q, p1, epsilon)\n",
+    "    game_engine.player_advance([X.action])\n",
     "    epsilon *= epsilon_decay"
    ]
   },
@@ -509,7 +518,7 @@
    "id": "0b9968af-0ec2-4b92-a19d-50912703dd4a",
    "metadata": {},
    "source": [
-    "### Learning and Temporal Difference"
+    "### Q-Learning, with Temporal Difference"
    ]
   },
   {
@@ -517,9 +526,17 @@
    "id": "ce537e44-ac8c-4f09-b89d-a330f13277da",
    "metadata": {},
    "source": [
-    "I will be using the temporal difference equation as the key learning element of my reinforcement function. In theory, it allows me adjust the expected reward of a state to agree with its observed successor. In practice, it will allow out agent to take actions that previously led it closer to the goal.\n",
+    "A good agent prioritizes actions that leads to the highest expected reward. The Q-function assigns an expected utility to each state-action pair, usually the expected reward-to-go.\n",
+    "\n",
+    "A popular method of adjusting this state-value function is a version of the temporal difference equation, which adjusts the utility associated with each input to agree with the maximum utility of its successor:\n",
+    "\n",
+    "$Q(s,a) = Q(s,a) + \\alpha[(R(s,a,s') + \\gamma * max _a'Q(s',a') - Q(s,a))]$\n",
+    "\n",
+    "The nature of this method is somewhat recursive, as it updates Q to agree with max(Q'), which in turn is updated to agree with max(Q'')... which is why the algorithm is sometimes referred to as bootstrapping.\n",
+    "\n",
+    "The discount factor $\\gamma$ can be used to weight immediate reward higher than future reward, though will be kept as 1 in my solution, which means we consider all future actions equally. All I need to do is assign an enticing enough reinforcement to goal-attaining actions, and use the temporal difference equation to update all other state transitions.\n",
     "\n",
-    "In order to use this equation, I simply need to create a function that takes the current state/action/outcome, and the previous state/action, as this will be updated in cases where the agent did not reach the goal.\n",
+    "The complication is that we require two time states to peMy below function takes the I simply need to create a function that takes the current state/action/outcome, and the previous state/action, as this will be updated in cases where the agent did not reach the goal.\n",
     "\n",
     "When the agent does reach the goal, I will manually set that state and action to the best reward, 0. Remember that the q-table is initialized with zeros, meaning untravelled actions are pre-assigned good rewards. Both this and epsilon will encourage exploration.\n",
     "\n",
@@ -533,12 +550,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "def update_q(q, old_state_action, new_state_action, outcome, lr=0.05):\n",
+    "reward = -1\n",
+    "\n",
+    "def update_q(q, old_X, new_X, outcome, lr=0.05):\n",
     "    if outcome == multiplayer.CollisionType.GOAL:\n",
-    "        q[new_state_action[0], new_state_action[1]] = 0\n",
+    "        q[new_X.state, new_X.action] = 0\n",
     "    else:\n",
-    "        td_error = -1 + q[new_state_action[0], new_state_action[1]] - q[old_state_action[0], old_state_action[1]]\n",
-    "        q[old_state_action[0], old_state_action[1]] += lr * td_error"
+    "        td_error = reward + q[new_X.state, new_X.action] - q[old_X.state, old_X.action]\n",
+    "        q[old_X.state, old_X.action] += lr * td_error"
    ]
   },
   {
@@ -569,17 +588,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "p1_old_s_a = pick_greedy_action(q, p1, epsilon) # state, action\n",
-    "game_engine.player_advance([p1_old_s_a[1]])\n",
+    "p1_old_X = pick_greedy_action(q, p1, epsilon) # state, action\n",
+    "game_engine.player_advance([p1_old_X.action])\n",
     "\n",
     "for step in range(n_steps):\n",
-    "    p1_new_s_a = pick_greedy_action(q, p1, epsilon) # state, action\n",
-    "    outcome = game_engine.player_advance([p1_new_s_a[1]])\n",
+    "    p1_new_X = pick_greedy_action(q, p1, epsilon) # state, action\n",
+    "    outcome = game_engine.player_advance([p1_new_X.action])\n",
     "\n",
-    "    update_q(q, p1_old_s_a, p1_new_s_a, outcome)\n",
+    "    update_q(q, p1_old_X, p1_new_X, outcome)\n",
     "\n",
     "    epsilon *= epsilon_decay\n",
-    "    p1_old_s_a = p1_new_s_a"
+    "    p1_old_X = p1_new_X"
    ]
   },
   {
@@ -587,7 +606,7 @@
    "id": "c6cbf429-c790-4bb9-8775-ed067844ab4e",
    "metadata": {},
    "source": [
-    "The results look promising. Here is everything it learned:"
+    "Most of the time, results look promising. Here is everything it learned:"
    ]
   },
   {
@@ -599,14 +618,14 @@
     {
      "data": {
       "text/plain": [
-       "array([[-1.85942515, -0.28461569, -0.1475    , -2.25187465],\n",
-       "       [-6.22032336, -2.97794369, -0.54294824, -0.73044876],\n",
-       "       [-1.36441677, -5.89002527, -0.37131058, -0.44722626],\n",
-       "       [-4.05086759, -5.41891136, -2.6667349 , -1.52778296],\n",
-       "       [-0.74803435, -1.03051018, -5.33579039, -1.17403081],\n",
-       "       [-1.14848949, -1.11816252, -5.95403265, -2.19930379],\n",
-       "       [-0.91983985, -0.43064397, -0.97973069, -6.54929252],\n",
-       "       [-5.1999968 , -0.75777597, -0.38182729, -2.87468477]])"
+       "array([[-2.35070513, -1.09149338, -0.16535442, -0.56686264],\n",
+       "       [-9.89365379, -3.92409428, -5.34681609, -1.89000207],\n",
+       "       [-0.11260587, -4.74939046, -9.9192565 , -0.43756285],\n",
+       "       [-1.40606955, -3.09973297, -1.00213181, -0.59918486],\n",
+       "       [ 0.        , -0.48427898, -3.18961465, -0.15631095],\n",
+       "       [-1.23097846, -1.83318991, -3.93160124, -1.29734622],\n",
+       "       [-0.85699409, -0.19789849, -0.82954074, -5.23579225],\n",
+       "       [-4.3355246 , -1.17885939, -0.53698344, -2.4048526 ]])"
       ]
      },
      "execution_count": 22,
@@ -682,15 +701,14 @@
    "source": [
     "for step in range(n_steps):\n",
     "    # p1\n",
-    "    _, p1_action = pick_greedy_action(set_q, p1, epsilon)\n",
+    "    p1_X = pick_greedy_action(set_q, p1, epsilon)\n",
     "\n",
     "    # p2\n",
-    "    p2_new_s_a = pick_greedy_action(q, p2, epsilon) # state, action\n",
+    "    p2_X = pick_greedy_action(q, p2, epsilon) # state, action\n",
     "    \n",
-    "    game_engine.player_advance([p1_action, p2_new_s_a[1]])\n",
+    "    game_engine.player_advance([p1_X.action, p2_X.action])\n",
     "\n",
-    "    epsilon *= epsilon_decay\n",
-    "    p2_old_s_a = p2_new_s_a"
+    "    epsilon *= epsilon_decay"
    ]
   },
   {
@@ -698,7 +716,7 @@
    "id": "820d7e5f-c3e9-4dae-82ce-3c9188d8a8d8",
    "metadata": {},
    "source": [
-    "The learned agent normally plays significantly worse than the artificially-learned one, which is okay given I hardly spent time optimizing the number of steps and learning rate. I plan to compare both of the agents again my neural-network approach, so the last this I will do is save the q_tables to a file."
+    "The learned agent usually plays almost as well as the artificially-learned one, which is okay given I hardly spent time optimizing the number of steps and learning rate. I plan to compare both of the agents again my neural-network approach, so the last I will do is save the q_tables to a file."
    ]
   },
   {
@@ -711,14 +729,6 @@
     "np.save('superior_qt.npy', set_q)\n",
     "np.save('inferior_qt.npy', q)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "9eec33d8-9a65-426a-8ad5-8ffb2cbe2541",
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
diff --git a/revised_snake_q_network.ipynb b/two_revised_snake_q_network.ipynb
index fbab70d..9b9028d 100644
--- a/revised_snake_q_network.ipynb
+++ b/two_revised_snake_q_network.ipynb
@@ -34,11 +34,17 @@
    "id": "b3aab739-e016-4700-89c9-41f3c2f536cf",
    "metadata": {},
    "source": [
-    "### New Game Implementation\n",
+    "### Representing Q-function using Neural Networks\n",
     "\n",
-    "I have an improved game implementation which allows for multiplayer snake games, as well as simplified training. This notebook will go over training of a simple q-network, which maps a total of 32 different combinations of states and actions onto rewards, much like the previous q-table implementation from ***revised_snake_q_table.ipynb***.\n",
+    "In the last notebook, I represented my Q-function in a simple lookup table. This notebook offers a difference approach by using a neural network. The function we want the neural network to learn is, of course, the snake's Q-function, which maps a state-action pair onto an expected reward.\n",
     "\n",
-    "Please read that notebook first if interested in a more complete description of the new game engine. As usual, we have some game-setup to do:"
+    "#### Benefits of a Q-network\n",
+    "\n",
+    "The distinction between a Q-table and Q-network is that a Q-network contains and updates a set of parameters (weights) which summarize previously seen data. A Q-table cannot do this, and thus is completely clueless in situations where it recieves an input it has either not seen, or has not been trained on. In theory, this allows a neural network to not only represent environments with many more states, but also the ability to make guesses about in 'gaps' in its learning.\n",
+    "\n",
+    "This notebook will go over training of a simple q-network, which maps a total of 32 different combinations of states and actions onto rewards, much like the previous q-table implementation from ***one_revised_snake_q_table.ipynb***.\n",
+    "\n",
+    "First, I will set up the game environment:"
    ]
   },
   {