diff options
Diffstat (limited to 'one_revised_snake_q_table.ipynb')
-rw-r--r-- | one_revised_snake_q_table.ipynb | 8 |
1 files changed, 4 insertions, 4 deletions
diff --git a/one_revised_snake_q_table.ipynb b/one_revised_snake_q_table.ipynb index 9a6ca5f..e2f7bb7 100644 --- a/one_revised_snake_q_table.ipynb +++ b/one_revised_snake_q_table.ipynb @@ -342,7 +342,7 @@ "There are two problems with a greedy approach...\n", "\n", "1. If the agent only sticks to the one policy it knows, then it will never truly learn the best solution. This is not the largest problem, due to how few states we have. Additionally, the snake is forced into a new state everytime it collects a goal. Still, a pure greedy approach results in slow, sub-optimal training.\n", - "2. My snake is ignorant of which actions will result in a collision. The snake must be able to understand that its Q function will commonly lead it astray in navigating its environment.\n", + "2. My snake is ignorant of which actions will result in a collision. The snake must understand that its Q function will commonly lead it astray in navigating its environment.\n", "\n", "In cases where the snake has explored enough to make its Q function useful and death is not a possibility, I still do want the snake to be greedy. I chose to implement a replacement argmin/max function to select actions from this table, which generates new actions in order from highest expected reward to lowest expected reward." ] @@ -508,7 +508,7 @@ "id": "5c36ab97-2ca0-4468-8d4c-ebd1e4deec23", "metadata": {}, "source": [ - "And the snake already plays optimally, no learning required.\n", + "And the snake already plays optimally, no learning required. This implementation might be more similar to a passive learning agent, in the sense I already told the snake what policy I want it to follow.\n", "\n", "Now that we have these methods, I will create functions to allow the snake to learn by its own, and then pair it off against the q-table I just built." ] @@ -536,7 +536,7 @@ "\n", "The discount factor $\\gamma$ can be used to weight immediate reward higher than future reward, though will be kept as 1 in my solution, which means we consider all future actions equally. All I need to do is assign an enticing enough reinforcement to goal-attaining actions, and use the temporal difference equation to update all other state transitions.\n", "\n", - "The complication is that we require two time states to peMy below function takes the I simply need to create a function that takes the current state/action/outcome, and the previous state/action, as this will be updated in cases where the agent did not reach the goal.\n", + "In order to implement this equation, I simply need a function that takes the q-table to be updated, the old state-action pair, and then ew state-action pair, and the outcome as returned by the game engine so we can assign a reward.\n", "\n", "When the agent does reach the goal, I will manually set that state and action to the best reward, 0. Remember that the q-table is initialized with zeros, meaning untravelled actions are pre-assigned good rewards. Both this and epsilon will encourage exploration.\n", "\n", @@ -565,7 +565,7 @@ "id": "01b21e01-174e-4fdd-ad70-dcc1e6483fb2", "metadata": {}, "source": [ - "Now all that is needed is the training loop. I have high expectations for this agent, so I will only allow it 1500 moves to train itself! Here is where the outputs of pick_greedy_action come in handy:" + "Now all that is needed is the training loop. I have high expectations for this agent, so I will only allow it 1500 moves to train itself! Here is where the outputs of pick_greedy_action come in handy, because they can be used as a direct index into Q:" ] }, { |