summaryrefslogtreecommitdiff
path: root/one_revised_snake_q_table.ipynb
diff options
context:
space:
mode:
authorbd-912 <bdunahu@gmail.com>2023-12-13 19:50:51 -0700
committerbd-912 <bdunahu@gmail.com>2023-12-13 19:50:51 -0700
commitd2c5b3ce4bccefeef40d7cadbbe14b350960475d (patch)
tree0b593a43dd4fa7957007d4a1e482ab260f4be8c0 /one_revised_snake_q_table.ipynb
parentb1137269b269eed1207005828b7939efc9f557c2 (diff)
Minor rewrites of various descriptions
Diffstat (limited to 'one_revised_snake_q_table.ipynb')
-rw-r--r--one_revised_snake_q_table.ipynb36
1 files changed, 17 insertions, 19 deletions
diff --git a/one_revised_snake_q_table.ipynb b/one_revised_snake_q_table.ipynb
index e827ae4..6d1c114 100644
--- a/one_revised_snake_q_table.ipynb
+++ b/one_revised_snake_q_table.ipynb
@@ -143,7 +143,7 @@
{
"data": {
"text/plain": [
- "[<CollisionType.GOAL: 1>]"
+ "[<CollisionType.NONE: 2>]"
]
},
"execution_count": 6,
@@ -169,7 +169,7 @@
"metadata": {},
"source": [
"### State-sensing methods, creating and reading a q-table\n",
- "Now, we can start redesigning some functions used to allow the snake to play intelligently. We'll use a multi-dimensional numpy array to store the rewards corresponding to each state and action. This is called a q-function, or a q-table in this case.\n",
+ "Now, we can start redesigning some functions used to allow the snake to play intelligently. We'll use a multi-dimensional numpy array to store the expected rewards corresponding to each state and action. This is called a q-function, or a q-table in this case, and represents one of the most fundamental methods of reinforcement learning. More on this later...\n",
"\n",
"How many states do I need? Seeing how the new **get_viable_actions** method already prevents the snake from choosing life-ending moves, the snake is no longer tasked with learning or memorizing it.\n",
"\n",
@@ -263,9 +263,7 @@
{
"data": {
"text/plain": [
- "([Point(x=160, y=80)],\n",
- " [Point(x=160, y=80), Point(x=160, y=0)],\n",
- " Point(x=0, y=0))"
+ "([Point(x=400, y=160)], [Point(x=400, y=160)], Point(x=0, y=80))"
]
},
"execution_count": 9,
@@ -282,7 +280,7 @@
"id": "d3fd47ce-55fe-4d2f-9147-8848193f7ca1",
"metadata": {},
"source": [
- "Now to make a function to index our expected reward-to-go given a state using sense_goal:"
+ "Now to make a function to index our expected reward-to-go given a state using sense_goal. Because we only have one state-sensing function, this function really only serves as a neat interface to sense_goal:"
]
},
{
@@ -339,7 +337,7 @@
"id": "6a2ef7f7-f6f7-4610-8e98-1d389327f3e8",
"metadata": {},
"source": [
- "In our learning agent, these actions will obviously be associated with different expected rewards. Essentially, we have a function that, given a state, tells us the expected utility of each action. Should we just choose the best one?\n",
+ "In my learning agent, these actions will obviously be associated with different expected rewards. Essentially, I have a function that, given a state, tells me the expected utility of each action. Should I just choose the best one?\n",
"\n",
"There are two problems with a greedy approach...\n",
"\n",
@@ -510,9 +508,9 @@
"id": "5c36ab97-2ca0-4468-8d4c-ebd1e4deec23",
"metadata": {},
"source": [
- "And the snake already plays optimally, no learning required. This implementation might be more similar to a passive learning agent, in the sense I already told the snake what policy I want it to follow.\n",
+ "And the snake already plays optimally, no learning required! This implementation might be more similar to a passive learning agent, in the sense I already told the snake what policy I want it to follow.\n",
"\n",
- "Now that we have these methods, I will create functions to allow the snake to learn by its own, and then pair it off against the q-table I just built."
+ "Now that I have these methods, I will create functions to allow the snake to learn by its own, and then pair it off against the q-table I just built."
]
},
{
@@ -528,7 +526,7 @@
"id": "ce537e44-ac8c-4f09-b89d-a330f13277da",
"metadata": {},
"source": [
- "A good agent prioritizes actions that leads to the highest expected reward. The Q-function assigns an expected utility to each state-action pair, usually the expected reward-to-go.\n",
+ "A rational agent prioritizes actions that leads to the highest expected reward. The Q-function assigns an expected utility to each state-action pair, usually the expected reward-to-go.\n",
"\n",
"A popular method of adjusting this state-value function is a version of the temporal difference equation, which adjusts the utility associated with each input to agree with the maximum utility of its successor:\n",
"\n",
@@ -538,7 +536,7 @@
"\n",
"The discount factor $\\gamma$ can be used to weight immediate reward higher than future reward, though will be kept as 1 in my solution, which means we consider all future actions equally. All I need to do is assign an enticing enough reinforcement to goal-attaining actions, and use the temporal difference equation to update all other state transitions.\n",
"\n",
- "In order to implement this equation, I simply need a function that takes the q-table to be updated, the old state-action pair, and then ew state-action pair, and the outcome as returned by the game engine so we can assign a reward.\n",
+ "In order to implement this equation, I simply need a function that takes the q-table to be updated, the old state-action pair, and then ew state-action pair, and the outcome as returned by the game engine so I can assign a reward.\n",
"\n",
"When the agent does reach the goal, I will manually set that state and action to the best reward, 0. Remember that the q-table is initialized with zeros, meaning untravelled actions are pre-assigned good rewards. Both this and epsilon will encourage exploration.\n",
"\n",
@@ -620,14 +618,14 @@
{
"data": {
"text/plain": [
- "array([[-7.98101961, -2.63365542, -0.40046043, -1.66295111],\n",
- " [-6.79790104, -8.97312148, -2.76629639, -2.01841064],\n",
- " [-2.5668375 , -7.76913115, -5.25510457, -1.60454875],\n",
- " [-2.53858287, -4.87135085, -7.27897488, -3.55392954],\n",
- " [-1.11332388, -1.16738974, -8.00673287, -3.76512078],\n",
- " [-4.80299325, -1.82240999, -4.36261659, -7.78143806],\n",
- " [-3.74031239, -1.81917483, -2.55794318, -8.59533619],\n",
- " [-7.27706114, -5.22216365, -2.79252452, -3.34047701]])"
+ "array([[-7.67491582, -2.33111962, -0.60015858, -1.37540041],\n",
+ " [-5.27152969, -7.98969718, -3.87453687, -1.24435724],\n",
+ " [-6.31827426, -5.34343005, -5.15269717, -3.14416104],\n",
+ " [-4.03006198, -7.7299521 , -5.56781757, -2.25634664],\n",
+ " [-2.21238723, -3.2719937 , -7.19183517, -1.12966867],\n",
+ " [-4.51471731, -3.0205144 , -7.82635816, -5.67783926],\n",
+ " [-3.65731728, -1.23278448, -1.3241749 , -7.94001044],\n",
+ " [-7.61199203, -2.0832036 , -2.27008676, -6.21290761]])"
]
},
"execution_count": 22,