Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8313643
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 8, 20262026-06-08T20:33:31+00:00 2026-06-08T20:33:31+00:00

The problem: I’ve trained an agent to perform a simple task in a grid

  • 0

The problem:

I’ve trained an agent to perform a simple task in a grid world (go to the top of the grid while not hitting obstacles), but the following situation always seems to occur. It finds itself in a easy part of the state space (no obstacles), and so continually gets a strong positive reinforcement signal. Then when it does find itself is difficult part of the state space (wedged next to two obstacles) it simply chooses same action as before, to no effect (It goes up and hits the obstacle). Eventually the Q value for this value matches the negative reward, but by this time the other actions have even lower Q values from being useless in the easy part of the state space, so the error signal drops to zero and the incorrect action is still always chosen.

How can I prevent this from happening? I’ve thought of a few solutions, but none seem viable:

  • Use a policy that is always exploration heavy. As the obstacles take ~5 actions to get around, a single random action every now and then seems ineffective.
  • Make the reward function such that bad actions are worse when they are repeated. This makes the reward function break the Markov property. Maybe this isn’t a bad thing, but I simply don’t have a clue.
  • Only reward the agent for completing the task. The task takes over a thousand actions to complete, so the training signal would be way too weak.

Some background on the task:

So I’ve made a little testbed for trying out RL algorithms — something like a more complex version of the grid-world described in the Sutton book. The world is a large binary grid (300 by 1000) populated by 1’s in the form of randomly sized rectangles on a backdrop of 0’s. A band of 1’s surrounds the edges of the world.

An agent occupies a single space in this world and only a fixed windows around it (41 by 41 window with the agent in the center). The agent’s actions consist of moving by 1 space in any of the four cardinal directions. The agent can only move through spaces marked by a 0, 1’s are impassible.

The current task to be performed in this environment is to make it to the top of the grid world starting from a random position along the bottom. A reward of +1 is given for successfully moving upwards. A reward of -1 is given for any move that would hit an obstacle or the edge of the world. All other states receive a reward of 0.

The agent uses the basic SARSA algorithm with a neural net value function approximator (as discussed in the Sutton book). For policy decisions I’ve tried both e-greedy and softmax.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-08T20:33:32+00:00Added an answer on June 8, 2026 at 8:33 pm

    The typical way of teaching such tasks is to give the agent a negative reward each step and then a big payout on completion. You can compensate for the long delay by using eligibility traces and by placing the agent close to the goal initially, and then close to the area it has explored.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Problem: I can set the exposureMode property of AVCaptureDevice, but it does not stay
Problem: I have a table that prints out vertical but I would like it
Problem: I wrote a hello world java program, compiled and created a jar file
Problem I have a xib file with a UIImageView over the top of the
?Problem is pretty simple, I extend Button to create ImageButton, and I extend ButtonSkin
Problem, simple and annoying. Im just trying to print a list of names, collected
Problem: I don't know why my script is not loading a value into the
Problem: Trying to create a Mix that is applied to the AVPlayerItem, but it
Problem: I have two spreadsheets that each serve different purposes but contain one particular
PRoblem: i'm trying to create (just for fun) a simple poker card (with a

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.