Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

target/riscv: Ensure to handle all triggered a halt events #1171

Open
wants to merge 1 commit into
base: riscv
Choose a base branch
from

Conversation

lz-bro
Copy link
Contributor

@lz-bro lz-bro commented Nov 20, 2024

If all current halted states are due to a halt group, then a new "triggered a halt" event has occurred.

If all current halted states are due to a halt group, then
a new "triggered a halt" event has occurred.
@lz-bro lz-bro force-pushed the handle-all-trigger-halt branch from 7652128 to 21d836a Compare November 20, 2024 12:34
else
break;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop seems weird. What is it's purpose? DCSR is writable, so we can occasionally trick debugger into wrong conclusion.

halt groups are an optional feature and I'm quite confused that we don't check for it.

Could you please provide a test scenario to reproduce your issue? Is it possible to use spike to model it? Or do you need a specific HW ?

Copy link
Contributor Author

@lz-bro lz-bro Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug was discovered when I was testing Semihosting on our hardware and smp is enable. It fails when the following happens:
If all current halted states are due to a halt group, and other harts state was running. In fact, there was a hart halted, which caused the other harts to halt because of the hart group.

} else if (halted && running) {
LOG_TARGET_DEBUG(target, "halt all; halted=%d",
halted);
riscv_halt(target);
} else {

If there is such a halted hart,but the record status is running,it would not process riscv_semihosting.
if (halt_reason == RISCV_HALT_EBREAK) {
int retval;
/* Detect if this EBREAK is a semihosting request. If so, handle it. */
switch (riscv_semihosting(target, &retval)) {
case SEMIHOSTING_NONE:
break;
case SEMIHOSTING_WAITING:
/* This hart should remain halted. */
*next_action = RPH_REMAIN_HALTED;
break;
case SEMIHOSTING_HANDLED:
/* This hart should be resumed, along with any other
* harts that halted due to haltgroups. */
*next_action = RPH_RESUME;
return ERROR_OK;
case SEMIHOSTING_ERROR:
return retval;
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aap-sc I think this is a bug, would you provide some suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lz-bro I'm still trying to understand your reasoning and what the issue is exactly (the situation is still not quite obvious to me). It will take a couple of days - I'll ask additional question if necessary.

@JanMatCodasip
Copy link
Collaborator

@lz-bro I am afraid I have not understood what case this merge request addresses; not even after reading the commit description and the discussion so far.

Please, could you provide a very clear description in the commit message. Doing so will help:

  • Code reviewers to understand your changes (and have them merged)
  • Other users of OpenOCD source code - both current and future.

Thank you.

@zqb-all
Copy link
Contributor

zqb-all commented Nov 29, 2024

Let me try to explain this issue.
Suppose we have two cores, core0 and core1, belong to one smp group.
Openocd constantly calls riscv_openocd_poll to check the status.

When openocd calls riscv_openocd_poll on core0:

A. if no hardware state change occurs,sequence is:
A.1. Check that core0 belongs to smp and start checking the status of all smp cores.
A.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
A.3. riscv_poll_hart checks core1. core1's status is running without change. . next_action=RPH_NONE, running++
A.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=0, and running=2, nothing happen

B. If core0 hit soft breakpoints on hardware, one possible sequence is
B.1. Check that core0 belongs to smp and start to check the status of all smp cores.
B.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between B.2 and B.3, core0 hits the breakpoint and becomes halted, and core1 also halted immediately due to hw haltgroup]
B.3. riscv_poll_hart checks core1 and finds that core1's status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
There is no problem in this case.

Let's re-assume that core0/core1 are both running and consider case C

C. If core0 hit semihosting ebeak on hardware, one possible sequence is:
C.1. Check that core0 belongs to smp and start to check the status of all smp cores.
C.2. riscv_poll_hart checks core0. core0's status is running without change. next_action=RPH_NONE, running++
[Between C.2 and C.3, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
C.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALT_GROUP. This results in next_action=RPH_NONE, halted++
C.4. Finally, should_remain_halt=0, should_remain_resume=0, halted=1, running=1. Enter else if (halted && running), and call riscv_halt to halt all cores in the smp group. In the process core0's target->status will also be corrected to halted, and next time riscv_openocd_poll will consider core0's status unchanged.
Now here's the problem: as status not change, poll will not enter this if , cannot realize that core0 is actually semihosting ebreak and will not handle it. core0 and core1 will remain halted.

Let's re-assume that core0/core1 are both running and consider case D

D. If core0 hit semihosting ebeak on the hardware, but the timing was earlier than in case C, one possible sequence is:
D.1. Check that core0 belongs to smp and start checking the status of all smp cores.
[Between D.1 and D.2, core0 hit semihosting ebreak and halted, and core1 also halted immediately due to hw haltgroup]
D.2. riscv_poll_hart checks core0 and finds that core0 has changed from running to halted. Because halt_reason is RISCV_HALT_EBREAK, semihosting will be further checked and processed, after that one possible result is next_action=RPH_RESUME, should_resume++
D.3. riscv_poll_hart checks core1 and finds that core1 status has changed from running to halted. The reason is RISCV_HALTED_GROUP. This results in next_action=RPH_NONE, halted++
D.4. Finally, should_remain_halt=0, should_remain_resume=1, halted=1, running=0. Enter the else if (should_resume), then call riscv_resume, and eventually core0 and core1 will return to the running state.
There is no problem in this case.

Thank you.

@zqb-all
Copy link
Contributor

zqb-all commented Nov 30, 2024

Things are a bit complicated.
The poll function of the software takes time to run, and hart may hit ebreak (breakpoint or semihosting) at any time, then harts in the same smp group, may be halted immediately (by hardware haltgroup) or may continue running temporarily (for example, hart is controlled by another DM and no hardware supports synchronous halt).
We expect the poll function to eventually adjust the smp group to halted or running. But before the software completes processing, these temporarily running harts may also hit ebreak, make the situation more complicated.

@zqb-all
Copy link
Contributor

zqb-all commented Dec 12, 2024

@JanMatCodasip @aap-sc Does my description of the issue help you understand what the issue is ?

@JanMatCodasip
Copy link
Collaborator

@zqb-all Thank you for describing the situation in more detail. It will take me some time to get back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants