-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent Issues with OAK-D W PoE Cameras Streaming Left Camera Data #1042
Comments
@jakaskerl would you mind checking if we can reproduce the issue? I think this will likely fall down to a HW issue. |
Sorry, I forgot to mention that I have two oaks connected to my laptop (oak0 and oak1). The fact that it happens w/ both cameras makes me wonder if it's a hardware issue. Please let me know if there's any further information I can share to help you better understand the issue. |
Can anyone please provide me with an update? |
Hi @guilhermedemouraa
This indicates power issue, perhaps do a recheck of the power source (injector/switch)? Change the source if you can. If pipeline related, the issue is most likely caused by feature tracker. Not sure what configuration you are using, but we have had issues with it before. The docs pages states the supported resolutions are 480p and 720p, whereas you are using 800p. Thanks, |
Thanks for getting back to me, @jakaskerl. For more context, here's what I did:
I believe that somewhere in this process, the device crashed. In fact, I cannot even ping it. Here are some logs from my service:
The message "Device crashed, but no crash dump could be extracted" is at least weird. To be clear, I'm not powering it down. The power source is never touched. However my code that creates the pipeline and actively subscribes to the camera stream goes out of scope. So, to "reopen" the camera, I need to start the pipeline all over again... I would appreciate your fast communication on this issue. We have more than 400 oak cameras at farm-ng and can't afford to have them not working properly. |
I am also happy to set up an offline meeting to explain in greater detail any questions you may have. |
Hi @guilhermedemouraa Perhaps the destructor is not properly called in the service.
Thanks, |
UpdateI upgraded both the However, from time to time I get the following error message: Monitor thread (device: 18443010E147C31200 [10.95.76.10]) - ping was missed, closing the device connection Despite this error, I can still ping the device from my system, and after restarting my service, I can re-establish the connection without needing to power cycle the device. After some digging, I found that the error message I'm getting comes from the depthai's watchdog timeout. Here's a relevant code snippet I found: // Example code snippet from DeviceBase::init2
if(watchdogTimeout > std::chrono::milliseconds(0)) {
// Watchdog thread setup
watchdogThread = std::thread([this, watchdogTimeout]() {
try {
XLinkStream stream(connection, device::XLINK_CHANNEL_WATCHDOG, 128);
std::vector<uint8_t> watchdogKeepalive = {0, 0, 0, 0};
while(watchdogRunning) {
stream.write(watchdogKeepalive);
{
std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
lastWatchdogPingTime = std::chrono::steady_clock::now();
}
// Ping with a period half of that of the watchdog timeout
std::this_thread::sleep_for(watchdogTimeout / 2);
}
} catch(const std::exception& ex) {
// ignore
pimpl->logger.debug("Watchdog thread exception caught: {}", ex.what());
}
// Watchdog ended. Useful for checking disconnects
watchdogRunning = false;
});
// Monitor thread setup
monitorThread = std::thread([this, watchdogTimeout]() {
while(watchdogRunning) {
// Ping with a period half of that of the watchdog timeout
std::this_thread::sleep_for(watchdogTimeout);
// Check if wd was pinged in the specified watchdogTimeout time.
decltype(lastWatchdogPingTime) prevPingTime;
{
std::unique_lock<std::mutex> lock(lastWatchdogPingTimeMtx);
prevPingTime = lastWatchdogPingTime;
}
// Recheck if watchdogRunning wasn't already closed and close if more than twice of WD passed
if(watchdogRunning && std::chrono::steady_clock::now() - prevPingTime > watchdogTimeout * 2) {
pimpl->logger.warn("Monitor thread (device: {} [{}]) - ping was missed, closing the device connection", deviceInfo.mxid, deviceInfo.name);
// ping was missed, reset the device
watchdogRunning = false;
// close the underlying connection
connection->close();
}
}
});
} Assistance RequestedI am looking for guidance on how to:
Thanks again @jakaskerl |
On a different note, I wanted to clarify that I can't upgrade my depthai SDK version to Could you provide any information on when the new Thank you for your assistance! |
WRT the watchdog, you may try increase it from 4 to 4.5s or disable it with the following env variables:
This can be done manually by catching any exceptions and restarting the pipeline/program flow to again boot the device.
If its not crash induced, then this likely comes from high network congestion. See if it possible to remove certain other traffic, make sure ETH is 1Gbit through the whole network, etc...
We'd gladly take a look at this - do you have any repro steps already, which makes it fail? |
Hello DepthAI team, Thank you for the initial suggestions regarding the watchdog and network settings. I wanted to follow up with a bit more detail on how our system is structured and request further guidance on handling dropped device scenarios. In our application, we register a callback for new frames using DepthAI's API. This callback is triggered whenever a new packet/frame is received. However, this design leaves me uncertain about how to catch exceptions or handle a dropped device event. Specifically, I'm looking for recommendations on how to implement reconnection logic in a situation where the device is dropped, but no new frames are being received (i.e., the callback is not triggered). Since I rely heavily on the callback to process data, I am unsure where or how to catch any exceptions that might arise due to a device being dropped or encountering other issues. I found this resource online, but I don't think this could work for the callback approach. Could you provide any examples or guidance on how to detect and handle these cases within the callback system or elsewhere in the data flow? Additionally, if there are best practices for managing reconnections, I would greatly appreciate your insights. Thank you for your continued support. |
Hi @themarpe and @jakaskerl, Just following up on my previous message. Any guidance on how to implement a robust reconnection logic using the callback approach? |
Describe the bug
I'm experiencing intermittent issues with streaming data from the left camera of OAK-D W PoE cameras. The left camera data is not always present, and the device sometimes fails to be recognized upon service restart. This is the log I get from my service:
Failed to resolve stream: Failed to find device after booting, error message: X_LINK_DEVICE_NOT_FOUND
Some other times, I get this log:
[1844301051BEC41200] [10.95.76.11] [1718734907.447] [host] [warning] There was a fatal error. Crash dump saved to /tmp/depthai_LR3TVl/1844301051BEC41200-depthai_crash_dump.json
I'm attaching the crash reports here, but they all seem like code bugs.
Setup Details:
Streaming Configuration:
Issues Encountered:
Intermittent Left Camera Data:
Occasionally, after rebooting the PC, the left camera stream works without any issues.
Minimal Reproducible Example
It's hard to add everything here, I have a gRPC server that streams the oak topics and a python gRPC client that subscribes to them. Here's how I created my dai pipeline:
I've tried both with and without
cam->setIsp3aFps(5);
.Expected behavior
I expect the left camera data to be consistently available and the device to be reliably recognized upon service restart.
Attach system log
depthai_DX4bJu_1844301051BEC41200-depthai_crash_dump.json
depthai_tYFDEE_18443010E147C31200-depthai_crash_dump.json
depthai_Zgzmrx_1844301051BEC41200-depthai_crash_dump.json
Additional context
Described above...
The text was updated successfully, but these errors were encountered: