Describe the bug
We have placed cameras in the field to record data, only to return a few days later and find them in a state where the cameras do not respond to SDK calls and can only be recovered through a full power cycle of the camera. This happens consistently on certain computers. The cameras stay in this stuck state even if we restart the computer they are plugged into. We cannot power cycle these cameras in software because we have the power cable directly soldered to an always-on port on the mother board. Moreover, we can replicate the problem with power plugged into the wall wart.
We have been told that we might be able to create a command line tool to reset the camera through the RGB camera. We will explore this. However, as it stands, we expect that even with this fix, we would have to live with the system being offline for a minute or two every day or so. This is part of a hospital alarm system, so we would rather not have that down time.
To Reproduce
We can reproduce this behavior on SOME machines. We are trying to track down the differences between those machine on which the problem reproduces and those on which it doesn't, but it is slow going. In particular, we have two NUCs that are the same model, and have the same updated USB drivers, but where one NUC consistently reproduces and one doesn't. We believe we've narrowed the problem down to SOME difference in BIOS or other computer firmware, but we haven't found the smoking gun. Even if we manage to find to a collection of BIOS settings and firmware versions for which we can't reproduce the problem, we would prefer, for peace of mind, that the issue were fixed in the Kinect firmware.
Notes: this repros 90% of the time. This may or may not be the minimal set of repro steps. The fact that it is interment makes it easy to get superstitious.
Let me know what info you need and I'll try to provide it to you.
Please try upgrading your firmware to 1.6.102075014. It has a couple minor USB connection updates that may help out here.
Grap the 1.2.0-alpha.7 NuGet package and crack it open to the firmware; https://www.nuget.org/packages/Microsoft.Azure.Kinect.Sensor/1.2.0-alpha.7. The SDK is alpha quality, but the firmware is ready to go. You
We have installed the new firmware on 3 cameras. We are running them and monitoring for problems. I'll update you when we have more information.
We have set up 6 machines with hardware identical to what we intend to release. They have updated BIOS, the latest Azure Kinect Sensor SDK, and the Azure Kinects have the latest firmware. We started them last night, and this morning 2 of them were in this state.
We have some evidence that this happens in the middle of our system running without being preceded by a freeze or a crash. We feel stuck at this point. We had developed a reset procedure, so this isn't catastrophic. On the other hand, as is we have to live with each of our cameras being down for a minute or two every day. We can ship you a box that gets into this state. We can also provide you remote access to the box. We're also open to any other ideas that might provide insight into the nature of the problem and a solution.
On the NUC with the repro, can you configure it to just run the k4aViewer.exe from our MSI? And let it run indefinitely?
When the issue repro's (when the viewer reports an error) please:
pnputil.exe /export-pnpstate <filename>.pnp NOTE: this will share a bunch of PC configuration info if you post the file publicly.Also why in Step #6 do you unplug the NUC? If you reset the Kinect then the device should be recovered without a reboot. This PR has a version of AzureKinectFirmwareTool that will reboot the device in this state.
We'll try to repro by letting k4aViewer.exe run and we'll set logging as you suggest. Thanks!
Also why in Step #6 do you unplug the NUC? If you reset the Kinect then the device should be recovered without a reboot.
The set of steps in the original post is meant to get the camera into a stuck state in a few minutes so that we don't have to wait a day for a repro. Some of the steps are probably unnecessary, but this is what worked for us. To emphasize, the purpose of this sequence of steps is not to clear the stuck state, but instead it is an (admittedly unnatural) way to get the system INTO a stuck state so that we/you can log, run diagnostics, etc.
I agree that we can get out of the state by resetting with the firmware tool, and that is what our system now does. This does not require a reboot. However, this stuck state is still a problem for us because it means our cameras briefly go offline daily.
If I understand the problem correctly, there are 2 issues here and we need to separate them.
1) The Azure Kinect stops working while streaming.
2) Rebooting certain NUC's the Azure Kinect starts up in a state that it can not be communicated with and device manager reports the USB devices are in an error state.
For the first part of the issue, if we can repro and diagnose why it stops working then the reboot won't be needed.
For the 2nd part, rebooting the NUC with the Azure Kinect attached results in the device being non-responsive is more challenging. As we understand the problem, the issue follows the NUC not the Kinect. If this turns out to be a NUC problem, then we may not be able to fix this. Adding the reset command to your device startup script will be a workaround, but it will garantee that after every reboot the device is functional. To investigate and diagnose this issue we are going to need a NUC capable of reproducing the problem. So far none of our NUC's repro the issue.
I got logging working. It was a misunderstanding on my part about how to apply environment variables.
Update: we ran 5 machines with k4aViewer over the weekend. 4 of the 5 were still running this morning. The 5th rebooted unexpectedly early this morning. Our logs indicate that the camera was NOT in a stuck state when this happened. The k4a.log file did not indicate a problem. In summary, we don't think we've replicated the problem yet by simply letting k4aViewer run.
In other news, we had another box running our system (not k4aViewer ) get into a stuck state over the weekend. This was surprising because we were running our hacked firmware tool reset on it. It turns out that this time the machine was unable to communicate with the RGB camera, which we haven't seen before. We intend to fix this by trying a reset through both cameras. I will check your pull request to see if it does this or not.
We still have not been able to replicate the problem simply running on k4aViewer. We can provide you access to a box where the problem replicates while running our system, in case there is any way for you to run more verbose diagnostics or otherwise determine why the camera is getting into the state in the first place.
In other news, I looked at your pull request. It actually doesn't reset the camera in the case where the depth camera is accessible but the RGB camera is not (there is a block at the bottom of firmware_create that deletes the return object if K4A_FAILED(result) is true, and K4A_FAILED(result) is true if the depth camera is accessible but RGB is not.
I rewrote firmware_create to keep track of separate results for the RGB camera and depth camera and - in the case you're resetting - delete the return object if they both fail. If you're not resetting it deletes if either fails. Let me know if you want the code.
In other news, I looked at your pull request. It actually doesn't reset the camera in the case where the depth camera is accessible but the RGB camera is not (there is a block at the bottom of firmware_create that deletes the return object if K4A_FAILED(result) is true, and K4A_FAILED(result) is true if the depth camera is accessible but RGB is not.
I recently ran into this too, #665, the fix will be in 1.2.0
@jbrownkramer, we recently released 1.2.0 version of the SDK that contains the fix for #665 that uses Firmware tool to reset the camera. That would be a workaround for your issue and you would have to add this step to your reboot script. Please let us know if it is working.
In recent testing, we've only needed to issue reset commands to the camera after machine restarts. So our immediate concerns have been addressed.