Dxvk: Monster Hunter World randomly freezes

Created on 17 Dec 2018  ·  79Comments  ·  Source: doitsujin/dxvk

Monster Hunter World (with proton) randomly freezes.
This usually happens in between after 10min to 4hours, so a long random time period.

As the DXVK_HUD (with memory) was enabled, at the time of the freeze, around ~3.9gb (assuming this is vram) of 6gb were used.

Most noticeable the dmesg output:

NVRM: Xid (PCI:0000:09:00): 31, Ch 0000004b, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_4 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

Xid 31, the addr 0x0_00000000, intr 10000000 and ACCESS_TYPE_READ are always constant.

To me, it looks like a simple nullptr access, as it is always the 0x0 addr, but I don't know how to investigate this problem further. I can not let the game run with
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation or apitrace for hours, as this makes it very unplayable.

PROTON_USE_WINED3D=1 just results in a black screen.
Allow flipping (in nvidia-setting) on/off does not change anything.

Please let me know how to make this report more useful, I am out of ideas.

Software information

  • Monster Hunter World
  • vsync: off
  • 30fps lock (getting weird input lag otherwise sometimes)
  • Steam / Proton 3.16-beta5

System information

  • GPU: Nvidia Geforce 1060gtx 6gb
  • Driver: nvidia-drivers-415.23
  • Wine version: Proton 3.16-beta5 (???)
  • DXVK version: Proton 3.16-beta5 (dxvk 0.93)
  • Kernel: 4.19.10
  • Ram: 16gb
  • CPU: Ryzen 2700X

Log files

(with DXVK_LOG_LEVEL=debug and DXVK_HUD=devinfo,fps,memory)

EDIT:
The game overall runs pretty well, just the random freezes are a pretty frustrating problem.

EDIT2:
The screen freezes but the game background music is still running.

nvidia

Most helpful comment

Reached out to Nvidia to see if they can help debug it

All 79 comments

I'm aware of this, but I cannot debug this. I don't even know what the dmesg message means exactly and what could possibly cause it, but it's definitely not a null pointer read in dxvk.

Since apitrace isn't going to help here, I would still ask you to run the game with VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation set. Righ now I have nothing to work with at all.

Hm ok, I will try to make something happen with
VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_standard_validation
but please tell me where/how/in which file should I see some output for this?
In d3d11.log? in dxgi.log? in proton log? on the console/terminal?
So that my attempts will not be in vain.

EDIT:
and for the record, I assumed a null pointer read on the gpu, not in dxvk, only that dxvk passes a null pointer to the gpu somehow. But I admit my Vulkan/etc knowledge is very limited here.

it will write to stdout, so capturing console output should work. It will not appear in the DXVK log files, and I don't know whether the Proton log captures it.

only that dxvk passes a null pointer to the gpu somehow

GPU pointers are hidden behind abstractions, so no, at least not directly.

Just a brief test already shows some errors like:

VUID-VkRenderPassCreateInfo-pDependencies-00837(ERROR / SPEC): msgNum: 0 - Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
    Objects: 1
       [0] 0x0, type: 0, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-VkRenderPassCreateInfo-pDependencies-00837 ] Object: VK_NULL_HANDLE (Type = 0) | Dependency 1 specifies a source stage mask that contains stages not in the GRAPHICS pipeline as used by the source subpass 0. The Vulkan spec states: For any element of pDependencies, if the srcSubpass is not VK_SUBPASS_EXTERNAL, all stage flags included in the srcStageMask member of that dependency must be a pipeline stage supported by the pipeline identified by the pipelineBindPoint member of the source subpass. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkRenderPassCreateInfo-pDependencies-00837)
VUID-VkSubpassDependency-srcAccessMask-00868(ERROR / SPEC): msgNum: 0 - vkCreateRenderPass(): pDependencies[3].srcAccessMask (0xa000540) is not supported by srcStageMask (0x8000). The Vulkan spec states: Any access flag included in srcAccessMask must be supported by one of the pipeline stages in srcStageMask, as specified in the table of supported access types (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkSubpassDependency-srcAccessMask-00868)
    Objects: 1
       [0] 0x0, type: 0, name: (null)

but NOTE the game did not freeze yet! They might not be related to the freeze itself.
MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation.log

Tomorrow I will try to make it until the game freezes.

Those are harmless and occur in every game because of some transform feedback issue.

Log with game frozen:
MonsterHunterWorld_VK_LAYER_LUNARG_standard_validation_long.log

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
    Objects: 1
       [0] 0xce03, type: 22, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkResetDescriptorPool-descriptorPool-00313 ] Object: 0xce03 (Type = 22) | It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)

I hope this helps. :)

The rest

VUID-vkDestroyFramebuffer-framebuffer-00892(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)
    Objects: 1
       [0] 0x4bf461f, type: 24, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkDestroyFramebuffer-framebuffer-00892 ] Object: 0x4bf461f (Type = 24) | Cannot call vkDestroyFramebuffer on Framebuffer 0x4bf461f that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to framebuffer must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFramebuffer-framebuffer-00892)

...

VUID-vkDestroyBufferView-bufferView-00936(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)
    Objects: 1
       [0] 0x4bf419a, type: 13, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkDestroyBufferView-bufferView-00936 ] Object: 0x4bf419a (Type = 13) | Cannot call vkDestroyBufferView on BufferView 0x4bf419a that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to bufferView must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyBufferView-bufferView-00936)

I guess is only because I killed the process (kill -9 pid) after it was stuck for 1+min.

Also new, but does not seem critical

UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch(ERROR / SPEC): msgNum: 0 - Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage
    Objects: 1
       [0] 0xd2e, type: 15, name: (null)
Validation(ERROR): msg_code: 0:  [ UNASSIGNED-CoreValidation-Shader-InterfaceTypeMismatch ] Object: 0xd2e (Type = 15) | Decoration mismatch on location 30.0: is per-patch in tessellation control shader stage but per-vertex in tessellation evaluation shader stage

EDIT:
new at the end of dxgi.log:

err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST
err:   DxvkSubmissionQueue: Failed to sync fence: VK_ERROR_DEVICE_LOST

new at the end of d3d11.log:

err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST
err:   DxvkDevice: Command buffer submission failed: VK_ERROR_DEVICE_LOST

When do the VK_ERROR_DEVICE_LOST issues start happening? I would suspect that the following is actually caused by those errors, and not causing them:

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)

I honestly can't tell, there are no timestamps on this logs. I happens too quickly so I guess around the same time.

Any way to add timestamp? or prove it? Or want me to test something else?

Further observation:

  • DXVK_STATE_CACHE=0 seems to make it worse, tried it 3 times and it froze within the first 15min.
    (But take it with a grain of salt, might be unrelated)
  • using DXVK_HUD, stats as it froze

    • Geforce GTX 1060 6GB

    • Driver: 415.23.0

    • Vulkan: 1.1.84

    • FPS: 30.0

    • min: 9.7 max: 57.0

    • Queue submissions: 5

    • Draw calls: 1310

    • Dispatch calls: 149

    • Render passes: 127

    • Graphics pipelines: 536

    • Compute pipelines: 140

    • Memory allocated: 2948 MB

    • Memory used: 2779 MB

  • also tried VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump but can not get it working.
    EDIT: I managed to get it working, but it produces a 1+gb file just going to the menu and it gives me like 1fps, really unplayanle this time.
    EDIT2:* VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace + vktrace -o mhw.vktrace produces a huge output file as well, but with slightly better performance, but I can not vkreplay this file.
  • after the freeze nvidia-smi reported 0% GPU usage for me (as I have seen reports where it is stuck with 100%)

But I found another bug, which might taint some of my reports so far.

The Game crashes when going to an F1 terminal.
Reproduce:

  • Open the Game
  • Wait for it to load until you are in the Menu
  • Press "Ctrl + Alt + F1"
    -> crash in
    =>0 0x00007f0d31a42e09 in libnvidia-glcore.so.415.25 (+0x11a0e09) (0x00007f0d2c84b4a0)
    steam-582010_dump_v5.log

This might invalidate my "after the freeze nvidia-smi reports 0% GPU usage" comment.

But it also could be caused by the new:

  • DXVK version: Proton 3.16-beta6 (dxvk 0.94)

or

  • Driver: nvidia-drivers-415.25

@doitsujin Happy to help, let me know how one can contribute. I have a 1080 GTX (nvidia 415.25) and of course I am _victim_ of the same bug.
It could also be a drivers' bug quite frankly... but not sure.

Is there a way to trace the API and then try to save the trace file for you? Although the file may be gigantic! :)

I also had this bug, and here is my log (with proton 3.16-6-beta) until the next proton release arrives.

on this one I played for few hours and then went to take a small nap and let game run (+ "login") to check if this bug would occur even doing nothing...and it did.
can't say if that log can help you but there it. if there is anything I can do give you better report I will.

I will try to replicate the bug on next proton update...whenever it arrives


my system spec:

inxi -b
System:    Host: linux Kernel: 4.20.6-1-default x86_64 bits: 64 Desktop: KDE Plasma 5.14.5 
           Distro: openSUSE Tumbleweed 20190209 
Machine:   Type: Desktop Mobo: ASUSTeK model: Z170 PRO GAMING v: Rev X.0x serial: <root required> UEFI: American Megatrends 
           v: 3805 date: 05/16/2018 
CPU:       Quad Core: Intel Core i5-6600K type: MCP speed: 4374 MHz min/max: 800/4400 MHz 
Graphics:  Device-1: NVIDIA GM204 [GeForce GTX 970] driver: nvidia v: 410.93 
           Display: x11 server: X.Org 1.20.3 driver: nvidia resolution: 1920x1080~60Hz, 1920x1080~60Hz 
           OpenGL: renderer: GeForce GTX 970/PCIe/SSE2 v: 4.6.0 NVIDIA 410.93 
Network:   Device-1: Intel Ethernet I219-V driver: e1000e 
Info:      Processes: 443 Uptime: 02:47:03  up 6 days  3:15,  3 users,  load average: 0.70, 1.07, 1.39 Memory: 15.60 GiB 
           used: 6.94 GiB (44.5%) Shell: bash inxi: 3.0.30

Just came back to the game, I'll see what I can do to produce some logs, can confirm that the same error occurs on the latest stable proton. Nvidia's website recommends cuda-memcheck or cuda-gdb, but I haven't had any luck getting cuda-gdb to attach properly (granted, that's my first attempt at gdb, so I might be missing something there).

I'll see what I can come up with for VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump and VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_vktrace tomorrow too

This is a note for the xid message. As documented here, xid 31 is GPU memory page fault, usually invalid memory access. In this case, it looks like a null pointer read.

For debugging with cuda-gdb, it depends on the number of gpus... Please see this.

It is probably possible to dump the core as instructed here

BTW, I am not an expert in GPU/DXVK... I am still learning from nvidia's manual. Feel free to correct me if I am wrong.

Managed to get cuda-gdb attached to it, fingers crossed!

To get cuda-gdb attached, you'll need to do the following:

Start MHW and get into the game proper. The early menu screens will crash if you attach early
ps aux | grep MonsterHunterWorld.exe # Note the PID of the actual executable
cuda-gdb
# The rest of these inside the cuda-gdb shell
handle SIGUSR1 nostop noprint
handle SIGQUIT nostop noprint
set cuda api_failures stop
attach <mhw pid from above>
continue

At this point the game will run, and hopefully will give us a nice backtrace when the null pointer deref occurs

@Xaenalt Could you please test disabling the nvapi hack and see if things improve? Put "dxgi.nvapiHack = False" in dxvk.conf, and add DXVK_CONFIG_FILE=/path/to/dxvk.conf to the launcher command line.

Will do!

No change with dxgi.nvapiHack = False, crash still occurs

Regular log from the crash while I attempt to get the SDK set up
steam-582010.log

Is this a driver bug? Has a bug report been submitted to Nvidia?

We'll know soon I hope. I got the api_trace working, and a 2TB drive to hold the log. Worst case I'm gonna leave it overnight and hope the error is in there

AHA, gotcha! Log uploading now! :D
(it'll be pretty big, I'm gzipping it to try to reduce size, but expect a long gunzip)
We lose the GPU on frame 13485 with the VK_ERROR_DEVICE_LOST (-4) error. I have a hotkey to kill -9 the game, which kicks in 2 frames later. I gave it a few minutes of wall clock time, since frames were on average taking maybe half a second

33G to 1G, wow, that compressed really well o.O

It's too big to upload to github directly, but I threw it into my gdrive, let me know if you have any issues pulling it: https://drive.google.com/file/d/1SHxowR6NZlSlUC4sm40o6m5WPk89OoXg/view?usp=sharing

I may be wrong, I'm no expert, but it seems like the semaphore at 0x16d4fc10 might be getting overwritten? It looks like some buffer copies target that area too. That might be expected behavior, idk

Hopefully @doitsujin will be able to take a peep at it? :)

Fingers crossed that the error is in plain sight in there :)

Any way we can help track down what's causing the lost device?

Does this still happen with the fixed shader and the latest driver? There's an admittedly small chance that the bad shader caused these hangs in the first place.

There doesn't really seem to be anything wrong apart from it, and these hangs do seem to be specific to Nvidia.

Testing now with that patch you just provided in the other bug (https://github.com/doitsujin/dxvk/issues/930 if anyone else wants to try as well). Should I grab another API trace if it encounters the hang?

@doitsujin I'm sorry to say, it just happened with the version you sent me

It seems to hang less frequently though, in the past 2 tests, it seems like it took a lot longer for the hang to happen. I'm going to keep testing, might just be my imagination

Still get a few early on, I'll try to get the api trace from one. In the meantime, here's a shader dump
https://drive.google.com/file/d/19MIcBdoZp6V8PPWstbbQvnJoXwSw3hR6/view?usp=sharing

Just doing some additional testing, setting d3d11.zeroWorkgroupMemory = True will cause the lockup to not lock up the entire system, still occurs though, and the fps takes a big hit

Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in case, will report if it crashes with that

I tried it already. It will crash.

On Tue, Feb 26, 2019, 18:50 Sean Pryor notifications@github.com wrote:

Just on a hunch, also trying with dxvk.numCompilerThreads = 1 just in
case, will report if it crashes with that


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/doitsujin/dxvk/issues/816#issuecomment-467663470, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABY5npcoS0JkYIelBf2n9kdJ4S_WEyj8ks5vRcgvgaJpZM4ZW911
.

Last time I tried using the validation layers I didn't get anything interesting, this time when it froze I received

VUID-vkDestroyCommandPool-commandPool-00041(ERROR / SPEC): msgNum: 0 - Attempt to destroy command pool with command buffer (0x7d41d750) which is in use. The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state. (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyCommandPool-commandPool-00041)
Objects: 1
[0] 0x7d41d750, type: 6, name: NULL
VUID-vkDestroyFence-fence-01120(ERROR / SPEC): msgNum: 0 - Fence 0x23b804 is in use. The Vulkan spec states: All queue submission commands that refer to fence must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroyFence-fence-01120)
Objects: 1
[0] 0x23b804, type: 7, name: NULL

Followed by a flood of

VUID-vkResetDescriptorPool-descriptorPool-00313(ERROR / SPEC): msgNum: 0 - It is invalid to call vkResetDescriptorPool() with descriptor sets in use by a command buffer. The Vulkan spec states: All uses of descriptorPool (via any allocated descriptor sets) must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkResetDescriptorPool-descriptorPool-00313)
Objects: 1
[0] 0xd3, type: 22, name: NULL

Unsure if this is important, but thought I'd report it anyways.

Yep, does indeed still crash with the same issue. I wonder if there's any extra debugging info we can add to track it down further, I put the API dump in an earlier comment, which it looks like didn't have any clues

@rsw0x that happens because of the error, but doesn't cause it. DXVK still doesn't handle DEVICE_LOST errors properly.

Are device lost errors something recoverable? Or moreover, are they something normal that if handled would allow the game to continue?

Device lost might just mean that the Driver crashed.
Most games don't handle this well. Mostly benchmarks do that because they know either users are trying unstable OC or unstable drivers.

I had this problem previously. Just updated to the latest nvidia-drivers ebuild on Gentoo:
x11-drivers/nvidia-drivers-418.43::gentoo was built with the following: USE="X acpi driver gtk3 kms multilib tools -compat -static-libs -uvm -wayland" ABI_X86="32 (64) (-x32)"

Running this kernel:
Linux localhost 4.20.13-gentoo #2 SMP Thu Feb 28 20:13:14 EST 2019 x86_64 Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz GenuineIntel GNU/Linux

Latest Steam proton beta (3.16-7).

nvidia-smi output:

```Thu Feb 28 22:09:06 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 On | N/A |
| 37% 59C P0 130W / 180W | 3164MiB / 8119MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 970 Off | 00000000:02:00.0 Off | N/A |
| 0% 27C P8 12W / 201W | 1MiB / 4043MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3273 G /usr/libexec/Xorg 291MiB |
| 0 3516 G /usr/bin/gnome-shell 171MiB |
| 0 4169 G ...uest-channel-token=17976434344080092270 51MiB |
| 0 19152 G ...in/.local/share/Steam/ubuntu12_32/steam 33MiB |
| 0 19161 G ./steamwebhelper 3MiB |
| 0 20860 C+G ...ter Hunter World\MonsterHunterWorld.exe 2495MiB |
| 0 22146 G ...quest-channel-token=1348227650674135017 113MiB |
+-----------------------------------------------------------------------------+
```

I've had MHW running for an hour and a half now without any freezes. I'm going to let it run overnight and see if it freezes up, just looking out over the ocean.

If it DOES freeze up eventually, I have two nvidia cards I can test with on identical hardware otherwise. This is running on a GTX 1070, but I also have a GTX 970 I was using previously.

The GTX 970 usually froze up within 45 minutes to an hour of starting.

Looks like it froze just after the 4 hour mark overnight. Same Xid dmesg log as buscher above.

Is there anything I can do to help diagnose further?

latest vulkan driver changelog:

Fixed a bug which could cause the compiler to crash in some Vulkan games

https://developer.nvidia.com/vulkan-driver

Any changes?

latest vulkan driver changelog:

Fixed a bug which could cause the compiler to crash in some Vulkan games

https://developer.nvidia.com/vulkan-driver

Any changes?

Not sure if I'm not understanding the way nvidia does version numbering, or if I already have that update? My nvidia-smi shows I'm running 418.43, whereas the latest version on that page is 418.42.02. I'll switch to that version and test again overnight, given the ~4 hour run time until it freezes it pretty much has to be an overnight test for me.

Same result with 418.42.02. Started at 1551910780.835805, froze at 1551931937.174271, just under 6 hours of runtime.

I have this problem with Battlefield 1. System freezes randomly, sometimes after many hours, and I have to kill X to get the frozen image away. Happens not only in game, but during loading screens and menus too.
Running KDE Neon 18.04 with nvidia-driver-418-418.43, with Wine 4.3 and DXVK 1.0.
My dmesg messages are [ 2980.508578] NVRM: GPU at PCI:0000:01:00: GPU-a919130d-9a04-dbf1-19c0-c827155af29b

[ 2980.508582] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000054, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x1_5d1ee000. Fault is of type FAULT_PTE ACCESS_TYPE_READ
Let me know if there's more info I can provide

Something about the Kulve Taroth fight makes it crash much more frequently @_@

Something about the Kulve Taroth fight makes it crash much more frequently @_@

Since I updated drivers to 418 and have a new videocard (2080 Ti), I never experienced this infamous crash once (used to have a 1080 GTX and the crash was happening every ~1.5 hours on average).

That's interesting, I'm on the 1080ti, with driver 418.43, I wonder if there's some ray tracing function that is being used that avoids this issue

Can you give us an inxi -b to tell us a bit more about the system?

I don't have any real proof one way or another, but this feels like a memory/handle issue. Is memory fragmentation a thing for GPU memory, could there be 2GB of VRAM free but no large contiguous chunks to allocate for a given texture/shader/etc, causing that allocation to fail and return null?

This would explain why more VRAM seems to let the game last longer before freezing, even though VRAM usage doesn't actually seem to leak.

Hmmm, if that were the case, I'd sort of expect to see a GPU out of memory error despite having free VRAM, or for that to be revealed in the trace I posted

I wish I knew more to help pin down the specifics of the error

Well at least some things do require contiguous memory. See for example https://vulkan.lunarg.com/doc/view/1.0.30.0/linux/vkspec.chunked/ch10s02.html

Note

vkMapMemory will fail if the implementation is unable to allocate an appropriately sized contiguous virtual address range, e.g. due to virtual address space fragmentation or platform limits. In such cases, vkMapMemory must return VK_ERROR_MEMORY_MAP_FAILED. The application can improve the likelihood of success by reducing the size of the mapped range and/or removing unneeded mappings using VkUnmapMemory.

Yeah, if that were the case, I'd expect to see that error rather than a null pointer dereference. We'd get the VK_ERROR_MEMORY_MAP_FAILED despite having GPU memory usage not being at 100%

@Xaenalt My understanding is that the game does not support Ray tracing (as that is Dx12 feature) nor does GTX gpus, you need RTX for that

Ah, I figured that would be the only explanation if the 2080rtx wasn't crashing while the 1080gtx did

Again, I think I have generally 10.9 GiB free ram (the difference with total of 11.264 is used by the OS/gnome).
I'm not sure if it's ram related, but that could be a good hint. Still, with new drivers I had less crashes on my 1080 too (please not it was not a Ti).

Since this MAY be somewhat memory related from your comments above, is https://github.com/doitsujin/dxvk/issues/267#issuecomment-475299958 possibly related? Even tho you are not using PRIME...

Since this MAY be somewhat memory related from your comments above, is #267 (comment) possibly related? Even tho you are not using PRIME...

I've got 418.43 currently...

Using
nvidia-drivers-418.56
and I got no freeze for at least 6h of game play.

it maybe is just a coincident but maybe the changelog entry (for 418.56)

  • Fixed a bug which could sometimes cause Vulkan applications to lock up the GPU when freeing large chunks of memory on systems with PRIME enabled.

even thou I am not using PRIME.

FTR I got several crashes & freezes with 418.43 or older, with my gtx1060 6gb ram OC. I will report back once I got a freeze.

Still freezing on 418.56 unfortunately.

Played with some friends for ~4 hours last night, then saved to title screen and loaded back in (to force a save). Let it idle overnight.

22:10 was the start time of the MonsterHunterWorld.exe process, froze at 6:18 (8h 8m of runtime).

-- edited to fix weird AM/PM

Still a problem for me as well. Nvidia 418.56, Proton 4.2, Linux Mint 19.1, Nvidia 970. Crashes randomly, in-between a 1 - 4 hours range it seems.

Thanks for all your work :green_heart: - let me know how I can help debug this.

I can confidently report that since I moved to a RTX 2080 Ti, the issue has completely gone.
I wonder if it's an Nvidia driver issue which appears with less performant cards (it did happen _frequently_ with my GTX 1080).

I can confidently report that since I moved to a RTX 2080 Ti, the issue has completely gone.
I wonder if it's an Nvidia driver issue which appears with less performant cards (it did happen _frequently_ with my GTX 1080).

How long did you test for? I was able to get 6-8 hours on average before having the issue on my 1070, and my 970 had the issues after 45-50 minutes on average.

Based on increased specs from the 1070 to the 2080 ti, I'd expect to see probably 14-20 hours of runtime before you ran into the issue.

I'm curious to see if the new HD texture pack DLC they released impacts the runtime before encountering the issue, as well, but haven't had a chance to test yet.

I can confidently report that since I moved to a RTX 2080 Ti, the issue has completely gone.
I wonder if it's an Nvidia driver issue which appears with less performant cards (it did happen _frequently_ with my GTX 1080).

How long did you test for? I was able to get 6-8 hours on average before having the issue on my 1070, and my 970 had the issues after 45-50 minutes on average.

Based on increased specs from the 1070 to the 2080 ti, I'd expect to see probably 14-20 hours of runtime before you ran into the issue.

I'm curious to see if the new HD texture pack DLC they released impacts the runtime before encountering the issue, as well, but haven't had a chance to test yet.

I've tested for many sessions, but never managed to have more than 4 hours uninterrupted.
I'm also using HD texture pack.

Just had a freeze after about 2h 15m when using HD texture DLC, as opposed to 6h 30m normally. there's an in-game VRAM usage estimator now that shows 4.4GB of VRAM with my normal settings, 6.2GB with HD textures.

Are people here still experiencing some freeze with current MHW build these days?

Are people here still experiencing some freeze with current MHW build these days?

Yes. I've played few days ago and had some freezes. I felt like they now happen less frequently. It's probably just random.
I was so annoyed with this I wrote a script wich detect if game froze and kill it. Had to do it, because when game freezes, everyting freezes with it, cant open terminal or tty.

I haven't been experiencing the freeze since I updated from 1080 GTX to 2080 Ti RTX.

Since it seems this problem is Nvidia GPU driver-related, anyone cared to reach out to Nvidia?

sadly, this error is still happening. Interestingly enough, it used to freeze my whole X Session and i had to Ctrl-Alt-F2 to get into another shell and kill X. Since a fews days it's no longer freezing X but only the game itself, and i can alt-tab out into another terminal to kill only the MHW process. Sadly i'm not sure which version update led to that change. Right now i'm using Proton 4.2.3 from Steam with 418.74 as Nvidia driver on Arch Linux.

I also found this thread by googling the error message: https://forums.geforce.com/default/topic/1096146/linux-monster-hunter-world-drivers-crash/

but there's not much too it.

I play the game a lot so if i can help debug this please tell me. Would it help to run the game with strace to see which systemcall is failing? Is there even any way to workaround this issue without nvidia's help?

Reached out to Nvidia to see if they can help debug it

Tried disable Z-Prepass in game graphics settings, give me a really stable gameplay (but performance not as good as previous

Update: Ah, this is interesting, after I config driver and dxvk according to this, even with Z-Prepass the game can run without freeze. However only with proton 3.16-4

@Misairu-G this is only for Optimus right?

@Misairu-G this is only for Optimus right?

@huberb I don't have other system setup, so I can't answer your question. But you're right, I'm using bumblebee and PrimusVK.

My freezing only appear when I switch to Driver version higher than 396, or proton 4.x. FYI, I'm running Ubuntu 16.04 with 4.15.0 kernel.

Sadly, it's the same problem with proton 3.16. Random freezes with FAULT_PDE ACCESS_TYPE_READ in dmesg. I'm really hoping Nvidia will eventually do something about this.

I have a GTX 1070

Perhaps this is related to #1169. There is a fix in development by nvidia that might fix that one. When it is available, it might be a good idea to test it here too.

Tested the latest vulkan beta drivers (435.19.03) and the issue persists (freeze after ~3h 50m), so this wasn't fixed by the fix for #1169 unfortunately.

I was running nvidia-smi dmon -s pucm -o DT -i 0 during the testing and this was towards the end of the output, fifth line being where the freeze occurred:

 20190909   16:00:15      0   131    63     -    99    47     0     0  4006  1936  3191     9
 20190909   16:00:16      0   132    63     -    99    47     0     0  4006  1936  3191     9
 20190909   16:00:17      0   131    63     -    99    47     0     0  4006  1936  3191     9
 20190909   16:00:18      0   133    63     -    99    47     0     0  4006  1936  3191     9
 20190909   16:00:19      0    49    62     -     0     0     0     0  4006  1936  3191     9
 20190909   16:00:20      0    49    61     -     0     0     0     0  4006  1936  3191     9
 20190909   16:00:21      0    49    61     -     0     0     0     0  4006  1936  3191     9
 20190909   16:00:22      0    33    60     -     0     1     0     0  3802   999  3191     9
#Date       Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk    fb  bar1
#YYYYMMDD   HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz    MB    MB
 20190909   16:00:23      0    33    59     -     0     1     0     0  3802   999  3191     9
 20190909   16:00:24      0    24    59     -     0     2     0     0   810   797  3191     9
 20190909   16:00:25      0    17    58     -     0     3     0     0   810   797  3191     9
 20190909   16:00:26      0    17    58     -     0     3     0     0   810   797  3191     9
 20190909   16:00:27      0    15    57     -     1     5     0     0   405   227  3191     9
 20190909   16:00:28      0    15    57     -     1     6     0     0   405   227  3191     9

At the start, the line listed sm % as 99 and mem % as 46. Normal usage shows sm % of 0 or 1, and mem % of 1.

Game window froze, but terminal on other monitor continued to output every second up until I tried to alt+tab out, then everything froze. Used an active SSH session to killall -9 MonsterHunterWorld.exe to kill MHW and everything unfroze, with no need to kill Xorg.

dmesg shows the same Xid error:

[2247625.011437] NVRM: GPU at PCI:0000:01:00: GPU-1ec9083a-db9c-1b2f-fd59-87c82dd1c09a
[2247625.011440] NVRM: GPU Board Serial Number: 
[2247625.011442] NVRM: Xid (PCI:0000:01:00): 31, pid=11036, Ch 0000003b, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_5 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

If an API Trace or something else can help with diagnosing the cause of this issue, I can generate one provided someone can provide me documentation on how to do so. I have no issue letting the game run overnight to get a valid dump if needed.

Happy new year ;-)

What should we do with this bug? MHW runs generally fine for me. There are a few issues left but none of them are (as far as I can tell) related to DXVK.

Close it?

Agreed - let's close it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SergeyLatyshev picture SergeyLatyshev  ·  57Comments

jarrard picture jarrard  ·  58Comments

doitsujin picture doitsujin  ·  65Comments

pingubot picture pingubot  ·  112Comments

mozo78 picture mozo78  ·  56Comments