Describe the bug
Performance regression
ethminer 0.15 - 28MH/s
ethminer 0.16 - 5MH/s
To Reproduce
Compare 0.15 and 0.16 on Fiji hardware.
Expected behaviour
Performance should be on par or better when updating ethminer.
Desktop (please complete the following information):
Additional context
Number of platforms 1
Platform Name AMD Accelerated Parallel Processing
Platform Vendor Advanced Micro Devices, Inc.
Platform Version OpenCL 2.1 AMD-APP (2679.0)
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_icd cl_amd_event_callback
Platform Host timer resolution 1ns
Platform Extensions function suffix AMD
Platform Name AMD Accelerated Parallel Processing
Number of devices 1
Device Name gfx803
Device Vendor Advanced Micro Devices, Inc.
Device Vendor ID 0x1002
Device Version OpenCL 1.2
Driver Version 2679.0 (HSA1.1,LC)
Device OpenCL C Version OpenCL C 2.0
Device Type GPU
Device Board Name (AMD) Fiji [Radeon R9 FURY / NANO Series]
Device Topology (AMD) PCI-E, 09:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 56
SIMD per compute unit (AMD) 4
SIMD width (AMD) 16
SIMD instruction width (AMD) 1
Max clock frequency 1050MHz
Graphics IP (AMD) 8.3
Device Partition (core)
Max number of sub-devices 56
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x1024
Max work group size 256
Preferred work group size (AMD) 256
Max work group size (AMD) 1024
Preferred work group size multiple 64
Wavefront width (AMD) 64
Preferred / native vector sizes
char 4 / 4
short 2 / 2
int 1 / 1
long 1 / 1
half 1 / 1 (cl_khr_fp16)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16)
Denormals No
Infinity and NANs No
Round to nearest No
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 4294967296 (4GiB)
Global free memory (AMD) 4192256 (3.998GiB)
Global memory channels (AMD) 16
Global memory banks per channel (AMD) 4
Global memory bank width (AMD) 256 bytes
Error Correction support No
Max memory allocation 3650722201 (3.4GiB)
Unified memory for Host and Device No
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 16384 (16KiB)
Global Memory cache line size 64 bytes
Image support Yes
Max number of samplers per kernel 29440
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x16384 pixels
Max 3D image size 2048x2048x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Local memory syze per CU (AMD) 65536 (64KiB)
Local memory banks (AMD) 32
Max number of constant args 8
Max constant buffer size 3650722201 (3.4GiB)
Preferred constant buffer size (AMD) 16384 (16KiB)
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Number of P2P devices (AMD) 0
P2P devices (AMD)
Profiling timer resolution 1ns
Profiling timer offset since Epoch (AMD) 0ns (Thu Jan 1 01:00:00 1970)
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Thread trace supported (AMD) No
Number of async queues (AMD) 8
Max real-time compute queues (AMD) 8
Max real-time compute units (AMD) 56
printf() buffer size 4194304 (4MiB)
Built-in kernels
Device Extensions cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [AMD]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)
Platform Name AMD Accelerated Parallel Processing
Device Name gfx803
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name AMD Accelerated Parallel Processing
Device Name gfx803
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name AMD Accelerated Parallel Processing
Device Name gfx803
Is it possible that with the rocm now the fiji processors have an issue with the atomics? Did you check kernel logs to see if any errors were present? rocm 1.9.x was it already installed when you ran 15 or did you upgrade to it as well as the latest version 16?
No related errors in kern.log
ROCm 1.9 was already installed and has been working well with ethminer 0.15
Same here.
Same here, ROCm 1.9.1, Vega 64, hashrate < 4 MH/s.
Have you tried the --cl-only option?
Tried it just now, but doesn't make a difference. Still terrible performance on 0.16.1 but 0.15 works fine.
cl 18:50:58 cl-0 Platform: AMD Accelerated Parallel Processing
cl 18:50:58 cl-0 Device: gfx803 / OpenCL 1.2
i 18:50:58 cl-0 Adjusting CL work multiplier for 56 CUs.Adjusted work multiplier: 101�945
cl 18:51:00 cl-0 OpenCL kernel
cl 18:51:00 cl-0 Creating light cache buffer, size: 44,250 MB
cl 18:51:00 cl-0 Creating DAG buffer, size: 2,766 GB, free: 1,191 GB
cl 18:51:00 cl-0 Loading kernels
cl 18:51:00 cl-0 Writing light cache buffer
cl 18:51:00 cl-0 Creating buffer for header.
cl 18:51:00 cl-0 Creating mining buffer
m 18:51:03 ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
m 18:51:08 ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
i 18:51:11 cl-0 2,766 GB of DAG data generated in 10�513 ms.
m 18:51:13 ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
m 18:51:18 ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
m 18:51:23 ethminer Speed 0,00 Mh/s gpu0 0,00 [A0] Time: 00:00
m 18:51:28 ethminer Speed 1,97 Mh/s gpu0 1,97 [A0] Time: 00:00
m 18:51:33 ethminer Speed 1,97 Mh/s gpu0 1,97 [A0] Time: 00:00
m 18:51:38 ethminer Speed 1,97 Mh/s gpu0 1,97 [A0] Time: 00:00
m 18:51:43 ethminer Speed 1,97 Mh/s gpu0 1,97 [A0] Time: 00:00
m 18:51:48 ethminer Speed 4,09 Mh/s gpu0 4,09 [A0] Time: 00:00
i 18:51:51 stratum Job: #9bbe974e… eu1.ethermine.org [172.65.207.106:5555]
i 18:51:51 stratum Job: #6a0af543… eu1.ethermine.org [172.65.207.106:5555]
m 18:51:53 ethminer Speed 4,09 Mh/s gpu0 4,09 [A0] Time: 00:01
i 18:51:55 stratum Job: #efa6f38c… eu1.ethermine.org [172.65.207.106:5555]
m 18:51:58 ethminer Speed 4,08 Mh/s gpu0 4,08 [A0] Time: 00:01
m 18:52:03 ethminer Speed 4,08 Mh/s gpu0 4,08 [A0] Time: 00:01
m 18:52:08 ethminer Speed 4,08 Mh/s gpu0 4,08 [A0] Time: 00:01
m 18:52:13 ethminer Speed 4,08 Mh/s gpu0 4,08 [A0] Time: 00:01
m 18:52:18 ethminer Speed 4,09 Mh/s gpu0 4,09 [A0] Time: 00:01
^C m 18:52:23 ethminer Speed 4,09 Mh/s gpu0 4,09 [A0] Time: 00:01
i 18:52:23 ethminer Shutting down...
i 18:52:23 ethminer Shutting down miners...
i 18:52:23 main Disconnected from eu1.ethermine.org [172.65.207.106:5555]
i 18:52:28 ethminer Terminated!
Any ideas @jean-m-cyr @ddobreff ?
I have no idea what fiji is???
I have to plug my old fury and find out.
I did a git bisect hoping it might be helpful to you developers. I'm not 100% confident in the results because there were two revisions which would not build which I just assumed to be bad, but maybe this could be of a little help anyway.
85e433401b08e51111c367f44b241cbd61ae8489 is the first bad commit
commit 85e433401b08e51111c367f44b241cbd61ae8489
Author: AndreaLanfranchi <[email protected]>
Date: Sat Jun 16 13:06:56 2018 +0200
Amend MSVC warning for unreferenced variable
:040000 040000 72a293a9ce1e8f511eeee6f868beea8ae2c0b0ce 8dba80085970044eba9f01a9199587f1401162f4 M libapicore
API has nothing to do with hashing speed.
You must investigate in changes over libetash-cl
@Brisse89 please retest with latest 0.17 and report.
0.16 is quite old now.
Performance is still crippled in 0.17.0-rc.0
git bisect is good approach to this (I wanted to suggest it), but the https://github.com/ethereum-mining/ethminer/commit/85e433401b08e51111c367f44b241cbd61ae8489 commit is definitely not a problem. Can you point other candidates?
I did another bisection with a more narrow target based on my previous findings, and this time there were no build errors so I didn't have to assume anything.
8ad03b0a301062b0e3d163b3b387c48c89df0f52 is the first bad commit
commit 8ad03b0a301062b0e3d163b3b387c48c89df0f52
Author: Jean Cyr <[email protected]>
Date: Fri Jul 20 18:42:27 2018 -0400
Binary kernels revisited
- add gooburs addaptation of zawawa binary kernel source
- add pre-compiled binary kernels
- Load binary kernels from INSTALLDIR/kernels
- Copy binary kernels to INSTALLDIR/kernels
- delete redundant cl_finish call, blocking read syncs the loop.
- minor DAG gen optimization
x
:100644 100644 d35b26859d927c3499c82538addf0fb8ccadd3cf cdbc063bce2fc0aa0cf2df8030043f150e6c51a6 M CMakeLists.txt
:040000 040000 50e21d916d552af521d49cd82b3dc86cf482622e 9b7ffa209e59c412a13878e3e83ab341ef40b245 M ethminer
:000000 040000 0000000000000000000000000000000000000000 0d6b4721525ed3fb4ea1ebb6a542e9f06f84c6de A kernels
:040000 040000 fc6b6a8620aa5a723d3afc8487729772760e81fc 4a11f6b6ba702a01f0a787a90db838e8651f31aa M libethash-cl
With this version I see the following errors when running ethminer
X 12:49:36 cl-0 OpenCL init failed: clSetKernelArg: CL_INVALID_ARG_SIZE (-51)
X 12:49:36 cl-0 OpenCL Error: clEnqueueWriteBuffer: CL_INVALID_MEM_OBJECT (-38)
Im not sure but these are some candidates between branch points v0.15.0 and v0.16.0 (not verified nor tested at all)
branch point - 07feecad0, 8d9674b68, fffc1bb1 - Bump version: 0.16.0rc1 → 0.16.0 (Mon Sep 17 12:54:30 2018 +0200)
commit b6284f58576a1234367c5f75a4689edbfa01309b
~~~
Author: Jean Cyr jean.m.cyr@gmail.com
Date: Tue Jul 31 12:43:31 2018 -0400
Improve opencl hash rate and reduce job switch time to 1 ms.
All credit for this improvement goes to @sukharev. This is
simply a complete implementation including binary kernels.
Runs with very high global work multiplier to improve hash rate.
Reduces job switch time to ~1 ms. Spend more time searching for
solutions rather than stales.
Extend hash smoothing interval for more representative hash rate
~
commit 4b63c874750e42b79b976bb319c09c74477fc869
~
Author: Jean Cyr jean.m.cyr@gmail.com
Date: Tue Jul 24 12:52:40 2018 -0400
Further minor CL optimizations
Remove constant iteration parameter
commit 6377efefa506578cab1f7cfaf46ed5f3890ef176
~~~
Author: Jean Cyr jean.m.cyr@gmail.com
Date: Mon Jul 23 15:20:14 2018 -0400
Make changes suggested in review.
commit 8918770138793b50c8e93adb3c085e3080fce37c
~~~
Author: Jean Cyr jean.m.cyr@gmail.com
Date: Sun Aug 26 13:58:25 2018 -0400
enable hash rate averaging for AMD
Exponential averaging is governed by a constant called alpha.
Setting alpha to 1.0 results in no averaging.
~~~
commit d94896ec5d103d2e52a21499ae32eddec3c3aad6
~~~
Author: Jean Cyr jean.m.cyr@gmail.com
Date: Sat Aug 25 23:47:44 2018 -0400
Smooth AMD hash rate.
...
Reference:
https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
~~~
Sorry I'm a bit new to 'git bisect' but I just learned I could use the 'skip' command whenever I encounter an unrelated error. With this newfound knowledge, I've been able to pintpoint the culprit to one of the following commits.
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
ae9585e7fdb1fe9c477f5ac8e87022a05194d610
8cc147b02b9939c03416806353f11ad645ca40be
a720541b930073a58a397185adc8c55f23536e56
37db3f8f936956180335277e8bbed44d1bcf902a
f8516eb25087b1c83410217e06c26883fdd138d3
232b216de810b572981a740330f01ed8c5cc8375
2ce3f1a1b1131277f8d6ba8864048f4b7144f8c7
b5385037fecb26d79b5fd43f0c5973f72eb52969
This seems to be related to the introduction of binary kernels for AMD. Since there is no binary kernel for fiji, it would run the opencl kernel. I've no idea why the zawawa opencl kernel would be so much slower on fiji??? Might have something to do with the early abort logic.
@Brisse89 I have a simple fix just remove "volatile" from the kernel file ethminer/libethash-cl/kernels/cl/ethash.cl. For some reason the updated kernel introduced in v0.16 declares a bunch of variables as volatile even though they are private. The output is also volatile but its updated using atomic_inc which should ensure coherency among the threads. This is only happening in rocm stack from my testing.
Volatiles removed by pr #1737
Tried this with mixed results on 480 depending on BIOS version.
Given that I can't explain these results, I've decided not to proceed.
@x3ccd4828 @jean-m-cyr Thanks, this does restore and maybe even improve performance on the Fiji. I'm now seeing ~31MH/s @ 200W which is more than I've ever seen before. Ethminer 0.15 on ROCm yielded ~28MH/s @ 190W, or ~29MH/s @ 190W using AMD's legacy (aka. Orca) OpenCL driver. Sorry to hear about regressions for other GPU's. Obviously I understand why it can't be merged.
@jean-m-cyr what driver are you using for the 480? I have tested a Rx580 (mining bios) with amdgpu-pro 18.40 as well as rocm 1.9.
It's been a while since I installed amd-gpu-pro. Don't remember which version... Any way I can tell?
On Ubuntu you can just check the installed package version: sudo apt list "amdgpu"
jcyr@miner1:~/ethminer/build$ sudo apt list "amdgpu*"
Listing... Done
amdgpu-pro/unknown 17.40-492261 i386
amdgpu-pro-core/unknown,now 17.40-492261 all [installed,automatic]
amdgpu-pro-dkms/unknown,now 17.40-492261 all [installed]
amdgpu-pro-lib32/unknown,now 17.40-492261 amd64 [installed]
With current 480 mining BIOS running opencl (--cl-nobin) I get about 28MH/s, with same opencl with volatiles removed) I get 25MH/s. Strange!!! I can't think of why that would be?
Binary kernels run at 29.5MH/s on same config.
I would recommend updating the driver to either the 18.40 or the rocm stack. I think 17.40 was the old beta blockchain driver. I i haven't tested the old 17.40 driver.
it won’t compile on 18.10+ also rocm requires pci atomics (3.0) for polaris.
@ddobreff Do you mean ROCm is not compiling? No need to build from source. AMD's pre compiled release for Xenial works fine for me on Debian Sid and most likely does on Ubuntu 18.10 as well. All you need is kernel > 4.17 which has the necessary kernel components up-streamed and Ubuntu 18.10 ships with 4.18 so that should be no problem. Install rocm-opencl instead of rocm-dkms since the latter is not needed on Linux > 4.17.
ROCm opencl for non Vega requires PCIe atomics on PCIe 3.0 compliant slot.
OpenCL legacy and PAL compilers are broken after 18.10+ versions, they produce invalid asm so it will not compile properly leading to non working kernel.
It should be possible to conditionally compile without volatile for older GPUs, right?
volatile should not be required for ANY gpu! Pls. don't take my odd results as a reason not to make this change. It would not affect 480/580 GPUs who would naturally be using binary kernels anyway...
cl 16:53:45 cl-0 Using PciId : 01:00.0 Ellesmere OpenCL 1.2 AMD-APP (2482.3) Memory : 3.99 GB
stock ethash.cl
m 16:54:45 ethminer Speed 28.93 Mh/s gpu0 28.93 Time: 00:01
ethash.cl with volatile removed
m 16:58:01 ethminer Speed 25.02 Mh/s gpu0 25.02 Time: 00:01
???
Weird, but overall the situation is still better because this regression is much less severe than the one affecting the Fiji and Vega, and like you said, Polaris would be using the binary kernels anyway so in reality they would not encounter the regression.
@ddobreff I believe you mentioned it makes no diff on Vega?
Yes, there was absolutely no difference in Vega on standard amdgpu-pro opencl(18.10 compiler).
EDIT: volatile removing should be ok.
There was a comment above suggesting that Vega on ROCm was affected. No confirmation on whether removing volatile fixed it in that case though.
@uentity Would be great if you could test the Vega and report your hashrate.
I have a similar issue. When running the benchmark I get something like this
ethminer/ethminer -M 2 -G
ethminer 0.18.0-alpha.1-18+commit.8294506b.dirty
Build: linux/release/gnu
m 11:22:26 ethminer Benchmarking on platform: CL Preparing DAG for block #2
cl 11:22:26 cl-0 Using PciId : 45:00.0 gfx900 OpenCL 1.2 Memory : 15.98 GB
i 11:22:26 cl-0 Adjusting CL work multiplier for 64 CUs.Adjusted work multiplier: 116509
m 11:22:26 ethminer Warming up...
cl 11:22:26 cl-0 Generating DAG + Light : 1.02 GB
cl 11:22:26 cl-0 OpenCL kernel
cl 11:22:26 cl-0 Loading binary kernel /home/user_name/mine/ethminer/build/ethminer/kernels/ethash_gfx900_lws192.bin
X 11:22:26 cl-0 Failed to load binary kernel: /home/user_name/mine/ethminer/build/ethminer/kernels/ethash_gfx900_lws192.bin
X 11:22:26 cl-0 Falling back to OpenCL kernel...
cl 11:22:26 cl-0 Creating light cache buffer, size: 16.00 MB
cl 11:22:26 cl-0 Creating DAG buffer, size: 1024.00 MB, free: 14.97 GB
cl 11:22:27 cl-0 Loading kernels
cl 11:22:27 cl-0 Writing light cache buffer
cl 11:22:27 cl-0 Creating buffer for header.
cl 11:22:27 cl-0 Creating mining buffer
cl 11:22:28 cl-0 1024.00 MB of DAG data generated in 2033 ms.
m 11:22:41 ethminer Trial 1...
m 11:22:44 ethminer Hashes per second 3611658
m 11:22:44 ethminer Trial 2...
m 11:22:47 ethminer Hashes per second 3598562
m 11:22:47 ethminer Trial 3...
m 11:22:50 ethminer Hashes per second 3598562
m 11:22:50 ethminer Trial 4...
m 11:22:53 ethminer Hashes per second 3614558
m 11:22:53 ethminer Trial 5...
m 11:22:56 ethminer Hashes per second 3618185
m 11:22:56 ethminer min/mean/max: 3598562/3608305/3618185 H/s
m 11:22:56 ethminer inner mean: 3608259 H/s
In my case I have a Vega FE that lists it at gfx 900 which doesn't match the available binary kernels here https://github.com/ethereum-mining/ethminer/tree/master/libethash-cl/kernels/bin I tried to just rename them to gfx900 but that causes some ELF loading error instead so I guess that may not work.
using 0.15 seems to work as expected (getting ~35 MH/s) as others has reported
Can we finish this?
Can it look like
#if __vega__
volatile
#endif
I would suggest just removing volatile, the performance degradation on polaris should not really matter, since it uses binary anyways.
@Brisse89 please re run your tests using additional CLI argument --cl-local-work 128 and report
@AndreaLanfranchi Seems to be working fine. 31MH/s.
I'm backporting the fix to 0.16 and 0.17
Most helpful comment
Volatiles removed by pr #1737Tried this with mixed results on 480 depending on BIOS version.
Given that I can't explain these results, I've decided not to proceed.