Hi,
I have some 4627 v2 CPU's with 3.3GHz base clock but oddly enough they don't run as fast as the 2690 V1 2.9 GHz. I am trying to figure out if there is anyway to improve the performance.
The processor cache specs look similar but there is a difference:
Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz 20MB Cache
64-byte Prefetching
Data TLB0: 2-MB or 4-MB pages, 4-way set associative, 32 entries
Data TLB: 4-KB Pages, 4-way set associative, 64 entries
Instruction TLB: 4-KB pages, 4-way set associative, 64 entries
L2 TLB: 1-MB, 4-way set associative, 64-byte line size
Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
L3: Associativity: 20-way set associative
Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz 16MB Cache
64-byte Prefetching
Data TLB: 1-GB pages, 4-way set associative, 4 entries
Data TLB: 4-KB Pages, 4-way set associative, 64 entries
Instruction TLB: 4-KB Pages, 4-way set associative, 128 entries
L2 TLB: 1-MB, 4-way set associative, 64-byte line size
Shared 2nd-level TLB: 4 KB pages, 4-way set associative, 512 entries
L3: Associativity: 16-way set associative
Includes F16C instructions, are they used by XMrig?
Can someone offer any tuning suggestions?
Thanks!
E5-2690 has 20 MB L3 cache and 8 cores with SMT support. So it is able to start 10 RandomX threads.
E5-4627 v2 has only 16 MB L3 cache and 8 cores without SMT support. That's why only 8 threads can be started.
On paper E5-4627 has higher base frequency, 3300 MHz instead of 2900 MHz. But the E5-2690 has several turbo states, so it can boost to 3300 MHz as well, even if all cores are loaded as long as the TDP, current and therrmal limits not are exceeded.
In case of architecture Ivy Bridge (E5-4627 v2) and Sandy Bridge (E5-2690) are quite equal, so the younger CPU has no IPC advantage. Only the Manufacturing process is smaller (22 nm vs. 32 nm)
Hi Lonnegan and thank you for replying to my obscure post.
I understand that xmrig trys to lock 2mb per thread per core... which I don't completely understand the bottleneck here... as 2mb x 8 = 16MB. In my mind that condition is satisfied. There is also SMT support on 2690 which is disabled per the developers recommendation. Both chips have hyperthreading support disabled as well as all the cache options in BIOS. When you talk about SMT are you speaking about hyperthreading? Since that is intended to be disabled even the chips cannot be equal despite base -> max clock frequency.
When I look at the chips design on ARK it seems that the V2 4627 is a better chip for single threaded operation and when I bought them I assumed they could beat out my 2690's; since I am asking them to do single threaded operation (disabled HT/SMT). I still don't understand completely why 2690 v1 is faster than 4627 v2 other than its L1 20MB cache size?
Thanks for trying to help.
Jeff
I'm interested to hear what's going on here.
On both cpu's you should run eight mining threads because of the 256Kb per core L2 cache.
Do you use the same memory and all memory channels on both systems?
Yes the systems are identical. I turn off hyperthreading in bios and turn off all caching options.
Sent from my iPhone
On Sep 24, 2020, at 8:55 AM, BillGatesIII notifications@github.com wrote:

I'm interested to hear what's going on here.On both cpu's you should run eight mining threads because of the 256Kb per core L2 cache.
Do you use the same memory and all memory channels on both systems?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
Can you post a screenshot or the output of the start of XMRIG for both systems?
Yes sure, here is the 2690 system start up log:
sudo ./start.sh
4627 V2 Startup log:
You probably already checked if all cores are mining so I guess another thing to check is the speed at which the cores are running?
watch -n 2 "cat /proc/cpuinfo | grep "^[c]pu MHz""
All cores of the E5-4627 V2 should run at 3.5 MHz.
All cores of the E5-2690 should run at 3.3 Mhz.
If everything runs at the advertised speed there are two other things I can think of.
One is the bus speed, the 2690 has a bus speed of 8.0 GT/s while the 4627 has a bus speed of 7.2 GT/s. Maybe @SChernykh knows if this might be a bottleneck.
And the other thing is maybe the E5-4627 V2s you have are not production chips? The spec number on the chips should be SR1AD.
Hi Bill Gates the third.
Yes, these are production chips.
That clock speed command doesn't work in Ubuntu. However, knowing what you are asking I found a different command for this:
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz
Stepping: 4
CPU MHz: 3500.255
CPU max MHz: 3600.0000
CPU min MHz: 1200.0000
BogoMIPS: 6600.48
Virtualization: VT-x
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 32 MiB
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
3500256
3500254
3500256
3500256
3500256
3500255
3500255
3500256
3500255
3500257
3500256
3500255
3500257
3500254
3500256
3500256
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Stepping: 7
CPU MHz: 3300.241
CPU max MHz: 3800.0000
CPU min MHz: 1200.0000
BogoMIPS: 5800.42
Virtualization: VT-x
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 40 MiB
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
3300241
3300240
3300241
3300241
3300241
3300240
3300242
3300241
3300241
3300240
3300240
3300241
3300241
3300241
3300241
3300242
Don't know how deep you wanna dive into this but next step I would suggest is to configure XMRIG to mine with one core only and compare the speeds. If the E5-2690 is still faster than the E5-4627v2 I hope one of the developers can explain why this is.
I’d like to test further as the performance doesn’t seem right to me. I can give access to one of these systems if anyone wants to help me debug or help me tune? I am running xmrig by command line how do I tell it to just mine using 1 core? Perhaps it is a good idea to check single core performance?
@agentpatience You can check the command line arguments here: Command Line Arguments. Its--threads=N where N is the number of thread the miner uses for mining. Set it to 1 and it will automatically set the affinity as well so it will mine on 1 core only.
Single Thread E5 4627 V2 - 452.3 H/s
Single Thread E5 2690 - 602 H/s
Can you share CPU clock speeds while the miner is running on those cores?
And one more thing you can do to double-check is that to set the --cpu-affinity manually. By default, when you run miner on 1 thread, it should use 0x1 affinity. You can set it to something like 0x4 to run on the second core if you have HT enable, if HT is disabled then use 0x2 as affinity to run on the second core and see if the hashrate is still the same as core 1.
You can calculate affinity HERE.
Hi Aria,
E5-2690 --threads=1
sudo ./start.sh
Ok, now when I set affinity to 0x1 I gained 100 H/s on the 2690 running core:
E5-2690 --threads=1 --cpu-affinity 0x1
sudo ./start.sh
[sudo] password for jeff:
jeff@miner9:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
3796278
3470581
1200193
3464265
1199947
3575880
1200167
1200195
3471951
1199867
3588141
1199793
3461349
1199953
3553871
1200037
0x1 and 0x2 show the same gains. I have HT turned off in BIOS.
E5 4627 v2 --threads=1 --cpu-affinity=0x1:
gcc/9.3.0
Clocks:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
3584517
3504796
1200229
3500298
1200354
3503598
1200291
1200527
3499941
1200393
3497419
1200107
3498735
1200516
3501512
1200089
OK, run the miner on whatever number of threads you want and set the affinity according to that thread number and see if you gain any more performance. As I saw, single thread mining is now better on E5 4627 v2 compared to E5 2690 0. But your 2690 0 will boost more to 3.8GHz as it should and 4627 v2 is hovering around 3.5-3.6GHz boost. So you should get more hashrate with higher clocks. I'm not sure but I think RAM frequency has some effects on the hashrate as well. Even RAM channels can possibly have some effect on hashrate. OS Configuration plays a huge part on your hashrate as well. So if you can try, do some clean installs and test with the exact same settings (in VM or in BIOS). And try to use pre-compiled binaries and compare them to your own compiled binaries (but I dont think this will affect the hashrate that much but worth noticing).
It's a long shot but I guess you'll have to test with two threads, three threads and so on to see when the E5 4627 v2 becomes slower.
The turbo stepping grouping or whatever it is called for these chips are:
E5-4627 v2: 2/2/2/2/2/2/2/3 so this one should run at 3.3 + 0.2 = 3.5 GHz running two or more threads and 3.6 GHz running one thread.
E5-2690: 4/4/4/5/5/7/7/9 so this one should run at 2.9 + 0.4 = 3.3 GHz running six or more threads, 3.4 GHz running four or five threads, 3.6 GHz running two or three threads and 3.8 GHz running one thread.
So even with a lower processor frequency, the E5-4627 V2 is faster when mining with one thread. So something somewhere somehow changes when running more threads :)
4627 v2
10 Threads = 5789
12 Threads = 6504
14 Threads = 6693
16 Threads = 6618 --threads=16 --cpu-affinity=0x000000000000FFFF ?????
So hashrate becomes less when going from 14 to 16 threads? And 15 threads? Better than 14 or not?
I have no clue what's going on here.
So hashrate becomes less when going from 14 to 16 threads? And 15 threads? Better than 14 or not?
I have no clue what's going on here.
Yea, it doesn't scale the same way after 12 threads. I am not sure at this point either it may have something to do with "Smart Caching" which these Xeons do not carry.
Both processors do have the Intel Smart Cache thing but, the older 2690 has 20 MB and the newer 4627 V2 has 16 MB. In both processors it is used for the level 3 (last-level) cache.
Now, because RandomX uses 2 MB level 3 cache for each thread my last guess is the 2690 will have a better cache hit rate when all cores are mining.
I found this document explaining it; the pictures in it are with L1 private cache and L2 shared cache but I assume it works the same for L2 private and L3 shared cache.
Software Techniques for Shared-Cache Multi-Core Systems
The article also explains some software techniques for using this Smart Cache effectively but my C and Assembly is by far not good enough to figure out if this is used in xmrig or not or if it is even beneficial.
I really hope one of the developers will shine his/her light on this.
I stand corrected about Smart Cache. It was not apparent to me. I will play around with memory configurations this week.