Ethminer: Maximum number of CUDA Devices is 16?

Created on 20 Dec 2017  ยท  10Comments  ยท  Source: ethereum-mining/ethminer

I have a computer with 18 CUDA cards, but it seems that there is a limit to the maximum number of devices, hard coded at 16. I thought that I had identified where in the code this was set, and I attempted to fix it. I am, unfortunately, rather unfamiliar with this code, and my fix was not sufficient to fix the issue.

Most helpful comment

@Abattia Yes, it should be enough. I doesn't matter how far apart the ranges are as long as they don't overlap. I chose an arbitrarily large shift, so we don't need to worry about overlaps anymore. It will be a while before we see a GPU that can do 2^40 hashes per job.

In fact, some kernel implementations will only roll the lower 32 bits of the nonce and are limited to 2^32 nonces per job. More than that and they start retesting the same nonces!

All 10 comments

You can try altering the line here:
https://github.com/ethereum-mining/ethminer/blob/master/libethash-cuda/CUDAMiner.cpp#L90

To something like:

int CUDAMiner::s_devices[18] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 };

Not sure if thats enought, its possible thats this is a driver restiction does the system show all 18 cards?

smurfy, that line is one that I had found and modified, but it is not sufficient. Somewhere else in the code, that I can't find, the limit exists as well. It is not a driver issue as all 18 cards show up in the system. Right now I can work around the issue by running two separate instances of ethminer, each with fewer than 16 devices and each device assigned to a specific instance. It is not ideal, but it works for now.

In any case, I think that hard coding the limit to 16 is not the best practice. The code could be written to dynamically support the number of devices detected. Below are listed the lines that I have found that may have something to do with this limit:

libethash-cuda/CUDAMiner.cpp:
90: int CUDAMiner::s_devices[16] = { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1 };
184: startN = w.startNonce | ((uint64_t)index << (64 - 4 - w.exSizeBits)); // this can support up to 16 devices

libethash-cuda/CUDAMiner.h:
16: static int s_devices[16];

I don't know how line 184 comes into play in the 16 device limit, but changing lines 90 and 16 is not enough to increase the limit.

Line 184 is what you need to look at, I believe.

This is Bit Field Manipulation. I believe it works as follows โ€ฆ

The "up to 16 devices" is a consequence of the term โ€œ- 4โ€ in the expression setting the bits of startN, a uint64_t(unsigned integer with width of exactly 64 bits).

Consider the 16th miner.

(uint64_t)index() will be:
0xb 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 1111
In the case wherew.exSizeBits = 0,((uint64_t)index() << (64 โ€“ 4 โ€“ w.exSizeBits)) will be:
0xb 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
i.e. just shifting everything 60 bits left.

Consider the 17th miner.

(uint64_t)index()will be:
0xb 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0000
In the case where w.exSizeBits = 0,((uint64_t)index() << (64 โ€“ 4 โ€“ w.exSizeBits)) will be:
0xb 1 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
which overflows the uint64_t.

Changing the โ€œ-4โ€ to โ€œ-5โ€ would allow handling up to 32 miners (and to โ€œ-6โ€, up to 64 miners, etc ).

I have a computer with 18 CUDA cards

Could you share a picture of this monster with us?

@abatta Note that there are not many pools that use the STRATUM_PROTOCOL_ETHEREUMSTRATUM version of Stratum. Only those that do will ever use the shift expression shown above. In all other cases, such as at Ethermine or Nanopool, exSizeBits will < 0 and a different startNonce allocator is used that is not limited to 16 devices.

I'm referring to the current HEAD code.

@jean-m-cyr

In all other cases, such as at Ethermine or Nanopool, exSizeBits will < 0 and a different startNonce allocator is used that is not limited to 16 devices.

Looks like it is limited to 2^25 - 1 (i.e. over 33 million) miners?

uint64_t get_start_nonce()

    {

        // Each GPU is given a non-overlapping 2^40 range to search

        return farm.get_nonce_scrambler() + ((uint64_t) index << 40);

    }

Should be enough ...

@Abattia Yes, it should be enough. I doesn't matter how far apart the ranges are as long as they don't overlap. I chose an arbitrarily large shift, so we don't need to worry about overlaps anymore. It will be a while before we see a GPU that can do 2^40 hashes per job.

In fact, some kernel implementations will only roll the lower 32 bits of the nonce and are limited to 2^32 nonces per job. More than that and they start retesting the same nonces!

@Abattia There is also no reason not to change the other start nonce allocation algorithm to accommodate more GPUs. I'm working on a generic patch to expand to 32 GPU support.

@jean-m-cyr Your solution was very nice and elegant. I've run it on my 18-card machine and it is running wonderfully. Thank you all for looking into this issue.

@IndigoDEZ You're welcome

Was this page helpful?
0 / 5 - 0 ratings

Related issues

skynet picture skynet  ยท  4Comments

andreikorchagin picture andreikorchagin  ยท  5Comments

rawsh picture rawsh  ยท  5Comments

nguylb73 picture nguylb73  ยท  3Comments

chfast picture chfast  ยท  3Comments