Picongpu: Warning: ABI change for 32-byte alignment

Created on 20 Aug 2016 · 26Comments · Source: ComputationalRadiationPhysics/picongpu

When compiling picongpu with CUDA 7.0 there is a warning _src/picongpu/src/picongpu/include/particles/ionization/ionization.hpp:63:1: note: The ABI for passing parameters with 32-byte alignment has changed in_
_GCC 4.6_

related to #36

Possible way of fixing this: passing non-native types by const reference instead of by copy.

update:
The warning only occurs when an ionization model is used which requires a random number generator.
It started occuring when I replaced the old version with the one written by @Flamefire. Maybe sth is not correctly aligned in the struct that contains the implementation of the ionization model. Also the order of typedefs and member definitions plays a role in that.

core warning

Source

n01r

All 26 comments

Passing by const-ref is not possible here as this is a kernel. If this happens when a RNG is used (try it by replacing the RNG in your model by a dummy functor with no members) then that could be the cause. The RNG state is not aligned via PMACC_ALIGN as it is also used in buffer and alignment there is not required and would introduce quite some space overhead to the already big buffer. So a possible fix (if that is the problem) would be to align the state in the random functor or the functor itself.

Flamefire on 22 Aug 2016

Yes, even though it points to the kernel call I think the problem is not there but deeper. I'll try what you suggested.

n01r on 23 Aug 2016

So far I found out that before I made use of the new RNG the warning does not occur. Directly after the implementation it does. Replacing the RNG with a dummy class in ADK_Impl.hpp however does not change the outcome: still a warning.

n01r on 23 Aug 2016

ccing @psychocoderHPC

ax3l on 23 Aug 2016

A comparison in the PTX codes before using the faster RNG and after (but all functional parts replaced by a dummy struct RandomGen) shows the following:

before

ptxas info    : Function properties for _ZN8picongpu9particles10ionization21kernelIonizeParticles...
    7280 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads
ptxas info    : Used 63 registers, 20352 bytes smem, 480 bytes cmem[0], 32 bytes cmem[16]

after

ptxas info    : Function properties for _ZN8picongpu9particles10ionization21kernelIonizeParticles...
    272 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 20352 bytes smem, 256 bytes cmem[0], 36 bytes cmem[16]

I cannot see any register usage exploding (in fact it was even more before since my dummy RNG struct is very small)

n01r on 23 Aug 2016

_Fun fact:_ the ~~warning~~ note also gets thrown just because of my dummy struct RandomGen that I defined in-place as a private member of ADK_Impl.

struct RandomGen
{
    template<typename T_CreateRandom>
    RandomGen(T_CreateRandom CreateRandom)
    {

    }

    template <typename T_LocalOffset>
    DINLINE void init(T_LocalOffset localoffset)
    {

    }

    DINLINE float_X operator()()
    {
        return 0.1;
    }
};

n01r on 23 Aug 2016

That is because your dummy objct is smaller than 32byte.

Am 23. August 2016 17:01:57 MESZ, schrieb Marco Garten [email protected]:

_Fun fact:_ the ~~warning~~ note also gets thrown just because of
my dummy struct RandomGen that I defined in-place as a private member
of ADK_Impl.
struct RandomGen
{
   template<typename T_CreateRandom>
   RandomGen(T_CreateRandom CreateRandom)
   {

   }

   template <typename T_LocalOffset>
   DINLINE void init(T_LocalOffset localoffset)
   {

   }

   DINLINE float_X operator()()
   {
       return 0.1;
   }
};
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/ComputationalRadiationPhysics/picongpu/issues/1553#issuecomment-241760954

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

psychocoderHPC on 25 Aug 2016

👍1

Just found an interesting issue while testing. Executing the following code yields in semi-random results:

#include <curand_kernel.h>

#define __optimal_align__(byte)   \
        __align__(                \
        ((byte)==1?1:             \
        ((byte)<=2?2:             \
        ((byte)<=4?4:             \
        ((byte)<=8?8:             \
        ((byte)<=16?16:           \
        ((byte)<=32?32:           \
        ((byte)<=64?64:128        \
        ))))))))

#define PMACC_ALIGN(var,...) __optimal_align__(sizeof(__VA_ARGS__)) __VA_ARGS__ var

struct Foo{
PMACC_ALIGN(rng, curandStateXORWOW_t);
};

    template<class T_RNG, class T_Mapper>
    __global__ void
    testRNG(T_RNG rng, T_Mapper mapper)
    {
        if(!threadIdx.x)
            printf("value %i\n", mapper);
    }

int main(){
    int value=1;
    Foo foo;
    testRNG<<<1,1>>>(foo, value);
    cudaDeviceSynchronize();
}

This is basically a proof-of concept of a bug caused by alignment. I get the same ABI-change warning when compiling this with nvcc (7.0, 7.5) and g++4-8. In my more complex case I don't get that warning although the behaviour is the same (semi-random results, big struct somewhere)

This does not happen, if the maximum alignment is set to 32 bytes. It also does not happen, if there is another 32-byte aligned param before value. If that param is more or less aligned than 32 bytes then the bug does happen again.

Flamefire on 26 Aug 2016

👍1

@n01r I miss some information how to reproduce the warning. Do I need to enable a special ionization solver? ...

psychocoderHPC on 29 Aug 2016

Sorry, yes, currently the warning occurs when you use ADKLinPol, ADKCircPol and Keldysh. Those are the models that are making use of the new random number generator implementation.

n01r on 29 Aug 2016

I found the following link with a patch which goes to gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44948

As I understand we need to change our alignment strategies and align maximal to valid 32byte stack alignment.

psychocoderHPC on 1 Sep 2016

This is also what my tests show: >32Byte alignment causes trouble on at least some configurations.
Are there any resources WHY we need this alignment at all? Haven't seen it in other CUDA codes before and it seems to cause more trouble than it solves.

Flamefire on 1 Sep 2016

With alignment you can speedup the memory transference.
It also avoid that large types are loaded byte by byte if you create a copy of a functor inside of a kernel.

On older architectures it is also possible that the code do wrong thinks if classes where not aligned. I am not if it is still an issue.

psychocoderHPC on 1 Sep 2016

Just to address the underlying question: _it is not a problem, that our binaries will be incompatible with pre 4.6 GCC created binaries._

a) we do not support <4.6 host compiler gcc any more; who cares
b) (we do not ship binaries right now that would be seeing that problem)

So the only question is:

how to suppress the warning
are their actual limits we should be aware of on the host or device code?

the last point might be important, since we saw actual bugs for >32bit alignment on Alex' laptop with sm_20 and gcc (why?)

ax3l on 5 Sep 2016

if we can't find a documented result on question 2: just limit the maximum alignment to 32byte and suppress the warning that we are not gcc 4.6 ABI compatible ("surprise", we are not even pre-4.6 compiling at all with nvcc).

ax3l on 5 Sep 2016

Suppressing alone is not enough. As #1563 shows, this ABI change also affects interoperability of GCC with NVCC. So we have potentially different ABIs in one (fat) binary created by NVCC and GCC.
According to the tests it seems ok to limit to 32 Byte (not bit ;-) ) alignment, but yes the lacking documentation makes this only a guess. Safe would be 16 Bytes which avoids the warning and any problems due to incompatible ABIs

Flamefire on 5 Sep 2016

As #1563 shows, this ABI change also affects interoperability of GCC with NVCC.

but only when used _above_ 32bit alignment.

32 Byte (not bit ;-) ) alignment

I know, corrected me but you got the GitHub mail too quickly ^^

ax3l on 5 Sep 2016

I am just reluctant to fix a warning we do not fulfil (going below 32byte) when we only see problems above 32byte (which is actually an orthogonal problem).

simply speaking: there is no problem for <=32 byte that we know of (thanks to missing documentation).

ax3l on 5 Sep 2016

Exactly what I said in the 2nd half of that post. So question is: We know of a (potentially) undocumented ABI incompatibility due to a change in the alignment (ABI) for at least >=64Byte. Can we assume that the ABI change for 32byte is safe from the current tests? Given that the former (breaking) ABI change was done in GCC 4.2 (or 4.3?) and the latter in 4.6 I'd expect that also the 32byte alignment will cause problems.
Only NVIDIA could answer this for sure.

Flamefire on 5 Sep 2016

Let us take the known working version 32byte and shuffle #1563 in the nvidia issue tracker! :)

ax3l on 5 Sep 2016

Given that we have a release I'd rather play it save. I don't expect a big difference in performance for 32byte alignment if any at all. So unless we know it is safe, we should assume it is not.
At least test it a bit more: Use initialized, wrapped byte-arrays of different sizes and alignment and test all combinations for number of parameters and alignments when passing it to a kernel. At least situations with 3 parameters with all possible alignment distributions so a differently aligned param before or after does not cause any breakage.

Flamefire on 5 Sep 2016

👍1

My question is, we don't even _know_ that 16byte is save.

The only thing we know: running with 32, 16 and below seems to work.
Yes, testing seems to be the only way...

ax3l on 5 Sep 2016

The difference: We know that something for 32byte has recently changed, for 16byte we don't know such thing.
To get more confidence for the "up to 32byte seems to work"-theory I'd suggest the described test, as mine was just a sample. Especially as CUDA 7.5 "seems to work" for up to 64byte on the cluster, but it breaks on my laptop GPU and 7.0 breaks on both.

Flamefire on 5 Sep 2016

sounds reasonable.

Could also be, that in gcc 4.6-4.9 something changed again and in CUDA 7.0 to 7.5 nvidia switched what they compiled their rt library code with... gist

ax3l on 5 Sep 2016

Can we close this @psychocoderHPC ?

n01r on 12 Jan 2017

closed with #1566

psychocoderHPC on 12 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Field-Ionization: Damp E-Field

ax3l · 4Comments

Transforming the FoilLCT example into a 3D model

cbontoiu · 3Comments

Bash Completion for Most Important Scripts

ax3l · 4Comments

travis-ci suggestions only valid for GNU sed

PrometheusPi · 3Comments

'unspecified launch failure' during simulation causes 'std::runtime_error'

steindev · 4Comments