Picongpu: PIConPhi

Created on 13 Sep 2017 · 4Comments · Source: ComputationalRadiationPhysics/picongpu

Does EPOCH support Intel's many integrated core architecture, the Xeon Phi?
Would I have to make any changes to the source or makefiles in order for the code to take advantage of this architecture?

documentation question

Source

berceanu

Most helpful comment

Thanks for asking!

Yes, we did!

These are the papers investigating our underlying library alpaka on various architectures in order to proof zero-overhead abstraction with C++ meta programming (aka performance portability):

DOI:10.1109/IPDPSW.2016.50 (http://arxiv.org/abs/1602.08477), paper in AsHES2016
DOI:10.5281/zenodo.49768 (thesis: diploma)
Alexander Matthes, René Widera, Erik Zenker, Benjamin Worpitz, Axel Huebl and Michael Bussmann
"Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library"
paper in ISC17 (P3MA), DOI:10.1007/978-3-319-67630-2_36, preprint: https://arxiv.org/abs/1706.10086

and on PIConGPU porting with Alpaka:

DOI:10.1007/978-3-319-46079-6_21 (https://arxiv.org/abs/1606.02862), paper in ISC16 (IWOPH), see cupla
E. Zenker, R. Widera, G. Juckeland et al., Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka (GTC16 talk)
https://mygtc.gputechconf.com/events/32/schedules/2792
Video: http://on-demand.gputechconf.com/gtc/2016/video/S6298.html

ax3l on 13 Sep 2017

👍2

All 4 comments

Thank you for your question!

Does EPOCH support [...] Xeon Phi?

We are PIConGPU ;-) , but yes we do support KNL already in dev and more convenient in our next stable release (in "native"/non-offloading mode).

@theZiz is currently fine tuning a profile for the Taurus (TU Dresden) cluster in #2210 and we will write up some example setups when we have good tuning and installs figured out.

Running on CPU or GPU will be extremely easy to control for users and can be steered with a simple switch during pic-build (pic-configure). You need to change nothing else to your source code or build scripts. We will give instructions/templates on how to configure the KNL hardware and how many MPI ranks to run per card for optimal performance.

ax3l on 13 Sep 2017

That sounds great! Did you benchmark PIConGPU's performance on the various architectures (GPGPU, KNL, CPU)? Do you expect GPGPU to be the fastest due to way the code is written?

berceanu on 13 Sep 2017

Thanks for asking!

Yes, we did!

These are the papers investigating our underlying library alpaka on various architectures in order to proof zero-overhead abstraction with C++ meta programming (aka performance portability):

DOI:10.1109/IPDPSW.2016.50 (http://arxiv.org/abs/1602.08477), paper in AsHES2016
DOI:10.5281/zenodo.49768 (thesis: diploma)
Alexander Matthes, René Widera, Erik Zenker, Benjamin Worpitz, Axel Huebl and Michael Bussmann
"Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library"
paper in ISC17 (P3MA), DOI:10.1007/978-3-319-67630-2_36, preprint: https://arxiv.org/abs/1706.10086

and on PIConGPU porting with Alpaka:

DOI:10.1007/978-3-319-46079-6_21 (https://arxiv.org/abs/1606.02862), paper in ISC16 (IWOPH), see cupla
E. Zenker, R. Widera, G. Juckeland et al., Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka (GTC16 talk)
https://mygtc.gputechconf.com/events/32/schedules/2792
Video: http://on-demand.gputechconf.com/gtc/2016/video/S6298.html

ax3l on 13 Sep 2017

👍2

As you see in the papers above, we already investigated Alpaka+GPU/Power/CPU/KNL/... and PIConGPU+Alpaka+GPU/Power/CPU/... the benchmarks for the latest Xeon Phi (KNL) are currently being tuned for the next release in #2210

Do you expect GPGPU to be the fastest due to way the code is written?

As we outline in more detail in the papers, the so called floating-point efficiency, which is the performance you get relative to what your hardware implements, is similar across most platforms we benchmarked on. That is great! Still, GPUs tend to be still a bit more efficient but we have to investigate how this plays out the more we tune (we have plenty of options with alpaka). The reasons for that are manifold and can be elaborated a bit more, e.g. regarding memory hierarchies (+ bandwidths and latencies) and the relatively low arithmetic intensity of the basic PIC algorithm. An interesting aspect is also power consumption per Flop, which is generally better on RISC-like architectures.

For more details, these blog posts of Karl Rupp + [2] try to organize the current horse race in HPC and are an interesting read.

Consequently, this leads to the most PIConGPU Flops/invested dollar for manycore-hardware such as GPUs (and also to the fastest time-to-solution).

ax3l on 13 Sep 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

command line parsing

psychocoderHPC · 4Comments

Send/receive buffer for species

berceanu · 4Comments

What's up with Axl's mustache?

bussmann · 4Comments

Bash Completion for Most Important Scripts

ax3l · 4Comments

Field-Ionization: Damp E-Field

ax3l · 4Comments