This was posted on discord by @Darth_Sidious a few days ago. I'm replicating it here for the future.
Hi guys. I have some suggestions that i wrote in archived channel. Please read these messages: https://discord.com/channels/435943710472011776/744337346585166016/753992057650937856 and https://discord.com/channels/435943710472011776/744337346585166016/753995112773582888
I've made a simple test to prove my first idea. https://pastebin.com/eKNuvT5c On my computer it's about 1.4 speedup even without removing zero-vectors. Feel free to launch this test and to ask questions. Is there anyone who wants to help me with the frequencies map?
Those links don't do anything for me.
Discord's uselessness confirmed: http://talkchess.com/forum3/viewtopic.php?t=74353
Ok, so he wants to sort the 641*64 vectors by frequency to improve caching. I doubt that will make any difference since the L1/L2/L3 caches anyway work by 64-byte cache lines and each of those vectors is much longer than a cache line. In other words, unused vectors should anyway not end up in cache.
A better idea seems to be to prefetch the relevant cache lines. This is on my TODO list already.
yes, very likely the right prefetches will improve things.
@syzygy1 did you look on the prefetch approach you mentioned before?
Not yet. I'll probably look into it soon. (If someone else beats me, then that's fine.)
@vondele
I've made a number of attempts to use prefetch for refreshing the accumulator, but I could not get a speedup. I suspect that feature weights are almost always already in one of the caches.
OK, sounds reasonable given the current small network size. I'll close the issue.
Most helpful comment
Discord's uselessness confirmed: http://talkchess.com/forum3/viewtopic.php?t=74353