Data.table: print uint64 using inttypes.h `PRIu64` instead of `llu`

Created on 20 Feb 2019  路  8Comments  路  Source: Rdatatable/data.table

%llu format works fine when using from R C API functions like Rprintf, but it fails on Windows when using in plain C like sprintf. We could use PRIu64 instead, defined in inttypes.h.
According to http://r.789695.n4.nabble.com/package-config-file-for-windows-td921218.html inttypes.h is available on CRAN's Windows machines.
According to https://github.com/pocoproject/poco/pull/1426 inttypes.h is available on Solaris.

We need this to be able to construct a message from plain C code, where message body contains length of a vector, thus can be int64. This is used for deferred error/warnings/verbose messages from code that runs in parallel regions.

High

All 8 comments

See https://github.com/Rdatatable/data.table/blob/master/src/fread.c#L30
There llu is defined to be unsigned long long int. Then use "%llu" in format string as usual, but cast the size_t or uint64_t arguments using (llu) prefix.

@mattdowle Is it going to work on windows 32bit?
I tried sprintf("%llu", (long long unsigned) nx); for uint_fast64_t nx and it was causing problems for win32.

@jangorecki The line in fread.c and in my comment above uses unsigned long long int. Why have you used long long unsigned?

@mattdowle unsigned long long int doesn't seems to work on windows (both 32 and 64)

froll.c:41:3: warning: unknown conversion type character 'l' in format [-Wformat=]
   if (verbose) sprintf(ans->message[0], "%s%s: running for input length %llu, window %d, hasna %d, narm %d\n", ans->message[0], __func__, (unsigned long long int)nx, k, hasna, (int) narm);
   ^
froll.c:41:3: warning: format '%d' expects argument of type 'int', but argument 5 has type 'uint_fast64_t' [-Wformat=]
froll.c:41:3: warning: too many arguments for format [-Wformat-extra-args]

I see what you mean and given those warnings about int types, it is reasonable to think it's about int types. But note the last warning (which is at the same point 41:3) : warning: too many arguments for format [-Wformat-extra-args]. If there's an extra argument (or something like that) then that could cause other compiler warnings to be misleading.

The first thing I noticed was that the target pointer (ans->message[0]) appeared again as the first argument after the format specifier, to achieve writing to the end of the string. That's going to cause trouble because sprintf() will need to read and write to the same part of the string at the same time. Maybe that should be possible and should work, but it isn't clean, it's not efficient, and probably not expected by compiler optimizer writers. So lets just remove that for starters. I added an end() function which just calls strchr() and then wrapped the target with end() each time.

Also we should never use sprintf. Always snprintf and pass the amount of space available. To avoid overwrites. In fread.c it uses snprintf not sprintf. I changed all sprintf to snprintf.

I made these changes to the PR and it's passing now: 17d2c522a823294628ceb5b7a4207196b08bf670 and 605d0b03a2dff6136e982c3497fed19966077779. This shows that unsigned long long int does work on Windows including 32bit and those warnings were misleading.

I left the 500 that I added hard coded just as a first step to make sure it passed CI first. Could you clear that up from here please. Maybe end() could subtract the length that strchr() returned from 4096 allocated and store the remaining space in a global variable. Then pass that global as the 2nd argument to snprintf. However, I'm not sure how much of this is in parallel, so a single global may not work. Perhaps store a usedLength in ans-> too then and wrap the snprintf() calls into an appendMessage helper function or something. If you define the 4096 somewhere in just one place (a .h) then use that macro definition in the two places to ensure there can't be mismatch in future if the 4096 is ever changed in one place but not the other.

AFAIU snprintf the 500 value is fine there, assuming none of the message has more than 500 chars. Even if such message would happen, the behaviour is defined and result in message being truncated. Writing those messages goes in parallel so, yes, global var wouldn't work. AFAIU the code that is now in this branch is enough, we don't need to store amount of characters used in extra variable. Second argument of snprintf, the 500, is not the number of chars that will be written to that buffer, but a maximum number, thus assuming that none of message is >500 and all messages in total are <=4096 (or precisely 4096-n_messages because each snprintf consume extra byte for \0), then we are safe as it is now.

Yes the 500 applies to each message and it's the maximum written (if the message is longer it is safely truncated to 500). There's an implicit upper bound of 8 messages per buffer then (8*500 < 4096). In practice, it will safely hold many more than 8 messages since the messages are a lot shorter than 500. Ok since you're happy with it (sounds like you know the maximum number of messages is 8 or less), I'll merge it so we can move on. Maybe limiting the number of messages is the easiest way to make it rock solid. That could be an int[] variable added to the ans structure alongside the message[][].

Hold on. ~It won't be enough as user may provide arbitrary number of columns or windows to calculate.~ Need to investigate.

Was this page helpful?
0 / 5 - 0 ratings