This was a tricky one and caused a bug being hidden in astroquery for a while (as it made parsing a mock table possible, while the one from the server was failing):
>>> from astropy.io import ascii
>>> ascii.read('1 | 2 | 3 \n 4 | 5 | 6 \n')
<Table length=2>
col1 col2 col3
int64 int64 int64
----- ----- -----
1 2 3
4 5 6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n')
<Table length=2>
col1 col2 col3
int64 float64 int64
----- ------- -----
1 2.0 3
4 -1.0 6
Changing the hyphen to a dash makes it work as expected, to return a string column.
>>> ascii.read('1 | 2 | 3 \n 4 | โ | 6 \n')
<Table length=2>
col1 col2 col3
int64 str1 int64
----- ---- -----
1 2 3
4 โ 6
What...? How does one type "hyphen" and "dash" differently on a keyboard?
Sorry, en dash. Or whatever is the name of the longer one. As you see above they are different. Normally don't spot the difference... ๐คทโโ
(they are not trivially different when debugging, lost a few hours trying to figure out why the test is not failing on master while it's clearly failing using the actual remote service).
Actually, I don't know the formal names to use for them. But I am just wondering how does one even type them out since I only see one of them on the keyboard. ๐
Oh, I can get it with option+'-'
(but first I was copy pasting from a google result on dashes, as I said it's was a good source of super annoyance for today...)
A very subtle bug indeed! My eyes can barely tell the difference between them. Good investigative work. ๐
Interesting variations
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader={'use_fast_converter': True})
<Table length=2>
col1 col2 col3
int64 float64 int64
----- ------- -----
1 2.0 3
4 -0.0 6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader={'use_fast_converter': False})
<Table length=2>
col1 col2 col3
int64 float64 int64
----- ------- -----
1 2.0 3
4 -1.0 6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader=None)
<Table length=2>
col1 col2 col3
int64 float64 int64
----- ------- -----
1 2.0 3
4 -1.0 6
'use_fast_converter': False (using C strtod) might be tricky to fix...
Yikes. The workaround is fast_reader=False. Might be safer for production code in places where the input data are not too huge. The fast reader does seem to be a continuing fountain of bugs, sadly. Note that fast_reader=None is the default and that is choosing a fast reader if possible.
@bsipocz @pllim @dhomeier @taldcroft i was looking for the bug and ended up at cparsers then followed by 'src.tokenizer.c' file and found that it wasnt necessarily a bug ,when you ran this ascii.read('1| 2 | 3 \n 4 | - | 6 \n') code, when it reaches the minus sign the float converter in the cparesers classs directs to strcod() in c(tokenizer.c) which raises a conversionerror which set the value of '-' to -1.0 because of the following code snippets
it is also written to produce similar effect on '+' sign too.
actually it denotes a minus symbol in the code ,and if you want to make any changes for the code(to alter the '-' the conversion to float) and treat it as a bug ,just let me know iamhappy to work on it as per the request
Thanks for digging at the code, @Harshil-C ! Hmm... the reason so such logic is undocumented and it was added in #3525
@pllim my pleasure,yeah it was hard to dig-it-out with out the docs
The crucial point is that no CONVERSION_ERROR is raised in these cases, although neither a NaN nor inf are identified. So I think at
https://github.com/astropy/astropy/blob/1cb6c81e874a785e2d7a3075a75a372188b0d844/astropy/io/ascii/src/tokenizer.c#L719
else
{
self->code = CONVERSION_ERROR;
}
is missing. But further, the use_fast_converter=True never jumps to the conversion_error as it xstrtod quietly skips through all the trailing spaces. Analogously to strtod - from the manpage:
If endptr is not NULL, a pointer to the character after the last charac-
ter used in the conversion is stored in the location referenced by
endptr.
If no conversion is performed, zero is returned and the value of nptr is
stored in the location referenced by endptr.
I'd suggest to add this check after the sign has been parsed:
// Handle optional sign
negative = 0;
switch (*p)
{
case '-': negative = 1; // Fall through to increment position
case '+': p++;
}
// No data after the sign - make no conversion and jump back to str
// (consistent with strtod)
if (*p == '\0' || isspace(*p))
{
if (endptr) *endptr = (char *) str;
return 0e0;
}
@Harshil-C you are most welcome to take care of the fixes - attaching a diff for the changes I have tested so far. Note that I had to change it a bit to accommodate floats without leading digits like -.25.
lone-signs.patch.gz
@taldcroft
Might be safer for production code in places where the input data are not too huge.
what is data that is not too huge? This was used in astroquery to parse the returned tables. It can either be a one row one, or a much larger one, but surely not large on the GB level. If not using the fast reader for a couple thousand of lines is not an issue, I'm happy to switch all cases to stay in pure python land.
either be a one row one, or a much larger one, but surely not large on the GB level. If not using the fast reader for a couple thousand of lines is not an issue, I'm happy to switch all cases to stay in pure python land.
For how slow does it become an issue? ;-)
Looking at my old fork of astropy-notebooks the speed tests there indicate that a couple thousand lines could still stay in the sub-second range, if that helps. Data are quite ancient (I haven't checked if the notebooks still run with 4.x), but the parsers have not changed much since then, so if anything, everything has probably just become a bit faster.
For how slow does it become an issue? ;-)
Tens of seconds, minutes. But for such large data the query will be download limited anyway, I think. So I suppose this means I have a new homework assignment :)
Tens of seconds, minutes. But for such large data the query will be download limited anyway, I think. So I suppose this means I have a new homework assignment :)
Note that the examples there are using 10-20 columns, so still only a few MB for 10^5 rows โ download should better not exceed a couple of seconds...
But if columns with string formats etc. are common, the fast reader will likely not make much difference anyway.
["Fast-C" there refers to strtod, "fast converter" to the xstrtod version.]
@dhomeier you want me to write a pr on the fix by adding the code you attached?(and i didnt get the attach a diff part :)
@Harshil-C if you wish to take care of the PR, feel free to use the code (the lone-signs.patch.gz file is linked within the comment above). I wasn't sure which changes you had already written that you might want to incorporate.
But if you prefer I can also submit the PR with my local changes myself.
i think you own this pr brother, go ahead (i didnt make much changes as iam new to c,i just touched cython and c for this issue )
Thanks, I think you deserve at least as much credit for tracking this down; but I am probably a bit more familiar with the C part as I contributed some of that code myself! :)
iam not confident enough to write c yet,in the future i hope for it (tracking it was pretty fun:) )i learned many things
Whoever ends up taking this up as a PR, there is an option to give the other person credit via GitHub's Co-Author feature. ๐ Thanks!
https://github.blog/2018-01-29-commit-together-with-co-authors/
Whoever ends up taking this up as a PR, there is an option to give the other person credit via GitHub's Co-Author feature. ๐ Thanks!
@Harshil-C can you find your Github noreply email address? According to the docs on commit email addresses it should be something with a 7-digit ID as [email protected], so I guess you can find that ID in your profile settings.
@bsipocz - very unscientific profiling on my new MBP with SSD, for a space-delimited ASCII table with 25 columns by 10000 rows, which is about 1 Mb in size:
In [11]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=False)
CPU times: user 151 ms, sys: 6.07 ms, total: 157 ms
Wall time: 157 ms
In [12]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=True)
CPU times: user 22.9 ms, sys: 1.28 ms, total: 24.2 ms
Wall time: 29.7 ms
One can roughly scale from there, but I agree you are most likely to be limited by server speeds in getting the text file into memory/disk before trying to parse into a table. So it may be a consideration to go with the pure Python parser.
Having said that, I cannot thank @dhomeier and @Harshil-C enough for trying to maintain the fast C parser!! I don't understand that code for the most part, so having people ready to step in and investigate bugs is hugely important.
@dhomeier sorry for the late reply I unknowingly turned the notification off ,is this the one...
[email protected]
Thanks @Harshil-C, I've updated the PR description, but I don't think I can modify the commit message, since it is already merged.
We do not modify commit once it is merged into master either. We apologize for any inconvenience caused!
Most helpful comment
@pllim my pleasure,yeah it was hard to dig-it-out with out the docs