Astropy: BUG: ascii.read converts - to float -1.0

Created on 23 Jan 2020 · 30Comments · Source: astropy/astropy

This was a tricky one and caused a bug being hidden in astroquery for a while (as it made parsing a mock table possible, while the one from the server was failing):

>>> from astropy.io import ascii
>>> ascii.read('1 | 2 | 3 \n 4 | 5 | 6 \n')                                                                                                                                                                                   
<Table length=2>
 col1  col2  col3
int64 int64 int64
----- ----- -----
    1     2     3
    4     5     6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n') 
<Table length=2>
 col1   col2   col3
int64 float64 int64
----- ------- -----
    1     2.0     3
    4    -1.0     6

Changing the hyphen to a dash makes it work as expected, to return a string column.

>>> ascii.read('1 | 2 | 3 \n 4 | – | 6 \n') 
<Table length=2>
 col1 col2  col3
int64 str1 int64
----- ---- -----
    1    2     3
    4    –     6

Bug io.ascii

Source

bsipocz

Most helpful comment

@pllim my pleasure,yeah it was hard to dig-it-out with out the docs

Harshil-C on 29 Jan 2020

👍2 😄1

All 30 comments

What...? How does one type "hyphen" and "dash" differently on a keyboard?

pllim on 23 Jan 2020

Sorry, en dash. Or whatever is the name of the longer one. As you see above they are different. Normally don't spot the difference... 🤷‍♀

(they are not trivially different when debugging, lost a few hours trying to figure out why the test is not failing on master while it's clearly failing using the actual remote service).

bsipocz on 23 Jan 2020

Actually, I don't know the formal names to use for them. But I am just wondering how does one even type them out since I only see one of them on the keyboard. 👀

pllim on 23 Jan 2020

Oh, I can get it with option+'-'

bsipocz on 23 Jan 2020

👍1

(but first I was copy pasting from a google result on dashes, as I said it's was a good source of super annoyance for today...)

bsipocz on 23 Jan 2020

A very subtle bug indeed! My eyes can barely tell the difference between them. Good investigative work. 👏

pllim on 23 Jan 2020

Interesting variations

>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader={'use_fast_converter': True})
<Table length=2>
 col1   col2   col3
int64 float64 int64
----- ------- -----
    1     2.0     3
    4    -0.0     6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader={'use_fast_converter': False})
<Table length=2>
 col1   col2   col3
int64 float64 int64
----- ------- -----
    1     2.0     3
    4    -1.0     6
>>> ascii.read('1 | 2 | 3 \n 4 | - | 6 \n', fast_reader=None)
<Table length=2>
 col1   col2   col3
int64 float64 int64
----- ------- -----
    1     2.0     3
    4    -1.0     6

'use_fast_converter': False (using C strtod) might be tricky to fix...

dhomeier on 29 Jan 2020

Yikes. The workaround is fast_reader=False. Might be safer for production code in places where the input data are not too huge. The fast reader does seem to be a continuing fountain of bugs, sadly. Note that fast_reader=None is the default and that is choosing a fast reader if possible.

taldcroft on 29 Jan 2020

@bsipocz @pllim @dhomeier @taldcroft i was looking for the bug and ended up at cparsers then followed by 'src.tokenizer.c' file and found that it wasnt necessarily a bug ,when you ran this ascii.read('1| 2 | 3 \n 4 | - | 6 \n') code, when it reaches the minus sign the float converter in the cparesers classs directs to strcod() in c(tokenizer.c) which raises a conversionerror which set the value of '-' to -1.0 because of the following code snippets

Harshil-C on 29 Jan 2020

https://github.com/astropy/astropy/blob/1cb6c81e874a785e2d7a3075a75a372188b0d844/astropy/io/ascii/src/tokenizer.c#L688

it is also written to produce similar effect on '+' sign too.

Harshil-C on 29 Jan 2020

actually it denotes a minus symbol in the code ,and if you want to make any changes for the code(to alter the '-' the conversion to float) and treat it as a bug ,just let me know iamhappy to work on it as per the request

Harshil-C on 29 Jan 2020

Thanks for digging at the code, @Harshil-C ! Hmm... the reason so such logic is undocumented and it was added in #3525

https://github.com/astropy/astropy/blob/1cb6c81e874a785e2d7a3075a75a372188b0d844/astropy/io/ascii/src/tokenizer.c#L701

pllim on 29 Jan 2020

@pllim my pleasure,yeah it was hard to dig-it-out with out the docs

Harshil-C on 29 Jan 2020

👍2 😄1

The crucial point is that no CONVERSION_ERROR is raised in these cases, although neither a NaN nor inf are identified. So I think at
https://github.com/astropy/astropy/blob/1cb6c81e874a785e2d7a3075a75a372188b0d844/astropy/io/ascii/src/tokenizer.c#L719

    else
    {
        self->code = CONVERSION_ERROR;
    }

is missing. But further, the use_fast_converter=True never jumps to the conversion_error as it xstrtod quietly skips through all the trailing spaces. Analogously to strtod - from the manpage:

     If endptr is not NULL, a pointer to the character after the last charac-
     ter used in the conversion is stored in the location referenced by
     endptr.

     If no conversion is performed, zero is returned and the value of nptr is
     stored in the location referenced by endptr.

I'd suggest to add this check after the sign has been parsed:

    // Handle optional sign
    negative = 0;
    switch (*p)
    {
    case '-': negative = 1; // Fall through to increment position
    case '+': p++;
    }

    // No data after the sign - make no conversion and jump back to str
    // (consistent with strtod)
    if (*p == '\0' || isspace(*p))
    {
        if (endptr) *endptr = (char *) str;
        return 0e0;
    }

dhomeier on 29 Jan 2020

👍1

@Harshil-C you are most welcome to take care of the fixes - attaching a diff for the changes I have tested so far. Note that I had to change it a bit to accommodate floats without leading digits like -.25.
lone-signs.patch.gz

dhomeier on 29 Jan 2020

👍1

@taldcroft

Might be safer for production code in places where the input data are not too huge.

what is data that is not too huge? This was used in astroquery to parse the returned tables. It can either be a one row one, or a much larger one, but surely not large on the GB level. If not using the fast reader for a couple thousand of lines is not an issue, I'm happy to switch all cases to stay in pure python land.

bsipocz on 29 Jan 2020

either be a one row one, or a much larger one, but surely not large on the GB level. If not using the fast reader for a couple thousand of lines is not an issue, I'm happy to switch all cases to stay in pure python land.

For how slow does it become an issue? ;-)
Looking at my old fork of astropy-notebooks the speed tests there indicate that a couple thousand lines could still stay in the sub-second range, if that helps. Data are quite ancient (I haven't checked if the notebooks still run with 4.x), but the parsers have not changed much since then, so if anything, everything has probably just become a bit faster.

dhomeier on 29 Jan 2020

For how slow does it become an issue? ;-)

Tens of seconds, minutes. But for such large data the query will be download limited anyway, I think. So I suppose this means I have a new homework assignment :)

bsipocz on 29 Jan 2020

Tens of seconds, minutes. But for such large data the query will be download limited anyway, I think. So I suppose this means I have a new homework assignment :)

Note that the examples there are using 10-20 columns, so still only a few MB for 10^5 rows – download should better not exceed a couple of seconds...
But if columns with string formats etc. are common, the fast reader will likely not make much difference anyway.
["Fast-C" there refers to strtod, "fast converter" to the xstrtod version.]

dhomeier on 29 Jan 2020

@dhomeier you want me to write a pr on the fix by adding the code you attached?(and i didnt get the attach a diff part :)

Harshil-C on 31 Jan 2020

@Harshil-C if you wish to take care of the PR, feel free to use the code (the lone-signs.patch.gz file is linked within the comment above). I wasn't sure which changes you had already written that you might want to incorporate.
But if you prefer I can also submit the PR with my local changes myself.

dhomeier on 31 Jan 2020

i think you own this pr brother, go ahead (i didnt make much changes as iam new to c,i just touched cython and c for this issue )

Harshil-C on 31 Jan 2020

😄1

Thanks, I think you deserve at least as much credit for tracking this down; but I am probably a bit more familiar with the C part as I contributed some of that code myself! :)

dhomeier on 31 Jan 2020

🎉1 👍1

iam not confident enough to write c yet,in the future i hope for it (tracking it was pretty fun:) )i learned many things

Harshil-C on 31 Jan 2020

😄1

Whoever ends up taking this up as a PR, there is an option to give the other person credit via GitHub's Co-Author feature. 😄 Thanks!

https://github.blog/2018-01-29-commit-together-with-co-authors/

pllim on 31 Jan 2020

🎉1

Whoever ends up taking this up as a PR, there is an option to give the other person credit via GitHub's Co-Author feature. 😄 Thanks!

@Harshil-C can you find your Github noreply email address? According to the docs on commit email addresses it should be something with a 7-digit ID as [email protected], so I guess you can find that ID in your profile settings.

dhomeier on 31 Jan 2020

@bsipocz - very unscientific profiling on my new MBP with SSD, for a space-delimited ASCII table with 25 columns by 10000 rows, which is about 1 Mb in size:

In [11]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=False)
CPU times: user 151 ms, sys: 6.07 ms, total: 157 ms
Wall time: 157 ms

In [12]: time out = Table.read('junk.dat', format='ascii.basic', guess=False, fast_reader=True)
CPU times: user 22.9 ms, sys: 1.28 ms, total: 24.2 ms
Wall time: 29.7 ms

One can roughly scale from there, but I agree you are most likely to be limited by server speeds in getting the text file into memory/disk before trying to parse into a table. So it may be a consideration to go with the pure Python parser.

Having said that, I cannot thank @dhomeier and @Harshil-C enough for trying to maintain the fast C parser!! I don't understand that code for the most part, so having people ready to step in and investigate bugs is hugely important.

taldcroft on 31 Jan 2020

👍1

@dhomeier sorry for the late reply I unknowingly turned the notification off ,is this the one...
[email protected]

Harshil-C on 13 Feb 2020

Thanks @Harshil-C, I've updated the PR description, but I don't think I can modify the commit message, since it is already merged.

dhomeier on 15 Feb 2020

We do not modify commit once it is merged into master either. We apologize for any inconvenience caused!

pllim on 17 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings