Hello everybody,
I am using Pandas 0.18 to open a sas7bdat dataset
I simply use:
df=pd.read_sas('P:/myfile.sas7bdat')
and I get the following error
buf[0:text_block_size].rstrip(b"\x00 ").decode())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)
If I use
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
I get
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte
Other sas7bdat files in my folder are handled just fine by Pandas.
When I open the file in SAS I see that the column names are very long and span several lines, but otherwise the files look just fine.
There are not so many possible options in read_sas... what should I do? Is this a bug in read_sas?
Many thanks!
Are you able to share that file, or a similar file with non-senstitive data that raises the same error?
well this is the problem.. I cant. but I can do my best to run tests on my side, or do stuff in sas, or whatever you need to sort out the problem
You said the lines were long and span several lines. Can you make a dummy file with long names (just random strings like AAAAA.... might work) and a bit of fake data? (I don't have a copy of SAS).
Actually, this might be a dupe of https://github.com/pydata/pandas/issues/12659 Can you try reading the file linked there and see if the same error is raised?
when I try to read that file, I get TypeError: read() takes at most 1 argument (2 given)
Just making sure, did you click on the raw link to download?
yes, I dowloaded the file test17.sas7bdat. This is not the error expected?
I checked the details of the file in SAS. Apparently it is encoded in latin 1 western.
So I tried read_sas('myfile.sas7bdat', encoding='latin-1') but I get the same error
-- ascii codec cant decode byte etc..
Sorry, I was mistaken about the error message. Looks like this is a different issue.
something strange is that even if I specify some encoding, I still get some error relative to the ascii codec. Can that be a cause of the error?
the encoding of my sas file is more precisely latin1 western ISO. Created in linux. (but I use pandas on windows)
Can you drop into the debugger after it raises the error? %debug if your in IPython. Then you can see what's going on.
The docstring says encoding is just for decoding string columns (the actual values), so perhaps it isn't being applied to decoding the column names.
aha! ok lemme try the debugger
> c:\users\me\appdata\local\continuum\anaconda2\lib\site-packages\pandas\io\sas\sas7bdat.py(529)_process_columntext_subheader()
527 buf = self._read_bytes(offset, text_block_size)
528 self.column_names_strings.append(
--> 529 buf[0:text_block_size].rstrip(b"\x00 ").decode())
530
531 if len(self.column_names_strings) == 1:
this is what I get. then the debugger seems to wait for instructions
Ahh, that looks promising though. Does buf[0:text_block_size].rstrip(b"\x00 ").decode('latin1') work there?
Although, that might not go well with the bit stripping there...
what do you mean? what should I do?
sorry I never user the debugger..
OK Tom, I found a fix.
Just check the encoding of your sas file (right click, properties, details) and set the encoding.
import sys
reload(sys)
sys.setdefaultencoding("latin-1")
the question I have is thus: why specifying the encoding in the read_sas function does nothing?
I believe the encoding parameter is just used to decode text data in the actual DataFrame itself, and not the metadata like column headers. Does that sound correct @kshedden ?
According to the docs below, depending on the setting of the VALIDVARNAME
option, variable names may be either restricted to ASCII, or may be
arbitrary bytes to be decoded somehow:
I'm not sure if this VALIDVARNAME (which I have never heard of before) is
in the file somewhere, or is an option that you specify within the
session. In any case, it appears that the column names may need to be
decoded.
Also relevant:
http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm
On Wed, Apr 6, 2016 at 8:27 AM, Tom Augspurger [email protected]
wrote:
I believe the encoding parameter is just used to decode text data in the
actual DataFrame itself, and not the metadata like column headers. Does
that sound correct @kshedden https://github.com/kshedden ?—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/12809#issuecomment-206347584
yes, makes sense although I dont have any control over the creation of these sas files.
I'm working on a PR https://github.com/pydata/pandas/pull/12656 and will
try to work this into it.
I haven't had much time lately but will try to get to this next week.
Kerby
On Wed, Apr 6, 2016 at 9:02 AM, randomgambit [email protected]
wrote:
yes, makes sense although I dont have any control over the creation of
these sas files.—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/12809#issuecomment-206359674
@randomgambit, can you try this branch against your SAS file:
https://github.com/kshedden/pandas/tree/sas7bdat_perf
I hope it fixes your problem.