Pandas: read_fwf: skip_blank_lines does nothing

Created on 13 Sep 2018  路  6Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

f = StringIO('''A B

C D''')

df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)

Problem description

Output:

     0    1
0    A    B
1  NaN  NaN
2    C    D

The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.

Expected Output

   0  1
0  A  B
1  C  D

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.2
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug IO Fixed Width

Most helpful comment

@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues

All 6 comments

Thanks for the report - investigation and PRs are certainly welcome

The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract. (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.

I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.

Code sample

import pandas as pd
from io import StringIO

csv_f = StringIO('''A,B,C,D 
FOO,1,2,3 
FOO,4,5,6 
,,, 
FOO,7,8,9 
,10,11,12 
,,, 
,,, 
,,, 
,,, 
,,, 
''' 
) 

df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)

Output

      0    1    2    3
0     A    B    C    D
1   FOO    1    2    3
2   FOO    4    5    6
3   NaN  NaN  NaN  NaN
4   FOO    7    8    9
5   NaN   10   11   12
6   NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN
10  NaN  NaN  NaN  NaN

With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)

Also possibly related to: #10164

Any updates on this? I'm still seeing this behavior in pandas 1.0.3

from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
   lucode  desc  val1  val2
0     1.0  corn   0.5   2.0
1     NaN   NaN   NaN   NaN
2     3.0  peas   1.0  -2.0

@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marcelnem picture marcelnem  路  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  路  3Comments

ericdf picture ericdf  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments