Pandas: read_fwf: skip_blank_lines does nothing

Created on 13 Sep 2018 · 6Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

f = StringIO('''A B

C D''')

df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)

Problem description

Output:

     0    1
0    A    B
1  NaN  NaN
2    C    D

The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.

Expected Output

   0  1
0  A  B
1  C  D

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.2
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug IO Fixed Width

Source

ivanradicek

Most helpful comment

@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues

jreback on 4 Jun 2020

👍2

All 6 comments

Thanks for the report - investigation and PRs are certainly welcome

WillAyd on 13 Sep 2018

The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. ~~Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract.~~ (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.

danpere on 13 Sep 2018

I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.

Code sample

import pandas as pd
from io import StringIO

csv_f = StringIO('''A,B,C,D 
FOO,1,2,3 
FOO,4,5,6 
,,, 
FOO,7,8,9 
,10,11,12 
,,, 
,,, 
,,, 
,,, 
,,, 
''' 
) 

df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)

Output

      0    1    2    3
0     A    B    C    D
1   FOO    1    2    3
2   FOO    4    5    6
3   NaN  NaN  NaN  NaN
4   FOO    7    8    9
5   NaN   10   11   12
6   NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN
10  NaN  NaN  NaN  NaN

With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)

dvalters on 11 Jul 2019

Also possibly related to: #10164

dvalters on 11 Jul 2019

Any updates on this? I'm still seeing this behavior in pandas 1.0.3

from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
   lucode  desc  val1  val2
0     1.0  corn   0.5   2.0
1     NaN   NaN   NaN   NaN
2     3.0  peas   1.0  -2.0