from io import StringIO
import pandas as pd
f = StringIO('''A B
C D''')
df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)
Output:
0 1
0 A B
1 NaN NaN
2 C D
The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.
0 1
0 A B
1 C D
pd.show_versions()commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.1
pytest: 3.6.2
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Thanks for the report - investigation and PRs are certainly welcome
The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract. (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.
I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.
import pandas as pd
from io import StringIO
csv_f = StringIO('''A,B,C,D
FOO,1,2,3
FOO,4,5,6
,,,
FOO,7,8,9
,10,11,12
,,,
,,,
,,,
,,,
,,,
'''
)
df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)
0 1 2 3
0 A B C D
1 FOO 1 2 3
2 FOO 4 5 6
3 NaN NaN NaN NaN
4 FOO 7 8 9
5 NaN 10 11 12
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)
Also possibly related to: #10164
Any updates on this? I'm still seeing this behavior in pandas 1.0.3
from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
lucode desc val1 val2
0 1.0 corn 0.5 2.0
1 NaN NaN NaN NaN
2 3.0 peas 1.0 -2.0
@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues
Most helpful comment
@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues