Dbt: Strip BOM from Excel-generated UTF-8 CSV files

Created on 6 Dec 2018  路  2Comments  路  Source: fishtown-analytics/dbt

Issue

UTF-8 CSV files generated by Excel have a byte order mark (BOM) which is included in the column title when using dbt seed.

Issue description

Save Excel file as UTF-8 CSV on Microsoft Excel for Mac. Run dbt seed on the resulting file.

Results

The first column name includes the unprintable BOM character which makes queries very difficult to write. I would expect the BOM to be stripped from the column name.

I was able to work around this by changing the file encoding in Emacs and re-saving the file.

System information

The output of dbt --version:

installed version: 0.12.1
   latest version: 0.12.1

Up to date!

The operating system you're running on: MacOS Mojave 10.14.1

The python version you're using (probably the output of python --version): 3.7.1

Steps to reproduce

  1. Using Excel, save file as UTF-8 CSV.
  2. Run dbt seed
  3. select <column> from <table>, column cannot be found because of the BOM.

Most helpful comment

I (fortunately) don't have an excel-generated csv lying around to test this on, but had written down some notes about this before.

I think the following enables you to read excel-generated csv files as expected, for what it's worth:

>>> import codecs
>>> import csv
>>> with codecs.open('excel_output.csv', encoding='utf-8-sig') as f:
...     reader = csv.reader(f)
...     rows = [row for row in reader]

All 2 comments

I (fortunately) don't have an excel-generated csv lying around to test this on, but had written down some notes about this before.

I think the following enables you to read excel-generated csv files as expected, for what it's worth:

>>> import codecs
>>> import csv
>>> with codecs.open('excel_output.csv', encoding='utf-8-sig') as f:
...     reader = csv.reader(f)
...     rows = [row for row in reader]

I (fortunately) don't have an excel-generated csv lying around to test this on, but had written down some notes about this before.

I _think_ the following enables you to read excel-generated csv files as expected, for what it's worth:

>>> import codecs
>>> import csv
>>> with codecs.open('excel_output.csv', encoding='utf-8-sig') as f:
...     reader = csv.reader(f)
...     rows = [row for row in reader]

@ryantuck - Thanks a bunch! I was stuck on a similar issue for hours and your comment solved it!

Was this page helpful?
0 / 5 - 0 ratings