Openrefine: Open XML file from URL generates lots of empty lines

Created on 27 Nov 2015 · 37Comments · Source: OpenRefine/OpenRefine

In the 2.6RC2 installed on Linux from https://github.com/OpenRefine/OpenRefine/releases/download/2.6-rc.2 I get blank lines displayed when importing XML from a URL.

eg importing XML from http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?date=2000:2001

Empty lines:

If the parsing of this can't be fixed, an ignore empty rows setting would be useful?

XML bug import

Source

psychemedia

👍3

Most helpful comment

It sounds like #2409 implements what I described in the first paragraph of https://github.com/OpenRefine/OpenRefine/issues/1095#issuecomment-162032765. One thing its tests don't include, though, is an example of pretty printed XML which I think is important.

Making the default behavior match http://usingxml.com/Basics/XmlSpace#MicrosoftXMLParserBehavior might be a reasonable approach. I think that's basically what we'd get with preserveEmptyStrings=False, trimWhitespace=True.

Whatever the approach, a more comprehensive set of test cases would make sure that the handling is reasonable for all important cases.

tfmorris on 16 Mar 2020

👍3

All 37 comments

Confirmed it's a bug. The extra empty lines seem are generated by one wb:data node.

Do you have similar issue for other xml files?

jackyq2015 on 2 Dec 2015

@jackyq2015 I haven't tried any others recently.. will let you know if I find other examples.

psychemedia on 3 Dec 2015

This reminds me that I saw this too on all XML files when I first began using OR. I decided to pre-process my XML because I needed to move ahead on the project, so I never reported the bug. This would be a nice one to fix. If additional files are needed to reproduce the bug I could dig some up.

joewiz on 3 Dec 2015

@joewiz May I know how did you pre-process the XML in order to make it work?

jackyq2015 on 3 Dec 2015

@jackyq2015 Sorry, what I meant was that I converted the XML to TSV outside of OR before importing into it. I used XQuery - using a script like this one: https://gist.github.com/joewiz/48ce061423aa7d3ada28.

joewiz on 3 Dec 2015

With Open Refine RC2 (Windows), I have a lot of problems for parsing XML
files as simple as this :
https://www.dropbox.com/s/acj3x3b8wl3bkso/4408000.xml?dl=0
I just get a white screen.
Google Refine 2.5 does not have this issue.

ettorerizza on 4 Dec 2015

The issue is being caused by whitespace between tags and a code path that the "trim whitespace" flag doesn't effect. With that fixed, turning off "preserve empty strings" (on by default) and turning on "trim whitespace" (off by default), the XML import generates a much more compact table.

We changed some of the import setting defaults in 2.6 to disable transformations which aren't reversible, but XML whitespace is kind of a special case, so I'll take a closer look to see if there's a better way to fix this.

@ettorerizza Your file imports for me, but most of the populated data columns are off to the far right of the screen. In other words, it behaves the same as Tony's example.

tfmorris on 4 Dec 2015

I revisited this and have a better solution, but there is an issue that I'm unsure how to deal with. Without a DTD or XML Schema, the parsing mode that XML parsers use is the "Mixed" mode where an element can have text, nested elements, or both. If there's a DTD/Schema and it says that an element is not a mixed element, then any text can be discarded by the parser. Without a schema, there's not way to distinguish pretty-print white space from element text.

We can drop all whitespace-only strings that occur in mixed elements, but I've got a nagging concern that this may cause other issues. Am I being overly paranoid or do we need yet another toggle to control this?

tfmorris on 22 Mar 2016

I just ran into the empty row problem while importing xml. I would very much welcome a toggle if that helps.

mokko on 16 Feb 2020

2409 seems to have a nice resolution to this issue. I was able to uncheck `preserve empty strings` and check `Trim leading & trailing whitespace from strings` and for the example XML URL file it seems to do the job nicely.

But perhaps a bit of auto detection of seeing those leading & trailing whitespaces in the imported file and automatically checking the option for Trim leading & trailing whitespace from strings could be an improvement to close out this #1095 ?

thadguidry on 16 Mar 2020

I understand Tom's @tfmorris concerns... but that concern is valid for any XML that doesn't enforce STRICTNESS through a DTD/Schema. It's something that anyone working with XML should be familiar with.

Having this new trim toggle solves the core problem where we didn't give a control to the user to let them control the behavior. Now the user has the power and it's not an automatic default for the XmlParser config.

I think what is missing now is to improve the test main/tests/server/src/com/google/refine/importers/XmlImportUtilitiesTests.java and supply degenerate XML as well as the existing well-formed XML in the test. And I should add, Jeni wrote the book on that anyways :-)

thadguidry on 16 Mar 2020

And this is a very good primer on the general issue and should be required reading for anyone wanting to work on improving the tests: http://usingxml.com/Basics/XmlSpace

thadguidry on 16 Mar 2020

Whatever the approach, a more comprehensive set of test cases would make sure that the handling is reasonable for all important cases.

tfmorris on 16 Mar 2020

👍3

Hi everyone,Please tell me after importing XML file from url in which type of files i need to parse Data as?
bcz paring as lines based text showing data in simple lines and as spreadsheets it is shown updaing preview only but not showinhg any result.

Nishtha3512 on 27 Mar 2020

If you are importing an XML file from a URL, then you should probably select "XML" as file type.

wetneb on 27 Mar 2020

@wetneb But in the first comment of this issue as we see in the screenshot data is shown in the form of table and it is before creating project. By selecting "XML" will give me XML file not table like above to check the bug. :)

Nishtha3512 on 27 Mar 2020

That's because you need to select one element in the XML tree before it gets converted into a table (see the yellow area circled by red dots in the screenshot above).

wetneb on 27 Mar 2020

Hi All, I've tried debugging this issue, I was able to reproduce the same issue while loading it from both 'file' and as well as from 'URL'. I'm able to fix it. If no one is working on this issue, I would like to take it up.

maheshjindal on 23 Oct 2020

Thank you for offering to help on this! Looking forward to your PR.

wetneb on 23 Oct 2020

@maheshjindal please review the thread of conversation above. My impression from reading it is that the problem is mostly fixed by #2409, but that we need a) more/better test cases and b) perhaps better defaults for importer option settings.

tfmorris on 23 Oct 2020

@tfmorris I found one issue with the #2409 and also with the existing changes in #3285. As the same method is being used while importing the JSON data(by both XmlParser and JsonParser), so by default after importing JSON having a valid fieldName and empty value, it will ignore that particular field in the preview, which will create an issue. The empty check on value should only be performed in the case when parser is an instance of XmlImporter.XmlParser .

Moreover Just for confirmation, as per the above discussion, along with these changes, the new default parser configuration settings (in case of XML) will be preserveEmptyStrings=False, trimWhitespace=True. Please correct me, if I'm going anywhere wrong.

maheshjindal on 24 Oct 2020

The empty check on value should only be performed in the case when parser is an instance of XmlImporter.XmlParser .

No. The default is and has always been that we preserve empty string values during import. If a particular importer has some trouble creating or preserving empty strings, then we need to change and improve that. Users are allowed to preserve and keep empty string values if they want, no matter if it was JSON, XML, CSV, etc. Those empty values will automatically display as blank cells in preview, and will thus be stored as blank, not null (empty string values) in cells in our grid after Project creation. When a user unchecks the preserve empty strings, they are asking not to store blanks, but instead not assign a value to a cell, which is then just a null cell. Prior to OpenRefine 3.0 you would see treatment of null/blank cells as equal. This changed after 3.0 and we give users much more choice of null and blank cell handling as well as new importer and facet for treating null and blanks differently and as well as new display options (All menu over the first column) to show or hide null value in cells.

Moreover Just for confirmation, as per the above discussion, along with these changes, the new default parser configuration settings (in case of XML) will be preserveEmptyStrings=False, trimWhitespace=True. Please correct me, if I'm going anywhere wrong.

No. The default settings are correct. We need to improve the importer so that it can deal with those default settings nicely, as per my previous paragraph.

In general, we provide smart defaults, but always try to expose more options so that users can work with their messy data within valid formats. The problem exposed in this issue is that around non-compliant C14N XML documents (non valid) sometimes found in the wild. So 1st thing is we always check for valid formatting (JSON, XML, etc.) and in the case of XML, there's the strictness concerns, where we actually want to provide MORE options , not less, (Jackson allows lots of options we don't currently expose in the UI and would like to) to the user and not changing the existing smart defaults for preserving Empty Strings, trimming Whitespace, etc. However... a lot of the existing smart defaults won't work when you toggle some of those other options for Jackson XML parser when its asked to be a "validating" parser. Jackson will try to normalize attribute values when "validating" (per C14N rules). In OpenRefine, we let users decide how things are normalized, that's our philosophy...giving users more power and controls to make the best decisions for themselves. We don't try to force users and tell them "you must supply valid XML to import that strictly conforms to all C14N rules", instead we want to allow users to ignore some of those rules...which is what Jackson's various options allow and that we don't fully expose yet but want to.

More broadly, the OpenRefine philosophy of letting users decide how to import and giving them more options with smart defaults that can be changed or toggled. What @tfmorris is saying about

Making the default behavior match http://usingxml.com/Basics/XmlSpace#MicrosoftXMLParserBehavior might be a reasonable approach. I think that's basically what we'd get with preserveEmptyStrings=False, trimWhitespace=True.

is basically about those smart defaults...and that our XML importer might need to have changes done to the existing defaults...but my opinion is that the preserve Empty Strings and Whitespace defaults are better as-is because MOST of the XML encountered in the wild will be C14N compliant and not have an issue. For those XML documents in the wild that are not compliant, we want users to have as many options as necessary in order to parse and import the data so that they can extract and use it in our grid to analyze, convert, transform, export, etc.

I hope this helps you @maheshjindal and others on this issue to understand what the community generally is asking for, especially concerning XML importer options. (and the community does understand that quite a bit of the current XML importer handling might have to be refactored to deal properly with non-compliant C14N XML documents and give them more options in our UI)

(and a bit of history lesson... prior to 2007-08? there were several XML libraries that dealt with canonicalization very differently. The global community stepped up to deal with consistent XML representation uniformly across tools/libraries - the Exclusive XML Canonicalization spec. So those XML documents in the wild that were generated with older tools/libraries are what typically cause the issues for OpenRefine and other tools during import...and which we want users to have more options and power to deal with those slightly invalid XML documents)

thadguidry on 24 Oct 2020

👍1

@thadguidry Got your point. Thank you very much for this elucidation.

maheshjindal on 24 Oct 2020

@maheshjindal Sure! (and sorry it wasn't only 2 sentences, because no one would likely have gotten the point...it indeed took a few paragraphs at minimum...sorry for the length!)

thadguidry on 24 Oct 2020

@thadguidry With respect to this issue, will it be nicer to have a new importer option named “skipWhitespaces” while importing XML data? For enabling/disabling this option from the server-side, we can follow up with any of the below approaches.

It's a usual approach, similar to other importer options, we can pass it as an additional new parameter in the method “XmlImportUtilities.importTreeData” . Inside this method we can check, if the option is enabled we can skip the whitespaces.
On the other hand, we can pass this option while creating a new instance of XmlParser class. So when the “(new XmlParser(...)).next()” method will be invoked, and if "skipWhitespaces" is enabled, we can return the value as “Token.Ignorable” for the whitespaces. With this approach, in the future, if we need to set any other configuration at the parser level it will be convenient to do so.

According to the project guidelines, could you please suggest, which is the better approach that needs to be followed? Please feel free to add any suggestions/improvements or if we can do this in a better way.

maheshjindal on 25 Oct 2020

@maheshjindal Those decisions will be best made by @wetneb and @tfmorris who will reply to you.

thadguidry on 25 Oct 2020

@thadguidry With respect to this issue, will it be nicer to have a new importer option named “skipWhitespaces” while importing XML data? For enabling/disabling this option from the server-side, we can follow up with any of the below approaches.

It's a usual approach, similar to other importer options, we can pass it as an additional new parameter in the method “XmlImportUtilities.importTreeData” . Inside this method we can check, if the option is enabled we can skip the whitespaces.

On the other hand, we can pass this option while creating a new instance of XmlParser class. So when the “(new XmlParser(...)).next()” method will be invoked, and if "skipWhitespaces" is enabled, we can return the value as “Token.Ignorable” for the whitespaces. With this approach, in the future, if we need to set any other configuration at the parser level it will be convenient to do so.

According to the project guidelines, could you please suggest, which is the better approach that needs to be followed? Please feel free to add any suggestions/improvements or if we can do this in a better way.

@wetneb and @tfmorris Could you please let us know your thoughts on this?

maheshjindal on 27 Oct 2020

Sorry for the delay, I'll get back to you by tomorrow.

wetneb on 29 Oct 2020

@maheshjindal I support your idea, it would make sense to have an option in tree-based importers (XML, MARC) to skip whitespace. It's not clear to me whether you would want that for JSON too (perhaps to skip null instead)? It could potentially also apply to the RDF importers (skip blank nodes).

wetneb on 31 Oct 2020

@wetneb, I was referring only to the case of XML importer. We can check the parser type and if 'XMLParser' is being used(i.e instance type for tree parser is XMLParser), we can proceed with skipping whitespaces.

maheshjindal on 1 Nov 2020

I think this is less flexible: it would be more convenient to expose an option that other importers can use as well.

wetneb on 1 Nov 2020

@wetneb, yes I totally affirm you. It would be much more convenient and useful for an end-user.

maheshjindal on 1 Nov 2020

Even if you only want to change the behaviour of the XML importer, I think we should do it in a way that makes it easier to extend the feature to other importers.

Having to check at runtime if the importer is an XML importer using instanceof is a symptom of an architectural problem, so I would oppose fixing it this way.

wetneb on 1 Nov 2020

@wetneb , yes exactly. Earlier I was taking only XmlImporter into consideration, that's why I proposed the above two approaches. Now as this feature really seems to be a good add-on for all other importers, I will look into this so that it will be easier to extend and integrate for other importers as well.

maheshjindal on 1 Nov 2020

I'm lost in all this verbiage. In March https://github.com/OpenRefine/OpenRefine/issues/1095#issuecomment-599709098, it sounded like everything was good except we needed better test coverage and, when I try the OP's WorldBank example with "trim white space" = true and "keep empty strings" = false, I get, what looks to me, like a perfect import.

What are we missing? Why do we need a third option here?

p.s. As an aside, the "C14N" acronym is apparently a reference to Canonical XML.

tfmorris on 3 Nov 2020

I'm lost in all this verbiage. In March #1095 (comment), it sounded like everything was good except we needed better test coverage and, when I try the OP's WorldBank example with "trim white space" = true and "keep empty strings" = false, I get, what looks to me, like a perfect import.

What are we missing? Why do we need a third option here?

p.s. As an aside, the "C14N" acronym is apparently a reference to Canonical XML.

@tfmorris , with reference to the above discussion, the suggestion was to introduce a new option "skipWhitespaces" to ignore the cells entirely containing only whitespace characters as a value parsed by the XmlParser. Yes, its behaviour will be exactly the same as setting 'trim white space = true and keep empty strings = false' from the importer. Presently, if it's fine to use these import options w.r.t end-user, then we can omit these changes and only work on Improving test cases.

maheshjindal on 3 Nov 2020

Extremely Sorry for the delay. I was completely occupied with some personal issues.

I have made the required changes and raised PR for the same. Please let me know if any furthur changes are required. The link of the PR is:
https://github.com/OpenRefine/OpenRefine/pull/3357

maheshjindal on 25 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Behaviour of date, number and boolean values in a test/list facet is inconsistent

ostephens · 26Comments

Extend cross() function to take either a cell or a value

felixlohmeier · 23Comments

Provide portable package for Windows

wetneb · 26Comments

OpenRefine on Spark

YannBrrd · 65Comments

HTML/CSV export corrupts UTF-8 characters outside of Basic Multilingual Pane (BMP) ie code point >10000

Yuutakasan · 27Comments

Openrefine: Open XML file from URL generates lots of empty lines

Most helpful comment

All 37 comments

2409 seems to have a nice resolution to this issue. I was able to uncheck preserve empty strings and check Trim leading & trailing whitespace from strings and for the example XML URL file it seems to do the job nicely.

Related issues

2409 seems to have a nice resolution to this issue. I was able to uncheck `preserve empty strings` and check `Trim leading & trailing whitespace from strings` and for the example XML URL file it seems to do the job nicely.