In the 2.6RC2 installed on Linux from https://github.com/OpenRefine/OpenRefine/releases/download/2.6-rc.2 I get blank lines displayed when importing XML from a URL.
eg importing XML from http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?date=2000:2001
Empty lines:
If the parsing of this can't be fixed, an ignore empty rows
setting would be useful?
Confirmed it's a bug. The extra empty lines seem are generated by one wb:data node.
Do you have similar issue for other xml files?
@jackyq2015 I haven't tried any others recently.. will let you know if I find other examples.
This reminds me that I saw this too on all XML files when I first began using OR. I decided to pre-process my XML because I needed to move ahead on the project, so I never reported the bug. This would be a nice one to fix. If additional files are needed to reproduce the bug I could dig some up.
@joewiz May I know how did you pre-process the XML in order to make it work?
@jackyq2015 Sorry, what I meant was that I converted the XML to TSV outside of OR before importing into it. I used XQuery - using a script like this one: https://gist.github.com/joewiz/48ce061423aa7d3ada28.
With Open Refine RC2 (Windows), I have a lot of problems for parsing XML
files as simple as this :
https://www.dropbox.com/s/acj3x3b8wl3bkso/4408000.xml?dl=0
I just get a white screen.
Google Refine 2.5 does not have this issue.
The issue is being caused by whitespace between tags and a code path that the "trim whitespace" flag doesn't effect. With that fixed, turning off "preserve empty strings" (on by default) and turning on "trim whitespace" (off by default), the XML import generates a much more compact table.
We changed some of the import setting defaults in 2.6 to disable transformations which aren't reversible, but XML whitespace is kind of a special case, so I'll take a closer look to see if there's a better way to fix this.
@ettorerizza Your file imports for me, but most of the populated data columns are off to the far right of the screen. In other words, it behaves the same as Tony's example.
I revisited this and have a better solution, but there is an issue that I'm unsure how to deal with. Without a DTD or XML Schema, the parsing mode that XML parsers use is the "Mixed" mode where an element can have text, nested elements, or both. If there's a DTD/Schema and it says that an element is not a mixed element, then any text can be discarded by the parser. Without a schema, there's not way to distinguish pretty-print white space from element text.
We can drop all whitespace-only strings that occur in mixed elements, but I've got a nagging concern that this may cause other issues. Am I being overly paranoid or do we need yet another toggle to control this?
I just ran into the empty row problem while importing xml. I would very much welcome a toggle if that helps.
preserve empty strings
and check Trim leading & trailing whitespace from strings
and for the example XML URL file it seems to do the job nicely.But perhaps a bit of auto detection of seeing those leading & trailing whitespaces in the imported file and automatically checking the option for Trim leading & trailing whitespace
from strings could be an improvement to close out this #1095 ?
I understand Tom's @tfmorris concerns... but that concern is valid for any XML that doesn't enforce STRICTNESS through a DTD/Schema. It's something that anyone working with XML should be familiar with.
Having this new trim
toggle solves the core problem where we didn't give a control to the user to let them control the behavior. Now the user has the power and it's not an automatic default for the XmlParser config.
I think what is missing now is to improve the test main/tests/server/src/com/google/refine/importers/XmlImportUtilitiesTests.java
and supply degenerate XML as well as the existing well-formed XML in the test. And I should add, Jeni wrote the book on that anyways :-)
And this is a very good primer on the general issue and should be required reading for anyone wanting to work on improving the tests: http://usingxml.com/Basics/XmlSpace
It sounds like #2409 implements what I described in the first paragraph of https://github.com/OpenRefine/OpenRefine/issues/1095#issuecomment-162032765. One thing its tests don't include, though, is an example of pretty printed XML which I think is important.
Making the default behavior match http://usingxml.com/Basics/XmlSpace#MicrosoftXMLParserBehavior might be a reasonable approach. I think that's basically what we'd get with preserveEmptyStrings=False, trimWhitespace=True.
Whatever the approach, a more comprehensive set of test cases would make sure that the handling is reasonable for all important cases.
Hi everyone,Please tell me after importing XML file from url in which type of files i need to parse Data as?
bcz paring as lines based text showing data in simple lines and as spreadsheets it is shown updaing preview only but not showinhg any result.
If you are importing an XML file from a URL, then you should probably select "XML" as file type.
@wetneb But in the first comment of this issue as we see in the screenshot data is shown in the form of table and it is before creating project. By selecting "XML" will give me XML file not table like above to check the bug. :)
That's because you need to select one element in the XMLÂ tree before it gets converted into a table (see the yellow area circled by red dots in the screenshot above).
Hi All, I've tried debugging this issue, I was able to reproduce the same issue while loading it from both 'file' and as well as from 'URL'. I'm able to fix it. If no one is working on this issue, I would like to take it up.
Thank you for offering to help on this! Looking forward to your PR.
@maheshjindal please review the thread of conversation above. My impression from reading it is that the problem is mostly fixed by #2409, but that we need a) more/better test cases and b) perhaps better defaults for importer option settings.
@tfmorris I found one issue with the #2409 and also with the existing changes in #3285. As the same method is being used while importing the JSON data(by both XmlParser and JsonParser), so by default after importing JSON having a valid fieldName and empty value, it will ignore that particular field in the preview, which will create an issue. The empty check on value should only be performed in the case when parser is an instance of XmlImporter.XmlParser .
Moreover Just for confirmation, as per the above discussion, along with these changes, the new default parser configuration settings (in case of XML) will be preserveEmptyStrings=False, trimWhitespace=True. Please correct me, if I'm going anywhere wrong.
The empty check on value should only be performed in the case when parser is an instance of XmlImporter.XmlParser .
No. The default is and has always been that we preserve empty string values during import. If a particular importer has some trouble creating or preserving empty strings, then we need to change and improve that. Users are allowed to preserve and keep empty string values if they want, no matter if it was JSON, XML, CSV, etc. Those empty values will automatically display as blank cells in preview, and will thus be stored as blank, not null (empty string values) in cells in our grid after Project creation. When a user unchecks the preserve empty strings, they are asking not to store blanks, but instead not assign a value to a cell, which is then just a null cell. Prior to OpenRefine 3.0 you would see treatment of null/blank cells as equal. This changed after 3.0 and we give users much more choice of null and blank cell handling as well as new importer and facet for treating null and blanks differently and as well as new display options (All menu over the first column) to show or hide null value in cells.
Moreover Just for confirmation, as per the above discussion, along with these changes, the new default parser configuration settings (in case of XML) will be preserveEmptyStrings=False, trimWhitespace=True. Please correct me, if I'm going anywhere wrong.
No. The default settings are correct. We need to improve the importer so that it can deal with those default settings nicely, as per my previous paragraph.
In general, we provide smart defaults, but always try to expose more options so that users can work with their messy data within valid formats. The problem exposed in this issue is that around non-compliant C14N XML documents (non valid) sometimes found in the wild. So 1st thing is we always check for valid formatting (JSON, XML, etc.) and in the case of XML, there's the strictness concerns, where we actually want to provide MORE options , not less, (Jackson allows lots of options we don't currently expose in the UI and would like to) to the user and not changing the existing smart defaults for preserving Empty Strings, trimming Whitespace, etc. However... a lot of the existing smart defaults won't work when you toggle some of those other options for Jackson XML parser when its asked to be a "validating" parser. Jackson will try to normalize attribute values when "validating" (per C14N rules). In OpenRefine, we let users decide how things are normalized, that's our philosophy...giving users more power and controls to make the best decisions for themselves. We don't try to force users and tell them "you must supply valid XML to import that strictly conforms to all C14N rules", instead we want to allow users to ignore some of those rules...which is what Jackson's various options allow and that we don't fully expose yet but want to.
More broadly, the OpenRefine philosophy of letting users decide how to import and giving them more options with smart defaults that can be changed or toggled. What @tfmorris is saying about
Making the default behavior match http://usingxml.com/Basics/XmlSpace#MicrosoftXMLParserBehavior might be a reasonable approach. I think that's basically what we'd get with preserveEmptyStrings=False, trimWhitespace=True.
is basically about those smart defaults...and that our XML importer might need to have changes done to the existing defaults...but my opinion is that the preserve Empty Strings and Whitespace defaults are better as-is because MOST of the XML encountered in the wild will be C14N compliant and not have an issue. For those XML documents in the wild that are not compliant, we want users to have as many options as necessary in order to parse and import the data so that they can extract and use it in our grid to analyze, convert, transform, export, etc.
I hope this helps you @maheshjindal and others on this issue to understand what the community generally is asking for, especially concerning XML importer options. (and the community does understand that quite a bit of the current XML importer handling might have to be refactored to deal properly with non-compliant C14N XML documents and give them more options in our UI)
(and a bit of history lesson... prior to 2007-08? there were several XML libraries that dealt with canonicalization very differently. The global community stepped up to deal with consistent XML representation uniformly across tools/libraries - the Exclusive XML Canonicalization spec. So those XML documents in the wild that were generated with older tools/libraries are what typically cause the issues for OpenRefine and other tools during import...and which we want users to have more options and power to deal with those slightly invalid XML documents)
@thadguidry Got your point. Thank you very much for this elucidation.
@maheshjindal Sure! (and sorry it wasn't only 2 sentences, because no one would likely have gotten the point...it indeed took a few paragraphs at minimum...sorry for the length!)
@thadguidry With respect to this issue, will it be nicer to have a new importer option named “skipWhitespaces” while importing XML data? For enabling/disabling this option from the server-side, we can follow up with any of the below approaches.
According to the project guidelines, could you please suggest, which is the better approach that needs to be followed? Please feel free to add any suggestions/improvements or if we can do this in a better way.
@maheshjindal Those decisions will be best made by @wetneb and @tfmorris who will reply to you.
@thadguidry With respect to this issue, will it be nicer to have a new importer option named “skipWhitespaces” while importing XML data? For enabling/disabling this option from the server-side, we can follow up with any of the below approaches.
- It's a usual approach, similar to other importer options, we can pass it as an additional new parameter in the method “XmlImportUtilities.importTreeData” . Inside this method we can check, if the option is enabled we can skip the whitespaces.
- On the other hand, we can pass this option while creating a new instance of XmlParser class. So when the “(new XmlParser(...)).next()” method will be invoked, and if "skipWhitespaces" is enabled, we can return the value as “Token.Ignorable” for the whitespaces. With this approach, in the future, if we need to set any other configuration at the parser level it will be convenient to do so.
According to the project guidelines, could you please suggest, which is the better approach that needs to be followed? Please feel free to add any suggestions/improvements or if we can do this in a better way.
@wetneb and @tfmorris Could you please let us know your thoughts on this?
Sorry for the delay, I'll get back to you by tomorrow.
@maheshjindal I support your idea, it would make sense to have an option in tree-based importers (XML, MARC) to skip whitespace. It's not clear to me whether you would want that for JSONÂ too (perhaps to skip null
instead)? It could potentially also apply to the RDF importers (skip blank nodes).
@wetneb, I was referring only to the case of XML importer. We can check the parser type and if 'XMLParser' is being used(i.e instance type for tree parser is XMLParser), we can proceed with skipping whitespaces.
I think this is less flexible: it would be more convenient to expose an option that other importers can use as well.
@wetneb, yes I totally affirm you. It would be much more convenient and useful for an end-user.
Even if you only want to change the behaviour of the XML importer, I think we should do it in a way that makes it easier to extend the feature to other importers.
Having to check at runtime if the importer is an XML importer using instanceof
is a symptom of an architectural problem, so I would oppose fixing it this way.
@wetneb , yes exactly. Earlier I was taking only XmlImporter into consideration, that's why I proposed the above two approaches. Now as this feature really seems to be a good add-on for all other importers, I will look into this so that it will be easier to extend and integrate for other importers as well.
I'm lost in all this verbiage. In March https://github.com/OpenRefine/OpenRefine/issues/1095#issuecomment-599709098, it sounded like everything was good except we needed better test coverage and, when I try the OP's WorldBank example with "trim white space" = true and "keep empty strings" = false, I get, what looks to me, like a perfect import.
What are we missing? Why do we need a third option here?
p.s. As an aside, the "C14N" acronym is apparently a reference to Canonical XML.
I'm lost in all this verbiage. In March #1095 (comment), it sounded like everything was good except we needed better test coverage and, when I try the OP's WorldBank example with "trim white space" = true and "keep empty strings" = false, I get, what looks to me, like a perfect import.
What are we missing? Why do we need a third option here?
p.s. As an aside, the "C14N" acronym is apparently a reference to Canonical XML.
@tfmorris , with reference to the above discussion, the suggestion was to introduce a new option "skipWhitespaces" to ignore the cells entirely containing only whitespace characters as a value parsed by the XmlParser. Yes, its behaviour will be exactly the same as setting 'trim white space = true and keep empty strings = false' from the importer. Presently, if it's fine to use these import options w.r.t end-user, then we can omit these changes and only work on Improving test cases.
Extremely Sorry for the delay. I was completely occupied with some personal issues.
I have made the required changes and raised PR for the same. Please let me know if any furthur changes are required. The link of the PR is:
https://github.com/OpenRefine/OpenRefine/pull/3357
Most helpful comment
It sounds like #2409 implements what I described in the first paragraph of https://github.com/OpenRefine/OpenRefine/issues/1095#issuecomment-162032765. One thing its tests don't include, though, is an example of pretty printed XML which I think is important.
Making the default behavior match http://usingxml.com/Basics/XmlSpace#MicrosoftXMLParserBehavior might be a reasonable approach. I think that's basically what we'd get with preserveEmptyStrings=False, trimWhitespace=True.
Whatever the approach, a more comprehensive set of test cases would make sure that the handling is reasonable for all important cases.