Sheetjs: Parsing XML with regular expressions

Created on 11 Mar 2018 · 1Comment · Source: SheetJS/sheetjs

I noticed that this library uses regular expressions to parse XML. It seems to get the job done for the overwhelming majority of excel files, but every so often a new issue arises. For example, an excel tab name containing a ">" resulted in the following PR:
https://github.com/SheetJS/js-xlsx/pull/769/files

Seems like there are an infinite number of cases of valid XML that are not correctly parsed with regular expressions (https://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex). This library seems very well done overall and so I don't mean to question the decision making here. I was just curious if there are any plans to introduce an XML parser into this library or if that option has been formally ruled out. My apologies if this is a repost, but I couldn't find any other issues on this repo raising this question.

Source

marcreicher

👍2

Most helpful comment

There are 3 related questions, pertaining to the past, present, and future, and this is a good chance to collect some thoughts (excuse the braindump):

Why were regular expressions used in the first place?

The JS ecosystem was/is fragmented and wildly different from pretty much any other ecosystem. V8 (which powers node, chrome, and other JS environments) might be the most popular engines, but other browsers and engines have nontrivial usage. IE6-8 are still in use in 2018, despite the fact that Windows XP was EOL'd in 2014. In fact, XP usage and therefore IE6-8 usage was significant enough that Microsoft released a patch for Windows XP SP3 to address the vulnerability targeted by the WannaCry 2017 attack!

Back in late 2012, when IE10 was hot off the presses, IE6 was still very relevant and node was in the 0.8/0.10 version range, there were three choices for parsing XML:

1) the "XML parser" solution: use DOMParser for Chrome, the MSXML or XMLDOM functionality in IE, and a module like sax in node.

2) farm out to a Flash SWF in the browser and a similar solution in node.

3) use a series of regular expressions to slice and dice the strings.

Farming out to a Flash or ActiveX or Silverlight component is not a great solution because you introduce a new set of potential problems. Performance-wise, at the time it was significantly faster in IE6-8 and Chrome to process with regex than to properly parse the XML. Coupled with the fact that the same exact code worked in browser environments as well as nodejs, it was very easy to pick regular expressions.

Why are regular expressions used now?

Even today, Photoshop and other tools use the ExtendScript engine, which has even stranger quirks than IE6. Other tools like NetSuite (SuiteScript) run ES3+ engines. Turns out that the RegExp solution still works in those nonstandard environments! The demos show other neat integrations. So from a compatibility standpoint, it is the clear winner in March 2018. And as the README says:

Emphasis on parsing and writing robustness, cross-format feature compatibility with a unified JS representation, and ES3/ES5 browser compatibility back to IE6.

The biggest advantage of extreme browser compatibility, which wasn't obvious until we released this library, is the universal nature of JS. Since the same code pathway is being run in node and in the browser and in other platforms, improvements in one area will benefit others. We receive reports from people testing in locked-down environments that prevent installing third-party software (but they can run the in-browser demos, which do all of the work on the client side) and those fixes also benefit nodejs deployments. We still receive feedback from people testing with IE8. This would not necessarily be possible if we had a different XML implementation for node / chrome / IE.

What about the future?

Backwards compatibility is a funny problem. From the developer perspective, everyone wants to use the latest and greatest language proposals like generators and async/await and pipeline operators (personally rooting for the elvis operator ?. ); from the user perspective, everyone wants updates to work on their current devices. It's easy to throw away backwards compatibility, but we strive not to break compatibility in environments we can reasonably support. We are continually frustrated when other participants in the ecosystem decide to "move fast and break things", like when npm broke backwards compatibility for the sake of nagware, and it would be hypocritical for us to do the same.

This mentality of preserving backwards compatibility informs other decisions. After facing issues with tooling like uglify not correcting for various browser quirks and seeing that the authors behind the tooling don't necessarily have the same belief in backwards compatibility, it turns out to be easier to just write ES3 and shimmable JS code than to fight the transpilers. It would be very easy to switch to TypeScript or use a transpiler to implement the elvis operator, but it is not worth breaking support just for our predilections.

At some point, when it is abundantly clear that IE, Photoshop Extendscript, and other alternative engines have no relevance in the JS ecosystem, we can re-evaluate the decision.

Postscript

Sometimes seemingly bizarre or antiquated patterns are superior to the "recommended" or "modern" solutions. Our ADLER32 checksum implementation, which used some nonobvious tricks to improve efficiency, ended up becoming part of the React Framework.

SheetJSDev on 12 Mar 2018

👍4

>All comments