While the following is not something that is needed directly within V core, it is quite essential to V adoption.
With the work on ORM well underway and moving towards full web production and consumption, it could be important to consider to consider a Nokogiri equivalent for V. For frameworks such as Rails we see that this is a critical component to the system and of course it's critical for many other areas.
Nokogiri is obviously a very robust system that is an HTML, XML, SAX, and Reader parser. Being able to use XPath and CSS3 selectors to search documents is pretty critical nowadays.
Nokogiri is also somewhat "slow" when compared to other options, but I suppose there's a big question of return on investment. Would it be worthwhile to implement a V Nokogiri equivalent (I don't know if essentially converting Nokogiri directly to V would be difficult) OR to convert something in C++ that's relatively "small" (around 13k loc) like pugixml and remarkably fast (benchmarks) (faster than RapidXML and in some cases faster than asmxml).
On the pugixml front, I suppose there are notable drawbacks such as missing XML namespace and other such missing W3C specifications for XPath. You'd also have to preprocess HTML to XHTML to verify/accommodate compliance.
Another question or thought: Would it be better to start with a more robust solution like Nokogiri that's more "enjoyable" for development and then also provide a more performant option like pugixml? Or not?
Just some thoughts and questions to consider here. I DO believe that working in this space is very important since ROBUST (and comprehensive) XML/HTML consumption is going to be critical to V pretty darn quickly.
It seems Nokogiri (and majority of other parsers) are not "lazy" parsers, but "parse everything to memory right at the beginning disregarding how much of the input data will actually be needed".
I'd rather go for a much simpler, faster and optimization-friendly approach - namely "make a lazy view with write-layer" - i.e. parsing would happen on-demand (by fragments, not that the first trigger will parse everything) and if there won't be any writes, then it'll stay like that. See https://github.com/zserge/jsmn and https://github.com/google/gumbo-parser for this approach. Note, that this approach allows extremely easy caching (in the spirit of the answers in https://stackoverflow.com/questions/815110/is-there-a-decorator-to-simply-cache-function-return-values ) of already withdrawn values to avoid re-parsing of the fragment over and over.
If any change to a certain chunk of data (imagine a string field in JSON) in the middle of the input JSON should occur (for document-oriented DBs and data-designs this is the most common operation), then there is a direct approach with 3 options:
realloc() needed)memove() is unbelievably fast)(If any inserts (either in the middle of the buffer or at the end) shall occur, then realloc() is needed.)
The second approach to changing data would be COW (copy-on-write) inspired - e.g. making a lightweight overlay over the original static data (in other words, it's a cache). For advanced overlaying/snapshotting see e.g. ctrie for parallel tree structures). The overlay will be merged with the static data first at the serialization phase in the future before sending them away. Overlay is for small data actually about as fast as (3) from the direct approach, but faster for data starting at hundreds of megabytes.
Most helpful comment
It seems Nokogiri (and majority of other parsers) are not "lazy" parsers, but "parse everything to memory right at the beginning disregarding how much of the input data will actually be needed".
I'd rather go for a much simpler, faster and optimization-friendly approach - namely "make a lazy view with write-layer" - i.e. parsing would happen on-demand (by fragments, not that the first trigger will parse everything) and if there won't be any writes, then it'll stay like that. See https://github.com/zserge/jsmn and https://github.com/google/gumbo-parser for this approach. Note, that this approach allows extremely easy caching (in the spirit of the answers in https://stackoverflow.com/questions/815110/is-there-a-decorator-to-simply-cache-function-return-values ) of already withdrawn values to avoid re-parsing of the fragment over and over.
If any change to a certain chunk of data (imagine a string field in JSON) in the middle of the input JSON should occur (for document-oriented DBs and data-designs this is the most common operation), then there is a direct approach with 3 options:
realloc()needed)memove()is unbelievably fast)(If any inserts (either in the middle of the buffer or at the end) shall occur, then
realloc()is needed.)The second approach to changing data would be COW (copy-on-write) inspired - e.g. making a lightweight overlay over the original static data (in other words, it's a cache). For advanced overlaying/snapshotting see e.g. ctrie for parallel tree structures). The overlay will be merged with the static data first at the serialization phase in the future before sending them away. Overlay is for small data actually about as fast as (3) from the direct approach, but faster for data starting at hundreds of megabytes.