V: Considering a Nokogiri equivalent for V

Created on 25 Aug 2019  路  1Comment  路  Source: vlang/v

While the following is not something that is needed directly within V core, it is quite essential to V adoption.

With the work on ORM well underway and moving towards full web production and consumption, it could be important to consider to consider a Nokogiri equivalent for V. For frameworks such as Rails we see that this is a critical component to the system and of course it's critical for many other areas.

Nokogiri is obviously a very robust system that is an HTML, XML, SAX, and Reader parser. Being able to use XPath and CSS3 selectors to search documents is pretty critical nowadays.

Nokogiri is also somewhat "slow" when compared to other options, but I suppose there's a big question of return on investment. Would it be worthwhile to implement a V Nokogiri equivalent (I don't know if essentially converting Nokogiri directly to V would be difficult) OR to convert something in C++ that's relatively "small" (around 13k loc) like pugixml and remarkably fast (benchmarks) (faster than RapidXML and in some cases faster than asmxml).

On the pugixml front, I suppose there are notable drawbacks such as missing XML namespace and other such missing W3C specifications for XPath. You'd also have to preprocess HTML to XHTML to verify/accommodate compliance.

Another question or thought: Would it be better to start with a more robust solution like Nokogiri that's more "enjoyable" for development and then also provide a more performant option like pugixml? Or not?

Just some thoughts and questions to consider here. I DO believe that working in this space is very important since ROBUST (and comprehensive) XML/HTML consumption is going to be critical to V pretty darn quickly.

Module Request

Most helpful comment

It seems Nokogiri (and majority of other parsers) are not "lazy" parsers, but "parse everything to memory right at the beginning disregarding how much of the input data will actually be needed".

I'd rather go for a much simpler, faster and optimization-friendly approach - namely "make a lazy view with write-layer" - i.e. parsing would happen on-demand (by fragments, not that the first trigger will parse everything) and if there won't be any writes, then it'll stay like that. See https://github.com/zserge/jsmn and https://github.com/google/gumbo-parser for this approach. Note, that this approach allows extremely easy caching (in the spirit of the answers in https://stackoverflow.com/questions/815110/is-there-a-decorator-to-simply-cache-function-return-values ) of already withdrawn values to avoid re-parsing of the fragment over and over.

If any change to a certain chunk of data (imagine a string field in JSON) in the middle of the input JSON should occur (for document-oriented DBs and data-designs this is the most common operation), then there is a direct approach with 3 options:

  1. the number of bytes is the same as chunk size and they'll be just replaced (no realloc() needed)
  2. the number of bytes is less than chunk size

    1. either the format allows padding (or adding a meaningless comment) in which case this would become the case (1); note, that such padding/comment shall be removed in the future when serializing the data before sending them away

    2. there is no padding support and realloc with a subsequent move of the whole rest of the data will be needed (some block-layer being e.g. a double-linked list could alleviate this, but that makes sense first for gigabytes of data in memory as memove() is unbelievably fast)

  3. the number of bytes is more than chunk size - see (2.ii) above

(If any inserts (either in the middle of the buffer or at the end) shall occur, then realloc() is needed.)

The second approach to changing data would be COW (copy-on-write) inspired - e.g. making a lightweight overlay over the original static data (in other words, it's a cache). For advanced overlaying/snapshotting see e.g. ctrie for parallel tree structures). The overlay will be merged with the static data first at the serialization phase in the future before sending them away. Overlay is for small data actually about as fast as (3) from the direct approach, but faster for data starting at hundreds of megabytes.

>All comments

It seems Nokogiri (and majority of other parsers) are not "lazy" parsers, but "parse everything to memory right at the beginning disregarding how much of the input data will actually be needed".

I'd rather go for a much simpler, faster and optimization-friendly approach - namely "make a lazy view with write-layer" - i.e. parsing would happen on-demand (by fragments, not that the first trigger will parse everything) and if there won't be any writes, then it'll stay like that. See https://github.com/zserge/jsmn and https://github.com/google/gumbo-parser for this approach. Note, that this approach allows extremely easy caching (in the spirit of the answers in https://stackoverflow.com/questions/815110/is-there-a-decorator-to-simply-cache-function-return-values ) of already withdrawn values to avoid re-parsing of the fragment over and over.

If any change to a certain chunk of data (imagine a string field in JSON) in the middle of the input JSON should occur (for document-oriented DBs and data-designs this is the most common operation), then there is a direct approach with 3 options:

  1. the number of bytes is the same as chunk size and they'll be just replaced (no realloc() needed)
  2. the number of bytes is less than chunk size

    1. either the format allows padding (or adding a meaningless comment) in which case this would become the case (1); note, that such padding/comment shall be removed in the future when serializing the data before sending them away

    2. there is no padding support and realloc with a subsequent move of the whole rest of the data will be needed (some block-layer being e.g. a double-linked list could alleviate this, but that makes sense first for gigabytes of data in memory as memove() is unbelievably fast)

  3. the number of bytes is more than chunk size - see (2.ii) above

(If any inserts (either in the middle of the buffer or at the end) shall occur, then realloc() is needed.)

The second approach to changing data would be COW (copy-on-write) inspired - e.g. making a lightweight overlay over the original static data (in other words, it's a cache). For advanced overlaying/snapshotting see e.g. ctrie for parallel tree structures). The overlay will be merged with the static data first at the serialization phase in the future before sending them away. Overlay is for small data actually about as fast as (3) from the direct approach, but faster for data starting at hundreds of megabytes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oleg-kachan picture oleg-kachan  路  3Comments

taojy123 picture taojy123  路  3Comments

shouji-kazuo picture shouji-kazuo  路  3Comments

medvednikov picture medvednikov  路  3Comments

lobotony picture lobotony  路  3Comments