Gpuweb: [wgsl] Encoding for shader text needs to be specified

Created on 27 Feb 2020  路  6Comments  路  Source: gpuweb/gpuweb

To my understanding the spec currently doesn't state the encoding for the shader text, which should be stated explicitly for compat.

From talking to @dj2 , the tokens of the language and the regex for var names are ASCII. The only thing that currently can be non-ASCII is the entry point names, which are UTF-8.

wgsl

Most helpful comment

Any text format transmitted over the internet needs to have its charset defined, or have a known mechanism for getting the charset. Otherwise, out-of-band knowledge is required to decode it to a DOMString.

Let's assume there's never a dedicated API to load a WGSL shader, and they are always loaded through something generic like Fetch. Then there needs to be a way to transform the bytes that come off the wire into a DOMString. Fetch can give you various packagings. Some are raw bytes. If you get one of those, you need out-of-band knowledge of the encoding. That's not good, because it stops a webpage from having generic code to load shaders, if those shaders might have come from someone using a different encoding. There's also a text packaging, which assumes UTF-8 encoding. Thus, UTF-8 will be the easiest option for authors.

For completeness' sake, there's also the option of putting the encoding as a charset parameter on the MIME type (but then the server needs to know the encoding somehow, which is inconvenient), or the file format can have an in-band charset declaration (like HTML and CSS do), but then there probably needs to be a dedicated loading API, or an API that takes an ArrayBuffer of bytes, or a way to turn an array of bytes into a DOMString. Those are all worse than just defining it to be utf-8.

When we submit a MIME type definition for WGSL to IANA, we will probably need to explain charset considerations there. That is yet another reason the encoding must be defined. It's not good to transmit text over the Internet without defining the charset encoding.

Of course, none of this stops anyone from locally editing a WGSL file in whatever encoding they, or their OS, or their text editor prefer. But when it's severed on the web, the charset must be defined somehow.

All 6 comments

Can comments contain UTF-8 also? (and maybe identifiers but that's more controversial)

UTF-8? The web is specified in some arbitrary Unicode encoding, as determined by the Content-Encoding header (or meta charset equivalent). If the strings come from JavaScript, they will be with the byte encoding of the JavaScript interpreter, which is likely UCS-2 / UTF-16, but this is an implementation detail.

UTF-8 is the best encoding for a textual language for the web. HTML has to deal with legacy encodings, but a new format doesn't need to.

The functions which accept shader source are IDL functions, which would accept DOMString. Encoding is handled by the underlying loading infrastructure of the webpage. It doesn't need to be defined ~(indeed: shouldn't be defined)~ by the shading language spec.

Any text format transmitted over the internet needs to have its charset defined, or have a known mechanism for getting the charset. Otherwise, out-of-band knowledge is required to decode it to a DOMString.

Let's assume there's never a dedicated API to load a WGSL shader, and they are always loaded through something generic like Fetch. Then there needs to be a way to transform the bytes that come off the wire into a DOMString. Fetch can give you various packagings. Some are raw bytes. If you get one of those, you need out-of-band knowledge of the encoding. That's not good, because it stops a webpage from having generic code to load shaders, if those shaders might have come from someone using a different encoding. There's also a text packaging, which assumes UTF-8 encoding. Thus, UTF-8 will be the easiest option for authors.

For completeness' sake, there's also the option of putting the encoding as a charset parameter on the MIME type (but then the server needs to know the encoding somehow, which is inconvenient), or the file format can have an in-band charset declaration (like HTML and CSS do), but then there probably needs to be a dedicated loading API, or an API that takes an ArrayBuffer of bytes, or a way to turn an array of bytes into a DOMString. Those are all worse than just defining it to be utf-8.

When we submit a MIME type definition for WGSL to IANA, we will probably need to explain charset considerations there. That is yet another reason the encoding must be defined. It's not good to transmit text over the Internet without defining the charset encoding.

Of course, none of this stops anyone from locally editing a WGSL file in whatever encoding they, or their OS, or their text editor prefer. But when it's severed on the web, the charset must be defined somehow.

(quoting myself)

It doesn't need to be defined

This is true. However, after discussing this offline, I realized that it's still valuable for the spec to list an encoding, _even if_ it isn't used anywhere inside WebGPU's implementation. If we consider the possibility that WGSL is a huge success, and has loads of tools, simply stating an official encoding in the spec is generally a good idea to reduce problems in long toolchains of compilers/optimizers/compressors etc.

And if we had to pick a particular encoding, yeah, UTF-8 is the clear winner.

Was this page helpful?
0 / 5 - 0 ratings