I have been searching for months and asked on discord, it's pretty simple what I am looking to do.
Given I have compiled a simple Wasm binary that exports a function stream that ideally takes a JavaScript Uint8Array or bytes are looped and passed individually through to my zig function. How can I convert the Utf8 bytes to chars? Is there a zig STD lib func to do string from Utf8 u8?
I am writing a parser and all I want for Christmas is an answer to this.
Edit: for clarification:
In JavaScript, if I use TextEncoder to encode the charcodes of a string into Utf8, how do I get this into zig as a decoded string.
https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder
How can I convert the Utf8 bytes to chars?
utf8 bytes already are chars? Your question is not clear.
@daurnimator I have updated the question for clarification. Thank you for such a fast reply.
Why do you need to decode the bytes?
If that's really what you need, have a look at std.unicode.
Hello @andrewrk (big fan of Zig! 100% the easiest language to get working with WASM) I cannot pass strings into WASM modules at this time as only integer types can be passed into exported functions.
https://github.com/WebAssembly/interface-types/blob/master/proposals/interface-types/Explainer.md
When this proposal drops it may be possible to support higher level types passed in to a module, similar to how bindgen for rust and embind for Emscripten c++ work. except this layer will be wrapping a WASM module instead of packed as some js gluecode.
As for why do I need to decode the bytes at all, interesting question.
Simply, I am not confident enough to write a Lexer Parser that deals with just bytes although I sort of see how you could map bytes to tokens. I would have thought it would be simpler for a novice like me to just deal with chars.
Anyway thank you for the hint, (I found that the docs are still lacking from a beginers standpoint).
Simply, I am not confident enough to write a Lexer Parser that deals with just bytes although I sort of see how you could map bytes to tokens. I would have thought it would be simpler for a novice like me to just deal with chars.
chars are bytes... do you mean unicode codepoints?
And generally no: working at the codepoint is not easier: you often want to "read until delimiter". See also std.mem.tokenize
Hi @adam-cyclones, welcome to the community.
Simply, I am not confident enough to write a Lexer Parser that deals with just bytes although I sort of see how you could map bytes to tokens. I would have thought it would be simpler for a novice like me to just deal with chars.
It may sound counter-intuitive, but my advice to a novice would be to have your zig code accept UTF-8 encoded data, and never decode it. UTF-8 is a brilliantly designed data format; you can do most operations you need to on it without decoding. For example you can look for the byte ' ' (space) to separate tokens, and that will work, and be correct with respect to Unicode.
I'm going to close this issue since there's nothing to do here to solve it, but please feel free to continue the discussion in one of the community gathering places.
Most helpful comment
chars are bytes... do you mean unicode codepoints?
And generally no: working at the codepoint is not easier: you often want to "read until delimiter". See also
std.mem.tokenize