Scryer-prolog: A predicate relating a list of characters to its UTF-8 encoding would be useful

Created on 7 May 2020  路  13Comments  路  Source: mthom/scryer-prolog

As soon as library(sockets) becomes available, we will want to write web servers with Scryer Prolog.

The HTTP protocol requires that the content-length of replies be determined and sent as part of the header. The content-length is the number of bytes ("octets" in RFC terminology) that encode the body.

Hence, a very useful addition to library(charsio) would be a predicate like chars_nbytes/2 (a better name would also be highly appreciated), relating a list of characters to the number of bytes that are needed to transmit the characters in UTF-8 encoding.

Or maybe a still better way to solve this would be a predicate char_bytes/2, yielding the list of bytes (integers in the range 0..255) that encode a given character in UTF-8! With this building block, we could easily implement chars_nbytes/2 ourselves, and in addition, it may also be useful in other contexts. And also, char_bytes/2 could be bidirectional, and construct a character from a given UTF-8 encoding specified as a list of bytes.

Any opinions on this? If so, please post them here!

Most helpful comment

Hi, so I got char -> UTF-8 encoding working! I used char_code but I suspect it has a bug if the character has a value that is too big.

When I do cat x.txt after running write_f I get (as expected) the following characters: 拢鈩も潳馃挅 (followed by a newline) for a total of 13 bytes.

Could you please take a look at my (Prolog) code? I would be grateful for improvements/suggestions/etc in particular to do a if/then/else maybe instead of the different branches of chars_to_utf8. See #493

Also: 1) since you can write characters to streams directly in UTF-8, is this predicate still useful? It was fun to write regardless :) Just computing the number of UTF-8 bytes of a char would be trivial. 2) Same question for decoding.

I think it is interesting to have these predicates, if at some point we want to support different encodings (UTF-16?) we would do it in Prolog instead of adding Rust libraries.

All 13 comments

Having a predicate mapping UTF-8 chars to bytes and vice-versa sounds interesting! I saw that SWI-Prolog provides utf8_codes. There is already an implementation of Rust of char/string to bytes, and bytes to string. Do you think it would be better to do the implementation in Rust (with a built-in predicate) or in Prolog?

Regarding the name, I'm still getting used to predicates being bidirectional, and not needing to have two separate predicates char_to_bytes and bytes_to_char :) The name char_bytes mirrors atom_codes and others, I'm less enthusiastic about char_nbytes :D Do we need to provide char_nbytes if we provide char_bytes? Could the name be num_bytes instead?

Yes, I completely agree: char_nbytes/2 is not a good name. Also num_bytes/2 is not a good name, because it gives no indication about what the first argument is. A good predicate name gives such an indication.

Of the options so far, I think char_bytes/2 would be the most useful and versatile building block. If we have that, we can use it to count the number of bytes in Prolog, using only this predicate. For the specific use case I mentioned, it would also be completely OK to support only one direction at least at first.

The only advantage that a predicate like chars_num_bytes/2 would have is speed, because it only counts the elements and need not create any additional lists, and therefore also puts less stress on the (future) garbage collector.

Regarding utf8_codes//1: A major attraction of Scryer Prolog is very elegant and efficient support for characters. Characters are much preferable to codes, because they are much more readable, both in source code, and also in answers. So, I think it is best to focus on extensive support for reasoning about characters, which are one-char atoms.

Generally, I am in favour of doing as much as possible within Prolog, because this nicely stresses and tests the core engine, and encourages improvements if it turns out to be too slow for specific use cases. The key question is: What are the most essential building blocks that the Rust engine must provide, so that everything else can be built on top, in Prolog?

For this specific case of a tentative char_bytes/2 predicate, I expect that at least a bit of help from the Rust engine could come in very handy, and that would be completely acceptable.

char_bytes/2 means that the second argument is a list of bytes. You probably mean something along atom_length/2. Why not char_bytelength/2?

@matt2xu : Directional predicates with _to_ inside are often the only way to go, provided insufficient instantiation is correctly alerted.

Yes, we want to see the list of bytes, so that we see the actual UTF-8 encoding of the character!

char_utf8bytes/2

@matt2xu : Directional predicates with _to_ inside are often the only way to go, provided insufficient instantiation is correctly alerted.

:+1:

char_utf8bytes/2

I like this name! So this one produces a list of UTF-8 bytes that encode a given character. I'd like to try to describe this in pure Prolog, see how it goes.

Why not char_bytelength/2?

Given the previous name, maybe char_utf8length/2?

Please consider the following example:

?- char_utf8bytes('\x2124\', Bs).
Bs = [226,132,164].

This is the expected result. It is clear that if we have such a building block, the length of the list can be trivially determined using length/2 from library(lists). It is therefore unnecessary to also provide a predicate that only determines the length of the encoding, since the length alone does not tell us what the actual bytes are, so the only advantage would be speed.

Hi, so I got char -> UTF-8 encoding working! I used char_code but I suspect it has a bug if the character has a value that is too big.

When I do cat x.txt after running write_f I get (as expected) the following characters: 拢鈩も潳馃挅 (followed by a newline) for a total of 13 bytes.

Could you please take a look at my (Prolog) code? I would be grateful for improvements/suggestions/etc in particular to do a if/then/else maybe instead of the different branches of chars_to_utf8. See #493

Also: 1) since you can write characters to streams directly in UTF-8, is this predicate still useful? It was fun to write regardless :) Just computing the number of UTF-8 bytes of a char would be trivial. 2) Same question for decoding.

I think it is interesting to have these predicates, if at some point we want to support different encodings (UTF-16?) we would do it in Prolog instead of adding Rust libraries.

Thank you so much for working on this! It is absolutely useful to have these predicates! Prolog characters do not only come from files, but also from user input or via sockets, and we need to reason about their low-level representation for example to encrypt them with future predicates in library(crypto), which will internally only see bytes.

For the implementation, between/3 from library(between) may be useful to distinguish the cases more clearly. Also, DCGs from library(dcgs) may come in handy to describe a sequence of bytes for a character: DCGs let us focus more on the essence, and need fewer arguments and variables that may cause confusion if they are explicit.

Keeping the implementation entirely in Prolog is a very good design decision!

For this purpose we now have chars_utf8bytes/2, excellently implemented by @matt2xu.

Thank you a lot!

support different encodings (UTF-16?)
This would be a veritable cauchemare! Internal representations must be simple. We have now two of them for the very same thing. What is now important is to make them fast and memory efficient. Currently the system uses much too much space - about a factor of 3 for regular cellls. Then strings are allocated in a special place - not the heap. The way how Rust provides these abstractions makes me confident that this transition will happen with ease. Then we need a gleaner of cellls (GC)...

Who uses UTF-16 anyway?

These considerations should not hinder us from providing chars_utf16bytes/2 in case this encoding is useful for someone, just as we now already have chars_utf8bytes/2.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

triska picture triska  路  3Comments

mkohlhaas picture mkohlhaas  路  3Comments

XVilka picture XVilka  路  3Comments

XVilka picture XVilka  路  3Comments

XVilka picture XVilka  路  3Comments