Crystal: [RFC] Indexable String wrappers for char, bytes and lines

Created on 7 Apr 2019  路  20Comments  路  Source: crystal-lang/crystal

Background

After some thought about this issue to move Array index methods to Indexable. What about String?

There are lots of methods that are mirrored from Indexable like each_char, each_char_with_index, #[], #[]?... Same for method related to bytes each_bytes and lines each_line.

Another problem is having "str".each_char and "str".chars.each. Several Indexable methods like map can only be used by creating an intermediate array.

Including Indexable(Char) into String isn't a proper solution , because a string can be, by nature, represented as bytes, chars, codepoints.

Proposal

On the high level API, the changes will be:

  • "str".each_char => "str".chars.each
  • "str".chars => "str".chars.to_a
  • "str".each_char_with_index => "str".chars.each_with_index
  • "str".count => "str".chars.count
  • "str".blank? => "str".chars.all? &.whitespace?
  • "str".codepoints => "str".chars.map &.ord
  • "str".each_codepoint=> "str".chars.each { |c| c.ord }
  • "str".lines => "str".lines.to_a
  • "str".each_line => "str".lines.each
  • "str".bytes => "str".bytes.to_a
  • "str".each_byte => "str".bytes.each

That's a preview, with possibly even more method changes.
The codepoints changes are less user-friendly, but being quite low-level and rarely used, on my opinion it's fine.

Implementation

Better to concentrate first on the proposal's API, but if you wonder what type String#chars can be and how this can be implemented by using structs: String::Chars, String::Bytes and String::Lines. CharIterator and LineIterator can be moved to String::Chars::Iterator and String::Lines::Iterator.

This proposal could superset https://github.com/crystal-lang/crystal/pull/7275

Most helpful comment

I think we should first try to answer the question of what's hard or inconvenient to do with the existing String API. From my side the answer is: nothing. I was just thinking of other ways to access the string contents, but it might not be worth it.

All 20 comments

The idea doesn't sound too bad, but what actual problem does it solve?

@straight-shoota remove duplicates with Indexable, improve the API (no more each_char vs chars.each), avoid creating intermediate arrays to use Enumerable methods like map, reject or select.

Just note that in my comment I proposed having views as another way to do things a bit more efficiently. I wouldn't remove any of the existing methods. They are convenient.

And note that you can do string.each_char.map...to_a

@asterite ok, this uses an intermediate Iterator instead of an intermediate Array :smile:
With your proposal and not removing any method we will got String#char_view.map, String#char_view.each.map, String#chars.map and String#each_char.map. It will brings even more confusion for beginners when deciding which one to choose.

So the benefit over the already existing iterators would be that it implements Indexable? Maybe we can just include Indexable in CharIterator etc. and combine the APIs?

@straight-shoota that's what I said:

CharIterator and LineIterator can be moved to String::Chars::Iterator and String::Lines::Iterator.

This will yield to "str".chars.each : String::Chars::Iterator

When including Indexable, a each returning an Iterator must be implemented for Iterable.
~Therefore including Indexable into an Iterator isn't a good idea.~
What's the point of Iterable over using each directly?

One goal I forgot is to simplify this documentation section.

After some research, Rust has this iterators: Struct std::str::Chars Struct std::str::Bytes and Struct std::str::Lines accessed by str methods chars, bytes and lines.

I think it's OK to include iterator into String::Chars, but not indexable into the existing CharsIterator.
IMHO the String class is a bit too full featured with various helpers related to chars, bytes, lines and codepoints. This raises questions (that I had too) because it resembles of an Indexable of Chars without being one.
The main goal is to better organize the String API.

Looking at the proposed API changes, I think I disagree with all of them. I fail to see any improvement or benefit to the developer.

I think we should first try to answer the question of what's hard or inconvenient to do with the existing String API. From my side the answer is: nothing. I was just thinking of other ways to access the string contents, but it might not be worth it.

The main problem I see at the end is there are multiple non optimal ways to do the same thing: https://github.com/search?l=Crystal&q=chars.{each,map}&type=Code

@asterite like you, I have a tendency to use String#chars instead of String#each_char - and we aren't alone looking at the GitHub search :sweat_smile:

require "benchmark"

STR = "string"

Benchmark.ips do |x|
  x.report("String#chars.map") do
    STR.chars.map { |c| }
  end

  x.report("String#each_char.map : CharIterator") do
    STR.each_char.map { |c| }
  end
end
                   String#chars.map  12.78M ( 78.24ns) (卤 7.01%)  112 B/op   1.84脳 slower
String#each_char.map : CharIterator  23.56M ( 42.45ns) (卤 7.38%)   80 B/op        fastest

After noticing that most of the time, users often use String#chars for iterator operations, I suggest:

  1. Change String#chars : Array(String) to String#chars : CharIterator (and suggest using String#chars.to_a to have an Array)
  2. Add deprecation notice for String#each_char : CharIterator
  3. Nothing else :)

WDYT?

I disagree: it's not clear what chars returns, whereas each_{item} is a common idiom in crystal where the block version yields each item, and the non-block version returns an {Item}iterator.

I'd rather remove chars than change it to each_char and remove each_char..

I think detecting things like .chars.each etc should be easy to do for a semantic linter and I would prefer a tool like that to educate people about this rather than designing the stdlib into fool-proof but ugly.

The benchmark is flawed, you must call to_a after map. It will be slower than chars, but not by too much (I think)

@asterite The benchmark also uses a very short string which is clearly in favour of #chars.

Ok, else does anyone has a idea how to achieve this level of performance with the actual stdlib then?
https://gist.github.com/j8r/c4b7d24257914b55cfad3acaf04cc7d8

         String#chars.map   9.16M (109.16ns) (卤 9.44%)  112 B/op   2.31脳 slower
String#each_char.to_a.map   3.97M ( 251.8ns) (卤15.06%)  209 B/op   5.32脳 slower
        String#chars2.map  21.14M ( 47.31ns) (卤 6.35%)   48 B/op        fastest
         Big String#chars.map  19.02k ( 52.58碌s) (卤30.32%)  24064 B/op  10.34脳 slower
Big String#each_char.to_a.map  16.69k ( 59.92碌s) (卤21.55%)  50032 B/op  11.78脳 slower
        Big String#chars2.map 196.65k (  5.09碌s) (卤 7.66%)    128 B/op        fastest

IMHO the API is nice.

EDIT: update according @bew suggestion of using a big String.

Why are you benchmarking each_char.to_a.map vs chars2.map, it doesn't make sense to me.. The equivalent to chars2.map is each_char.map.

And using a bigger string like "string" * 1000, I get very different results:
https://gist.github.com/bew/3e44bf01b27db15da0f1ff14be38b450

Edit: actually forget me, each_char.map doesn't do anything.... my bad

Again: each_char.map is a no-op without later invoking to_a

chars is faster than char2.map because chars knows the resulting size of the Array. I think Rust has an optional size_hint getter for this, but that's a separate discussion.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

cjgajard picture cjgajard  路  3Comments

nabeelomer picture nabeelomer  路  3Comments

Papierkorb picture Papierkorb  路  3Comments

grosser picture grosser  路  3Comments

asterite picture asterite  路  3Comments