Crystal: Make each_char yield the index already

Created on 8 Nov 2019  路  16Comments  路  Source: crystal-lang/crystal

Why don't we change each_char to yield the index already, instead of having an extra method each_char_with_index for it? The thing is that either way, whether you use the index or not, the index has to be calculated. It has to keep track of what character it's at.
So we could yield the char and the index to each_char and then both of this will work:

"hello".each_char do |char|
  puts char
end
"hello".each_char do |char, index|
  puts char, index
end



md5-0c529ee3a619341491dbaec4e06c6388



1 11.03M ( 90.70ns) (卤 8.44%) 0.0B/op 1.24脳 slower
2 13.63M ( 73.38ns) (卤10.67%) 0.0B/op fastest
`` each_char_with_index` actually counts the index twice. That isn't efficient at all. They should have the same performance. So when we do this it should also be easy to fix this important performance regression.

We could also go even further and rename each_char to each. Like, when we do "hello".each do we obviously want to go through the chars, why do we have to explicitly mention that with the _char prefix? Similar to how you do [1, 2, 3].each do and not [1, 2, 3].each_element do.

I would be willing to implement this if any of this is accepted.

Most helpful comment

Also the idea of each is to get each object, not necessarily with an index. Maybe an enumerator fetches data from a socket or a file, for each each_line just reads until eof, no need to count anything. So forcing the index on every method just complicates the implmentors.

All 16 comments

Your benchmark is completely meaningless because it compares a generic implementation with a highly optimized special case.

each_char_with_index actually counts the index twice.

Where is the second counter? I don't think Char::Reader keeps track of the char index. It only counts bytes. So apart from ASCII-only strings where both are identical, #each_char_with_index needs to count the characters explicitly.

However, it seems optimization removes the index counter entirely when its not used. Both .each_char_with_index do |char| and .each_char do |char| see to be equivalent.

require "benchmark"

FOO = "enemenemuhundrausbistdu!脽"

chars = 0
Benchmark.ips do |bm|
  bm.report "each_char_with_index" do
    FOO.each_char_with_index do |char|
      chars |= char.ord
    end
  end
  bm.report "each_char" do
    FOO.each_char do |char|
      chars |= char.ord
    end
  end
end
   5.21M (191.98ns) (卤25.87%)  0.0B/op   1.01脳 slower
   5.28M (189.55ns) (卤29.09%)  0.0B/op        fastest

So maybe we could actually consider merging both methods into a single one.

why do we have to explicitly mention that with the _char prefix?

In contrast to an array or other enumerable, with a string you can iterate different things. Using each_char and each_byte avoids any ambiguity.

Your benchmark is completely meaningless because it compares a generic implementation with a highly optimized special case.

Okay, I didn't handle the non-ASCII case there but now look when I make it a bit more sophisticated and handle all then the old one is still slower:

Old  11.88M ( 84.14ns) (卤 1.40%)  0.0B/op   1.26脳 slower
New  15.01M ( 66.63ns) (卤 2.09%)  0.0B/op        fastest

I don't think Char::Reader keeps track of the char index. It only counts bytes.

It does, that's what Char::Reader#pos is for.

Well, it seems the benchmark results are just buggy again. Sometimes the new one is "slower".
So they are indeed equivalent in performance.
But the main point of this issue is anyway to remove each_char_with_index. So let's focus on that.

In contrast to an array or other enumerable, with a string you can iterate different things. Using each_char and each_byte avoids any ambiguity.

Hmm yeah, we should probably keep it each_char.

each_char_with_index makes it clear that the second block argument is the index.

Is there anything we are trying to solve here? What's the problem with using each_char or each_char_with_index?

@r00ster91 Char::Reader#pos is the byte index, not character count.

@asterite Assuming that each_char_with_index is equally performant as each_char when the index argument is not used, it would provide an option to simplify the API by using only each_char which yields both char and index (where the index is optional). But yeah, the expressiveness of each_char_with_index has its benefits as well.

@straight-shoota
https://github.com/crystal-lang/crystal/blob/0e2e1d067af09e7b1e4573a7258c433e3cf8fa17/src/char/reader.cr#L182
https://github.com/crystal-lang/crystal/blob/0e2e1d067af09e7b1e4573a7258c433e3cf8fa17/src/char/reader.cr#L98
Whenever it's incremented, it's incremented by a specific char width, not 1, so I'm pretty sure that's the character index, not the byte index.

@asterite I think each_char do |char, index| is just as clear.

  1. It's shorter.
  2. It's simpler.
  3. It's easier to add the index when you need it because you just need to add , index| and that's it. Currently you also have to add _with_index.
  4. It's compatible with every index counter style. Some prefer to write it i, not index. each_char_with_index do |char, index| fits well. each_char_with_index do |char, i| doesn't fit so well. Some might prefer the method to be named each_char_with_i. With each_char that problem doesn't exist.
  5. One method less. I think that's always a good thing.

No, that's exactly the byte index. It adds a number of n (1 <= n <= 6) per character. If it was a character index, it would add exactly 1 per character. So, after consuming 3 unicode characters, pos could be anything between 3 and 18. There is no way to tell from that pos value how many characters have been read.

I've bought up that *_with_index is redundant before, but it seems everyone wants to keep the status quo.

Instead of adding *_with_index counterparts for all methods method with optional second argument index sounds like a good idea to me.

For more than 2 arguments I would say it's not worth it, but when index is the second optional argument it's easy to remember.

There is also Enumable#map_with_index.
It can lead the way to have an optional index yielded for other existing enumerable methods, like find, select, reject, and so on, for free, and without changing the API.

OTOH there are things like each_with_index(offset) {|val, idx|} with optional offset argument to start iteration from. It also yields 3 arguments to the block in case of named tuple tuple.each_with_index do |key, value, index|.

So to go "get rid of _with_index methods" path I think all the methods with _with_index suffix must be reviewed and only then it'll be clear if dropping the suffix would be better.

So for example for Hash each would yield two objects: the key-value pair and the index. Then doing each { |k, v| doesn't work anymore.

Also the idea of each is to get each object, not necessarily with an index. Maybe an enumerator fetches data from a socket or a file, for each each_line just reads until eof, no need to count anything. So forcing the index on every method just complicates the implmentors.

@asterite why it won't work anymore for Hash? Enumerable#each isn't used, it's a custom implementation, which uses Hash#each_entry_with_index.

I agree it can be cumbersome to put counters everywhere to yield an index, but at the same time having with_index methods are meh, too.
_with_index are arbitrarily only in few place because one guy needs it at some point, but we can need a counter on lots of other methods.

For what it worth, I have found a generic way by using a macro (didn't find another way):

macro with_index(call, &block)
  {{block.args[-1]}} = 0
  {{call}} do |{{block.args[0..-2].join(',').id}}|
    {{block.body}}
    {{block.args[-1]}} += 1
  end
end

with_index ['a', 'b', 'c'].each do |el, i| 
  puts "#{el}: #{i}"
end

with_index({"a" => 'A', "b" => 'B'}.each) do |k, v, i| 
  puts "#{k}, #{v}: #{i}"
end

Result:

a: 0
b: 1
c: 2
a, A: 0
b, B: 1

Another problem for dropping _with_index is that omitting the index won't be free for iterator methods. It might not be a huge issue, because iterators are already quite costly in comparison to inlined blocks, but still.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

will picture will  路  3Comments

lbguilherme picture lbguilherme  路  3Comments

asterite picture asterite  路  3Comments

Papierkorb picture Papierkorb  路  3Comments

oprypin picture oprypin  路  3Comments