re.split delivers unexpected results when using zero-width characters. It seems to calculate the split as if the zero-width characters had a width.
This applies only to zero-width characters, zero-width lookaround constructs work as expected.
import re
echo "expecting:"
echo @["", "foo"]
echo "getting:"
echo "foo".split(re"^")
echo "expecting:"
echo @["foo", ""]
echo "getting:"
echo "foo".split(re"$")
echo "expecting:"
echo @["", "foo", " ", "bar", ""]
echo "getting:"
echo "foo bar".split(re"\b")
echo "expecting: "
echo @["foo", "ar"]
echo "getting:"
echo "foobar".split(re"(?<=o)b") # This works
echo "expecting: "
echo @["fo", "bar"]
echo "getting:"
echo "foobar".split(re"o(?=b)") # This works
expecting:
@["", "foo"]
getting:
@["f", "oo"]
expecting:
@["foo", ""]
getting:
@["foo"]
expecting:
@["", "foo", " ", "bar", ""]
getting:
@["f", "oo ", "b", "ar"]
expecting:
@["foo", "ar"]
getting:
@["foo", "ar"]
expecting:
@["fo", "bar"]
getting:
@["fo", "bar"]
expecting:
@["", "foo"]
getting:
@["", "foo"]
expecting:
@["foo", ""]
getting:
@["foo", ""]
expecting:
@["f", "oo ", "b", "ar"]
getting:
@["f", "oo ", "b", "ar"]
expecting:
@["foo", "ar"]
getting:
@["foo", "ar"]
expecting:
@["fo", "bar"]
getting:
@["fo", "bar"]
The split code needs to be changed so that zero-width characters are not treated as having a width in their position calculation.
$ nim -v
Nim Compiler Version 1.2.0 [Linux: amd64]
Compiled at 2020-04-04
Copyright (c) 2006-2020 by Andreas Rumpf
active boot switches: -d:release -d:nativeStackTrace
Related (or possibly duplicate) of https://github.com/nim-lang/Nim/issues/14284, https://github.com/nim-lang/Nim/issues/9437.
Just use nre then, we won't "fix" re.nim.
Just use
nrethen, we won't "fix" re.nim.
No thank you I'm sticking with re, it's nicer.
@capocasa sometimes it takes a bit more convincing and data points, can you please show output for other programming languages and libraries?
here's a start: D, python, re, nre, nim-regex
/+
D20200528T005513
+/
import std.stdio;
import std.regex;
void main(string[]args){
writeln(splitter("foo", regex("^")));
writeln(splitter("foo", regex("$")));
writeln(splitter("foo bar", regex("\\b")));
writeln(splitter("foobar", regex("(?<=o)b")));
writeln(splitter("foobar", regex("o(?=b)")));
}
rdmd $timn_D/tests/nim/all/t10843.d
["", "foo"]
["foo", ""]
["", "foo", " ", "bar", ""]
["foo", "ar"]
["fo", "bar"]
# D20200528T005513
import re
print(re.split("^", "foo"))
print(re.split("$", "foo"))
print(re.split("\\b", "foo bar"))
print(re.split("(?<=o)b", "foobar"))
print(re.split("o(?=b)", "foobar"))
python3 $timn_D/tests/nim/all/t10842.py
['', 'foo']
['foo', '']
['', 'foo', ' ', 'bar', '']
['foo', 'ar']
['fo', 'bar']
import regex
echo split("foo", re"^")
echo split("foo", re"$")
echo split("foo bar", re"\b")
echo split("foobar", re"(?<=o)b")
echo split("foobar", re"o(?=b)")
/cc @nitely => see unhandled exception:
@["foo"]
@["foo"]
@["foo", " ", "bar"]
Error: unhandled exception: Invalid group. Unknown group type (?<=o)b
when true:
import nre
echo split("foo", re"^")
echo split("foo", re"$")
echo split("foo bar", re"\b")
echo split("foobar", re"(?<=o)b")
echo split("foobar", re"o(?=b)")
@["foo"]
@["foo"]
@["foo", " ", "bar"]
@["foo", "ar"]
@["fo", "bar"]
@["f", "oo"]
@["foo"]
@["f", "oo ", "b", "ar"]
@["foo", "ar"]
@["fo", "bar"]
As shown above, although there is some difference in behavior across languages/libraries, re stands out (not just surprising but not standard) and its behavior should be considered a bug, at least for split("foo", re"^"), echo split("foo bar", re"\b")
Well PRs are welcome but I won't do it. Been there, done that, regex is a most expensive time sink. Without much benefits, in the end regexes are most unsuitable for robust parsers.
It would be more productive to deprecate both re and nre and point people to nim-regex.
Well PRs are welcome but I won't do it.
well I'm pretty sure no one ever implied here that you should do it; a bug is a bug, anyone including reporter can fix it; but ppl (esp reporter) is unlikely to fix it if it's classified as "wontfix"
Without much benefits, in the end regexes are most unsuitable for robust parsers.
yes, unsuitable for parsers but there are many practical use cases you can't discount; consistency matters when possible (cf above discrepencies)
It would be more productive to deprecate both re and nre and point people to nim-regex.
nim-regex has the better API (+ pure nim = works with other backends + vm), but there's still a performance gap, so until gap is bridged, re,nre stay relevant (see https://github.com/nitely/nim-regex/pull/58#issuecomment-619522766)
Well PRs are welcome but I won't do it. Been there, done that, regex is a most expensive time sink. >Without much benefits, in the end regexes are most unsuitable for robust parsers.
deprecate both re and nre and point people to nim-regex
I do not believe that line of thinking is applicable to the 1.x branch.
Andreas, you are stretched very, very thin, so I totallly get it might be borderline offensive to be bothered with a dumb little maybe-a-bug in a module that's mainly there to accomodate people's mindless habits.
On the other hand, Nim is post 1.0- That branch is no longer about good, that branch is about mature. So both re and nre are going to have to be dragged into complete inoffensiveness one ragged detail at a time through uninspiring leg work, or people are going to reach for awk. If that doesn't sound like you, I think it's because you aren't. I would like to go out on a limb here and wonder if you should really be on github triaging bugs for 1.x. Might it be possible to make something kind of radical happen- to turn over responsibility for maintenance of 1.x to someone who excels at polish, so you can focus on what you want to focus on?
And until then, can you re-open this please, so someone can fix it?
@capocasa sometimes it takes a bit more convincing and data points, can you please show output for other programming languages and libraries?
Great idea, see below. Looks like regex split behavior is less standardized than I thought- PHP, D and Python behave as I anticipated, while JavaScript, Ruby and- look at that- Perl omit the empty strings. None behave like re, so I'd say that's a bug by expectation.
Since it probably doesn't really matter and we just have to pick one, it would probably make sense to emulate python/re/nim-regex behavior here. Python users are future Nim users, aren't they? 馃槈
PHP
<?php
var_export(preg_split('/^/', 'foo'));
var_export(preg_split('/$/', 'foo'));
var_export(preg_split('/\\b/', 'foo bar'));
var_export(preg_split('/(?<=o)b/', 'foobar'));
var_export(preg_split('/o(?=b)/', 'foobar'));
array (
0 => '',
1 => 'foo',
)array (
0 => 'foo',
1 => '',
)array (
0 => '',
1 => 'foo',
2 => ' ',
3 => 'bar',
4 => '',
)array (
0 => 'foo',
1 => 'ar',
)array (
0 => 'fo',
1 => 'bar',
)
Javascript / Node
console.log("foo".split("/^/"));
console.log("foo".split("/$/"));
console.log("foo bar".split("/\b/"));
console.log("foobar".split(/(?<=o)/));
console.log("foobar".split(/o(?=b)/));
[ 'foo' ]
[ 'foo' ]
[ 'foo bar' ]
[ 'fo', 'o', 'bar' ]
[ 'fo', 'bar' ]
Ruby
p "foo".split(/^/)
p "foo".split(/$/)
p "foo bar".split(/\b/)
p "foobar".split(/(?<=o)/)
p "foobar".split(/o(?=b)/)
["foo"]
["foo"]
["foo", " ", "bar"]
["fo", "o", "bar"]
["fo", "bar"]
Perl
use Data::Dump 'dump';
dump split(/^/, 'foo');
dump split(/$/, 'foo');
dump split(/\b/, 'foo bar');
dump split(/(?<=o)b/, 'foobar');
dump split(/o(?=b)/, 'foobar');
"foo"
"foo"
("foo", " ", "bar")
("foo", "ar")
("fo", "bar")
well he already said Well PRs are welcome, and the data you've added confirms that no other language replicates re/nre's behavior in the most contentious cases, so I'm reopening.
@capocasa thanks for the data, it helps; now would you like to contribute a PR to fix at least the clearly wrong cases, and leave out the other cases for further debate/work? too many bugs, not enough PRs :-)
Related (or possibly duplicate) of #14284, #9437.
@timotheecour Cool!
OK I'll see what I can do, long list of various intended contributions and pesky life-related activities.
I had a look but I ran into a bit of a snag- are there any more unit tests for re except those in tests/stdlib/tregex.nim? If they're not and i fix this I might be introducing any number of regressions.
Edit: Couldn't find any so far. Without them, this is more than a fix- best way forward I found is to copy/paste the nre tests and then drop or adapt the ones that don't match, figure out why the remaining ones fail, and then fix this.
@capocasa
tests/stdlib/tre.nim (since https://github.com/nim-lang/Nim/pull/14483 fresh from this morning ; the tests were moved from isMainModule block inside re.nim)tests/stdlib/tregex.nim (which is tiny), it uses wrong naming convention so should be merged inside tests/stdlib/tre.nimthere's also tests/stdlib/nre/*; IMO 1 file per function is overkill and not standard so probably should be converted to tests/stdlib/tnre.nim or at least refactored in fewer tests (at some point in the future ...) ; => at least its structure (as multiple files) shouldn't duplicated to re
(controversial, I feel some ppl are not gonna like this, so maybe ignore) if possible, for at least a common subset of tests for re and nre, it would be nice to factor code instead of duplicate the tests, for various reasons. Many ways to do it, here's one:
# tests/stdlib/tre.nim # re specific
# tests/stdlib/tnre.nim # nre specific
# tests/stdlib/mre_or_nre.nim # common to nre + re
import unittest
template testAll(xre) = # call with re and nre
import xre
check split(re"(?<=o)b", "foobar") == @["foo", "ar"]
check ...
when defined(testRe): testAll(re)
elif defined(testNre): testAll(nre)
else: static: doAssert false
# then call with -d:testRe and -d:testNRe (multiple options for that, one of them is from trunner; another is via the usual testament spec)
(would be much easier to do that with import at block scope though, as it would avoid the separate compilation)
@timotheecour Cool, thanks for the move and your input!
factor code
I think it would be good to duplicate the tests but structure them to be reasonably easy to keep in sync manually. That keeps some additional moving parts out of the debugging process.
structure them to be reasonably easy to keep in sync manually
if so, a single file instead of 1 file per function is preferable (much easier to manage); and as mentioned, that's other most other modules use
I'm messing with this from time to time. It's hard.
do you have a PR draft ?
Nope! Very limited time but will return from time to time. Still understanding code, figuring out how to get a different code path for "foo".split(re"") and "foo".split(re"^"). If someone else were to look into this please post so we don't duplicate effort. Thanks!
@capocasa FWIW, you can probably copy/adapt the nim-regex code
@capocasa FWIW, you can probably copy/adapt the nim-regex code
Not a bad idea! I'll have a look. Without having profiled it I suspect fixing the re code may still be the best option. It looks fast.