Nim: re.split unexpected results with zero-width characters

Created on 27 May 2020 · 20Comments · Source: nim-lang/Nim

re.split delivers unexpected results when using zero-width characters. It seems to calculate the split as if the zero-width characters had a width.

This applies only to zero-width characters, zero-width lookaround constructs work as expected.

Example

import re
echo "expecting:"
echo @["", "foo"]
echo "getting:"
echo "foo".split(re"^")
echo "expecting:"
echo @["foo", ""]
echo "getting:"
echo "foo".split(re"$")
echo "expecting:"
echo @["", "foo", " ", "bar", ""]
echo "getting:"
echo "foo bar".split(re"\b")
echo "expecting: "
echo @["foo", "ar"]
echo "getting:"
echo "foobar".split(re"(?<=o)b")  # This works
echo "expecting: "
echo @["fo", "bar"]
echo "getting:"
echo "foobar".split(re"o(?=b)")  # This works

Current Output

expecting:
@["", "foo"]
getting:
@["f", "oo"]
expecting:
@["foo", ""]
getting:
@["foo"]
expecting:
@["", "foo", " ", "bar", ""]
getting:
@["f", "oo ", "b", "ar"]
expecting: 
@["foo", "ar"]
getting:
@["foo", "ar"]
expecting: 
@["fo", "bar"]
getting:
@["fo", "bar"]

Expected Output

expecting:
@["", "foo"]
getting:
@["", "foo"]
expecting:
@["foo", ""]
getting:
@["foo", ""]
expecting:
@["f", "oo ", "b", "ar"]
getting:
@["f", "oo ", "b", "ar"]
expecting: 
@["foo", "ar"]
getting:
@["foo", "ar"]
expecting: 
@["fo", "bar"]
getting:
@["fo", "bar"]

Possible Solution

The split code needs to be changed so that zero-width characters are not treated as having a width in their position calculation.

$ nim -v
Nim Compiler Version 1.2.0 [Linux: amd64]
Compiled at 2020-04-04
Copyright (c) 2006-2020 by Andreas Rumpf

active boot switches: -d:release -d:nativeStackTrace

Source

capocasa

👍2

All 20 comments

kaushalmodi on 27 May 2020

Just use nre then, we won't "fix" re.nim.

Araq on 27 May 2020

😕1

Just use nre then, we won't "fix" re.nim.

No thank you I'm sticking with re, it's nicer.

capocasa on 27 May 2020

@capocasa sometimes it takes a bit more convincing and data points, can you please show output for other programming languages and libraries?

here's a start: D, python, re, nre, nim-regex

D

/+
D20200528T005513
+/
import std.stdio;
import std.regex;
void main(string[]args){
  writeln(splitter("foo", regex("^")));
  writeln(splitter("foo", regex("$")));
  writeln(splitter("foo bar", regex("\\b")));
  writeln(splitter("foobar", regex("(?<=o)b")));
  writeln(splitter("foobar", regex("o(?=b)")));
}

 rdmd $timn_D/tests/nim/all/t10843.d
["", "foo"]
["foo", ""]
["", "foo", " ", "bar", ""]
["foo", "ar"]
["fo", "bar"]

python3

# D20200528T005513
import re
print(re.split("^", "foo"))
print(re.split("$", "foo"))
print(re.split("\\b", "foo bar"))
print(re.split("(?<=o)b", "foobar"))
print(re.split("o(?=b)", "foobar"))

python3 $timn_D/tests/nim/all/t10842.py
['', 'foo']
['foo', '']
['', 'foo', ' ', 'bar', '']
['foo', 'ar']
['fo', 'bar']

nim-regex

import regex
echo split("foo", re"^")
echo split("foo", re"$")
echo split("foo bar", re"\b")
echo split("foobar", re"(?<=o)b")
echo split("foobar", re"o(?=b)")

/cc @nitely => see unhandled exception:

@["foo"]
@["foo"]
@["foo", " ", "bar"]
Error: unhandled exception: Invalid group. Unknown group type (?<=o)b

nre

when true:
  import nre
  echo split("foo", re"^")
  echo split("foo", re"$")
  echo split("foo bar", re"\b")
  echo split("foobar", re"(?<=o)b")
  echo split("foobar", re"o(?=b)")

@["foo"]
@["foo"]
@["foo", " ", "bar"]
@["foo", "ar"]
@["fo", "bar"]

re

@["f", "oo"]
@["foo"]
@["f", "oo ", "b", "ar"]
@["foo", "ar"]
@["fo", "bar"]

As shown above, although there is some difference in behavior across languages/libraries, re stands out (not just surprising but not standard) and its behavior should be considered a bug, at least for split("foo", re"^"), echo split("foo bar", re"\b")

timotheecour on 28 May 2020

Well PRs are welcome but I won't do it. Been there, done that, regex is a most expensive time sink. Without much benefits, in the end regexes are most unsuitable for robust parsers.

Araq on 28 May 2020

It would be more productive to deprecate both re and nre and point people to nim-regex.

Araq on 28 May 2020

👍1

Well PRs are welcome but I won't do it.

well I'm pretty sure no one ever implied here that you should do it; a bug is a bug, anyone including reporter can fix it; but ppl (esp reporter) is unlikely to fix it if it's classified as "wontfix"

Without much benefits, in the end regexes are most unsuitable for robust parsers.

yes, unsuitable for parsers but there are many practical use cases you can't discount; consistency matters when possible (cf above discrepencies)

It would be more productive to deprecate both re and nre and point people to nim-regex.

nim-regex has the better API (+ pure nim = works with other backends + vm), but there's still a performance gap, so until gap is bridged, re,nre stay relevant (see https://github.com/nitely/nim-regex/pull/58#issuecomment-619522766)

timotheecour on 28 May 2020

Well PRs are welcome but I won't do it. Been there, done that, regex is a most expensive time sink. >Without much benefits, in the end regexes are most unsuitable for robust parsers.
deprecate both re and nre and point people to nim-regex

I do not believe that line of thinking is applicable to the 1.x branch.

Andreas, you are stretched very, very thin, so I totallly get it might be borderline offensive to be bothered with a dumb little maybe-a-bug in a module that's mainly there to accomodate people's mindless habits.

On the other hand, Nim is post 1.0- That branch is no longer about good, that branch is about mature. So both re and nre are going to have to be dragged into complete inoffensiveness one ragged detail at a time through uninspiring leg work, or people are going to reach for awk. If that doesn't sound like you, I think it's because you aren't. I would like to go out on a limb here and wonder if you should really be on github triaging bugs for 1.x. Might it be possible to make something kind of radical happen- to turn over responsibility for maintenance of 1.x to someone who excels at polish, so you can focus on what you want to focus on?

And until then, can you re-open this please, so someone can fix it?

@capocasa sometimes it takes a bit more convincing and data points, can you please show output for other programming languages and libraries?

Great idea, see below. Looks like regex split behavior is less standardized than I thought- PHP, D and Python behave as I anticipated, while JavaScript, Ruby and- look at that- Perl omit the empty strings. None behave like re, so I'd say that's a bug by expectation.

Since it probably doesn't really matter and we just have to pick one, it would probably make sense to emulate python/re/nim-regex behavior here. Python users are future Nim users, aren't they? 😉

PHP

<?php
var_export(preg_split('/^/', 'foo'));
var_export(preg_split('/$/', 'foo'));
var_export(preg_split('/\\b/', 'foo bar'));
var_export(preg_split('/(?<=o)b/', 'foobar'));
var_export(preg_split('/o(?=b)/', 'foobar'));

array (
  0 => '',
  1 => 'foo',
)array (
  0 => 'foo',
  1 => '',
)array (
  0 => '',
  1 => 'foo',
  2 => ' ',
  3 => 'bar',
  4 => '',
)array (
  0 => 'foo',
  1 => 'ar',
)array (
  0 => 'fo',
  1 => 'bar',
)

Javascript / Node

console.log("foo".split("/^/"));
console.log("foo".split("/$/"));
console.log("foo bar".split("/\b/"));
console.log("foobar".split(/(?<=o)/));
console.log("foobar".split(/o(?=b)/));

[ 'foo' ]
[ 'foo' ]
[ 'foo bar' ]
[ 'fo', 'o', 'bar' ]
[ 'fo', 'bar' ]

Ruby

p "foo".split(/^/)
p "foo".split(/$/)
p "foo bar".split(/\b/)
p "foobar".split(/(?<=o)/)
p "foobar".split(/o(?=b)/)

["foo"]
["foo"]
["foo", " ", "bar"]
["fo", "o", "bar"]
["fo", "bar"]

Perl

use Data::Dump 'dump';
dump split(/^/, 'foo');
dump split(/$/, 'foo');
dump split(/\b/, 'foo bar');
dump split(/(?<=o)b/, 'foobar');
dump split(/o(?=b)/, 'foobar');

"foo"
"foo"
("foo", " ", "bar")
("foo", "ar")
("fo", "bar")

capocasa on 28 May 2020

well he already said Well PRs are welcome, and the data you've added confirms that no other language replicates re/nre's behavior in the most contentious cases, so I'm reopening.

@capocasa thanks for the data, it helps; now would you like to contribute a PR to fix at least the clearly wrong cases, and leave out the other cases for further debate/work? too many bugs, not enough PRs :-)

Related (or possibly duplicate) of #14284, #9437.

https://github.com/nim-lang/Nim/issues/9437 is indeed related (and would make sense to consider it in the PR)
#14284 was rather a different thing, I just sent out https://github.com/nim-lang/Nim/pull/14483 which should close this

timotheecour on 28 May 2020

@timotheecour Cool!

OK I'll see what I can do, long list of various intended contributions and pesky life-related activities.

capocasa on 28 May 2020

I had a look but I ran into a bit of a snag- are there any more unit tests for re except those in tests/stdlib/tregex.nim? If they're not and i fix this I might be introducing any number of regressions.

Edit: Couldn't find any so far. Without them, this is more than a fix- best way forward I found is to copy/paste the nre tests and then drop or adapt the ones that don't match, figure out why the remaining ones fail, and then fix this.

capocasa on 28 May 2020

@capocasa

tests are in tests/stdlib/tre.nim (since https://github.com/nim-lang/Nim/pull/14483 fresh from this morning ; the tests were moved from isMainModule block inside re.nim)
i know realize there's also a tests/stdlib/tregex.nim (which is tiny), it uses wrong naming convention so should be merged inside tests/stdlib/tre.nim
there's also tests/stdlib/nre/*; IMO 1 file per function is overkill and not standard so probably should be converted to tests/stdlib/tnre.nim or at least refactored in fewer tests (at some point in the future ...) ; => at least its structure (as multiple files) shouldn't duplicated to re
(controversial, I feel some ppl are not gonna like this, so maybe ignore) if possible, for at least a common subset of tests for re and nre, it would be nice to factor code instead of duplicate the tests, for various reasons. Many ways to do it, here's one:

# tests/stdlib/tre.nim # re specific
# tests/stdlib/tnre.nim # nre specific
# tests/stdlib/mre_or_nre.nim # common to nre + re
import unittest
template testAll(xre) = # call with re and nre
  import xre
  check split(re"(?<=o)b", "foobar") == @["foo", "ar"]
  check ...

when defined(testRe): testAll(re)
elif defined(testNre): testAll(nre)
else: static: doAssert false
# then call with -d:testRe and -d:testNRe (multiple options for that, one of them is from trunner; another is via the usual testament spec)

(would be much easier to do that with import at block scope though, as it would avoid the separate compilation)

early draft PR appreciated (eg for early feedback)

timotheecour on 29 May 2020

@timotheecour Cool, thanks for the move and your input!

capocasa on 1 Jun 2020

factor code

I think it would be good to duplicate the tests but structure them to be reasonably easy to keep in sync manually. That keeps some additional moving parts out of the debugging process.

capocasa on 1 Jun 2020

structure them to be reasonably easy to keep in sync manually

if so, a single file instead of 1 file per function is preferable (much easier to manage); and as mentioned, that's other most other modules use

timotheecour on 1 Jun 2020

👍1

I'm messing with this from time to time. It's hard.

capocasa on 4 Jun 2020

do you have a PR draft ?

timotheecour on 4 Jun 2020

Nope! Very limited time but will return from time to time. Still understanding code, figuring out how to get a different code path for "foo".split(re"") and "foo".split(re"^"). If someone else were to look into this please post so we don't duplicate effort. Thanks!

capocasa on 11 Jun 2020

@capocasa FWIW, you can probably copy/adapt the nim-regex code