Pydantic: change regex to use re.search or re.fullmatch

Created on 14 Jun 2020  Â·  4Comments  Â·  Source: samuelcolvin/pydantic

Bug

Output of python -c "import pydantic.utils; print(pydantic.utils.version_info())":

             pydantic version: 1.5.1
            pydantic compiled: False
                 install path: /home/khan/.local/lib/python3.6/site-packages/pydantic
               python version: 3.6.9 (default, Nov  7 2019, 10:44:02)  [GCC 8.3.0]
                     platform: Linux-4.15.0-88-generic-x86_64-with-Ubuntu-18.04-bionic
     optional deps. installed: ['typing-extensions']

Related: #1396

import jsonschema
import pydantic

class Foo(pydantic.BaseModel):
    bar: str = pydantic.Field(..., regex='baz')

try:
    Foo(bar='bar baz quux')
except pydantic.ValidationError as e:
    print(e)
    # ValidationError: 1 validation error for Foo
    # bar
    #   string does not match regex "baz" (type=value_error.str.regex; pattern=baz)
else:
    print('Valid')

try:
    jsonschema.validate({'bar': 'bar baz quux'}, Foo.schema())
except jsonschema.ValidationError as e:
    print(e)
else:
    print('Valid')
    # Valid

In JSON Schema, all versions so far, the regular expression in the pattern keyword is treated as unanchored at both ends, i.e. re.search behavior rather than re.match or re.fullmatch.

Pydantic uses re.match to validate strings with Field regex argument, as explained in #1396.

When constructing a JSON Schema from a model, Pydantic generates a pattern keyword from a Field regex argument, without any regex postprocessing. Thus, the resulting JSON Schema validates differently from the original Pydantic model.

I strongly feel Pydantic should follow suit and use re.search to validate fields with regex argument, and explicitly call this out in the documentation. (Anchors in example usage are not sufficient.)

Alternatively, if you are concerned with backward compatibility, I propose the following long-term solution:

  1. Add a new Field (and constr) argument, perhaps named pattern after the JSON Schema keyword, mutually exclusive with the existing regex argument. This is API extension, thus, backward compatible.
  2. Implement validation for pattern using re.search.
  3. Change the behavior of BaseModel.schema to copy the pattern Field argument to the pattern schema keyword as is if present. This way, new users of the pattern argument get correct schema generation.
  4. One or more of:

    • Fix the behavior of BaseModel.schema so that a regex argument with value REGEX produces a schema pattern keyword with value ^(REGEX). (The grouping is necessary because the original regex may contain alternatives.) This way, users of the regex argument get schemas that behave consistently with the original Pydantic model.

    • Deprecate regex, suggesting pattern with explicit anchoring.

    • Emit a warning if a model containing fields with the regex argument is used to generate a JSON Schema. This way, users of the regex argument who are likely to be adversely affected by the inconsistensy get a heads-up.

Change Feedback Wanted documentation

Most helpful comment

Doc patch will follow. Not immediately but possibly on the weekend.

+1 to re.match being confusing — of the three, it is simplest to implement and hardest to use correctly.

As for behavior in v2, my vote is for re.search, because it achieves model/schema equivalence with minimum hassle.

All 4 comments

This is not a bug, but a suggested change.

Happy to change this in v2, personally I think re.fullmatch behaviour would be preferable to re.search, but I'm open to input from others.

However, it looks like JSON Schema is closer to re.search. Is that correct or are there any nuances?

As described in #1478, pydantic tries to follow 2019-09.

There are multiple aspects here.

  1. There exist many ways to test a string against a regular expression. In Python, they are re.search (no implicit anchoring), re.match (implicit anchor to the beginning of input), and re.fullmatch (implicit anchors to the beginning and end of input). This distinction is useful, and we, users of regular expressions, must be aware of it.

  2. JSON Schema unambiguously calls for unanchored regular expression testing; see Core § 6.4 paragraph 3 for an authoritative Draft 2019-09 reference. And yes, this is exactly re.search; the jsonschema library directly uses re.search in its implementation of pattern validation.

    Arguably, JSON Schema should have chosen implicit anchoring on both ends, but, unfortunately, the verbiage about anchoring first appeared only in Draft 04. This suggests that early implementations of Drafts 01–03 arbitrarily picked unanchored test (because that’s what Perl’s =~ /foo/ does), and Draft 04 had to codify the existing practice.

  3. The choice of anchoring semantics in Pydantic is your prerogative as the library author and maintainer. No bug here but two mutually opposite possibilities for change.

  4. That choice should be explicitly and accurately documented. Currently, the documentation says:

    regex: for string values, this adds a Regular Expression validation generated from the passed string and an annotation of pattern to the JSON Schema

    Because the exact method is not mentioned, and JSON Schema pattern keyword is mentioned, the reader is led to believe that Pydantic treats regex the same way JSON Schema treats pattern, that is, unanchored. This is a documentation bug. It could be fixed by adding a note to the effect of “Pydantic uses re.match to validate strings”.

  5. My expectation, and likely your intent, was that a generated schema accepts exactly the same strings as the model it was generated from. Because the model and the schema use the same regular expression but different implicit anchoring semantics, that equivalence does not hold. I consider it a bug that the actual behavior does not match expectations. It could be fixed:

    • by changing Pydantic validation to use the same anchoring semantics, i.e. re.search

      • con: breaking change
      • pro: users who deal with both Pydantic and JSON Schema have an easier time remembering that regexen are not implicitly anchored
    • or by exporting a different, explicitly anchored regex into the schema

      • pro: mostly backward-compatible
      • con: users have to remember each tool’s anchoring semantics
      • con: the generated schema becomes less readable, e.g. a regex='^foo$' becomes "pattern": "^(^foo$)"
      • This could be further solved by clever regex rewriting, but it’s a nontrivial exercise.
    • or by adjusting expectations, such as documenting Pydantic’s use of re.match, calling out the difference in behavior with JSON Schema validators and suggesting that users who need schema interoperability always make sure their regexen are either explicitly anchored with ^ or explicitly unanchored with .*.

      • pro: fully backward-compatible
      • con(?): users have to take care composing regular expressions

I agree with all of this. Happy to accept a PR to correct the documentation.

In v2 we have the chance to make breaking changes and get things right. I personally think using re.match() was definitely a mistake - we should either use unanchored regexes or regexes anchored at beginning and end - re.match() is the most confusing choice.

We could either switch to re.search or re.fullmatch, I'm flexible about which. Like you said: JSON Schema is most like re.search, but re.fullmatch is probably the least confusing in most scenarios.

I will accept whichever seems most popular here. @yurikhan you might be the only person other than me who cares enough to think about this, in that case I'll accept your preference.

Doc patch will follow. Not immediately but possibly on the weekend.

+1 to re.match being confusing — of the three, it is simplest to implement and hardest to use correctly.

As for behavior in v2, my vote is for re.search, because it achieves model/schema equivalence with minimum hassle.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

demospace picture demospace  Â·  26Comments

chopraaa picture chopraaa  Â·  18Comments

marlonjan picture marlonjan  Â·  37Comments

MrMrRobat picture MrMrRobat  Â·  22Comments

jasonkuhrt picture jasonkuhrt  Â·  19Comments