Pandas: Add support for "regex" library

Created on 24 Aug 2018  Â·  13Comments  Â·  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

import re
import pandas as pd
import regex

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "1", "2"]})
pattern = r"\d"

df.b.str.match(pattern)
df.b.str.match(re.compile(pattern))
df.b.str.match(regex.compile(pattern))     # throws typeError

```

TypeError Traceback (most recent call last)
in ()
9 df.b.str.match(pattern)
10 df.b.str.match(re.compile(pattern))
---> 11 df.b.str.match(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in match(self, pat, case, flags, na, as_indexer)
2421 def match(self, pat, case=True, flags=0, na=np.nan, as_indexer=None):
2422 result = str_match(self._data, pat, case=case, flags=flags, na=na,
-> 2423 as_indexer=as_indexer)
2424 return self._wrap_result(result)
2425

~/.virtualenvs/edgar/lib/python3.6/site-packages/pandas/core/strings.py in str_match(arr, pat, case, flags, na, as_indexer)
736 flags |= re.IGNORECASE
737
--> 738 regex = re.compile(pat, flags=flags)
739
740 if (as_indexer is False) and (regex.groups > 0):

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
231 def compile(pattern, flags=0):
232 "Compile a regular expression pattern, returning a pattern object."
--> 233 return _compile(pattern, flags)
234
235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
298 return pattern
299 if not sre_compile.isstring(pattern):
--> 300 raise TypeError("first argument must be string or compiled pattern")
301 p = sre_compile.compile(pattern, flags)
302 if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

A simpler way to demonstrate the problem is:
```python
re.compile(regex.compile(pattern))
TypeError                                 Traceback (most recent call last)
<ipython-input-64-38578ab20aeb> in <module>()
----> 1 re.compile(regex.compile(pattern))

~/.virtualenvs/edgar/lib/python3.6/re.py in compile(pattern, flags)
    231 def compile(pattern, flags=0):
    232     "Compile a regular expression pattern, returning a pattern object."
--> 233     return _compile(pattern, flags)
    234 
    235 def purge():

~/.virtualenvs/edgar/lib/python3.6/re.py in _compile(pattern, flags)
    298         return pattern
    299     if not sre_compile.isstring(pattern):
--> 300         raise TypeError("first argument must be string or compiled pattern")
    301     p = sre_compile.compile(pattern, flags)
    302     if not (flags & DEBUG):

TypeError: first argument must be string or compiled pattern

Problem description

The regex library seems not to be supported by pandas. Not sure if you want to add support for it, but I had a quick look and It seems relatively straight forward to add support for it (+ it would make maintainance for projects that have already opted for regex easier).

How to fix

So, I think that the steps that seem to be required are:

  1. pandas.core.dtypes.inference.is_re should return True for regex compiled patterns too (assuming that regex is installed of course).
  2. Make sure that you use call "is_re" before re.compile() (as is being done e.g. here):
if not is_re(pat):
    pat = re.compile(pat, flags)

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.17.5-1-ARCH
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.7.1
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.5
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.6.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Enhancement Strings

All 13 comments

Are patterns compiled by regex instances of typing.re.Pattern?

No they are not.

import re
import typing
import regex

re_pat = re.compile(r"\d")
regex_pat = regex.compile(r"\d")

re_pat.__class__.mro()                # [_sre.SRE_Pattern, object]
isinstance(re_pat, typing.Pattern)    # True

regex_pat.__class__.mro()                # [_regex.Pattern, object]
isinstance(regex_pat, typing.Pattern)    # False

Hi @pmav99, any luck with this? Or did you happen to create a workaround for yourself?

@madimov, I think I used vanila re for pandas, and regex for everything else. Not nice ,but there was no feedback and I needed to move on.

We could probably optionally import regex and append it’s type to the re types we handle.

On Jul 26, 2019, at 10:23, pmav99 notifications@github.com wrote:

@madimov, I think I used vanila re for pandas, and regex for everything else. Not nice ,but there was no feedback and I needed to move on.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

@TomAugspurger that would be great! Might you have the time to give it a go?

No, I won't have time.

On Mon, Jul 29, 2019 at 3:21 AM Miko Dimov notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger that would be great!
Might you have the time to give it a go?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/22496?email_source=notifications&email_token=AAKAOITEIYARGCLBCFFOYVLQB2SB5A5CNFSM4FRLSLXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27632Q#issuecomment-515894762,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAKAOIT5NQEZ7LURSM6AADLQB2SB5ANCNFSM4FRLSLXA
.

@TomAugspurger

Are patterns compiled by regex instances of typing.re.Pattern?

If the answer to this is "no", then that's an upstream bug IMO.

That said, here is my attempt at a fix: https://github.com/pandas-dev/pandas/compare/master...gwerbin:patch-2

Just made the edits here on Github, so haven't actually run any tests yet.

Might you have the time to give it a go?

Tom didn't have time, but PRs are welcome.

@jbrockmendel did you take a look at my proposed patch? It will probably need a major rebase obviously. Just want to make sure what I did is an acceptable approach before I put more time into it.

That looks roughly correct. You'll need to update some of the CI envs in ci/deps to include regex and skip the test if it isn't present.

@gwerbin thanks for pinging on this. Yah, that looks a lot less invasive than I expected, seems reasonable.

Was this page helpful?
0 / 5 - 0 ratings