Scikit-learn: CountVectorizer and TfidfVectorizer docs do not mention token_pattern gets ignored when passing a custom tokenizer

Created on 29 Nov 2019 · 3Comments · Source: scikit-learn/scikit-learn

Description

The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between token_pattern and passing a custom tokenizer. Currently, when a tokenizer is passed, the token_pattern is ignored. But the docstring entry for the tokenizer parameter only mentions Override the string tokenization step while preserving the preprocessing and n-grams generation steps.. To me, it was not immediately clear that this meant that token_pattern was not used at all.

Here' a user that got thrown by this: Stackoverflow

Some things I can think of:

raise a warning if the user passes a (non-standard) token pattern and a custom tokenizer
update the docstring to be explicit about the interaction

Source

stephantul

Most helpful comment

The warning is new. Let's see how it goes

jnothman on 1 Dec 2019

👍3

All 3 comments

Warnings should be present in 0.23rc3. try it for us?

jnothman on 30 Nov 2019

Sure. The warning (UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None') indeed does show up, my bad for not checking it first. If you want, I can create a PR with some doc edits that state what is going on, but perhaps the warning is enough.

stephantul on 30 Nov 2019

The warning is new. Let's see how it goes

jnothman on 1 Dec 2019

👍3

Was this page helpful?

0 / 5 - 0 ratings