Scikit-learn: CountVectorizer and TfidfVectorizer docs do not mention token_pattern gets ignored when passing a custom tokenizer

Created on 29 Nov 2019  路  3Comments  路  Source: scikit-learn/scikit-learn

Description

The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between token_pattern and passing a custom tokenizer. Currently, when a tokenizer is passed, the token_pattern is ignored. But the docstring entry for the tokenizer parameter only mentions Override the string tokenization step while preserving the preprocessing and n-grams generation steps.. To me, it was not immediately clear that this meant that token_pattern was not used at all.

Here' a user that got thrown by this: Stackoverflow

Some things I can think of:

  • raise a warning if the user passes a (non-standard) token pattern and a custom tokenizer
  • update the docstring to be explicit about the interaction

Most helpful comment

The warning is new. Let's see how it goes

All 3 comments

Warnings should be present in 0.23rc3. try it for us?

Sure. The warning (UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None') indeed does show up, my bad for not checking it first. If you want, I can create a PR with some doc edits that state what is going on, but perhaps the warning is enough.

The warning is new. Let's see how it goes

Was this page helpful?
0 / 5 - 0 ratings