The documentation for the Countvectorizer and TfidfVectorizer is not clear about the interaction between token_pattern
and passing a custom tokenizer
. Currently, when a tokenizer
is passed, the token_pattern
is ignored. But the docstring entry for the tokenizer parameter only mentions Override the string tokenization step while preserving the preprocessing and n-grams generation steps.
. To me, it was not immediately clear that this meant that token_pattern
was not used at all.
Here' a user that got thrown by this: Stackoverflow
Some things I can think of:
Warnings should be present in 0.23rc3. try it for us?
Sure. The warning (UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
) indeed does show up, my bad for not checking it first. If you want, I can create a PR with some doc edits that state what is going on, but perhaps the warning is enough.
The warning is new. Let's see how it goes
Most helpful comment
The warning is new. Let's see how it goes