Benchmarks on real data have steered me towards this token filter as other forms of stemmer are generally too aggressive for ecommerce (e.g. loafers==loaf).
Good plural-stemming is ideally what is required because most user searches are plural and yet product descriptions are singular (e.g. "dresses" search should match product "red dress").
Good examples of plural stemming by this filter include:
Search string|Good stemmed form
-----|--------
cases|case
shades|shade
bottles|bottle
However, these terms fail to match because of bad stemming:
Search string | Bad stemmed form
---------|-----------
dresses|dresse
watches|watche
brushes|brushe
boxes|boxe
Example reproduction:
DELETE test
PUT test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"filter_english_minimal"
]
}
},
"filter": {
"filter_english_minimal": {
"type": "stemmer",
"name": "minimal_english"
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
POST test/_doc/1
{
"name":"red dress"
}
# Does not match (search stems to "dresse")
GET test/_search
{
"query":{
"match":{
"name":"dresses"
}
}
}
It would be good to fix these poor examples of stemming but would obviously need to worry about backwards compatibility.
Pinging @elastic/es-search
Is there a general rule, we can't just remove the e before the s all the time? E.g. forces, votes, xylophones, etc.
If this is a hard problem, maybe an alternative would be to recommend using a synonym filter for those terms that get frequently misstemmed. A good list to start with could be added to our docs.
If we want to change the behavior of this stemmer, I'd rather make a new one since this one is a direct implementation of a stemmer that is documented in a paper.
Is there a general rule, we can't just remove the e before the s all the time?
I can do some digging but for a start I would expect *ss *tch *x *sh are always patterns that would always remove the es part of a plural.
For reference - crossword solvers:
sses examples
tches examples
shes examples
xes examples
If we want to change the behavior of this stemmer, I'd rather make a new one
Probably for a different issue, but it would be good to consider how we name token filters and deal with BWC.
We end up with a proliferation of analyzer/tokenizer names: e.g. "english_light", "english_minimal" and now perhaps "english_better_minimal" as we evolve.
Perhaps it would be useful to let the names convey the intention (eg stemming plurals only) but also include a version number component to allow us to evolve the details of the implementation e.g. english_minimal_2019
this one is a direct implementation of a stemmer that is documented in a paper.
According to the javadocs it's called "S stemmer"
I took the signal media million news dataset and used this script to benchmark my proposals. I measured the recall gain that could be had from removing the extra "e" for a number of suffixes.
In the tables below it's important to note that the figures for Proposed stem count are all the false-negatives we have today when using the S-stemmer so there's a lot of uplift for what I hope is a little loss.
This was a very positive uplift in recall for the most popular terms. Most of the rare non-zero matches for the S stemmer forms are questionable - presses is rarely likely to appear in English and mean a plural of presse. The valid exceptions that retain the e are the are crevasse and posse.
Popular wins for typical ecommerce site searches might be the various dresses, mattresses they sell.
Plural|count|Proposed stem|Proposed stem count|S stemmer|S stemmer count
---|---|---|---|---|---|
businesses | 38698 | business | 143764 |
processes | 14592 | process | 69044 |
losses | 13164 | loss | 42859 |
classes | 11946 | class | 48412 | classe | 23
passes | 10998 | pass | 29525 | passe | 42
addresses | 7491 | address | 48607 |
witnesses | 5929 | witness | 8866 |
weaknesses | 3406 | weakness | 5481 |
discusses | 3301 | discuss | 25244 |
successes | 3071 | success | 47715 |
glasses | 3036 | glass | 12610 |
bosses | 2609 | boss | 13511 | bosse | 25
masses | 2594 | mass | 20427 | masse | 499
dresses | 2502 | dress | 9588 |
illnesses | 2316 | illness | 6664 |
misses | 2070 | miss | 25704 |
crosses | 1692 | cross | 27400 | crosse | 221
encompasses | 1630 | encompass | 1067 |
sunglasses | 1551 | sunglass | 76 |
stresses | 1511 | stress | 11000 |
possesses | 1384 | possess | 3022 |
progresses | 1298 | progress | 23792 |
expresses | 1296 | express | 15843 |
actresses | 1165 | actress | 10873 |
kisses | 916 | kiss | 3228 |
assesses | 825 | assess | 6807 | assesse | 4
presses | 820 | press | 82998 | presse | 1068
mattresses | 556 | mattress | 931 |
dismisses | 536 | dismiss | 2207 |
harnesses | 484 | harness | 2438 |
excesses | 438 | excess | 6456 |
grasses | 409 | grass | 4848 | grasse | 23
surpasses | 408 | surpass | 1397 |
confesses | 366 | confess | 951 |
guesses | 350 | guess | 11651 |
messes | 335 | mess | 5290 | messe | 145
impresses | 325 | impress | 2589 |
princesses | 323 | princess | 3624 | princesse | 9
eyewitnesses | 321 | eyewitness | 971 |
tosses | 317 | toss | 3223 |
asses | 233 | ass | 2104 | asse | 19
accesses | 224 | access | 66139 |
molasses | 218 | molass | 0 |
tresses | 217 | tress | 59 |
eyeglasses | 209 | eyeglass | 86 |
carcasses | 189 | carcass | 270 |
blesses | 184 | bless | 2349 |
suppresses | 156 | suppress | 971 |
recesses | 152 | recess | 884 |
ulysses | 143 | ulyss | 0 | ulysse | 13
waitresses | 141 | waitress | 483 |
mistresses | 137 | mistress | 539 |
goddesses | 137 | goddess | 699 |
gasses | 132 | gass | 73 |
busses | 118 | buss | 84 | busse | 20
congresses | 116 | congress | 20118 |
thicknesses | 112 | thickness | 1096 |
bypasses | 105 | bypass | 1559 |
compresses | 104 | compress | 189 |
masterclasses | 100 | masterclass | 329 |
posses | 91 | poss | 36 | posse | 252
professes | 89 | profess | 287 |
glosses | 87 | gloss | 1024 |
likenesses | 86 | likeness | 621 |
hostesses | 80 | hostess | 383 |
pisses | 70 | piss | 402 |
overpasses | 69 | overpass | 352 |
grosses | 64 | gross | 9838 | grosse | 104
trusses | 62 | truss | 429 |
disses | 59 | diss | 226 | disse | 10
lionesses | 58 | lioness | 74 |
agribusinesses | 58 | agribusiness | 467 |
basses | 56 | bass | 3483 | basse | 17
subclasses | 49 | subclass | 69 |
obsesses | 44 | obsess | 171 |
represses | 42 | repress | 89 |
buttresses | 41 | buttress | 137 |
trespasses | 40 | trespass | 376 |
crevasses | 40 | crevass | 0 | crevasse | 39
plusses | 40 | pluss | 11 |
caresses | 36 | caress | 115 |
depresses | 33 | depress | 180 |
amasses | 33 | amass | 273 |
headdresses | 32 | headdress | 83 |
embarrasses | 32 | embarrass | 396 |
stewardesses | 31 | stewardess | 44 |
sundresses | 29 | sundress | 42 |
Another positive uplift. Disagreements with S stemmer like ashe or rushe tended to be people names that again would not account for the majority of the plural forms.
Popular wins for typical ecommerce site searches might be the various brushes, dishes.
Plural|count|Proposed stem|Proposed stem count|S stemmer|S stemmer count
---|---|---|---|---|---|
wishes | 5284 | wish | 20877 |
dishes | 4470 | dish | 4548 |
clashes | 3388 | clash | 8264 |
finishes | 3039 | finish | 21923 |
publishes | 2376 | publish | 5568 |
pushes | 2188 | push | 21428 |
ashes | 2179 | ash | 2203 | ashe | 1418
crashes | 2143 | crash | 12997 |
establishes | 1163 | establish | 11758 |
flashes | 1150 | flash | 7612 |
bushes | 886 | bush | 8213 |
rushes | 864 | rush | 8995 | rushe | 7
brushes | 738 | brush | 3491 |
parishes | 733 | parish | 3146 |
lashes | 697 | lash | 516 |
washes | 565 | wash | 5493 |
distinguishes | 364 | distinguish | 1718 |
flourishes | 347 | flourish | 1611 |
unleashes | 344 | unleash | 1235 |
diminishes | 321 | diminish | 1061 |
fishes | 286 | fish | 9491 |
crushes | 282 | crush | 2946 |
accomplishes | 256 | accomplish | 4052 |
blemishes | 254 | blemish | 328 |
refreshes | 224 | refresh | 2162 |
skirmishes | 220 | skirmish | 256 |
relishes | 215 | relish | 862 |
smashes | 214 | smash | 2139 |
splashes | 204 | splash | 2338 |
dashes | 197 | dash | 2802 | dashe | 7
rashes | 172 | rash | 1290 |
eyelashes | 171 | eyelash | 107 |
punishes | 168 | punish | 1522 |
toothbrushes | 161 | toothbrush | 383 |
radishes | 154 | radish | 173 |
polishes | 152 | polish | 2726 |
slashes | 151 | slash | 1729 |
marshes | 149 | marsh | 1876 |
gushes | 142 | gush | 203 |
nourishes | 140 | nourish | 419 |
blushes | 131 | blush | 714 |
vanishes | 124 | vanish | 412 |
cherishes | 111 | cherish | 1035 |
leashes | 110 | leash | 584 |
meshes | 87 | mesh | 1432 |
languishes | 82 | languish | 220 |
furnishes | 81 | furnish | 512 |
flushes | 80 | flush | 1162 |
bashes | 79 | bash | 1653 | bashe | 7
replenishes | 79 | replenish | 462 |
squashes | 68 | squash | 1143 |
sashes | 66 | sash | 219 |
garnishes | 66 | garnish | 700 |
ambushes | 64 | ambush | 610 |
trashes | 60 | trash | 3324 |
quashes | 59 | quash | 225 |
hashes | 56 | hash | 783 |
demolishes | 56 | demolish | 482 |
mouthwashes | 49 | mouthwash | 188 |
backsplashes | 48 | backsplash | 135 |
tarnishes | 43 | tarnish | 311 |
paintbrushes | 41 | paintbrush | 166 |
admonishes | 39 | admonish | 53 |
cashes | 39 | cash | 30046 |
varnishes | 35 | varnish | 175 |
fetishes | 34 | fetish | 256 |
stashes | 34 | stash | 869 |
perishes | 32 | perish | 392 |
banishes | 31 | banish | 261 |
whitewashes | 29 | whitewash | 297 |
refurbishes | 28 | refurbish | 209 |
fleshes | 28 | flesh | 2287 |
gashes | 28 | gash | 253 |
abolishes | 25 | abolish | 800 |
mashes | 23 | mash | 933 |
brandishes | 21 | brandish | 56 |
embellishes | 17 | embellish | 152 |
thrashes | 14 | thrash | 397 |
rehashes | 14 | rehash | 98 |
thrushes | 13 | thrush | 75 |
dervishes | 13 | dervish | 30 |
extinguishes | 13 | extinguish | 541 |
lavishes | 10 | lavish | 1283 |
noshes | 9 | nosh | 75 |
vanquishes | 9 | vanquish | 78 |
sloshes | 7 | slosh | 18 |
reestablishes | 7 | reestablish | 108 |
swashes | 7 | swash | 15 |
refinishes | 7 | refinish | 72 |
knishes | 5 | knish | 9 |
stoushes | 5 | stoush | 63 |
Another good uplift in recall for popular terms.
Disagreement with S-stemmer is limited to WATCHe which look to be a brand name and certainly should not be seen as the stem of the popular watches term.
These words are mostly nouns and would benefit most ecommerce type searches
Plural|count|Proposed stem|Proposed stem count|S stemmer|S stemmer count
---|---|---|---|---|---|
matches | 16535 | match | 39680 |
catches | 4186 | catch | 17512 |
watches | 2966 | watch | 56591 | watche | 5
pitches | 2426 | pitch | 11668 |
stretches | 1964 | stretch | 8756 |
switches | 1769 | switch | 9901 |
patches | 1523 | patch | 4377 |
sketches | 889 | sketch | 2123 |
batches | 888 | batch | 3122 |
stitches | 694 | stitch | 688 |
scratches | 648 | scratch | 3301 |
witches | 540 | witch | 1492 |
smartwatches | 518 | smartwatch | 966 |
glitches | 464 | glitch | 564 |
clutches | 448 | clutch | 1770 |
crutches | 402 | crutch | 148 |
ditches | 289 | ditch | 2272 |
dispatches | 285 | dispatch | 2112 |
notches | 277 | notch | 2482 |
bitches | 177 | bitch | 1002 |
hatches | 172 | hatch | 1152 |
swatches | 170 | swatch | 181 |
mismatches | 132 | mismatch | 481 |
latches | 124 | latch | 519 |
snatches | 107 | snatch | 831 |
hitches | 101 | hitch | 715 |
fetches | 77 | fetch | 926 |
wristwatches | 39 | wristwatch | 109 |
britches | 35 | britch | 1 |
blotches | 33 | blotch | 10 |
twitches | 33 | twitch | 518 |
rematches | 26 | rematch | 846 |
itches | 24 | itch | 340 |
splotches | 20 | splotch | 8 |
crotches | 17 | crotch | 182 |
scotches | 15 | scotch | 638 |
cwtches | 15 | cwtch | 5 |
etches | 14 | etch | 122 |
masterbatches | 14 | masterbatch | 18 |
wretches | 13 | wretch | 44 |
snitches | 12 | snitch | 74 |
despatches | 12 | despatch | 125 |
botches | 9 | botch | 86 |
iwatches | 7 | iwatch | 42 |
Reasonable uplift from s stemmer.
Exceptions like taxe and boxe are again, names or acronyms that wouldn't be the usual interpretation of the plural form from which they stem.
Ecommerce sites will benefit from matchs on the various boxes they sell (gearboxes to jewellery boxes). One noticeable false stem is axes to axe.
Plural|count|Proposed stem|Proposed stem count|S stemmer|S stemmer count
---|---|---|---|---|---|
taxes | 10477 | tax | 24595 | taxe | 4
boxes | 5351 | box | 26638 | boxe | 8
indexes | 1920 | index | 19179 |
fixes | 1418 | fix | 10755 | fixe | 68
mixes | 970 | mix | 18027 |
complexes | 617 | complex | 24480 | complexe | 4
foxes | 591 | fox | 14241 |
sixes | 398 | six | 87451 |
sexes | 378 | sex | 16266 | sexe | 6
axes | 281 | ax | 371 | axe | 963
remixes | 252 | remix | 1153 |
exes | 238 | ex | 17191 | exe | 112
relaxes | 195 | relax | 3978 |
reflexes | 191 | reflex | 343 |
mailboxes | 182 | mailbox | 837 |
hoaxes | 135 | hoax | 1187 |
inboxes | 114 | inbox | 3963 |
annexes | 111 | annex | 514 | annexe | 46
waxes | 109 | wax | 1122 |
multiplexes | 105 | multiplex | 185 |
gearboxes | 86 | gearbox | 646 |
flexes | 72 | flex | 1475 |
faxes | 72 | fax | 3908 |
lunchboxes | 62 | lunchbox | 234 |
duplexes | 56 | duplex | 367 |
paradoxes | 54 | paradox | 566 |
tuxes | 49 | tux | 192 |
climaxes | 45 | climax | 1000 |
sandboxes | 45 | sandbox | 330 |
influxes | 34 | influx | 4764 |
maxes | 32 | max | 8292 |
prefixes | 30 | prefix | 114 |
coaxes | 29 | coax | 287 |
toolboxes | 28 | toolbox | 364 |
nixes | 25 | nix | 222 |
premixes | 25 | premix | 25 |
vortexes | 24 | vortex | 381 |
fluxes | 23 | flux | 650 |
suplexes | 22 | suplex | 64 |
shoeboxes | 22 | shoebox | 106 |
equinoxes | 22 | equinox | 479 |
vexes | 22 | vex | 57 |
hotfixes | 21 | hotfix | 38 |
connexes | 20 | connex | 16 |
xerxes | 18 | xerx | 6 |
alexes | 18 | alex | 12931 |
suffixes | 17 | suffix | 67 |
checkboxes | 16 | checkbox | 277 |
bugfixes | 15 | bugfix | 7 |
crucifixes | 15 | crucifix | 168 |
jukeboxes | 14 | jukebox | 141 |
letterboxes | 13 | letterbox | 83 |
saxes | 13 | sax | 219 | saxe | 45
subindexes | 13 | subindex | 34 |
hexes | 13 | hex | 171 |
suezmaxes | 12 | suezmax | 69 |
unboxes | 12 | unbox | 31 |
perplexes | 12 | perplex | 21 |
affixes | 11 | affix | 108 |
detoxes | 11 | detox | 431 |
pickaxes | 9 | pickax | 5 | pickaxe | 8
rolexes | 9 | rolex | 285 |
apexes | 8 | apex | 1214 |
xboxes | 8 | xbox | 2671 |
praxes | 8 | prax | 1 |
aframaxes | 8 | aframax | 16 |
cineplexes | 7 | cineplex | 68 |
appendixes | 6 | appendix | 878 |
flummoxes | 6 | flummox | 9 |
panamaxes | 6 | panamax | 104 |
boomboxes | 6 | boombox | 40 |
transfixes | 6 | transfix | 15 |
jinxes | 5 | jinx | 244 |
textboxes | 5 | textbox | 17 |
muxes | 4 | mux | 20 |
Another bizarre choice in s-stemmer is to avoid any stemming of ees (ie the trailing s is not removed).
Tests on the signal media 1m news dataset shows this overlooks a lot of valid words.
The only ees words that are not plurals here are raess (a Bollywood movie) and drees and dees (names), all of which when index as stemmed would not clash with other common English words. I recommend making this change to the original S-stemmer algorithm too.
Plural | count | Proposed stem | Proposed stem count
-- | -- | -- | --
employees | 32064 | employee | 12965 |
refugees | 17323 | refugee | 12865 |
sees | 12675 | see | 183619 |
fees | 11210 | fee | 11549 |
degrees | 8705 | degree | 21259 |
trees | 7550 | tree | 9160 |
attendees | 6994 | attendee | 493 |
guarantees | 4877 | guarantee | 12974 |
agrees | 3806 | agree | 18583 |
oversees | 2491 | oversee | 2879 |
committees | 2451 | committee | 27464 |
yankees | 2445 | yankee | 963 |
knees | 2390 | knee | 9674 |
nominees | 2050 | nominee | 2900 |
trustees | 1888 | trustee | 1491 |
bees | 1215 | bee | 2001 |
retirees | 1067 | retiree | 472 |
referees | 1006 | referee | 3752 |
brees | 814 | bree | 116 |
franchisees | 778 | franchisee | 476 |
disagrees | 776 | disagree | 2973 |
honorees | 666 | honoree | 335 |
rupees | 643 | rupee | 855 |
detainees | 637 | detainee | 262 |
devotees | 634 | devotee | 159 |
frees | 588 | free | 116658 |
tees | 530 | tee | 1876 |
trainees | 487 | trainee | 633 |
licensees | 468 | licensee | 567 |
entrees | 434 | entree | 350 |
rees | 418 | ree | 88 |
coffees | 413 | coffee | 11426 |
lees | 393 | lee | 13865 |
appointees | 327 | appointee | 176 |
toffees | 292 | toffee | 175 |
evacuees | 246 | evacuee | 37 |
foresees | 241 | foresee | 769 |
flees | 226 | flee | 3335 |
decrees | 218 | decree | 945 |
inductees | 212 | inductee | 469 |
awardees | 203 | awardee | 113 |
threes | 186 | three | 194122 |
returnees | 178 | returnee | 72 |
chimpanzees | 173 | chimpanzee | 94 |
grantees | 169 | grantee | 100 |
interviewees | 162 | interviewee | 91 |
enrollees | 156 | enrollee | 34 |
invitees | 139 | invitee | 48 |
escapees | 122 | escapee | 70 |
pharisees | 113 | pharisee | 41 |
honeybees | 112 | honeybee | 85 |
absentees | 112 | absentee | 288 |
burpees | 108 | burpee | 64 |
amputees | 84 | amputee | 201 |
divorcees | 77 | divorcee | 123 |
gees | 74 | gee | 670 |
lessees | 72 | lessee | 86 |
emcees | 66 | emcee | 321 |
pedigrees | 65 | pedigree | 971 |
humvees | 63 | humvee | 62 |
soirees | 60 | soiree | 161 |
maccabees | 53 | maccabee | 15 |
sarees | 50 | saree | 58 |
manatees | 50 | manatee | 133 |
elysees | 50 | elysee | 86 |
dees | 49 | dee | 1223 |
marquees | 47 | marquee | 1350 |
loanees | 44 | loanee | 183 |
signees | 43 | signee | 83 |
pees | 41 | pee | 534 |
mentees | 41 | mentee | 58 |
purees | 40 | puree | 438 |
monkees | 39 | monkee | 4 |
kees | 38 | kee | 153 |
jaycees | 35 | jaycee | 24 |
bumblebees | 33 | bumblebee | 76 |
fugees | 33 | fugee | 1 |
transferees | 19 | transferee | 27 |
drees | 12 | dree | 12 |
Another proposal:
The *ies -> *y rule should only apply to words longer than 4 characters.
In tests on the news dataset this proposal loses nothing but gains matches on pies -> pie, lies->lie and ties->tie.
The only 2 letter *y word of consequence in English is by and does not have a plural.
oes plurals are not treated at all by the "S" stemmer.
I beleive a positive uplift can be had by removing the es part of the suffix.
In the table below I contrast the popularity of the full oes term, an aggressive o stemmed form and a less aggressive oe stem in the million news articles dataset.
The aggressive stem back to the o word looks to be the most useful stemming rule to employ and a small set of exceptions that should retain the e (canoe, shoe and oboe) could be maintained as a list in the code.
The use of an exception list could make this a more contentious rule but it should be noted the alternative is to stick with the current policy of not stemming oes suffixes at all, which we can produce a lot of false negatives.
Plural|count|es stemmed|count|s stemmed|count
--|--|--|--|--|--|
shoes | 8173 | sho | 255 | shoe | 2902 |
heroes | 5076 | hero | 8280 | | |
tomatoes | 2507 | tomato | 2133 | | |
potatoes | 2271 | potato | 2530 | potatoe | 7 |
echoes | 1192 | echo | 2485 | | |
superheroes | 647 | superhero | 1554 | | |
mosquitoes | 541 | mosquito | 744 | | |
undergoes | 495 | undergo | 3492 | | |
volcanoes | 353 | volcano | 668 | | |
tornadoes | 336 | tornado | 688 | | |
buffaloes | 240 | buffalo | 4746 | buffaloe | 4 |
cargoes | 238 | cargo | 3595 | | |
throes | 237 | thro | 27 | | |
zeroes | 179 | zero | 13047 | | |
vetoes | 157 | veto | 1903 | | |
canoes | 147 | cano | 221 | canoe | 531 |
mangoes | 144 | mango | 902 | | |
dominoes | 113 | domino | 524 | | |
faroes | 103 | faro | 195 | faroe | 342 |
negroes | 85 | negro | 263 | | |
horseshoes | 85 | | | horseshoe | 385 |
torpedoes | 83 | torpedo | 202 | | |
frescoes | 73 | fresco | 330 | | |
embargoes | 70 | embargo | 1007 | | |
kroes | 49 | kro | 27 | | |
backhoes | 48 | | | backhoe | 73 |
mementoes | 41 | memento | 275 | | |
tiptoes | 33 | | | tiptoe | 53 |
floes | 27 | flo | 277 | floe | 11 |
outdoes | 27 | outdo | 222 | | |
simoes | 26 | simo | 22 | | |
marloes | 24 | marlo | 104 | | |
dingoes | 22 | dingo | 31 | | |
forgoes | 21 | forgo | 441 | | |
briscoes | 18 | brisco | 9 | briscoe | 227 |
cohoes | 16 | coho | 50 | | |
commandoes | 14 | commando | 293 | | |
snowshoes | 14 | | | snowshoe | 20 |
undoes | 14 | undo | 665 | | |
avocadoes | 14 | avocado | 981 | | |
mottoes | 13 | motto | 1345 | | |
antiheroes | 13 | antihero | 54 | | |
siloes | 13 | silo | 253 | | |
foregoes | 13 | forego | 271 | | |
flamingoes | 12 | flamingo | 233 | | |
overdoes | 12 | overdo | 183 | | |
sloes | 11 | slo | 145 | sloe | 25 |
ghettoes | 10 | ghetto | 300 | | |
gittoes | 10 | gitto | 4 | | |
innuendoes | 10 | innuendo | 250 | | |
manifestoes | 9 | manifesto | 1041 | | |
haloes | 9 | halo | 889 | | |
coxconservesheroes | 8 | coxconserveshero | 8 | | |
aloes | 8 | alo | 41 | aloe | 283 |
grottoes | 7 | grotto | 150 | | |
ciscoes | 7 | cisco | 2340 | | |
acoes | 7 | aco | 113 | | |
desperadoes | 6 | desperado | 43 | | |
sheroes | 6 | shero | 53 | | |
peccadilloes | 6 | peccadillo | 7 | | |
erdoes | 6 | erdo | 13 | | |
weirdoes | 6 | weirdo | 171 | | |
tahoes | 5 | taho | 24 | tahoe | 433 |
supervolcanoes | 5 | | | | |
ringoes | 5 | ringo | 252 | | |
oboes | 5 | obo | 32 | oboe | 43 |
domingoes | 5 | domingo | 391 | | |
porticoes | 3 | portico | 122 | | |
sermoheroes | 3 | | | | |
vidoes | 3 | | | | |
faeroes | 3 | | | faeroe | 6 |
fiascoes | 3 | fiasco | 573 | | |
hammertoes | 3 | | | hammertoe | 6 |
ricardoes | 3 | ricardo | 1041 | | |
groes | 3 | gro | 108 | | |
croes | 3 | cro | 274 | | |
It's possible that the EnglishMinimalStemmer's implementation of the original algorithm has a bug.
This is the original S-stemmer description:

The notes accompanying the table state :
"the first applicable rule encountered is the only one used"
For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.
"The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.
I notice this implementation of the s-stemmer makes the same mistake. (Perhaps our Java version was a port of this javascript or vice versa?).
@jpountz I've been working on a new TokenFilter but what does this ees/oes discovery mean for the existing EnglishMinimalStemmer code if it falls short of its goal in faithfully implementing the original paper?
Ches rules:
Looks like the es can be dropped but with a small number of English-adopted words like cliche, quiche and avalanche.
Plural | count | Proposed stem | Proposed stem count|Retain E|Retain E count
-- | -- | -- | --|--|--
matches | 16535 | match | 39680 | | | | |
launches | 8251 | launch | 32948 | | | | |
coaches | 8189 | coach | 35967 | | | | |
approaches | 6736 | approach | 36169 | | | | |
inches | 5675 | inch | 9864 | inche | 7 | inche | 7 |
reaches | 4684 | reach | 45691 | reache | 5 | reache | 5 |
catches | 4186 | catch | 17512 | | | | |
branches | 3606 | branch | 8197 | | | | |
touches | 3598 | touch | 22720 | touche | 208 | touche | 208 |
teaches | 3384 | teach | 8592 | | | | |
churches | 3166 | church | 21364 | | | | |
watches | 2966 | watch | 56591 | watche | 5 | watche | 5 |
speeches | 2832 | speech | 18023 | | | | |
searches | 2829 | search | 31303 | | | | |
breaches | 2739 | breach | 5148 | | | | |
beaches | 2641 | beach | 19419 | | | | |
pitches | 2426 | pitch | 11668 | | | | |
sandwiches | 2203 | sandwich | 2630 | | | | |
stretches | 1964 | stretch | 8756 | | | | |
switches | 1769 | switch | 9901 | | | | |
headaches | 1565 | | | headache | 1990 | headache | 1990 |
patches | 1523 | patch | 4377 | | | | |
punches | 1227 | punch | 4798 | | | | |
lunches | 1169 | lunch | 10838 | | | | |
riches | 890 | rich | 21176 | riche | 59 | riche | 59 |
sketches | 889 | sketch | 2123 | | | | |
batches | 888 | batch | 3122 | | | | |
benches | 810 | bench | 9126 | | | | |
stitches | 694 | stitch | 688 | | | | |
scratches | 648 | scratch | 3301 | | | | |
trenches | 618 | trench | 547 | | | | |
peaches | 618 | peach | 1247 | | | | |
marches | 608 | march | 30547 | marche | 48 | marche | 48 |
witches | 540 | witch | 1492 | | | | |
smartwatches | 518 | smartwatch | 966 | | | | |
attaches | 496 | attach | 1896 | attache | 85 | attache | 85 |
arches | 473 | arch | 1973 | arche | 15 | arche | 15 |
glitches | 464 | glitch | 564 | | | | |
clutches | 448 | clutch | 1770 | | | | |
pouches | 417 | pouch | 628 | | | | |
researches | 414 | research | 87327 | | | | |
crutches | 402 | crutch | 148 | | | | |
niches | 296 | nich | 6 | niche | 5715 | niche | 5715 |
ditches | 289 | ditch | 2272 | | | | |
dispatches | 285 | dispatch | 2112 | | | | |
notches | 277 | notch | 2482 | | | | |
preaches | 271 | preach | 891 | | | | |
couches | 260 | couch | 2131 | couche | 45 | couche | 45 |
tranches | 253 | | | tranche | 469 | tranche | 469 |
torches | 222 | torch | 827 | | | | |
bunches | 217 | bunch | 7482 | bunche | 25 | bunche | 25 |
enriches | 211 | enrich | 1720 | | | | |
backbenches | 208 | backbench | 485 | | | | |
ranches | 181 | ranch | 2495 | | | | |
bitches | 177 | bitch | 1002 | | | | |
hatches | 172 | hatch | 1152 | | | | |
swatches | 170 | swatch | 181 | | | | |
cliches | 166 | clich | 20 | cliche | 332 | cliche | 332 |
cockroaches | 164 | cockroach | 79 | | | | |
crunches | 152 | crunch | 2093 | | | | |
porches | 143 | porch | 1408 | porche | 17 | porche | 17 |
pooches | 138 | pooch | 328 | | | | |
caches | 133 | | | cache | 983 | cache | 983 |
mismatches | 132 | mismatch | 481 | | | | |
starches | 129 | starch | 400 | | | | |
latches | 124 | latch | 519 | | | | |
clinches | 122 | clinch | 1744 | | | | |
porsches | 113 | | | porsche | 2109 | porsche | 2109 |
snatches | 107 | snatch | 831 | | | | |
avalanches | 105 | | | avalanche | 722 | avalanche | 722 |
hitches | 101 | hitch | 715 | | | | |
perches | 97 | perch | 623 | | | | |
roaches | 94 | roach | 372 | roache | 20 | roache | 20 |
wrenches | 93 | wrench | 351 | | | | |
finches | 89 | finch | 860 | | | | |
pinches | 80 | pinch | 2441 | pinche | 6 | pinche | 6 |
fetches | 77 | fetch | 926 | | | | |
leeches | 68 | leech | 109 | | | | |
brunches | 66 | brunch | 924 | | | | |
lurches | 63 | lurch | 239 | | | | |
mustaches | 44 | | | mustache | 314 | mustache | 314 |
relaunches | 44 | relaunch | 362 | | | | |
apaches | 44 | | | apache | 889 | apache | 889 |
breeches | 42 | breech | 108 | | | | |
brooches | 41 | brooch | 134 | | | | |
slouches | 40 | slouch | 171 | | | | |
wristwatches | 39 | wristwatch | 109 | | | | |
winches | 39 | winch | 143 | | | | |
heartaches | 38 | | | heartache | 516 | heartache | 516 |
psyches | 37 | psych | 305 | psyche | 581 | psyche | 581 |
moustaches | 37 | | | moustache | 220 | moustache | 220 |
haunches | 34 | haunch | 7 | | | | |
hunches | 34 | hunch | 246 | | | | |
blotches | 33 | blotch | 10 | | | | |
beseeches | 33 | beseech | 56 | | | | |
twitches | 33 | twitch | 518 | | | | |
smooches | 31 | smooch | 82 | | | | |
quiches | 31 | | | quiche | 100 | quiche | 100 |
deutsches | 29 | deutsch | 643 | deutsche | 4088 | deutsche | 4088 |
encroaches | 28 | encroach | 136 | | | | |
entrenches | 27 | entrench | 112 | | | | |
rematches | 26 | rematch | 846 | | | | |
goldfinches | 26 | goldfinch | 51 | | | | |
flinches | 26 | flinch | 180 | | | | |
roches | 25 | roch | 60 | roche | 1329 | roche | 1329 |
outreaches | 24 | outreach | 3569 | | | | |
beeches | 23 | beech | 473 | | | | |
naches | 23 | nach | 57 | | | | |
bleaches | 21 | bleach | 411 | | | | |
detaches | 20 | detach | 162 | | | | |
poaches | 19 | poach | 151 | | | | |
birches | 19 | birch | 601 | | | | |
impeaches | 17 | impeach | 145 | | | | |
crouches | 16 | crouch | 257 | | | | |
belches | 16 | belch | 37 | | | | |
cwtches | 15 | cwtch | 5 | | | | |
masterbatches | 14 | masterbatch | 18 | | | | |
geocaches | 13 | | | geocache | 16 | geocache | 16 |
cinches | 12 | cinch | 140 | | | | |
stiches | 12 | stich | 11 | | | | |
despatches | 12 | despatch | 125 | | | | |
botches | 9 | botch | 86 | | | | |
Having heard back from the author of the paper on which Lucene's EnglishMinimalStemFilter is based I've concluded that the S-Stemmer algorithm presented there has muddled logic and the implementation of it in Lucene is also buggy.
Below is a final round-up of the differences between the proposed https://github.com/elastic/elasticsearch/pull/43248 stemmer and the Lucene filter based on trials with the Signal Media news dataset. The differences illustrate that the Lucene version either fails to offer any stem (e.g. employees keeps the s) or offers a non-sensical stem (dresses becomes dresse which means a search for that wouldn't match dress).
Plural | count | proposed new stem | count | Lucene stem (blank if not stemmed) | count
-- | -- | -- | -- | -- | --
employees | 32063 | employee | 12965 | | |
refugees | 17323 | refugee | 12865 | | |
sees | 12675 | see | 183619 | | |
fees | 11210 | fee | 11549 | | |
degrees | 8705 | degree | 21259 | | |
ties | 8596 | tie | 10507 | ty | 1234 |
lies | 8232 | lie | 6590 | ly | 157 |
shoes | 8173 | shoe | 2902 | | |
trees | 7550 | tree | 9160 | | |
attendees | 6994 | attendee | 493 | | |
heroes | 5076 | hero | 8280 | | |
guarantees | 4877 | guarantee | 12974 | | |
dies | 3967 | die | 13662 | dy | 133 |
agrees | 3806 | agree | 18583 | | |
tomatoes | 2507 | tomato | 2133 | | |
oversees | 2491 | oversee | 2879 | | |
committees | 2451 | committee | 27464 | | |
yankees | 2445 | yankee | 963 | | |
knees | 2390 | knee | 9674 | | |
woes | 2365 | wo | 283 | | |
potatoes | 2271 | potato | 2530 | | |
nominees | 2050 | nominee | 2900 | | |
trustees | 1888 | trustee | 1491 | | |
toes | 1693 | to | 917321 | | |
foes | 1246 | fo | 358 | | |
bees | 1215 | bee | 2001 | | |
echoes | 1192 | echo | 2485 | | |
retirees | 1067 | retiree | 472 | | |
referees | 1006 | referee | 3752 | | |
pies | 927 | pie | 3091 | py | 43 |
brees | 814 | bree | 116 | | |
franchisees | 778 | franchisee | 476 | | |
disagrees | 776 | disagree | 2973 | | |
honorees | 666 | honoree | 335 | | |
superheroes | 647 | superhero | 1554 | | |
rupees | 643 | rupee | 855 | | |
detainees | 637 | detainee | 262 | | |
devotees | 634 | devotee | 159 | | |
frees | 588 | free | 116658 | | |
mosquitoes | 541 | mosquito | 744 | | |
tees | 530 | tee | 1876 | | |
undergoes | 495 | undergo | 3492 | | |
trainees | 487 | trainee | 633 | | |
licensees | 468 | licensee | 567 | | |
entrees | 434 | entree | 350 | | |
rees | 418 | ree | 88 | | |
coffees | 413 | coffee | 11426 | | |
aes | 398 | ae | 450 | | |
lees | 393 | lee | 13865 | | |
volcanoes | 353 | volcanoe | 0 | | |
tornadoes | 336 | tornado | 688 | | |
paes | 328 | pae | 28 | | |
appointees | 327 | appointee | 176 | | |
toffees | 292 | toffee | 175 | | |
evacuees | 246 | evacuee | 37 | | |
foresees | 241 | foresee | 769 | | |
buffaloes | 240 | buffalo | 4746 | | |
businesses | 38694 | business | 143764 | businesse | 0 |
matches | 16535 | match | 39680 | matche | 1 |
processes | 14592 | process | 69044 | processe | 2 |
losses | 13164 | loss | 42859 | losse | 3 |
classes | 11946 | class | 48412 | classe | 23 |
passes | 10998 | pass | 29525 | passe | 42 |
taxes | 10477 | tax | 24595 | taxe | 4 |
launches | 8251 | launch | 32948 | launche | 3 |
coaches | 8189 | coach | 35967 | coache | 2 |
addresses | 7491 | address | 48607 | addresse | 0 |
approaches | 6736 | approach | 36165 | approache | 2 |
witnesses | 5929 | witness | 8866 | witnesse | 1 |
inches | 5675 | inch | 9864 | inche | 7 |
boxes | 5351 | box | 26638 | boxe | 8 |
wishes | 5284 | wish | 20877 | wishe | 3 |
reaches | 4684 | reach | 45691 | reache | 5 |
dishes | 4470 | dish | 4548 | dishe | 0 |
catches | 4186 | catch | 17512 | catche | 0 |
branches | 3606 | branch | 8197 | branche | 3 |
touches | 3598 | touch | 22720 | touche | 208 |
weaknesses | 3406 | weakness | 5481 | weaknesse | 0 |
clashes | 3388 | clash | 8264 | clashe | 0 |
teaches | 3384 | teach | 8592 | teache | 1 |
discusses | 3301 | discuss | 25244 | discusse | 0 |
churches | 3166 | church | 21364 | churche | 1 |
successes | 3071 | success | 47715 | successe | 1 |
finishes | 3039 | finish | 21923 | finishe | 1 |
glasses | 3036 | glass | 12610 | glasse | 0 |
watches | 2966 | watch | 56591 | watche | 5 |
speeches | 2832 | speech | 18023 | speeche | 0 |
searches | 2829 | search | 31303 | searche | 0 |
breaches | 2739 | breach | 5148 | breache | 0 |
beaches | 2641 | beach | 19419 | beache | 0 |
bosses | 2609 | boss | 13511 | bosse | 25 |
masses | 2594 | mass | 20427 | masse | 499 |
dresses | 2502 | dress | 9588 | dresse | 0 |
pitches | 2426 | pitch | 11668 | pitche | 0 |
publishes | 2376 | publish | 5568 | publishe | 2 |
illnesses | 2316 | illness | 6664 | illnesse | 0 |
sandwiches | 2203 | sandwich | 2630 | sandwiche | 1 |
pushes | 2188 | push | 21428 | pushe | 1 |
ashes | 2179 | ash | 2203 | ashe | 1418 |
crashes | 2143 | crash | 12997 | crashe | 3 |
misses | 2070 | miss | 25704 | misse | 2 |
stretches | 1964 | stretch | 8756 | stretche | 1 |
indexes | 1920 | index | 19179 | indexe | 2 |
switches | 1769 | switch | 9901 | switche | 0 |
crosses | 1692 | cross | 27398 | crosse | 221 |
encompasses | 1630 | encompass | 1067 | encompasse | 0 |
sunglasses | 1551 | sunglass | 76 | sunglasse | 0 |
patches | 1523 | patch | 4377 | patche | 0 |
stresses | 1511 | stress | 11000 | stresse | 0 |
fixes | 1418 | fix | 10755 | fixe | 68 |
possesses | 1384 | possess | 3022 | possesse | 1 |
progresses | 1298 | progress | 23792 | progresse | 0 |
expresses | 1296 | express | 15843 | expresse | 2 |
punches | 1227 | punch | 4798 | punche | 0 |
lunches | 1169 | lunch | 10838 | lunche | 2 |
actresses | 1165 | actress | 10873 | actresse | 0 |
establishes | 1163 | establish | 11754 | establishe | 1 |
flashes | 1150 | flash | 7612 | flashe | 1 |
mixes | 970 | mix | 18027 | mixe | 3 |
kisses | 916 | kiss | 3228 | kisse | 0 |
riches | 890 | rich | 21176 | riche | 59 |
sketches | 889 | sketch | 2123 | sketche | 0 |
batches | 888 | batch | 3122 | batche | 0 |
bushes | 886 | bush | 8213 | bushe | 2 |
rushes | 864 | rush | 8995 | rushe | 7 |
assesses | 825 | assess | 6807 | assesse | 4 |
presses | 820 | press | 82998 | presse | 1068 |
benches | 810 | bench | 9126 | benche | 0 |
brushes | 738 | brush | 3491 | brushe | 0 |
parishes | 733 | parish | 3146 | parishe | 0 |
lashes | 697 | lash | 516 | lashe | 0 |
stitches | 694 | stitch | 688 | stitche | 0 |
scratches | 648 | scratch | 3301 | scratche | 0 |
trenches | 618 | trench | 547 | trenche | 0 |
peaches | 618 | peach | 1247 | peache | 0 |
complexes | 617 | complex | 24476 | complexe | 4 |
marches | 608 | march | 30547 | marche | 48 |
foxes | 591 | fox | 14241 | foxe | 3 |
washes | 565 | wash | 5493 | washe | 0 |
mattresses | 556 | mattress | 931 | mattresse | 0 |
witches | 540 | witch | 1492 | witche | 0 |
dismisses | 536 | dismiss | 2207 | dismisse | 1 |
harnesses | 484 | harness | 2438 | harnesse | 0 |
glitches | 464 | glitch | 564 | glitche | 1 |
clutches | 448 | clutch | 1770 | clutche | 0 |
excesses | 438 | excess | 6456 | excesse | 0 |
researches | 414 | research | 87327 | researche | 0 |
sexes | 378 | sex | 16266 | sexe | 6 |
messes | 335 | mess | 5290 | messe | 145 |
impresses | 325 | impress | 2589 | impresse | 2 |
diminishes | 321 | diminish | 1061 | diminishe | 0 |
niches | 296 | nich | 6 | niche | 5715 |
notches | 277 | notch | 2482 | notche | 0 |
@markharwood - thanks for the effort put into analyzing this! As a temporary workaround, I gathered your corrected misstems in a synonyms file, here
@softwaredoug I need to push this. Re your synonyms - note that there's a small amount of collateral damage in this stemming that you probably want to fix in your synonyms file - toes -> to, woes -> wo and foes -> fo
thanks, fixed!
Most helpful comment
I can do some digging but for a start I would expect
*ss*tch*x*share always patterns that would always remove theespart of a plural.For reference - crossword solvers:
sses examples
tches examples
shes examples
xes examples