Youcompleteme: Better suggestions ranking

Created on 5 Nov 2015 · 31Comments · Source: ycm-core/YouCompleteMe

Hi people,

The YouCompleteMe is awesome, but I'm facing some strange ranking issues, given the strings on document:

InternalCountingCreator
internal_counting_id
internal_counting
internal_counting_items
increment_existing_counting

If you type intecounting the first result is internal_counting_id and the second is internal_counting_items, the internal_counting suggestion, which I want, come on fifth position below to most unobvious suggestion like increment_existing_counting that come on third position.

Other case:

validate
validates

If you type valid the first suggestion is the validate_associated instead of validate, but think, If I want validate_associated I could type valiass to this, but to get validate on this case there's no other way than navigate to second position.

Why do you people don't use some sort of levenshtein distance algorithm to ranking? I believe that results would be much more accurate eliminating the need of navigate on the suggestions list to get the right suggestion.

Source

samuelsimoes

Most helpful comment

I also think just put the shortest on the top is a good idea. Because even if it's not the correct word, I can just select it and add some new letters, then it's done. In another case, like in previous example,

InternalCounting
InternalCountingCreator
internal_counting_id
internal_counting
internal_counting_items
increment_existing_counting

we can just put the shortest two at front,
internal_counting
InternalCounting
Even if what we want was "InternalCountingCreator", we only need to hit 'tab' 2 times, and add another character 'C', then complete it again.

shawn-peng on 2 Dec 2016

👍2

All 31 comments

YCM match base on all the words ,you should type _
InternalCounting
InternalCountingCreator
internal_counting_id
internal_counting
internal_counting_items
increment_existing_counting
in this case if you want internal_counting
you can type i_c

wsdjeg on 10 Nov 2015

i think now the ycm's matcher is very good ,if the char you type are the first char of every word,i think the first one is what you want!
i will show you some case

wsdjeg on 10 Nov 2015

InternalCounting
InternalCountingCreator
internal_counting_id
internal_counting
internal_counting_items
increment_existing_counting

in the same case every word split by _
if you want increment_existing_counting, you can type this there chars,iec which is the begin of these three words.

wsdjeg on 10 Nov 2015

in this case if you want validates
validate_associated
validate
validates

you can type vds,begin +mid+end of the word you want

wsdjeg on 10 Nov 2015

if you type vdd,validate_associated will be the first

wsdjeg on 10 Nov 2015

i_c and vds isn't intuitive because I need to think on what "fuzzy token" I need type (on the second example I need think on the middle of every word :scream:) to get my suggestion on first position and with this fuzzy search lose all meaning.

Levenshtein Distance is great in this case because It works with your examples and don't require any thinking about the token that you are typing, you just need complement with "ahead details" to refine your search and ranking.

samuelsimoes on 10 Nov 2015

any suggestion about how to match?
YCM do not know what you want,it only can show the rusuly based on what you type.
as you know,some result string is very long,YCM should not match the shorter one
for example in this case vde should put the longest in the first position instead of the shorter one.because both of them are v\w*d\w*e
validate_associate
validate

wsdjeg on 10 Nov 2015

here is the result for the above case
in vim /v\w*d\w*e will match both validate_associateandvalidate
and i think YCM is Greed Matcher,so the longest one will be the first

wsdjeg on 10 Nov 2015

sorry ,it is my fault ,ycm will put the shorter string in the first position

wsdjeg on 10 Nov 2015

I'll bring some proof of concept to some that I wish to see on YCM soon. But in general Sublime Text and Atom, for example, do something like I said.

_(in this case, Sublime brought "validates" on first position and I don't know why. Btw validates looks better suited than validate_associated to first position)_

samuelsimoes on 10 Nov 2015

YCM putting the shortest string on top could resolve most of cases, I believe. :+1:

samuelsimoes on 10 Nov 2015

so i think we have similar views,is there any issue about this?

wsdjeg on 10 Nov 2015

I don't think Levensthein Distance is well suited for this kind of thing. Take the following words:

triage

When typing tiae, you will get triage in first place because:

LD(tiae, this_is_an_example) = 14
LD(tiae, triage) = 2

where LD is the Levensthein Distance.
Word boundaries are far more pertinent criterion for a ranking system. Let's see your examples:

InternalCountingCreator
internal_counting_id
internal_counting
internal_counting_items
increment_existing_counting

If you want internal_counting, you should type ic (or i_c to remove the capitalized suggestion).

validate
validates

You should type ve for validate, vs for validates, and va for validate_associated.

The goal of a completion system is to type the least number of characters to get your completion and a ranking system based on Levensthein Distance is against this idea.

micbou on 10 Nov 2015

@micbou, I understand your point, we use completion systems on different ways.

I'm always more inclined to don't care to word boundaries because I need think about these limits when I'm typing, when I use some like Sublime-like completion I don't need care about these boundaries, I just need type any fragments in right order.

Btw, I'm leaving my suggestion, but If it goes much against YCM philosophy you can close this issue.

@wsdjeg I think I misunderstood you, I thought that you'd submit some patch to YCM put the shortest on top.

samuelsimoes on 10 Nov 2015

One question. In the example below, should bebring the billable term on first position, since it has b on beginning and e on ending? (I'm using ruby as buffer syntax)

captura de tela 2015-11-10 as 12 22 51 pm

samuelsimoes on 10 Nov 2015

By looking at the code, YCM seems to only consider the first (and not the last) character of word boundaries for suggestions ranking. @Valloric should confirm this.

micbou on 10 Nov 2015

IIRC ycmd weights suggestions higher when the query is a prefix of the candidate, after checking word boundary chars. In this case 'be' is a prefix of 'between' so it beats 'billable'

puremourning on 14 Nov 2015

Understood. The previous example didn't show the problem properly.

With this on file:

billable_item
billables
billable

and wanting get billable on the first position.

bae was getting billable on the first position. When because word entered on file, bae start bringing because on first position. So I needed to find another matching token, this time bbl does the work, but until when?

These are tiny examples of what I'm getting many times every day. :/

samuelsimoes on 14 Nov 2015

So, I'm toying with the idea of always considering the last char of the identifier a word-boundary char. I think this works quite well in these situations, where you're sort of after the "shorter" word with equivalent subsequence match.

Quick demo:

ycm_last-char-wb

I have pushed a simple change to a ycmd branch which you can try out. Let me know feedback, etc.

Commit: https://github.com/puremourning/ycmd-1/commit/11e5308ba7c4800bf649f0e58a3d52f849143b04

https://github.com/puremourning/ycmd-1/tree/last-char-wb

Interested in thoughts from @Valloric, @micbou, @vheon, @oblitum on this before investing too much time.

puremourning on 21 Nov 2015

ycm_last-char-wb1

For the other case at the top

puremourning on 21 Nov 2015

BTW i know that the patch above is garbage (it can potentially add the last char multiple times, needs tests, breaks tests, etc.), but it's just a prototype to see if it feels right.

puremourning on 21 Nov 2015

Word boundaries considering the first and last letter could improve the things, but we'll continue have the problem described on https://github.com/Valloric/YouCompleteMe/issues/1757#issuecomment-156714617 :confused:

samuelsimoes on 22 Nov 2015

Personally, I find word boundary chars way more intuitive than the proposed alternatives, particularly for the use cases YCM is designed for. And the change would be breaking for everyone's mental model of how YCM offers suggestions, so I would be surprised if we changed more than just tweaking the current approach.

puremourning on 23 Nov 2015

may be a new branch should be tried by everyone

wsdjeg on 23 Nov 2015

i think now the ycm works well ,also no need to change

wsdjeg on 23 Nov 2015

@puremourning I don't think that trying to consider the last word char a word boundary char is a good idea with the current ranking system. What we have right now, while not IMO perfect, is an understandable and well-functioning mechanism with a simple guide for the user: the best way to rank something to the top is to write the first char of each word.

The real improvement would be rewriting the matching algorithm to be factor-multiplicative instead of tree-based as it is now. Basically something similar to the math behind linear regression. A rewrite of the matching algorithm so that this can be done and we can easily extend it with new factors has been on my mind for more than a year, but I haven't been able to get around to it.

WRT Levenshtein distance, it plain doesn't work when you have _many_ completion candidates. The net it casts is too wide and you get shitty suggestions. I know because it was the first matching algo used in YCM, even before it was released. It was terrible and I quickly replaced it.

Valloric on 20 Dec 2015

while not IMO perfect, is an understandable and well-functioning mechanism with a simple guide for the user: the best way to rank something to the top is to write the first char of each word.

@Valloric got it, the main pain IMHO is the rank not bringing the most simple completion on the first position, making super tricky do some completions (like I said on https://github.com/Valloric/YouCompleteMe/issues/1757#issuecomment-156714617).

About Levenshtein distance, in fact, it was a naive suggestion, sorry.

samuelsimoes on 28 Feb 2016

Copying my report I was giving in the Gitter channel, I agree the ranking system acts in an unexpected way quite often. In the following case for example, trying to complete to word "udev", it shows up on "ud", vanishes on "ude", and comes back on "udev", never on top: