Elasticsearch: Allow Ids query to score documents by ids order

Created on 7 Aug 2013  路  15Comments  路  Source: elastic/elasticsearch

_This is more a feature suggestion than an issue._

Id's query are great and allow to use ES as a great searchable datastore.

I have some use cases where I want to restrict user search to a limited set of documents - so I use Ids query with some match queries.

But sometime, I don't have query, I just want my N documents, without sorting on a field: and in this case, all my documents get a score of 1.

If I run this query multiple times, I get different order almost everytime, as score are equals.

{
    "query": {
            "ids": {
                "values": [
                   "1221","5","6","7","8","9","10"
                ]
            }
    }
}

This is not great for users (as they get a random feeling), and I may also want to use the id values order as document order. So I created a custom script query for this case:

{
  "query": {
    "custom_score": {
      "query": {
        "ids": {
          "type": "pony",
          "values": [
            "1337",
            "1664",
            "8888",
            "1111"
          ]
        }
      },
      "script": "
        count = ids.size();
        id    = org.elasticsearch.index.mapper.Uid.idFromUid(doc['_uid'].value);
        for (i = 0; i < count; i++) {
          if (id == ids[i]) { return count - i; }
        }"
      "params": {
        "ids": [
          "1337",
          "1664",
          "8888",
          "1111"
        ]
      },
      "lang": "mvel"
    }
  }
}

As you can see, I inject the ids to the script as a param, and give them a custom score based on the position of the current document ID in the list.

This fix consistency and ordering issues, but this is slow when dealing with lots of ID's (started noticing when hitting 3k ids).

What I was thinking about is some kind of option we can add to IdsQuery to score docs based on the Id position.

{
    "query": {
            "ids": {
                "values": [
                   "1221","5","6","7","8","9","10"
                ],
                "score_by_order": true
            }
    }
}

With this value to "true", the IdsQuery could give a score to each document, removing the random effect and the need of a custom script to sort by id's.

What do you think?!
Thanks!

discuss

Most helpful comment

I would still consider supporting an alternative for this, @clintongormley

It's pity that after 3 years ES haven't provide a reliable alternative for these situations where we need to keep the order of the documents that have been requested + apply search filters.

Scripting is not an option for serious and large-scale applications and anyway we cannot use expression for this I guess, which is in theory more performant.

It would be enough to be able to sort by position in an array of values, provided in the ES request, like:

{
 "sort": [{
    "_position": {
       "field_to_compare_values_with": [1201, 982, 34134]
    }
  }]
}

Similar to the way_geo_distance is used. Benefits compared to the first proposal:

  • Would be compatible with any kind of search supporting sort
  • As you can see, we don't need to mess up with scores.
  • It will work with any field, not only ids.
  • This is useful when you want to keep a fixed/constant sorting and you still need, for example, to ES to calculate and return the distance in geospatial searches.

If you reconsider it, I could open this in a new feature-request ticket.

All 15 comments

If the problem is having consistent ordering of your documents and you only have the ids filter (or query), I'd suggest to switch to the multi_get api. In that case you would get back the documents in the same order you have put the id in your request. Also, get and multi_get are a better fit when using elasticsearch as a storage as they are real-time, while search is only (Near) real-time, which means that a refresh needs to happen in order to make newly indexed documents searchable (a refresh happens automatically every second by default though).

Otherwise, if you do need a query and want to use the search API, can't you just sort your documents by _id? The issue you may encounter there is that the _id field is not indexed by default, but you can change its mapping or use the _uid field instead, which contains type+id and it is indexed by default, thus it can be used for sorting out of the box.

Let me know if this helps and maybe next time (if you haven't done it yet) can you send a question to the mailing list just to double check that you tried all the options you have?

Thank for the reply :)

  • Multi get api can't run facets or other ES Query powerful features - having the feature in the ID's query would allow a lot of possibilities.
  • Ordering by _id/_uid is not a solution, you can have non linear _id (like hashes from an url shortener...), and also I want my documents in the order I request them, it can be random.

PS: Here is the related discussion in the ES ML: https://groups.google.com/d/topic/elasticsearch/QQ8RXyMD4fM/discussion

Thanks for your quick feeback, I see what you mean!

I think a custom script is the way to go here, as it's really your own logic and not something really common. I'd suggest to have a look at script sorting though. In fact, you need to infuence the way the score is computed because you are sorting by score, but if you are able to express your sorting logic as a script, you can just sort based on it, that's it.

That's exactly what I do (see the second example in my issue);
I use the list of ID to compute the score of the document. But using score is painfully slow on large dataset, that's why I opened this issue: asking the community if I'm the only one who need this as a feature (a new option in the ID Query) or not :grimacing:

Got it, what I suggested to do is different to custom_score, although still executes a script per document. Have a look at script sorting.

I'd love to have a score_by_order as well

Unfortunately by some reason function idFromUid was recently removed and this solution doesn't work anymore with latest ElasticSearch version.
Maybe @martijnvg could comment why was this method removed and are there alternatives?
https://github.com/elasticsearch/elasticsearch/commit/0e780b7e99ac1af46d9f0f4a8b04517ef2a0cec2#diff-376fdeb0c8f420de09933212c022341cL97

Maybe someone else knows how to get this feature to work again?
Thank you.

Actually i just tried using doc['id'].value instead of org.elasticsearch.index.mapper.Uid.idFromUid(doc['_uid'].value); and looks like everything works fine too. Don't even know why idFromUid was used in this solution in first place.

"script": "return -ids.indexOf(Integer.parseInt(doc['id'].value));"

My script just looks like this: ids.indexOf(doc['id'].value)

Closing as won't fix.

I would still consider supporting an alternative for this, @clintongormley

It's pity that after 3 years ES haven't provide a reliable alternative for these situations where we need to keep the order of the documents that have been requested + apply search filters.

Scripting is not an option for serious and large-scale applications and anyway we cannot use expression for this I guess, which is in theory more performant.

It would be enough to be able to sort by position in an array of values, provided in the ES request, like:

{
 "sort": [{
    "_position": {
       "field_to_compare_values_with": [1201, 982, 34134]
    }
  }]
}

Similar to the way_geo_distance is used. Benefits compared to the first proposal:

  • Would be compatible with any kind of search supporting sort
  • As you can see, we don't need to mess up with scores.
  • It will work with any field, not only ids.
  • This is useful when you want to keep a fixed/constant sorting and you still need, for example, to ES to calculate and return the distance in geospatial searches.

If you reconsider it, I could open this in a new feature-request ticket.

I suggest you use a script like this :
switch(doc['id'].value){case "1337":return 0;case "1664":return 1;case "8888":return 2;case "1111":return 3;}
By avoiding the lookup, it will speed up your script execution.
You can also use a hashmap.

I see the feature request closed. Is there a support in 6.4+? We need to sort index of 20k+ documents (with ids as strings - guids) and dont think that script with custom ranking is the correct way to do this...

The original request to support through query option in json is still useful instead of maintaining the scripting logic. Can you please consider this feature request?

sort by script will take more times
I suggest you use a script like this :
"sort":
{
"_script": {
"type": "number",
"script":{
"source": "params.get(doc._id.value);",
"params": {
"5c357c0eb565e654fcc3507c": 0,
"5c3539f1b565e632d1c690ee": 1
},
"lang": "painless"
},
"order": "asc"
}
}

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jasontedor picture jasontedor  路  3Comments

clintongormley picture clintongormley  路  3Comments

matthughes picture matthughes  路  3Comments

dawi picture dawi  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments