Arangodb: AQL optimizer doesn't choose the fastest query when joining and sorting a big table

Created on 10 Jul 2017 · 3Comments · Source: arangodb/arangodb

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

[ ] 2.8
[ ] 3.0
[x] 3.1
[x] 3.2 Beta4
[ ] self-compiled devel branch

On this operating system:

[ ] DCOS on
- [ ] AWS
- [ ] Azure
- [ ] own infrastructure
[x] Linux
- [x] Debian .deb
- [ ] Ubuntu .deb
- [ ] SUSE .rpm
- [ ] RedHat .rpm
- [ ] Fedora .rpm
- [ ] Gentoo
- [ ] docker - official docker library
- [ ] other:
[ ] Windows, version:
[ ] MacOS, version:

this is an AQL-related issue:

[ ] I'm using graph features

I'm issuing AQL via:

[x] web interface with this browser: _Chrome_ running on this OS: _Debian 3.16.43-2 (2017-04-30) x86_64 GNU/Linux_
[ ] arangosh
[ ] this Driver:

I've run db._explain("<my aql query>") and it didn't shed more light on this.
The AQL query in question is:

FOR d IN vDevice
  FOR r IN vRoute
    FILTER d._id == r.vDeviceId
    SORT r.mask DESC
    LIMIT 50
    RETURN {id:r._id, dev:d.hostname, network:r.prefix}

This query takes ~5.5s. vDevice has 60 docs and vRoute has ~1.5M docs. We have a hash index at vRoute table called vDeviceId. This is the execution plan of the query above:

 Id   NodeType                    Est.   Comment
  1   SingletonNode                  1   * ROOT
  3   EnumerateCollectionNode   211604     - FOR r IN vRoute   /* full collection scan */
  6   CalculationNode           211604       - LET #4 = r.`mask`   /* attribute expression */   /* collections used: r : vRoute */
 11   IndexNode                 211604       - FOR d IN vDevice   /* primary index scan */
  7   SortNode                  211604         - SORT #4 DESC
  8   LimitNode                     50         - LIMIT 0, 50
  9   CalculationNode               50         - LET #6 = { "id" : r.`_id`, "dev" : d.`hostname`, "network" : r.`prefix` }   /* simple expression */   /* collections used: r : vRoute, d : vDevice */
 10   ReturnNode                    50         - RETURN #6

Indexes used:
 By   Type      Collection   Unique   Sparse   Selectivity   Fields       Ranges
 11   primary   vDevice      true     false       100.00 %   [ `_key` ]   (d.`_id` == r.`vDeviceId`)

The following query takes ~1.5s. We have to add a useless FILTER (2nd line) to force arango to use our index and make the query to perform.

 FOR d IN vDevice
   FILTER LIKE (d.hostname,'%%')
   FOR r IN vRoute
     FILTER d._id == r.vDeviceId
     SORT r.mask DESC
     LIMIT 50
     RETURN {id:r._id, dev:d.hostname, network:r.prefix}

The execution plan follows:

 Id   NodeType                    Est.   Comment
  1   SingletonNode                  1   * ROOT
  2   EnumerateCollectionNode      417     - FOR d IN vDevice   /* full collection scan */
  3   CalculationNode              417       - LET #2 = LIKE(d.`hostname`, "%%")   /* simple expression */   /* collections used: d : vDevice */
  4   FilterNode                   417       - FILTER #2
 13   IndexNode                 393648       - FOR r IN vRoute   /* hash index scan */
  8   CalculationNode           393648         - LET #6 = r.`mask`   /* attribute expression */   /* collections used: r : vRoute */
  9   SortNode                  393648         - SORT #6 DESC
 10   LimitNode                     50         - LIMIT 0, 50
 11   CalculationNode               50         - LET #8 = { "id" : r.`_id`, "dev" : d.`hostname`, "network" : r.`prefix`}   /* simple expression */   /* collections used: r : vRoute, d : vDevice */
 12   ReturnNode                    50         - RETURN #8

Indexes used:
 By   Type   Collection   Unique   Sparse   Selectivity   Fields            Ranges
 13   hash   vRoute       false    false         0.11 %   [ `vDeviceId` ]   (d.`_id` == r.`vDeviceId`)

The solution seems hacky though. What could be the reason for such behavior?

The issue can be reproduced using this dataset: _We can provide dump via email._

Please provide a way to create the dataset to run the above query on; either by a gist with an arangodump, or `db.collection.save({my: "values"}) statements. If it can be reproduced with one of the ArangoDB example datasets, it's a plus.

1 Feature 2 Fixed 3 Optimizer performance

Source

gurisko

👍1

Most helpful comment

Hey team! I was wondering if you are thinking of use index <name> feature where we could specify which index to use directly in AQL.

gurisko on 12 Oct 2017

👍3

All 3 comments

Hey team! I was wondering if you are thinking of use index <name> feature where we could specify which index to use directly in AQL.

gurisko on 12 Oct 2017

👍3

Hi,
sorry for comming back late on this.

Let me explain the tactics of the optimizer here.
You can think of the AQL becoming a snail inside of arangodb, chewing the data at its mouth passing it down all the way to the RETURN-statement. It strives to Bite as less data with its mouth as possible, and move the parts of the query so the data that is being passed along to the back of the snake is reduced as early as possible.
So, usually a full collection scan is always a bad idea. For that reason it will always try to have indexed access on the outermost FOR statement. After that it will try to have the the join-condition being indexed.

Your workaround introduces a Calculation node, which the optimizer can't tell whether its ok to wrap the query around or not, so it mustn't swap the loops.

A more smarter way to tell the Optimizer how to handle this situation (if the vDevice collection doesn't have many documents) may look like this:

LET devices=(FOR device IN vDevice RETURN { device._id, device.hostname} )
FOR d IN devices
....

Then the optimizer should also pick your Index of choice.

dothebart on 22 Nov 2017

Hi @gurisko,

both named indexes and the option "indexHint" to override the internal optimizer decision regarding which index to use are supported in Release Candidate 7 of ArangoDB 3.5.

For further details, please have a look at:

You can find the latest Release Candidate of 3.5 in the download section.