Server: Files search takes a long time with unified search

Created on 2 Nov 2020  Â·  8Comments  Â·  Source: nextcloud/server

Depending on the amount of files, the file search component of unified search takes a long time to get results back. It’s the slowest of all – Talk, Mail and everything loads before. On our own instance, it takes roughly 40 seconds.

Also, there is no x in the input field during loading so it’s impossible to cancel the search – you have to wait for it to finish to start a different one. To fix this, we could replace the spinner with an x on hover/focus.

Gif for illustration, showing how it takes 40 seconds:
Search takes 40 seconds

(This is very related to the direct filtering of the current folder, which also doesn’t work: https://github.com/nextcloud/server/issues/23432 )

cc @nextcloud/search

1. to develop bug search

All 8 comments

Also, there is no x in the input field during loading so it’s impossible to cancel the search – you have to wait for it to finish to start a different one. To fix this, we could replace the spinner with an x on hover/focus.

Searching cancel other searches.
You can always change your query and it should trigger the search again

It's mainly slow because the files search implementation doesn't paginate. So you'd always build the full results list on the back-end before it's chopped into the individual pages. This doesn't scale, as you observed. Our instance just has a lot of files and shares, hence the slowness.

@ChristophWurst non-pagination is NOT the problem:

Pasting 152151 into the search term will yield a single result after ~4.7s (runtime taken from Firefox devtools).

Executing a query directly on the PostgreSQL 12.0 server (2.6e6 rows in oc_filecache):

  • SELECT * FROM oc_filecache WHERE name LIKE '%152151%: 0.5s
  • SELECT * FROM oc_filecache WHERE name ILIKE '%152151%: 1.1s

This makes NC file search at least 4 times slower than the database full table search.

If the search term is not entered by copy/paste, but typed in, things get much worse, since each character will trigger a new query on the server, consuming more resources (when typing fast, the final search request will take 16s).

@skjnldsv A new search does not cancel the running queries in the database backend.

@skjnldsv A new search does not cancel the running queries in the database backend.

Right, it cancel the frontend request of course.

I enabled log_min_duration_statement=800 in the PostgreSQL Server, and found that each search will execute the same query four times, which explains the x4 observed above.

@ChristophWurst non-pagination is NOT the problem:

If there was pagination the query would include more WHERE clauses to scope the range. It would therefore help.

But yeah, ILIKE is expensive.

Not really, if the result set is small, or not at all when using window functions, since the time consuming part is the sequential scan, not the result set transfer.

For PostgreSQL, speed could be improved by not using name ILIKE :searchExpression (1.1s), but lower(name) ILIKE lower(:searchExpression) (0.8s) or even better lowercasedname LIKE lower(:searchExpression) (0.5s).

But there are fruits hanging much lower:

  • eliminating those 3 exceeding queries, there's something going awfully wrong here.
  • delaying search for some 300-500ms after the last keystroke, to prevent multiple zombie queries eating database performance.

More advanced:

  • split filesearch in two requests: First search the current directory only (my example has 4800 Files in the folder, <20ms), then the full scan. Quite often, only the current directory is of interest.

I found the reason why the database was hammered with the same query 4x for a single search. They correspond with four shares from the same user. These result in four MountPoints, having different different roots on the same physical storage (same numericId). Unfortunately, the filtering code is buried very deep in CacheWrapper, and each wrapper retrieves the full data from its Cache instance before filtering.
Hiding (parts of) the filtering from the View in separate Caches might be good programming practice in general, but is a real pain for big datasets. Can't see a way how to speed this up in a non-dirty way.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

brylie picture brylie  Â·  3Comments

juliushaertl picture juliushaertl  Â·  3Comments

ChristophWurst picture ChristophWurst  Â·  3Comments

ghost picture ghost  Â·  3Comments

blackcrack picture blackcrack  Â·  3Comments