Core: Add search support on the REPORT method

Created on 6 Jul 2016  Â·  39Comments  Â·  Source: owncloud/core

Currently the REPORT method on the files DAV (from https://github.com/owncloud/core/pull/22112) provides a way to filter by tags.

It should be extended to also allow specifying search attributes.
Eventually it should be possible for the search app to use this method instead of a private endpoint.

Backend changes

  • TODO: improve REPORT to work with oc:search and oc:depth: 2-4 md
  • TODO: improve REPORT to work without any search pattern (for PROPFIND pagination): 0.5-1md
  • [x] new REPORT type for searching files, plugged into the existing PHP search API (which will deliver full text search when search_lucene or search_elastic is enabled): https://github.com/owncloud/core/pull/31873
  • [x] Add capability

Frontend changes

Follow up

files enhancement

Most helpful comment

@jvillafanez master PR merged, please backport.

Also please add a capability so the clients know this feature is available.

In general it should already be possible to do an OPTIONS call on the files endpoint to find out that the new report type is available. However considering that most capabilities are listed in our capabilities endpoint, better add it there as well. I wonder if we should also expose a value telling what search API type is available (ex: regular search, full text search, etc) so the clients can display a text accordingly.

All 39 comments

Additionally we should add pagination on the REPORT method that can be used regardless of the search.

Ref: https://github.com/owncloud/core/issues/13915

Taken from a draft spec which was planned for 9.1:

Requirements

Any client using ownCloud APIs wants to search in files based on meta information as well as content. Clients are our mobile clients as well as the web ui – even the desktop client can gain some benefits from it (hint: pagination).

Since everything which is related to files is handled via WebDAV – search will be just the same.
This fits quite will into the overall architecture: we do tagging and comments via webdav as well and other dav based protocols support search mechanisms as well (see https://tools.ietf.org/html/rfc6352#section-8.6 and https://tools.ietf.org/html/rfc4791#section-7.8)

There is even a search specification on its own: https://tools.ietf.org/html/rfc5323
Requirements for the search API:

  • pagination
  • search in content (full text search)
  • search in meta data like name, mtime, mime type, size
  • ? integrate tags and comments ?
  • ? exif information ?
  • limit search to a sub folder
  • ? limit to share related information like shared with and shared by ?

Feature description

We should follow the calendar and addressbook query approach.
First of all because these concepts have proven to work quite well in the area of caldav and carddav.

The WebDAV search spec introduces a completely new verb – this will pretty sure result in deployment issues. Furthermore the search spec is supported by not one single webdav client.
Last but not least the spec is pretty rich and its implementation can be rather time consuming.

Use-cases

Gallery wants to search in a sub folder Pictures for all jpeg and a special camera model as set in the exif information
An Android music player wants to search for all mp3s with an id3 tag set to a specific artist
The desktop client wants to list all files in a subfolder but paginated with 1000 files per page/request.

Next Steps

  1. Have a look at calendar and addressbook query specs and define the query xml for files
  2. Decide on the search scope

setting this to 9.2 - we need this for paginated file list in files.

@georgehrke you already did come up with an proposal for the report object - please add here - THX

@georgehrke feel free to submit a wip-pr to see how the implementation can look like - THX

Make it a separate report called "search-files" ?

Because if we mix it together with the existing "filter-files" report, it will become messy, especially if we want to combine "search by X but also filter by favorites", etc

In addition let's try the json way .... see https://youtu.be/sqgvjidj7iQ

@PVince81 are we ready with this?

Search query:

<?xml version="1.0" encoding="utf-8" ?>
<oc:filter-files xmlns:a="DAV:" xmlns:oc="http://owncloud.org/ns" >
        <a:prop>
                <oc:id/>
                <oc:fileid/>
                <oc:permissions/>
                <a:getlastmodified/>
                <a:getetag/>
                <a:getcontenttype/>
                <a:resourcetype/>
                <oc:downloadURL/>
                <oc:ddC/>
                <oc:size/>
                <oc:owner-id/>
                <oc:owner-display-name/>
                <oc:size/>
                <oc:checksum />
                <oc:tags />
                <a:quota-used-bytes/>
                <a:quota-available-bytes/>
        </a:prop>
        <oc:search>
                <oc:pattern>pattern-goes-here</oc:pattern>
                <oc:offset>0</oc:offset>
                <oc:limit>5</oc:limit>
        </oc:search>
</oc:filter-files>

Properties are PROPFIND-properties.

Response is PROPFIND response with a list of result nodes.

We could use the Depth header or have oc:depth to specify:

  • depth: 0: only search or return results from current folder
  • depth: inf: search / results from all folders.

To do a pagination PROPFIND without search: use REPORT with Depth: 0/1 + oc:offset, oc:limit
To do a search in all folders: use REPORT with Depth: Inf + oc:offset, oc:limit

  • move files app search field to use REPORT
  • instead of using PROPFIND for file list in web UI, use REPORT + pagination

=> moved to top post

  • TODO: improve REPORT to work with oc:search and oc:depth: 2-4 md
  • TODO: improve REPORT to work without any search pattern (for PROPFIND pagination): 0.5-1md

Beware that it will not be possible to combine oc:search with oc:filter-rules.

=> moved to top post

  • [x] decide whether to use oc:depth or the header. I do like to see everything related to the query inside of the body.

decide whether to use oc:depth or the header. I do like to see everything related to the query inside of the body.

make it part of the xml body

Talked to @michaelstingl. This would be a good item to schedule for Q2/2018/10.0.6.

moving to planned again as we focus mostly on sharing and finishing existing stuff in 10.0.9

  • add capability entry whether search is available -> moved to top post

updated top post with task list https://github.com/owncloud/core/issues/25373

in a first phase I think we should focus on the backend part only as it will be useful for all clients

Since we have some room in the sprint, assigned to @jvillafanez to have a look to make some progress

Some interesting additions to be considered:

google-like search patterns

Basic search pattern can be searchme or search me. We can include something like name:file or mimetype:file to indicate if we want to search by name or by mimetype. We can also use lessthan:size:20MB to search for files with size less than 20 MB.

The general pattern would be [[operator:]attribute:]value. Note just using the value would perform a regular search as the search provider wants.

Specific attributes and operators will depend on what the providers supports.

Using a non-supported operator or attribute will either return an error or consider the whole string as a raw search pattern (this might need discussion, but I'd keep the code changes small).

Note that the whole search string will be sent to the search provider. It's up to the search provider to decide how it will be interpreted.

<?xml version="1.0" encoding="utf-8" ?>
<oc:filter-files xmlns:a="DAV:" xmlns:oc="http://owncloud.org/ns" >
        <a:prop>
......
        </a:prop>
        <oc:search>
                <oc:pattern>pattern-goes-here</oc:pattern>
                <oc:offset>0</oc:offset>
                <oc:limit>5</oc:limit>
        </oc:search>
</oc:filter-files>

Expected changes for this

  • Just include a \OCP\Search\SearchPatternHandler to make things easier. https://stackoverflow.com/a/15191418 can serve as a base. Each provider is expected to use this class to handle the search pattern.
  • Each search provider will need some changes depending on what it wants to support.

For short term goals, this can be ignored for now. This is just to make sure that the search pattern will be handled by each provider, showing how things can be extended.

Using specific search providers

By default, if no search provider is specified, all the providers will be queried and the response will be merged. This implies that there might be duplicated results. (This is the current approach of the search service). To be decided what to do with the duplicates: it might be interesting if we consider that the more duplicates the more relevance the result has, meaning that you might be wanted that result and not other.

There are some advantanges if we include the previous point about the search pattern:

  • Basic search pattern will likely go through all the providers returning information from all of them. This means that there shouldn't be any change needed in the providers for this case.
  • "Complex" patterns might not return results in some providers by several reasons, such as the operator / attribute isn't supported or the pattern isn't supported at all. Worst case, a provider might return unwanted results, but I'd expect it to return no results on most of those searches.

The main disadvantage is that we need to use all the providers and let them search with a query it wasn't supposed to be for them, which will cause a considerable waste of time.

In order to overcome this, we'll need to include an optional "provider" tag in the request body

        <oc:search>
                <oc:provider>myawesomeSearchProvider</oc:provider>
                <oc:pattern>pattern-goes-here</oc:pattern>
                <oc:offset>0</oc:offset>
                <oc:limit>5</oc:limit>
        </oc:search>

That would tell the search API to use only that specific search provider. It should solve the problems above as well as properly use complex search patterns that might make sense within the scope of that search provider.

Expected changes for this

  • search provider name discovery: we'll need to provide a list with all the available search provider names in the server. First candidate to show this list is within the capabilities.
  • search provider naming: we can use the class name by default. To be decided if we want to allow changing the name of the provider. Note that right now there is no search provider with a name
    public final function getName() { return \get_class($this); }
  • additional changes in the search system to allow searching in an specific provider

Additional considerations

Pagination

There are several problems we'll need to decide how to solve:

  • Same result appearing in different pages. Let's say you have "A" "C" "E" "G" and "I" results and you want them in pages of 2 elements. First result is "A" and "C". At this point, something changes and "B" appears as part of the result (a new file has been added, for example). Now you request the second page and what you get is "C" (duplicated) and "E". There is also the additional problem of "B" which might be ignored as it wasn't present in the first page at that time.
  • Search providers might not scope the results to a particular folder. I think all the search providers that we have right now can't do this. We'll need to decide who's going to be responsible of scoping the results, whether the search service or each of the providers. Note that switching the responsability might not be possible later
Limiting the results to a particular folder can be done by the search service, but it will likely have a noticeable performance penalty because it will need to filter all the results of all the providers.

  • Unknown number of pages per provider. Let's say you want the page 20 with 5 elements per page. We'll need to check how many pages has the first provider, then the second, and so on until we know that this provider has "our" page 20. We'll likely need to get all the results of the first provider and cut the results into pages, and the same for the rest if needed.

Depth property

This might depend on the support of the property by the provider. Some of them might search in all the FS tree (elastic search?) while others might just search within the current folder due to performance reasons.

An easy solution could be to include this as part of the search pattern, such as depth:2, trying to target a specific search provider that support this option.

The other option could be fetch all the results and let the search service filter them. The problem is that this will be inefficient.

Scoping results to a particular folder

Similar to what happens with the depth property. You can check the point about scoping results in the pagination.

Regarding the expression stuff: it is likely that the search_lucene or search_elastic already have a language you can use. All that matters here is that we pass the string as is to the search providers then who then decide how to parse and use the expression.

This implies that there might be duplicated results. (This is the current approach of the search service)

In general I'd expect people to only have a single search app enabled. I'm not sure whether we really support having multiple search providers enabled at the same time. @butonic

Is there logic for the existing search field that disables filename-based search whenever another search provider is enabled ? In any case I think we should just replicate whatever the search field does as the requirement is to provide similar functionality in clients.

In general I think we should only support having a single search provider as we cannot guarantee coherence in expression syntax. Also I don't think we should design any generic expression language to be translated for multiple search providers. Let's limit to a single search provider for now.

duplicated results

The current REPORT method used for searching favorites can have results with the same file names but different href URLs. In case of duplicates we might need to deduplicate based on the href field. I'm not sure if we should concern ourselves with such duplicates at this stage.

Pagination

Yes, pagination trickiness was always a concern so far due to possible concurrent changes. For now I suggest to ignore this issue and assume the clients will only retrieve the first page, big page as we cannot guarantee consistency of results. (there are likely technical ways to do so involving caching, but we don't want to go there right now).

Does the web UI search have pagination ?

@michaelstingl I think we talked about pagination problems before and I think we agreed that clients would only retrieve the first page of results, a big one ?

Depth

At this point, if we go the search provider route, I'm not sure whether Depth even makes sense any more. Maybe we remove it and only allow search on the root element ?

Scoping results to a particular folder

As far as I remember the web UI has two different search routines:
1) a simple filter applied on the current folder view
2) a call to the search API which will show "results in other folders" in a second block of results, from which the current results are likely excluded

@michaelstingl do we require the search to be scopable by folder or is it enough for it to be a global file search ?

@PVince81 Go for the simplest approach for v1, and we'll test-drive it in the new iOS app (https://github.com/owncloud/ios-app/issues/53) and provide feedback…

Regarding the expression stuff: it is likely that the search_lucene or search_elastic already have a language you can use. All that matters here is that we pass the string as is to the search providers then who then decide how to parse and use the expression.

Yes. The interpretation of the search string will depend on each particular search provider. Our search service will just send the string to the provider without any manipulation on our part.
This implies no changes on our plans.

In general I'd expect people to only have a single search app enabled. I'm not sure whether we really support having multiple search providers enabled at the same time. @butonic

Having multiple search apps is possible. The results will be merged. If we only want to support a search provider we have to limit it somehow in the search service.

Note that this limitation isn't planned, so we'll need to check possiblities.

Is there logic for the existing search field that disables filename-based search whenever another search provider is enabled ?

No. The search_elastic app removes the core's provider and inserts its own, while the search_lucene just inserts the provider.

The current REPORT method used for searching favorites can have results with the same file names but different href URLs. In case of duplicates we might need to deduplicate based on the href field. I'm not sure if we should concern ourselves with such duplicates at this stage.

I don't think we should worry about duplicates as long as the clients are aware. If there are duplicates within a provider, blame the provider. We might want to separate the results somehow based on the provider: my provider returns these files, your provider these ones, etc, this would clarify what results come from what provider.

Does the web UI search have pagination ?

Yes, but it isn't properly implemented: it fetches all the results and then returns the slice we want. I don't think the provider will scale for large datasets, and fetching all the data (because the provider doesn't support pagination) won't help.

The question here is whether we want to support pagination or not taking into account that there might be several search providers, at least until we decide to have only one provider.

Instead of pagination we can have a limit, kind of "give me the first X results". This can be easier to handle on our side. The drawback is that we won't handle pages, so the <oc:offset>0</oc:offset> doesn't make sense in this scenario.

As far as I remember the web UI has two different search routines:

The backend call is the same. The UI present the results differently.

No. The search_elastic app removes the core's provider and inserts its own, while the search_lucene just inserts the provider.

Is there already an API that aggregates the results ? I'd assume that the web UI search field calls said API. We could just use the same and not worry about single or multiple providers.

Regarding pagination:

Yes, let's only add a limit and no offset. If no limit given, set a default one (or fail with 400, whatever we already do now). Don't let the server return infinite results.

Did you check what format the results are returned in ? Are these easy to pack into PROPFIND-like response format ? Is additional lookup required ?

Is there already an API that aggregates the results ? I'd assume that the web UI search field calls said API. We could just use the same and not worry about single or multiple providers.

Not really. The search service appends all the results from all the providers. It's part of the service.

Did you check what format the results are returned in ? Are these easy to pack into PROPFIND-like response format ? Is additional lookup required ?

Feels BAD. https://github.com/owncloud/core/blob/master/lib/public/Search/Result.php is what we should be handling at webdav level, and https://github.com/owncloud/core/blob/master/lib/private/Search/Result/File.php is what we'll likely be handling, with the risk of being private API.

Looks like the intention was to use json_encode and run away with it. I don't know if we want to add a layer to be able to change what's below at some point in the future...

There is already a report for tags, so I'll create another one for the search. It doesn't seem a good idea to mix things.

Should we add Result->getObject() on the public API that returns the wrapped object for search results ? Since we're only processing File results we could directly get the FileInfo then.

PR ready in https://github.com/owncloud/core/pull/31873

I think most of the "public" details are commented in the PR.

Regarding performance, without changing the search service and likely replace it with a new one, I don't think we can optimize more than that. Maybe fiddling with the webDAV layer, but I'd prefer to touch as less as possible and let webDAV handle it own stuff.

There are several cases that haven't been tested yet, mostly with large data (thousands of files) and some error handling.

@jvillafanez master PR merged, please backport.

Also please add a capability so the clients know this feature is available.

In general it should already be possible to do an OPTIONS call on the files endpoint to find out that the new report type is available. However considering that most capabilities are listed in our capabilities endpoint, better add it there as well. I wonder if we should also expose a value telling what search API type is available (ex: regular search, full text search, etc) so the clients can display a text accordingly.

Capabilities added in https://github.com/owncloud/core/pull/31943

For the record, it's possible to check what reports are available through webDAV:

  1. Create a "supported.xml" file (for example), with the following content (it might contain more things, but this is the minimum to get what we want):
    <?xml version="1.0" encoding="UTF-8"?> <A:propfind xmlns:A="DAV:"> <A:prop> <A:supported-report-set/> </A:prop> </A:propfind>
  2. Send a propfind request to the target node:
    curl -H "Depth: 0" -X PROPFIND --data "@supported.xml" -u admin:Password 'http://10.0.2.8:7080/remote.php/dav/files/admin' | xmllint --format -

The expected response would be something like:

<?xml version="1.0"?>
<d:multistatus xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns" xmlns:oc="http://owncloud.org/ns">
  <d:response>
    <d:href>/remote.php/dav/files/admin/</d:href>
    <d:propstat>
      <d:prop>
        <d:supported-report-set>
          <d:supported-report>
            <d:report>
              <oc:filter-files/>
            </d:report>
          </d:supported-report>
          <d:supported-report>
            <d:report>
              <oc:search-files/>
            </d:report>
          </d:supported-report>
          <d:supported-report>
            <d:report>
              <oc:filter-comments/>
            </d:report>
          </d:supported-report>
        </d:supported-report-set>
      </d:prop>
      <d:status>HTTP/1.1 200 OK</d:status>
    </d:propstat>
  </d:response>
</d:multistatus>

Note that the <oc:search-files/> is only expected to appear in the root folder of the user (either "/remote.php/webdav/" or "/remote.php/dav/files/user/") and not in other places. (https://github.com/owncloud/core/pull/31943 fixes this to make it consistent with the current implementation)

I'll backport both PRs at the same time.

@jvillafanez you can also use the OPTIONS method and/or check response headers, likely easier

It won't be enough. OPTIONS might check only if the REPORT method is available in the target endpoint, but it doesn't say anything about what reports are available

curl -v -X OPTIONS -u admin:Password 'http://10.0.2.8:7080/remote.php/dav/files/admin/foo'
*   Trying 10.0.2.8...
* TCP_NODELAY set
* Connected to 10.0.2.8 (10.0.2.8) port 7080 (#0)
* Server auth using Basic with user 'admin'
> OPTIONS /remote.php/dav/files/admin/foo HTTP/1.1
> Host: 10.0.2.8:7080
> Authorization: Basic YWRtaW46UGFzc3dvcmQ=
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 29 Jun 2018 06:45:38 GMT
< Server: Apache
< Strict-Transport-Security: max-age=15768000; preload
< Set-Cookie: oc8geiifsytk=o600f4hla6dp6ov56tdioqgrfp; path=/; HttpOnly
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate
< Pragma: no-cache
< Set-Cookie: oc_sessionPassphrase=DMKgcg6HAkeZs5ngDj3tdz0Tj5DmsXuYrCkib9IWBfItxyVNq322WKTwCgLppGZ5mNo5KMUxpjFt%2FsY%2F0rjb24SIxHs9VeNJv9rBi9kQQILeVWkV%2BtsJS%2FZlWYE7wECG; path=/; HttpOnly
< Content-Security-Policy: default-src 'none';
< Set-Cookie: oc8geiifsytk=nbotj9i0irkbktajmld062ths2; path=/; HttpOnly
< Set-Cookie: cookie_test=test; expires=Fri, 29-Jun-2018 07:45:39 GMT; Max-Age=3600
< Allow: OPTIONS, GET, HEAD, DELETE, PROPFIND, PUT, PROPPATCH, COPY, MOVE, REPORT
< DAV: 1, 3, extended-mkcol
< MS-Author-Via: DAV
< Accept-Ranges: bytes
< Content-Length: 0
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Robots-Tag: none
< X-Frame-Options: SAMEORIGIN
< X-Download-Options: noopen
< X-Permitted-Cross-Domain-Policies: none
< Content-Type: text/html; charset=UTF-8
< 
* Connection #0 to host 10.0.2.8 left intact


curl -v -H "Depth: 0" -X PROPFIND --data "@supported.xml" -u admin:Password 'http://10.0.2.8:7080/remote.php/dav/files/admin/foo' | xmllint --format -
*   Trying 10.0.2.8...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 10.0.2.8 (10.0.2.8) port 7080 (#0)
* Server auth using Basic with user 'admin'
> PROPFIND /remote.php/dav/files/admin/foo HTTP/1.1
> Host: 10.0.2.8:7080
> Authorization: Basic YWRtaW46UGFzc3dvcmQ=
> User-Agent: curl/7.58.0
> Accept: */*
> Depth: 0
> Content-Length: 125
> Content-Type: application/x-www-form-urlencoded
> 
} [125 bytes data]
* upload completely sent off: 125 out of 125 bytes
< HTTP/1.1 207 Multi-Status
< Date: Fri, 29 Jun 2018 06:49:10 GMT
< Server: Apache
< Strict-Transport-Security: max-age=15768000; preload
< Set-Cookie: oc8geiifsytk=2ja5ntq35becat6cigf6e388dq; path=/; HttpOnly
< Expires: Thu, 19 Nov 1981 08:52:00 GMT
< Cache-Control: no-store, no-cache, must-revalidate
< Pragma: no-cache
< Set-Cookie: oc_sessionPassphrase=86vjXSjlSFsqVwl7GdppfNhAR91rpQ4dKb%2FHJkfPzrxNLBNmd%2FdjqMZwIwVF93cRkf2gOmQeXSBdH1cq74LYiuuRsqW6mU6KpuxEPRANZ5TEGCaHKIMjmt5f9B9XW1%2F3; path=/; HttpOnly
< Content-Security-Policy: default-src 'none';
< Set-Cookie: oc8geiifsytk=9r9cgopcefjtcsjvagoq3nqjur; path=/; HttpOnly
< Set-Cookie: cookie_test=test; expires=Fri, 29-Jun-2018 07:49:10 GMT; Max-Age=3600
< Vary: Brief,Prefer
< DAV: 1, 3, extended-mkcol
< X-Content-Type-Options: nosniff
< X-XSS-Protection: 1; mode=block
< X-Robots-Tag: none
< X-Frame-Options: SAMEORIGIN
< X-Download-Options: noopen
< X-Permitted-Cross-Domain-Policies: none
< Content-Length: 499
< Content-Type: application/xml; charset=utf-8
< 
{ [499 bytes data]
100   624  100   499  100   125   2952    739 --:--:-- --:--:-- --:--:--  3692
* Connection #0 to host 10.0.2.8 left intact
<?xml version="1.0"?>
<d:multistatus xmlns:d="DAV:" xmlns:s="http://sabredav.org/ns" xmlns:oc="http://owncloud.org/ns">
  <d:response>
    <d:href>/remote.php/dav/files/admin/foo/</d:href>
    <d:propstat>
      <d:prop>
        <d:supported-report-set>
          <d:supported-report>
            <d:report>
              <oc:filter-files/>
            </d:report>
          </d:supported-report>
          <d:supported-report>
            <d:report>
              <oc:filter-comments/>
            </d:report>
          </d:supported-report>
        </d:supported-report-set>
      </d:prop>
      <d:status>HTTP/1.1 200 OK</d:status>
    </d:propstat>
  </d:response>
</d:multistatus>

@jvillafanez thanks for confirming. I might have mixed up with the "DAV" header when remembering.

@jvillafanez please raise documentation ticket and copy your usage examples there. This will be useful for client devs to start programming against this.

Considering that Phoenix is providing the future files app, let's not invest time in the old files app regarding integration with this API.

I've raised https://github.com/owncloud/phoenix/issues/172 for Phoenix integration.

The remaining advanced search topics will be adressed in this separate ticket: https://github.com/owncloud/core/issues/31993

Was this page helpful?
0 / 5 - 0 ratings