Plan for new search system:
Search driver (eg. Algolia) is called upon to update its index each time a post/discussion is changed
(Each post is a separate object in the index which contains not only post content but also its discussion's title/tags/other metadata)
q = foo "bar baz" (author:toby OR author:franz) is:following
q into an abstracted query object
. eg. https://github.com/pimcore/search-query-parser or https://github.com/netgen/query-translatorTerm in the query object, allow gambits to match and convert term into a Where object to be processed by search driverWhere to restrict results to only tags that the user can seeTerms/Wheres, returns list of discussion IDsHowever, this is complex and will take a while to implement, so it would be wise to keep it slated for 0.2 (as per current roadmap). In the meantime, we are doing #1339
After continually jumping back and forth between performance issues and little bugs in our search implementation, it has become clear that we need to make progress on our driver-based search infrastructure (#509). This would then allow for swapping out a default, naive implementation (hand-maintained index, similar to FluxBB; MySQL fulltext search; etc.) for more advanced drivers based on e.g. Elasticsearch or Algolia.
I was asked to describe my thoughts on the initial architecture in a new ticket. But we already have this one, so I'll add my thoughts here.
Multiple people brought up using Laravel Scout, as it incorporates a driver infrastructure for searches on Eloquent models. This might be possible to use, but I am somewhat skeptical for several reasons:
What might be possible is using Scout's engines (i.e. drivers) so that we have less work in building out the adapters to the various backends. We would still have to take care of the integration into Eloquent / our search infrastructure manually, though. Definitely worth a shot when we get to the implementation phase.
These are the filters / gambits supported by our current search layer...
Searching users:
Searching discussions:
Initial thoughts:
We really have to come up with a better conceptual separation between what is search and what is filtering.
Once this list is complete, we should be able to derive an interface for driver implementations:
As we don't want to require something like Elasticsearch by default (this would complicate the setup far too much), we have to provide a simple implementation by default.
As an alternative to the current MySQL-based fulltext search, we could implement a search based on a hand-made index, similar to what FluxBB and other forum software do / have done. Now that the application layer has to notify the index when it should be changed, this is possible. The main benefit would be compatibility with other database systems, should we ever get there.
Quick primer on how FluxBB's search works:
search_words table and give it an ID. (Only do this for new words, obviously.)search_matches (word_id, post_id) table. This can also contain metadata about where the match occurred (subject, opening post, body...) and the number of matches.search_words, then get the matching post IDs from the search_matches table.We want to keep these things in mind when designing the interface, even if we don't implement them right away...
This is very likely incomplete right now. I will try my best to keep this post updated, as we discuss things below.
Currently, we have a DiscussionSearcher and a UserSearcher. I think we should abstract this system down to an AbstractSearcher, which would support extending ANY model, flarum-defined or not, with a searcher. When applied to Posts, this should also remove necessity for the (frankly redundant) ConfigurePostsQuery event.
Within that searcher, filtering is currently supported by gambits. I think this is a good system, and the new AbstractSearcher system should support extending ANY searcher with a gambit (if it exists).
Now, the searching itself it currently handled by a 'FullTextGambit'. This, in my opinion, is the part that we could refactor out into a system that supports various search drivers. A few possible ideas:
In my opinion, the refactor I described above should be included by stable. Implementing an extender for custom search drivers could be delayed by 0.2.
Franz and I had a very productive discussion today, here are my takeaways:
One point of uncertainty is where to draw the line between searching and filtering, and how these 2 concepts should fit in/work together. I am in favor of a tightly coupled approach, that is, having filtering and searching together under one system. Here's why:
askvortsov and filtering for open PRs by askvortsov with a search term of Policy Refactor would not be processed radically differently, but that the latter would build upon the former.Our goal in designing the new system should be to eventually support search drivers. This should be extendable, so that extension developers can add new drivers. The issue emerges of how to generalize our implementation, so that filters (currently presented by gambits) will work across various extenders.
SimpleFlarumSearch search driver. It will have a SimpleFlarumSearch extender, which will allow people to add gambits, and to set fulltextgambits (or filterers and searchers, respectively if we want to change the names to be clearer).SimpleFlarumSearch driver, and corresponding FilterSpecs will be registered in the core of search.public class Search implements ExtenderInterface {
// Add search driver
public function searchDriver(SearchDriverInterface $driver);
// Add filter spec
public function filterSpec(AbstractModel $model, FilterSpecInterface $filterSpec);
}
public function SimpleFlarumSearch implements ExtenderInterface {
// Right now these are 'Gambits'
public function filter(AbstractModel $model, FilterSpecInterface $filterSpec, SimpleFlarumFilterInterface $filter);
// Right now these are 'FullTextGambits'
public function search(AbstractModel $model, SimpleFlarumSearchInterface $fullTextSearcher);
}
public interface SearchDriverInterface {
// This returns a searcher for a given model, which actually performs the searching/filtering
public function searcher(AbstractModel $model): SearcherInterface;
// This will return an array of specs for the filters that this driver implements.
// This should not be a static list, but rather should be generated as a result of other devs
// adding filter implementations through the driver's extender.
public function implementedFilters(): array[FilterSpec];
// This will be called during resolution, and will provide a list of all existing FilterSpecs to the driver.
public function addAvailableFilter(AbstractModel $model, array[FilterSpec] $filterSpecs);
}
public interface SearcherInterface {
public function search(SearchCriteria $criteria, $limit = null, $offset = 0, array $load = []): SearchResults;
}
md5-03a40be493a9bb378ab94a4a4bd89ead
public interface FilterSpecInterface {
// Regex representing the filter's pattern
public function pattern(): string;
}
This sounds like a great plan @askvortsov1 ; I'm extremely curious about the technical implications once we move gambits to the SimpleSearch. Well done!
After a bit more thought, I realized I'm missing one method in the SearchDriverInterface:
public function supportedModels(): array[AbstractModel];
This should work similarly to implementedFilters, but I think it's important to have the 2 separately, because we might have filters but not a searcher, or vice versa. However, there might be a bit of overlap, which might be unideal.
One other thing to consider: should filter specs go beyond a regex, and provide, say, a description? or non-regex filter types (equals, contains, between, gt, gte, etc)? In essence, I think that regex might be better as an implementation detail than as the actual interface, but I might be wrong
I feel for the API instead of:
filter[q]=testquery tag:testtag author:testauthor
this would be a lot clearer:
filter[q]=testquery&filter[tag]=testtag&filter[author]=testauthor
That would make it behave like the page parameter page[offset]=20&page[near]=20&page[limit]=20
But since it's not an easy change, the better alternative could be to rename filter[q] to either q or filter, which would also clarify things:
filter=testquery tag:testtag author:testauthor
(And maybe even simplify the page parameters to just offset, near, limit to keep it consistent.)
I don't know that we'd want to rename filter[q] to q, as we want to ensure compliance with JSON-API. That being said, I like splitting up AND queries across multiple request params. We could also look into a broader overhaul, something like https://json-api-dotnet.github.io/JsonApiDotNetCore/usage/filtering.html.
That being said, I feel like we should separate discussion about the format of REST API requests, and the underlying search architecture (which is more focused on establishing a driver-based system)
I've spent a bit more time playing around with the system (using elasticsearch as an example, but only as that), and here's a reiteration of the main issues (shocklingly, pretty much exactly what Franz identified above):
So, possible, non-mutually-exclusive approaches to fighting these issues (some of these are quite rough) :
a. Run 2 search systems (Flarum's basic one and elasticsearch/algolia/whatever) at the same time. If a search includes an unsupportable gambit like subscription status or unread, default to the native Flarum driver
b. Implement search drivers by tying gambits to a driver name. So an elasticsearch driver would work by overriding (some of) the default filtration gambits
c. Figure out how to improve Flarum's default search so that it takes discussion titles into account (and works better with CJK characters)
d. Have ElasticSearch return 10,000 or 100,000 search results, and keep all offset / size logic in the flarum portion of the application. This wouldn't be too different from our current method of searching, but could have performance implications (how fast can we filter down by ID if the list of IDs is very large?) Can we maintain order for relevance searches as returned by ElasticSearch?
e. Ensure every gambit is aware of ALL other gambits being applied. Also, allow filtration gambits to pass some kind of metadata to the fulltext gambit so that instead of directly executing the filter, they could prepare some elasticsearch query fragment to be run in the fulltext gambit. The more gambits a driver can replace, the more effectively it'll perform, and the vast majority of gambits are replaceable.
I'm planning to take a better look through the Laravel Scout system and see how they accomplish this (and whether they actually do).
Somewhat concrete proposal:
If we consider typical user behavior, people are unlikely to look beyond the _nth_ page of search results, where n is typically in single digits. Results become less relevant, and people have typically found something that works by then. (cue "The best place to hide a body is the 2nd page of google results" jokes).
Criteria that filter out significant portions of the search space (tags, dates, authors/participants, etc) can all be handled by Elasticsearch As long as we run queries that contain subscription and unread filters through the SQL search, searches we send to ElasticSearch should mostly contain results the user can view. We could further improve that by telling elasticsearch not to include anything from tags/categories the user can't view discussions in, which shouldn't be too performance-intensive, and would further cut down on inaccessible (permission-wise) results returned by ElasticSearch.
With these optimizations in place, if we request (offset + size) * c) results from ElasticSearch (where 2 < c < 10) and proceed as per (d) above, user experience shouldn't be affected in the VAST majority of cases. Furthermore, we might even be able to stick that returned data in redis or something, since if we don't use all of it, it could be recycled when the user next clicks "read more".
It goes without saying that this is an extension-space issue, although we might want to make it bundled (at least at the start) because of how critical an issue it is.
Re-reading this discussion, I feel like we do need to separate filters and full text search into different API endpoints and implementations.
The filter part, like proposed in #2104 , makes a lot of sense to me for retrieving posts with filters, discussions with a particular flag (bookmarked, followed, ...) or users based on filters (enabled, group) for an admin UI. The current approach works well and apart from creating proper extenders and re-usable classes, I don't think we need to change it too much. Changing to query-string based over string-based would be nice.
The full text search really stands on its own. Very few filters matter. You just want relevant results, and fast. Currently there are only two situations where this endpoint would be used: the quick search dropdown that includes discussions and users, and the full discussion search results page. We don't really need to make the search work with anything else by default. Here's there's also potentially the use case of directly hitting a search server like Algolia without going through Flarum backend at all.
My preference/proposal would be as follows:
filters[] query string instead of a string querySo the API could look something like this:
PHP
class FilteredQuery implements ExtenderInterface {
public function __construct(string $filteredQueryClass);
// Equivalent to FlarumSearch::addGambit in the PR
public function addFilter(string $filterClass);
// Equivalent to FlarumSearch::addSearchMutator in the PR
public function addQueryMutator(callable $callback);
}
// This extender's only purpose would be to make the drivers and resources known to the admin panel to create a configuration screen. Everything else will be on the client-side
class Search implements ExtenderInterface {
// Registers a new driver for full text search
public function driver(string $driverClass);
// Registers a new full text search resource that this extension provides (for example, pages or albums)
public function resource(string $name);
}
// This is the extender's for the base search driver provided by Flarum
class SimpleTextSearch implements ExtenderInterface {
// Add a gambit for the default search driver. Those could work very similarly to what we have now, but would be separate from the filtered query filters
public function addGambit(string $gambitClass);
// Allows altering the SQL query of the default search driver, just like we can with filtered queries
public function addQueryMutator(callable $callback);
}
// The server-side search driver class would only serve to build the admin configuration panel. Flarum needs to get a name/description, and know which resources are compatible with this driver. The list of resources isn't hard-coded as an extension could use this search driver's own extenders to add functionality
class AbstractSearchDriver {
abstract function name();
abstract function compatibleResources(): array;
}
JS:
// The extension would be responsible to use this extender to register one class for each resource they are compatible with
class SearchExtender {
addDriver(resource: string, class implements AbstractSearchDriver);
}
// The search driver would work similarly to the current SearchSource objects, but it would also be used for paginated results in the discussions search results
class AbstractSearchDriver {
// This method lets Flarum know which gambits it can offer to this driver
// Flarum would take the list of suggested gambits for the context and compare it against this before suggesting it
abstract supportedGambits(resource: string);
// Performs the search asynchronously. We could wrap the result in another object that gives pagination hints
// The data returned does not need to come from Flarum's API, but must contain models with some minimal attributes/relationships that we could spec out
async abstract search(resource: string, query: string, page: number): Model[];
}
// We could add a new method to the global search state that can be extended by extensions to provide gambit suggestions
// This would replace the current GlobalSearchState.prototype.params used by extensions like Tags
app.search.suggestedGambits(): string[];
In the UI, I suggest making it so gambits get offered under the search field when focused.
So for example you click the search bar, and the following appears:
[ :mag: Search Forum ]
Suggested filters: [Tag General] [User Admin]
Clicking a gambit would add it visually to the search, either in text form or ideally in a fashion similar to the tag picker:
[ :mag: (Tag General) Search Forum ]
Suggested filters: [User Admin]
2 completely separated systems works even better! I like pretty much everything about this proposal. We'll probably have some duplicate code between the filters and SimpleTextSearch gambits, especially towards the start. I think we should just keep it duplicate without any forced abstractions to unite the two: as you mentioned, these are more complex systems and unnecessary complexity doesn't help.
The biggest challenge here will probably need to be implementation, as the PRs would be bulky. Possible split of work:
$driver = SearchDriverManager::getDriver(RESOURCE_NAME)So, that begs the question: what do we need when? PR 1A and 1B I think needs to be included for stable so we can deprecate the old events (preferably in beta 15, instead of #2104, which I think is feasible).
PR 2 would also be really nice to have sooner than later to enable foreign language search. I would also like to have this for stable, but maybe as a beta 16 thing.
PRs 3, 4, and 5 can be done when developers are available, but are not THAT urgent. PR 3 is the most important IMO.
I love the idea of separating fulltext vs filters. This would also allow retrieving private discussions using the fulltext search and then running filtering afterwards. We only need to tackle some basic issues like retrieving more entries than we need etc.
This would also mean we could look at laravel scout again, potentially.. Well established and easy to replace.
I personally don't think that we need to allow search drivers per resource. I think that would over-complicate core as most (if not all) implementation will use one driver.
Changing search, adding an advanced search page and introducing better ux for gambits/filters but I think it doesn't make sense to add that to the scope before stable. For the issues with search that UTF-8 needs fixed, we can possibly go for a simpler, temporary solution?
I personally don't think that we need to allow search drivers per resource. I think that would over-complicate core as most (if not all) implementation will use one driver.
I think it would be simpler to register drivers to resources, not even from the perspective of ALLOWING search per resources as for registering and accessing those drivers in the first place. Since each resource would have different gambits, trying to shoehorn search for ALL resources under one block of logic would be messy IMO.
Most of what I bring up in Phase 1A about custom pages isn't for an "advanced search UI". It's just that we need the index page to hit the filtering API endpoint, and the search page to hit the search API endpoint. This might be possible to integrate into the index page, but I need to play around with it first.
I can't see a solution where there isn't one driver per resource and where extensions can define their own search endpoints.
Unless we standardize a way for extensions to provide the data that must be indexed, but this kind of goes against the flexibility that could be offered by having radically different drivers.
My use case for drivers per resources is as follows:
If we force the forum to use basic for everything since it's the only driver that supports all search types, we will create lots of incompatibilities or just discourage extensions from implementing search in their own data.
We could force selecting the same driver for both user and discussion however since the two will always be used together and can't be disabled.
This would also allow retrieving private discussions using the fulltext search and then running filtering afterwards. We only need to tackle some basic issues like retrieving more entries than we need etc.
This actually wouldn't happen. Retrieving more entries than we need is the exact issue I brought up above. I'm envisioning these as 2 parallel, but maximally detached systems,
I made another attempt at #2453, and came across some concerns:
After some more thinking, I don't see why search and filter couldn't share one endpoint. My main reasoning behind this is that regardless of whether a query is searched or filtered, the frontend expects a consistent format of results. The distinction between which method is used is essentially:
Either method should return an Eloquent collection, which can then be marked as paginated (the actual application of sort, offset, and size happens inside the filtering / searching logic), have related objects / includes added to it, and so on. IMO, https://github.com/flarum/core/blob/master/src/Search/SearchResults.php#L14-L14, which is currently returned by DiscussionSearcher and UserSearcher, is pretty much what we'd want both the SearchDriverInterface and the Filterer to return.
With the filtering / search split, everything that is currently a gambit for the SimpleFlarumSearchDriver (in core and extensions) could/should be supported as a filter for the Filterer. However, the formats in which they are provided are different:
key:value substrings of the filter[q] paramThis means we'd have 2 copies of code for all filters / SimpleFlarumSearch gambits. A few ideas:
FilterInterface and GambitInterface?SearchUtil::filterFromGambit, but I don't like that as it would be magicky and could constrain functionality / formatFilterer for its gambits (kind of the previous option, but better), but that's still hackyAnother concern is preserving backwards compatibility. With the approach I'm taking in #2453, it's completely preserved for the Search logic (since that is completely unchanged), but it seems like it'll be a breaking change for the filter endpoints, as we can't really apply the gambits from the ConfigureXGambit events, since they won't comply with FilterInterface. However, we could set off Searching events in the Filterer basec on the resource being filtered.
Most helpful comment
Re-reading this discussion, I feel like we do need to separate filters and full text search into different API endpoints and implementations.
The filter part, like proposed in #2104 , makes a lot of sense to me for retrieving posts with filters, discussions with a particular flag (bookmarked, followed, ...) or users based on filters (enabled, group) for an admin UI. The current approach works well and apart from creating proper extenders and re-usable classes, I don't think we need to change it too much. Changing to query-string based over string-based would be nice.
The full text search really stands on its own. Very few filters matter. You just want relevant results, and fast. Currently there are only two situations where this endpoint would be used: the quick search dropdown that includes discussions and users, and the full discussion search results page. We don't really need to make the search work with anything else by default. Here's there's also potentially the use case of directly hitting a search server like Algolia without going through Flarum backend at all.
My preference/proposal would be as follows:
filters[]query string instead of a string querySo the API could look something like this:
PHP
JS:
In the UI, I suggest making it so gambits get offered under the search field when focused.
So for example you click the search bar, and the following appears:
[ :mag: Search Forum ]
Suggested filters: [Tag General] [User Admin]
Clicking a gambit would add it visually to the search, either in text form or ideally in a fashion similar to the tag picker:
[ :mag: (Tag General) Search Forum ]
Suggested filters: [User Admin]