Elasticsearch: Avoid calling transport action in InferenceProcessor?

Created on 25 Feb 2020 · 5Comments · Source: elastic/elasticsearch

In order to perform inference on ingest documents, InferenceProcessor calls out to InternalInferModelAction. My understanding is that the inference is always local to the node -- currently we'll never call out to another node to perform the action.

A potential refactor is to instead pass ModelLoadingService directly to InferenceProcessor.Factory when it's created, and use this service to load then invoke the model. This makes it clear that the inference is performed on the node itself, and also avoids the overhead of calling a transport action.

It looks like InternalInferModelAction performs some license validation, which would likely need to be moved to make sure the inference processor still does this validation.

:ml >refactoring

Source

jtibshirani

Most helpful comment

I am not sure we want to do this. It is restrictive. It requires that all models used at ingest must be loaded on the ingest node. If a model is sufficiently complex (i.e. BERT), a native process needs to be loaded and executing. Those types of models also need hardware acceleration. This means, all ingest nodes now need the appropriate hardware to use a model. Having a client call allows us to route to a specific subset of nodes if necessary.

As for the technical implementation.

The loading service would:

still need to make client calls (to load the model)
We would have to add the built NamedXContentRegistry into the Processor.Parameters object. This is fairly easily done.

//CC @martijnvg @droberts195

benwtrent on 26 Feb 2020

👍2

All 5 comments

Pinging @elastic/ml-core (:ml)

elasticmachine on 25 Feb 2020

As for the technical implementation.

The loading service would:

still need to make client calls (to load the model)
We would have to add the built NamedXContentRegistry into the Processor.Parameters object. This is fairly easily done.

//CC @martijnvg @droberts195

benwtrent on 26 Feb 2020

👍2

@davidkyle Do you have any thoughts? ^

benwtrent on 26 Feb 2020

I am not sure we want to do this. It is restrictive. It requires that all models used at ingest must be loaded on the ingest node. If a model is sufficiently complex (i.e. BERT), a native process needs to be loaded and executing. Those types of models also need hardware acceleration.

My thought behind this suggestion is that it'd be good to take the cleanest possible approach that satisfies the current requirements (performing inference local to a node). It could certainly be changed when the design of 'native inference' is nailed down, if we end up wanting to call out to other nodes with dedicated hardware.

As for the implementation, I think we can just save off the ModelLoadingService that's created in MachineLearning#createComponents and pass it through to the processor factory. We do this in a few places in other plugins. I experimented with this approach when thinking about how to invoke a model in a fetch sub phase: https://github.com/jtibshirani/elasticsearch/pull/7

Thanks @droberts195 for linking me to those background documents/ discussions! We've been brainstorming how to perform inference at search time and have similar design questions, will keep in touch with you both as it progresses.

jtibshirani on 26 Feb 2020

I prefer to call the internal action the main benefits I see are:

Calling the action automatically creates a task that is tracked by the task manager
We are likely to want to do the work in a different thread anyway, the action does the forking for us
Yes is does add overhead but my experience writing ml functions in ES says that it is always better to call out it, helps scalability and liveliness. Also I think the code is more readable, you call out to the action and some time later handle the response