Describe the feature
There are two cases where we need to improve the categorization UX:
For both scenarios, we want to display a warning to the user. With the button in this warning callout, the ML job setup can be updated.

Title
[dataset.name] does not provide enough training data
Message
Longer periods of time will improve the categorization results for [dataset.name]. Update the configuration to improve your results. Learn more
Button
Update configuration -> Links to setup screen
Title
Multiple datasets do not provide enough training data
Message
We have too little training data for following datasets: [dataset.name], [dataset.name]. Longer periods of time will improve the categorization results. Learn more
Button
Update configuration -> Links to setup screen
Title
[dataset.name] does not provide data for meaningful categorization
Message
Because of the structure the log messages in [dataset.name] have, they can not be categorized in a meaningful way. Update your job configuration to improve the results. Learn more.
Button
Update configuration -> Links to setup screen
Title
Multiple datasets do not provide data for meaningful categorization
Message
Because of the structure the log messages in [dataset.name] have, they can not be categorized in a meaningful way. Update your job configuration to improve the results. Learn more.
Button
Update configuration -> Links to setup screen
Title
Multiple datasets don鈥檛 provide data for meaningful categorization or provide too little training data
Message
Because of the structure the log messages in [dataset.name], [dataset.name] and [dataset.name], they can not be categorized in a meaningful way or there is too little training data. Learn more.
Button
Update configuration -> Links to setup screen
-> the learn more links should point to a docs page. @mukeshelastic would you please provide the link?

The changes in the setup screen affect the index selection. With the new version if should be possible to select/deselect an index but also all datasets within individually.
The default state will not change. When a user first enters the setup all indices and their datasets are selected.
The warning message should be the same as described above for the categorization view. If there is too little training data for a dataset, the warning message appears.
Additionally, the alert icon shows which of the indices/datasets has problems (see screenshot above). Hovering the icon shows a tooltip explaining the warning.
Too little training data
One or more datasets in this index provide not enough training data.
Data not suitable
One or more datasets in this index can not be categorized in a meaningful way.
Too little training data
The dataset provides not enough training data.
Data not suitable
The data in this dataset can not be categorized in a meaningful way.
Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)
Thank you for providing so many details, this looks great. A few thoughts come to mind:
Data quality criteria: It would probably be good to write down the specific criteria we want to use to determine whether there is "too little training data" or whether a dataset is "not suitable for categorization". In both cases I assume it would be evaluations of the count or cardinality?
count_docs(dataset) < min_training_count :arrow_right: "too little training data"?count_categories(index, dataset) > max_category_count :arrow_right: "not suitable for categorization`?@sophiec20, can you provide any guidance on such quality criteria? IIRC you were considering emitting such warnings while running the ML jobs? Can we access these?
Well-known datasets: And what about the well-known filebeat module datasets, which we already know to be unsuitable? Do we want to hard-code a warning list for those?
Combination of warnings: From the UI perspective I wonder if the combined case "Too little training data and not suitable" should be displayed as two separate warnings? Otherwise the user might not be able to tell which is which and the combinatorial complexity in the implementation grows - especially if we possibly add more warnings in the future.
Combination of warnings: From the UI perspective I wonder if the combined case "Too little training data and not suitable" should be displayed as two separate warnings? Otherwise, the user might not be able to tell which is which and the combinatorial complexity in the implementation grows - especially if we possibly add more warnings in the future.
We discussed this and decided to have a single callout box but adjust the text accordingly. There is also a special case where one dataset has both problems. I think if it does not provide useful data for categorization we should not even mention too little training data - it won't be useful no matter how much data we have.
@mukeshelastic will help with the wording (so the text in the issue description is likely to change).
@katrin-freihofner @weltenwort when we detect lack of sufficient training dataset, we lack confidence in the displayed anomaly score. I wonder whether the appropriate user feedback is 1. Show N/A or something similar in the anomaly score column for each category of the dataset where we detect this case 2. Show a warning message at the top, exactly as katrin suggested but tweak the message to communicate the lack of confidence in anomaly score and hence being not displayed in the anomaly score column for the detected datasets.
Specifies which field will be categorized. Using text data types is recommended. Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting.
In ML, we have the helper text (above) which is aimed at helping users understand what categorization is designed for. It would be good to align on this if possible - the final sentence anyway.
// too little training data
Anomaly detection learns from trained data. The probability of anomalies has already been adjusted according to the amount of training data seen. So I advise against Logs UI picking an arbitrary value which defines if enough training data has been seen. It depends on the data.
It is not the case that we lack confidence in displaying the anomaly score because the model has already built this in.
The proposal above links to the Update Configuration page. I would have thought that the answer to usually wait a bit longer, rather than to update the configuration. Perhaps I am missing something here but seems to me that this check can be avoided.
// not suitable for categorization
In 7.7 we now have the following stats categorization stats. A categorization_status can be warn or ok. If you query this for a running job then this our indicator to say if the data is suitable for categorization. https://github.com/elastic/elasticsearch/pull/51879
Unfortunately, because categorization is not yet done on a per partition basis, then this status is also not yet partition aware. It gives a view of the overall job. This will be set to warn by a single dataset that is not suited to categorisation, however it could be set to warn if there were many datasets that were all categorizing nicely.
In 7.6 we had a basic log category check, which would raise an ML job message if 1000 or more categories existed for a job before 100 buckets of results have been created. Because the Logs UI job is partitioned and has model_plot enabled, then I believe it is possible to query this per partition and that this will be a good indicator of a dataset with a message field that does not categorize well.
// well-known datasets
In the end, we did not add this into ML. We did not feel that the business logic ought to be written into the back-end APIs. However I would still think that it has value in the Logs UI application which already has logic in-built to handle different dataset types. This hard-coded list could be extended over time and based on telemetry. It would help with the web access log data which I suspect might be used with categorization but is actually structured data.
// combination of warnings
I do not believe that the "too little data" message should be a warning.
One thing I forgot about, we do have a job validation check in the ML UI which pertains to too little data. If there is less than 25 buckets or 2 hrs (which ever is greater), then we warn prior to job creation that there is too little data for the model to be initialized, and therefore no anomalies will be written until such time as sufficient data can be seen. Meaning, there is no historical data to analyse and you'll have to wait for it to continue in real-time until you start to see anomalies. I assumed the comments above were about the early lifetime of the job which comes after the model initialisation but that may not have been the case.
Thank you for the detailed response, @sophiec20!
With the awesome new model stats it sounds like we could do something like the following:
categorization_status stat.warn, perform per-partition queries to determine which partitions likely cause the high rare categories count or a high category count in respect to the overall count.Does that make sense?
I think the idea behind warning about "too little data" would be to indicate that some datasets might never have enough documents for training due to their rare occurrence. But maybe that's not useful enough to confuse the user with that detail?
@weltenwort the steps 1-4 above sound good. In addition, due to categorization_status: warn not yet being partition aware, I'd suggest a 4b which would be to provide a useful message if the status was set to warn but all partitions looked good according to the basic count check.
These are other reasons that indicate that the message is not suited for categorizing.
Categorization is detecting a distribution of categories
that suggests the input data is inappropriate for categorization.
Problems could be that there is only one category, more than 90% of
categories are rare, the number of categories is greater than 50% of
the number of categorized documents, there are no frequently
matched categories, or more than 50% of categories are dead.
These cannot be assessed using elasticsearch queries as they are metrics captured as we model. The ML UI categorization wizard does do some pre-flight data validations using _analyze. These seem to me to be too big a lift to include in the Logs UI onboarding workflow, but wanted to share for visibility. https://github.com/elastic/kibana/pull/60502
For 7.7, so our end-users can get the most benefit from categorizing data that is categorize-able, then I think a pragmatic approach would be to
Categorization works best on machine written log messages, typically logging written by a developer for the purpose of system troubleshooting.warn status and allow them to use their judgement to exclude datasets.For beyond 7.7, then there are options for a smoother experience from the ml side, such as making categorization_status partition aware or perhaps having some self correcting logic in the job to exclude partitions that are not suited or perhaps having a data validation endpoint. And also from the Logs UI side, such as allowing multiple categoization jobs, which would potentially give better results when looking at datasets with very different data rates especially if the logs do not belong to related systems, or perhaps incorporating something similar to the current ML cat wizard checks. From the ML side, we will have these discussions soon.
We split this issue into many sub-issues.
Done: https://github.com/elastic/kibana/issues/60385
To be completed: https://github.com/elastic/kibana/issues/60392, https://github.com/elastic/kibana/issues/60390
so closing this issue.
Most helpful comment
We discussed this and decided to have a single callout box but adjust the text accordingly. There is also a special case where one dataset has both problems. I think if it does not provide useful data for categorization we should not even mention too little training data - it won't be useful no matter how much data we have.
@mukeshelastic will help with the wording (so the text in the issue description is likely to change).