Kibana: [Fleet] Proposal: Store installed packages in cluster

Created on 20 Oct 2020  路  18Comments  路  Source: elastic/kibana

When the package manager was created, it was created with the idea that the registry and packages are always available. The current implementation uses a local in-memory cache for package contents. Whenever a package is missing from this cache it is re-fetched from the registry. Over the last 2 releases, quite a few issues have shown up where it became a problem that packages are only available from the registry:

  • Packages are removed from the registry: In general, this should not happen in production but it is common for the snapshot registry where packages are not always stable. Removing a package puts Fleet into a state where it can't pull the assets again.
  • Registry not available: Luckily, this did not really happen yet. But in the case of running a local registry or users running it on prem, this becomes more likely. Also at one stage, our registry will have downtime. If this is the case, Fleet must stay operational, meaning it should be possible to create new policies, not necessarily install new packages.
  • Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.
  • Package installation by direct upload: We are working on making uploading packages directly to Kibana available. Currently these packages are only cached in-memory and not stored anywhere. It means on restart these packages are lost.
  • Memory issue: As all the packages are stored in memory currently, this adds to the memory usage of Kibana.
  • Changing registry: A user is testing the staging registry and then switches to the production registry. Now an installed package might not be available anymore because it was only availabe on staging. The same would happen if we would support multiple registries in the future and one registry might disappear.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index. This also unifies how packages work which are uploaded through zip, coming from the registry or any other model to add packages. Below is an image to have this visualised:

image

One important detail here is, that for browsing packages from the registry without installing them, these should not be downloaded, see https://github.com/elastic/kibana/issues/76261. Packages which are uploaded are always installed.

How exactly packages and assets stored in Elasticsearch should be a follow up discussion if we decide to move forward with this.

Decision: Agreed to move forward. Follow up discussion ticket: https://github.com/elastic/kibana/issues/83426

Links

Ingest Management v7.11.0

Most helpful comment

When a package is uninstalled from the system, I'd propose it will be removed from the storage index as well.

That way the storage index doesn't silently turn into a secondary installation source that we need to check during package listings and installations.

All 18 comments

Pinging @elastic/ingest-management (Team:Ingest Management)

I would clarify the use of the word 'local' in the initial description:

  • "The current implementation uses a local cache and pulls down the package again if any files are missing or a package does not exist." -- The current implementation uses a local in-memory cache [...]. If Kibana runs clustered, every Kibana instance has its own in-memory cache.

  • "store locally" -- I assume what is meant is to store in Elasticsearch in a dedicated index? As Kibana can be clustered, local file storage should not be used.

@skh Spot on. Can you directly update the issue?

In addition, I see two ways to store packages in ES:

  • as the zip file, in a binary field, which may be large (see also https://github.com/elastic/dev/issues/1544 )
  • every file from the unpacked zip file in a separate document, containing

    • package name

    • package version

    • package source (upload or registry)

    • file type (e.g. screenshot, icon, field definitions, ingest pipeline....)

    • file path (because the folder structure in the package carries meaning, e.g. which fields.yml belongs to which data stream)

    • file content as binary field

The zip file would always need to be downloaded and unpacked as a whole.

Single files could be queried by file type or path so that we can access single assets more quickly, but there may be many of them in some packages.

I like the idea around storing each file as a document instead of the zip file. It not only allows us to query as you mentioned but in addition, but allows us to exclude certain files form storing and add metadata to each like hash, date modified / date installed etc.

You basically have a VFS over Elasticsearch, I like the idea @skh, one thing to consider for a future improvement/release is signature of package / files?. I am not sure of the level of risk here, but it could be possible for a user to update a file via a document.

馃憤 to the problem description and proposal. Two things that come to mind are

Number of assets

I'm curious about the difference between making 10, 100, etc requests to ES vs moving them from memory. The best case scenario (few assets & fast connection to ES) might not be noticeably affected, but the more assets or greater latency to ES the slower things will feel. One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Dealing with binary assets (images)

We'll have to base64 encode any binary asset. That adds about 30% to their file size. There's also a CPU cost of decoding them. Again, storing the decoded Buffer in the memory cache would help.

Here a quick check of the image files sizes. Remember these will be about 30% larger after base64 encoding

image files sizes

du -h -d1 */*/img/*
396K    aws/0.2.5/img/filebeat-aws-cloudtrail.png
1.1M    aws/0.2.5/img/filebeat-aws-elb-overview.png
188K    aws/0.2.5/img/filebeat-aws-s3access-overview.png
2.5M    aws/0.2.5/img/filebeat-aws-vpcflow-overview.png
8.0K    aws/0.2.5/img/logo_aws.svg
384K    aws/0.2.5/img/metricbeat-aws-billing-overview.png
260K    aws/0.2.5/img/metricbeat-aws-dynamodb-overview.png
1.1M    aws/0.2.5/img/metricbeat-aws-ebs-overview.png
600K    aws/0.2.5/img/metricbeat-aws-ec2-overview.png
600K    aws/0.2.5/img/metricbeat-aws-elb-overview.png
496K    aws/0.2.5/img/metricbeat-aws-lambda-overview.png
792K    aws/0.2.5/img/metricbeat-aws-overview.png
640K    aws/0.2.5/img/metricbeat-aws-rds-overview.png
332K    aws/0.2.5/img/metricbeat-aws-s3-overview.png
720K    aws/0.2.5/img/metricbeat-aws-sns-overview.png
348K    aws/0.2.5/img/metricbeat-aws-sqs-overview.png
560K    aws/0.2.5/img/metricbeat-aws-usage-overview.png
396K    aws/0.2.7/img/filebeat-aws-cloudtrail.png
1.1M    aws/0.2.7/img/filebeat-aws-elb-overview.png
188K    aws/0.2.7/img/filebeat-aws-s3access-overview.png
2.5M    aws/0.2.7/img/filebeat-aws-vpcflow-overview.png
8.0K    aws/0.2.7/img/logo_aws.svg
384K    aws/0.2.7/img/metricbeat-aws-billing-overview.png
260K    aws/0.2.7/img/metricbeat-aws-dynamodb-overview.png
1.1M    aws/0.2.7/img/metricbeat-aws-ebs-overview.png
600K    aws/0.2.7/img/metricbeat-aws-ec2-overview.png
600K    aws/0.2.7/img/metricbeat-aws-elb-overview.png
496K    aws/0.2.7/img/metricbeat-aws-lambda-overview.png
792K    aws/0.2.7/img/metricbeat-aws-overview.png
640K    aws/0.2.7/img/metricbeat-aws-rds-overview.png
332K    aws/0.2.7/img/metricbeat-aws-s3-overview.png
720K    aws/0.2.7/img/metricbeat-aws-sns-overview.png
348K    aws/0.2.7/img/metricbeat-aws-sqs-overview.png
560K    aws/0.2.7/img/metricbeat-aws-usage-overview.png
8.0K    checkpoint/0.1.0/img/checkpoint-logo.svg
4.0K    cisco/0.3.0/img/cisco.svg
796K    cisco/0.3.0/img/kibana-cisco-asa.png
 12K    crowdstrike/0.1.2/img/logo-integrations-crowdstrike.svg
392K    crowdstrike/0.1.2/img/siem-alerts-cs.jpg
512K    crowdstrike/0.1.2/img/siem-events-cs.jpg
4.0K    endpoint/0.14.0/img/security-logo-color-64px.svg
4.0K    endpoint/0.15.0/img/security-logo-color-64px.svg
4.0K    fortinet/0.1.0/img/fortinet-logo.svg
4.0K    microsoft/0.1.0/img/logo.svg
424K    o365/0.1.0/img/filebeat-o365-audit.png
296K    o365/0.1.0/img/filebeat-o365-azure-permissions.png
 16K    o365/0.1.0/img/logo-integrations-microsoft-365.svg
436K    okta/0.1.0/img/filebeat-okta-dashboard.png
4.0K    okta/0.1.0/img/okta-logo.svg
476K    panw/0.1.0/img/filebeat-panw-threat.png
1.5M    panw/0.1.0/img/filebeat-panw-traffic.png
 12K    panw/0.1.0/img/logo-integrations-paloalto-networks.svg

package-storage image sizes in KB

To clarify, I'm saying we would still put assets in ES, but use a cache to store ready-to-serve values to avoid hitting ES and doing any unnecessary work. We could add TTL or any other logic to decide when use or invalidate cache entries.

One option is to keep the memory cache (changing to an LRU or something less naive than now) and adding values in there on thier way in to ES. That way we keep the durability of ES but still avoid the latency issues.

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

Would it be an option to keep the in-memory cache, but purge some files, like ES and Kibana assets from it regularly, while keeping others, like images, for longer?

Definitely. That's what I was getting at with

We could add TTL or any other logic to decide when use or invalidate cache entries.

We'll have to define the rules and then see if there's an existing package that does what we want out of the box or if we need to wrap one with some code to manage it.

Seems like we want both support for both TTL (different by assert class) _and_ max memory size for the cache.

https://github.com/isaacs/node-lru-cache is an existing dependency and my go-to, but it doesn't support per-entry TTL. I think we'd have to create multiple caches to get different expiration policies.

I did some searching and both https://github.com/node-cache/node-cache & https://github.com/thi-ng/umbrella/tree/develop/packages/cache seem like they'd work for this case

Before we add a cache, we should first test if we really need it. Having a cache will speed up things but also make things more complicated.

Quite a few of the large assets are images and are only used when run through the browser. I assume the browser cache will also help use here to only load it once per user?

I agree we should profile. The additional work/complexity is low so we can add it later.

The browser cache will also need some work (setting headers) but we can look at that when profiling.

When a package is uninstalled from the system, I'd propose it will be removed from the storage index as well.

That way the storage index doesn't silently turn into a secondary installation source that we need to check during package listings and installations.

Upgrade rollback: When an upgrade of a package fails, it is rolled back. If the older version of a package does not exist anymore, Fleet ends up in a state between two packages.

I'm not sure at what point during the package installation process we want to update the storage index, but if possible, it seems like it'd be easiest to add it to the storage index only if installation has successfully completed. Then during rollback we can perhaps use the storage index if the previous version is not available in the registry. If we update the storage index as we are installing, this probably won't be possible.

I agree with @ruflin here, adding cache seems great but this adds a level of complexity. +1 on @jfsiii to add it later.

I just want to highlight that a) we already use a cache b) the proposal specifically mentions.

To solve all the above problems, I'm proposing to not only cache the packages in memory, but also store them in a dedicated ES index.

I don't want to pull us into the weeds re: caching. We can discuss it in the implementation ticket(s). Just highlighting it's not an alteration to the proposal

Closing since we agreed on the proposal and are discussing further in https://github.com/elastic/kibana/issues/83426

@jfsiii Can you share what the final proposal is that was agreed on? What I put here is more a high level proposal and I hoped the questions around storing etc. (which are also mentioned in https://github.com/elastic/kibana/issues/83426) would be answered in a detailed proposal.

Was this page helpful?
0 / 5 - 0 ratings