Libelektra: Observation/Question about Split

Created on 20 Sep 2020 · 16Comments · Source: ElektraInitiative/libelektra

NOTE: I am still not entirely sure, what the exact purpose of the struct _Split (hereafter Split) is. My understanding is, that it represents the tree of backends and is used to determine 1) which plugin to use for loading/storing a backend 2) which key belongs to which backend. Everything below is based on this assumption.

While working on #3447 and also thinking about #2969, I had the idea that now might be a good moment to rethink the implementation of Split. After a while, I came to the conclusion that the whole struct could be replaced by a single KeySet with binary-valued Keys. If this indeed possible, it would likely simplify many parts of the codebase and may even result in performance gains.

The idea is as follows:

Once #2969 is merged, a backend basically consists of a parent key /some/parent (should also have a namespace) and an instance of the backend plugin, i.e. a Plugin * someBackend.
Let splitKs be the KeySet used to replace Split.
For every backend we insert the parent key into the splitKs and set its value to the Plugin *. Now we have a KeySet with lots of mappings /some/parent -> Plugin * someBackend.
We also insert the key / (or after #3447 maybe default:/) with the a pointer to the default backend into splitKs.

The two main operations of for splits would then work as follows:

Finding the Plugin * and data (i.e. the KeySet) for backend given its parent key and the KeySet of the full KDB:
- The Plugin * can be found by simply doing a ksLookup followed by a ksValue on splitKs.
- The data KeySet is extracted from the full KDB KeySet via a simple ksCut (or possibly better a new non-destructive alternative as described in #2992).
Finding the parent key of the corresponding backend, given an arbitrary key K:
- AFAIK, there is currently no function for this. However, the operation is essentially identical to inserting the arbitrary key K into splitKs. But instead of actually inserting the key, we just find the correct position. Then the key prior to this position is the corresponding parent key.

The sync operation (AFAIK it just checks the size of the various KeySets in Split) would be a simplified ksCut that doesn't actually cut anything, but instead just returns the size of the "cut".

If there are other important operations the Split is used for, please mention them below. So we can investigate further. This solution would certainly get rid of the problems Split caused in #2969 and also simplify how cache deals with the Split (since it would be just another KeySet).

@markus2330 @mpranj @vLesk

enhancement question

Source

kodebach

❤1

Most helpful comment

But the whole point of the global KeySet is that there is only one instance, globally.

I know, and it makes sense for global plugins. But if non-global plugins (e.g. the toml and type) on a certain mountpoint want to communicate, the global KeySet is not very useful. I'm not saying there is actually a concrete use-case right now, but we should think about adding something before 1.0 or at least make sure we can do it in a non-breaking way post-1.0

kodebach on 27 Sep 2020

👍2

All 16 comments

Thank you very much, I highly appreciate your ideas about simplifying Elektra! :sparkling_heart:

1) which plugin to use for loading/storing a backend 2) which key belongs to which backend.

Exactly. It splits up the keys in a way that the individual backends can be called. It is basically a "pre-computation" of the parentKeys and individual KeySets for each Backend.

Details can be found in https://www.libelektra.org/ftp/elektra/thesis.pdf

The data KeySet is extracted from the full KDB KeySet via a simple ksCut

This is actually how the Split is created. The data structure is basically there to avoid the need of ksCut several times and that information can be passed along during the different function invocations. (Basically only for structured programming so that there is not one huge function doing everything.)

If there are other important operations the Split is used for

The parents might transfer data from plugin to plugin and in particular the value is used for the file name. But in your idea the value would already be the Plugin pointer?
The size/syncbits are needed for the optimizations that kdbGet()/kdbGet() is a NOP if no file was changed and to not write config which is not changed. How would this be in your data structure?

markus2330 on 20 Sep 2020

The parents might transfer data from plugin to plugin and in particular the value is used for the file name.

The simplest solution would be to use a different Key instance for the parent key containing the Plugin * and the one that exists now. (Stored in separate KeySets) Another solution would be to store the file name along this other data in the metadata of the parent key. (Seems appropriate to me that the file name is metadata)

The size/syncbits are needed for the optimizations that kdbGet()/kdbGet() is a NOP if no file was changed and to not write config which is not changed. How would this be in your data structure?

IMO both of these optimizations should be internal to the backend plugin (once #2969 is done). If that is not an option, this could again be stored as metadata of the parent keys.

We could also wrap the Plugin * pointing to an instance of the backend plugin in a new struct Backend (different from the existing one) that contains all the information associated with the backend.

This is actually how the Split is created.

There is not a single reference to ksCut within split.c. As far as I can tell, the KeySet containing the whole KDB is iterated over and for each key first the Trie is used to find the Backend then there is linear search within the Split for the index of the Backend and finally the key is added to the right KeySet with ksAppendKey and the syncbit is updated. This function (splitDivide) is also called every single time we call kdbSet.

The data structure is basically there to avoid the need of ksCut several times

AFAIK ksCut is pretty fast and it should only be needed once per backend per kdbGet and kdbSet. This would mean that we allocate an additional pointer to every single key in the KDB plus some overhead for the KeySets, just to avoid a handful of ksCut.

There are also some functions (e.g. splitPrepare) that use ksDeepDup, so I am think Split might actually hurt performance more than it helps. On the kdbGet side Split might be more useful, but since we have a cache now, I think we can sacrifice some performance there, if it helps somewhere else.

The least we could do, would be to create a struct

struct _FooBar {
    const char * filename;
    Backend * backend; // Plugin * backend; with #2969
    splitflag_t syncbits;
    KeySet * keyset;
}

and store those in a KeySet under the relevant parent key. This would have the same information as the current Split, but since it is stored as a KeySet it can be queried as such and we don't have to deal with Tries, indexes and linear searches.

For the calls to the kdbGet and kdbSet methods of plugins, a Key could be created with the file name as its value and then passed to the various plugins of a backend. If we do actually need to keep this Key between kdbGet and kdbSet then we could also store it in struct _FooBar.

kodebach on 20 Sep 2020

Another solution would be to store the file name along this other data in the metadata of the parent key. (Seems appropriate to me that the file name is metadata)

Probably not such a bad idea anyway, as the convention to have the file name in the values is not so obvious and not so good documented (in particular the information is missing in doc/METADATA.ini).

IMO both of these optimizations should be internal to the backend plugin (once #2969 is done).

The more we can delegate to the Backend plugin, the better. Some parts, however, need to be implemented outside (the outside, however, actually also could be some "distributor" plugin), in particular the global semantics of kdbSet (not committing anything if some Backend failed).

We could also wrap the Plugin * pointing to an instance of the backend plugin in a new struct Backend (different from the existing one) that contains all the information associated with the backend.

I do not understand the proposal.

There is not a single reference to ksCut within split.c.

You are right, it is similar but not the same. It is "cut out" but not with the help of the relation between the keys but with the help of the Trie.

ksCut is pretty fast

Trie lookup is also pretty fast (linear to the number of characters, not with the keys in the keyset!), so without benchmark it is hard to say. And maybe this is not a bottleneck anyway. In general we should concentrate on changes of the API (like passing "filename" via meta-data and the responsibility of the backend plugin) as all the implementation details can be changed easily also in post-1.0.

There are also some functions (e.g. splitPrepare) that use ksDeepDup

This is essential for the semantics of plugins. In kdbSet() plugins change the KeySet (back in a way suitable for serialization, e.g., they rename back the keys) but these changes should not be reflected to the KeySet of the user. So the ksDeepDup is essential, even if it might be expensive. (I do not think it CPU-wise. Of course it is memory-wise, though.)

I was recently considering that we make keys more intelligent to specially mark them in kdbSet. My idea was about that their sync state to not get destroyed (to avoid bugs like #404). But if we make them even more intelligent (with possibilities to change them without actually changing them for the user) we might be able to avoid ksDeepDup altogether.

I think we can sacrifice some performance there, if it helps somewhere else.

@mpranj do you already have some insight how the cache influenced the overall importance of individual optimizations in Elektra?

and store those in a KeySet under the relevant parent key

Syncbits could also be part of meta-data?

linear searches

Which linear search do you mean? The Split is usually only iterated, not searched.

If we do actually need to keep this Key between kdbGet and kdbSet then we could also store it in struct _FooBar.

Iirc nothing except the sizes+error state of kdbSet is remembered between invocations (@mpranj: please correct me if I am wrong. Is there something else in the global KeySet?) In #2955 I proposed that we ignore the error state. Then only the sizes would remain. With a more intelligent KeySet, actually even this could be avoided. But I did not find a way how to do this yet. (Without losing some of the properties we want from KeySet like efficient iteration.) Furthermore, it is probably too short before 1.0 to do such changes.

markus2330 on 21 Sep 2020

I do not understand the proposal.

The proposal was essentially what I described later with the struct _FooBar.

ksCut is pretty fast

Trie lookup is also pretty fast

Yes, Trie lookups are also fast, but they are comparable only to a very small part of ksCut, namely the binary search for the Key. The important factor here is that splitDivide iterates over every single key in the KDB, does one trie lookup plus a ksAppendKey (and a bit of other stuff) for each of them. The version with ksCut on the other hand would do two binary searches plus a ksAppend for each backend. Even if we ignore "trie lookup vs binary searches in ksCut" it is pretty clear to me that a few ksAppend are preferable to a lot of ksAppendKey because we avoid resizing the array multiple times.

So the ksDeepDup is essential, even if it might be expensive.

Thanks, for clarifying that. I don't think "making keys more intelligent" is a good idea, I think they are complicated enough already. The better solution would probably be to have an input and an output KeySet in plugins. Then we only need to deep-copy the keys that actually need to be copied. But since that involves a big change to the plugin API it is far out of scope for 1.0.

Syncbits could also be part of meta-data?

I had that idea as well, just forgot to mention it. Also the solution with the struct _FooBar is a bit closer to the current solution and would probably require less changes.

Which linear search do you mean? The Split is usually only iterated, not searched.

There is splitSearchBackend which returns the index of the given Backend within the Split. It is essentially an indexOf operation and therefore a linear search. splitSearchBackend, is used in splitAppoint, splitDivide and splitUpdateFileName (these in turn are used at various points in kdbGet and kdbSet). In splitAppoint and splitDivide we call splitSearchBackend once for every single key in the KDB and in splitUpdateFileName we call it just a once.

Iirc nothing except the sizes+error state of kdbSet is remembered between invocations

AFAICT none of the data inside the struct _Split is kept between kdbGet and kdbSet, repeated kdbGet or repeated kdbSet operations. The sizes are stored within the struct _Backend which is initially copied from the Split in struct _KDB, but the sizes are then recalculated an overwritten (which would mean we should be able to ignore them in #2969 and remove the failing tests).

In fact AFAICT only the various Plugin * and the Key * mountpoint (this is the parentKey passed to plugin AFAIK) are retained within the KDB handle (and the global KeySet of course). My question now was, whether it is our intention that plugins can pass information to one another and from kdbGet to kdbSet via the parentKey they receive, or whether that should be done only via the global KeySet (between plugins) and the plugin's data handle (between kdbGet and kdbSet)?

Furthermore, it is probably too short before 1.0 to do such changes.

Personally, I think now is exactly the right time do such big changes. With #3447 and #2969 coming up, we already have to big changes that will require a lot of testing before release. Another big change wouldn't have much effect, as long as the time required for implementing the change was short.

Also I think my proposed changes would make Elektras core much more approachable to beginners. Removing struct _Split and struct _Trie (in addition to the removal of struct _Backend in #2969) means the core uses KeySet as its only data structure. (struct _Key, struct _Plugin, struct _KDB et al. are not really data structures, but rather simple records in my mind)

kodebach on 21 Sep 2020

While working on #3447 and also thinking about #2969, I had the idea that now might be a good moment to rethink the implementation of Split.

First of all, I fully agree that this is the best time to make such changes.

The sync operation (AFAIK it just checks the size of the various KeySets in Split) would be a simplified ksCut that doesn't actually cut anything, but instead just returns the size of the "cut".

The size of the KeySet is definitely not enough to detect all possible changes, so something like syncbits is definitely needed. (e.g. pop one key and insert a different key results in same KeySet size).

simplify how cache deals with the Split (since it would be just another KeySet).

I would postpone this until the changes to Split are final. It's not fun to rewrite the cache multiple times.

@mpranj do you already have some insight how the cache influenced the overall importance of individual optimizations in Elektra?

I don't have any hard facts about this (not part of my thesis). In a very read heavy scenario the cache will mask the runtime of any other code called from kdbGet(). For mixed scenarios this is not clear and we would need benchmarks. Optimizations like ksLookup are still relevant though.

Re-thinking the Split could potentially speed up kdbSet(), but even here a benchmark would make sense. I don't really know where most of the time in kdbSet() is spent as my previous work focused on kdbGet() only.

Iirc nothing except the sizes+error state of kdbSet is remembered between invocations (@mpranj: please correct me if I am wrong. Is there something else in the global KeySet?)

That is how I remember it too. The content of the global KeySet is only important during an invocation of kdbGet() regarding the cache. In the API docs however, we promised more than that:

The global keyset is tied to a KDB handle, initialized on kdbOpen() and deleted on kdbClose().

Furthermore, it is probably too short before 1.0 to do such changes.

I think the bigger problem is that we do not have the capacity to rewrite everything. We should be careful to decide only on changes that we can get done with the currently active developers. That said, I am all for simplifying (especially the core) so it might be well worth the invested time.

The next meeting should be next week. Maybe we can dedicate a portion of time (before/after the main part) to discuss this.

mpranj on 22 Sep 2020

The better solution would probably be to have an input and an output KeySet in plugins.

Interesting idea. This alone, however, would be not enough, the plugin developers still would need to be very careful about duplicating keys and removing sync bits. Furthermore, "NOP" plugins would suddenly be expensive as they would need to append all keys to the output.

Overall, it seems to me Elektra's guarantees are quite tied to the Elektra's core guarantees. Everything else is, unfortunately, in every plugin always done a bit differently, at least as long as we do not get a much bigger man power and need to accept PRs from newcomers with limited endurance to do fixes in their plugins.

The sizes are stored within the struct _Backend which is initially copied from the Split in struct _KDB, but the sizes are then recalculated an overwritten (which would mean we should be able to ignore them in #2969 and remove the failing tests).

They are recalculated after it is ensured that we did not hit one of the optimizations. (And the KeySet is actually different.) So we cannot simply ignore them, the optimizations would not work then.

My question now was, whether it is our intention that plugins can pass information to one another and from kdbGet to kdbSet via the parentKey they receive, or whether that should be done only via the global KeySet (between plugins) and the plugin's data handle (between kdbGet and kdbSet)?

In earlier times the parentKey was the only way. Now we could actually reconsider if we can drop the idea with the parentKey as transport between plugins (and make the parentKey only as transport mechanism from applications to plugins, or even make the parentKey a pure parentKey). To clarify this, would actually be quite important for 1.0, otherwise implementation details suddenly would turn to an interface we need to support.

Personally, I think now is exactly the right time do such big changes.

If we have the man power, of course it would be a good time.

Also I think my proposed changes would make Elektras core much more approachable to beginners.

This convinces me much more than all the possible performance gains. With mmap the speed of the kdbGet/kdbSet operations are probably nearly irrelevant (or real-world benchmark first need to show in which cases they are not irrelevant). KeySet is obviously something every Elektra user need to learn. So to reuse it, is of course desirable.

I have to note, however, that KeySet can also be very unsuitable in some situations. E.g. we already tried several times in storage plugins to use use the KeySet as the only data structures, and this was often a bad idea. E.g. if you need hierarchical structure or need order for serialization, it is much better to use a data structure which represents exactly this what you need. In the case of Split, however, you might be right, your latest proposal sounds to be superior then the current one.

Btw. entry barriers is also a good topic for a master thesis :1st_place_medal:

I would postpone this until the changes to Split are final.

I agree here. I would even say that changes to Split should go together with the Backend restructuring, as it does not make much sense to fix now all the Split test cases, and then to destroy everything again with restructuring Split.

The next meeting should be next week. Maybe we can dedicate a portion of time (before/after the main part) to discuss this.

I agree. I would even say, this discussion will be the main part :+1:

markus2330 on 22 Sep 2020

Now we could actually reconsider if we can drop the idea with the parentKey as transport between plugins

Between plugins during a single invocation of kdbGet or kdbSet should be fine. It would be more work to reset the key for every plugin than to just keep it around.

However, the problem is that AFAIK currently the parentKey is persistent from kdbOpen until kdbClose, i.e. for multiple invocations of kdbGet and kdbSet, since it is stored in the KDB handle. Whether that is purely an implementation detail or is actually part of some API guarantee, I do not know.

My suggestion would be that we say the parentKey is owned by the core and passed to a plugin. At that point it contains certain pieces of data (configfile name, etc). Plugins should treat it as entirely read-only (we may even lock it). Plugins should also not expect the lifetime of parentKey to be any longer than the current plugin invocation.

Instead plugins should only rely on elektraPluginGetData/elektraPluginSetData and the global keyset.

KeySet can also be very unsuitable in some situations

Of course, a KeySet is not a general purpose data structure. But in a way the tree of backends is almost the same as the whole KDB tree, just that the value of each key is not a string (mostly), but another tree. Since all the optimizations in KeySet are agnostic to the actual values of the Keys, it should be perfect to replace Split.

If we have the man power, of course it would be a good time.

That's why we should discuss changes right now (even if that takes time as well), because how much manpower we need heavily depends on what changes we want.

kodebach on 22 Sep 2020

since it is stored in the KDB handle

Can you please point me to the source line of which parentKey you mean?

Whether that is purely an implementation detail or is actually part of some API guarantee, I do not know.

This interface is totally under-documented. src/plugins/doc/doc.h describes what you might get from parentKey but I do not think that parentKey interactions between plugins are anywhere documented or tested.

This, together with the global plugins, is unfortunately still a big TODO.

Instead plugins should only rely on elektraPluginGetData/elektraPluginSetData and the global keyset.

Actually only the global keyset. The data is a private thing and not passed between plugins. The docu should be more clear about this:

diff --git a/src/libs/plugin/plugin.c b/src/libs/plugin/plugin.c
index dc198c86c..590bd5358 100644
--- a/src/libs/plugin/plugin.c
+++ b/src/libs/plugin/plugin.c
@@ -118,7 +118,9 @@ KeySet * elektraPluginGetConfig (Plugin * handle)
 }

 /**
- * @brief Store a pointer to any plugin related data.
+ * @brief Store a pointer to plugin specific data.
+ *
+ * This data is private to one instance of a plugin.
  *
  * @see elektraPluginGetData
  * @param plugin a pointer to the plugin
@@ -131,10 +133,12 @@ void elektraPluginSetData (Plugin * plugin, void * data)
 }

 /**
- * @brief Get a pointer to any plugin related data stored before.
+ * @brief Get a pointer to the plugin specific data stored before.
  *
  * If elektraPluginSetData() was not called earlier, NULL will be returned.
  *
+ * This data is private to one instance of a plugin.
+ *
  * @see elektraPluginSetData
  * @param plugin a pointer to the plugin
  * @return a pointer to the data

And we also need unit tests about this. @vLesk is there any progress with the backend specific unit tests?

it should be perfect to replace Split.

Ok, sounds good then. I will not mourn over Split or Trie.

depends on what changes we want.

I see it as an implementation detail. If it helps to get #2963 implemented more correct, maintainable or faster (in terms of development speed), we should go this way.

markus2330 on 22 Sep 2020

Can you please point me to the source line of which parentKey you mean?

It is called mountpoint and is part of the struct _Backend.
https://github.com/ElektraInitiative/libelektra/blob/4b0e9f8bdd6d890a1a3abaf04c543b9c7d33e984/src/include/kdbprivate.h#L361-L366

I have not check the whole codebase, but I am fairly certain this is the key passed to plugins. Although I might have missed a keyDup somewhere...

The data is a private thing and not passed between plugins.

Yes of course, but the data is the right place if a single plugin (instance) wants to retain information from kdbGet to kdbSet. Data specific to a plugin instance should definitely not be in the global keyset.

Of course putting mountpoint specific information into the global keyset to pass information from e.g. one postgetstorage plugin to another is a bit awkward (you will have to prefix all the keys with the mountpoint and even then there might be clashes), but its the best solution we currently have (a part from undocumented parentKey "API").

kodebach on 22 Sep 2020

❤1 👍1

but I am fairly certain this is the key passed to plugins.

This would be a bug, the key is supposed to only hold the information of where the mountpoint is and should be immutable after creation of the Backend.

Yes of course, but the data is the right place if a single plugin (instance) wants to retain information from kdbGet to kdbSet. Data specific to a plugin instance should definitely not be in the global keyset.

Exactly.

and even then there might be clashes

Which clashes are you thinking of? We should document the best practices of the global keyset before we get a mess there.

markus2330 on 26 Sep 2020

Which clashes are you thinking of?

I think he means clashes (conflicts) in the key name. Currently there are no rules how to name keys within the global keyset. Therefore unintended key name conflicts are easily possible.

To communicate between cache and resolver I used the following key name prefix, but theoretically anyone can overwrite it.
https://github.com/ElektraInitiative/libelektra/blob/46723f3f738c3eaf4a76d4648bfe919e2ac3235a/src/include/kdbprivate.h#L69-L70

mpranj on 26 Sep 2020

👍1

I think he means clashes (conflicts) in the key name. Currently there are no rules how to name keys within the global keyset. Therefore unintended key name conflicts are easily possible.

Then I probably overlooked a keyDup or something

I think he means clashes (conflicts) in the key name. Currently there are no rules how to name keys within the global keyset. Therefore unintended key name conflicts are easily possible.

Yes, specifically there could be a conflict between two plugins, so you need long prefixes.

A partial solution would be to provide a mountpoint keyset that is unique for each mountpoint. The other part can only be solved by convention like METADATA.ini.

kodebach on 26 Sep 2020

Then I probably overlooked a keyDup or something

It is possible that we have a bug there. The parentKeys are unfortunately not tested at all. I hope we get some tests with #2963.

A partial solution would be to provide a mountpoint keyset that is unique for each mountpoint. The other part can only be solved by convention like METADATA.ini.

Why would this solution be beneficial compared to prefixing? Why is that solution be partial?

I think guidelines for the global keyset should be enough.

markus2330 on 27 Sep 2020

Why would this solution be beneficial compared to prefixing?

Long keynames result in large amounts of memory usage, and if the keyset is cached (not sure if that's the case), large mmap cache files. Separate keysets are also less error prone.

Why is that solution be partial?

With just a single global keyset, non-global plugins need to use the mountpoint as a prefix to avoid collisions.

Separate keysets for each mountpoint only solves the issue where e.g. type is mounted at two different mountpoints.

It does not solve the issue where to different plugins try to use the same key name. Of course it also doesn't affect global plugins at all.

I think guidelines for the global keyset should be enough.

Guidelines will be needed anyway. We could just create a document like METADATA.ini where plugins can "reserve" key names. Any non-global plugin would also have to prefix the reserved name with the mountpoint, if we don't add separate keysets.

kodebach on 27 Sep 2020

With just a single global keyset, non-global plugins need to use the mountpoint as a prefix to avoid collisions.

But the whole point of the global KeySet is that there is only one instance, globally. (well one instance per KDB handle)
The point of the global KeySet was that the cache plugin needed it exactly as is, to communicate with the resolver plugins. (Getting file timestamps etc. to check if the cache needs updating)

mpranj on 27 Sep 2020

But the whole point of the global KeySet is that there is only one instance, globally.

kodebach on 27 Sep 2020

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Plugin Functions: Return Value on Success

sanssecours · 4Comments

Jenkins: retry problematic stages

mpranj · 4Comments

mmap: changing environment

markus2330 · 4Comments

Jenkins: build timeouts

mpranj · 3Comments

Toml/tablearray problem

markus2330 · 3Comments