Jabref: Group hierarchy lost when moving to BibEntry

Created on 12 Jun 2016  路  37Comments  路  Source: JabRef/jabref

When moving from JabRef 9.10 to 3.4, the philosophy governing groups changed. Rather than
a key being linked to a group in "jabref-meta: groupstree" the group is attached to the individual entry, as in "groups = {culture},",

The big disadvantage is that groups are no longer in a hierarchy but a independent. As the result, a key associated, say, with "culture", is associated with that term everywhere it appears in the database. So when I select "culture", hundreds of keys are elected that use the word "culture" in a different context or meaning. If I understand the situation there is no way now to avoid the hundreds of irrelevant links.

The issue is now to avoid duplicating the situation in the future. I find that if a BibEntries share a
string, they are viewed as the same. Since they are case sensitive, I can distinguish Culture, cUlture, cuLture, etc. Then go through all the keys selected by "culture" and re-associate them with one of the new names. Since this is hundreds of hours of work, I wonder if there is a better method.

groups bug 馃悰

All 37 comments

So the issue is when the group name is used multiple times in a large group hierarchy at different places? Like you have GroupA -> culture, GroupB -> GroupC -> culture and in this case the two groups will be merged falsely? Did I get your question right?

JabRef should not be doing something like this. We need a step to uniquify the group names before they are converted. What do you think @tobiasdiez ? Or this this already done? I am a little bit uncertain if I have understood the question correctly.

Hi. This is exactly that.

I usually use one group per project/paper, each one having subgroups with generic names, such as "to_review", "introduction", "materials", "pathophysiology" etc. With JabRef3.4, a given "to_review" subgroup contains the articles from all the "to_review" subgroups.

Since movement from JabRef 9.10 to 3.4 resulted in massive loss of information, I reverted to the former and recovered the jabref-meta: groupstree for each database. While I lack any expertise, it seems that the hierarchical structure of group names could be readily preserved. For example, here is a hierarchical group tree:

1
-A
-- a
2
-B
--a
3
-C
-- b

So a bibliographic entry that is associated only the first "a", and with "C" and "b" would have the field:

groups = {1,A,a; 3,C; 3,C,b},

If this could be done, I recommend it.

Refs #628: "Feature: Hierarchical Keywords"

Okay, this is confirmed.

A solution using like 1>A>a or something like this sounds reasonable.

A quickfix is not possible. This requires a major rework as a lot of issues are caused by this.

  • the name and the keyword differ in static group
  • explicit group needs to now its context always to compute the correct key (or the other classes must ensure that the correct keyword is always set which encodes the group hierarchy)

@tobiasdiez we should discuss this when you are back

I could implement that the whole hierarchy is always written in the groups field. But what about cases where the same group name is used on the same level, i.e. 1 > A > a and 1 > A > a with different entries? should this case just be forbidden upon creation of the group?

@tobiasdiez I think so: duplicate names should not be allowed for sibling groups, just like file names in a classic file system.

@tobiasdiez I don't get your example. Does that mean, that it is not possible to have the same group a associated with two different entries? (With a being nested in A, being nested in 1). I would allow two entries belonging to the same group... Maybe, I did get something wrong.

I can confirm this bug for the build "2016-07-13--master--304d280".

@HainesB reported a similar issue at https://sourceforge.net/p/jabref/mailman/message/35303493/. Maybe, he is willing to test the fix as soon as we had time to implement it.

Come to think of it, maybe it is not quite that bad... In bug #1508 there was an example when repeating a group was undesirable:

  • Asthma

    • Treatment

  • Diabetes

    • Treatment

But repeating groups enables one to create such structure:

  • Asthma

    • Asthma treatment

    • Asthma diagnosis

  • Diabetes

    • Diabetes treatment

    • Diabetes diagnosis

  • Treatment

    • Asthma treatment

    • Diabetes treatment

And such structure might be desirable.

Perhaps a better (and easier) solution would be to make GUI aware of such possibilities ("Add existing group as subgroup", a dialog informing that the newly entered group name is not unique and asking what should be done)?

I think the above example is overly complex and has several pitfalls (i.e. deleting content from a groups one is not aware of that it is duplicated and one is therefore triggering unwanted side effects). Actually, the usecase illustrated above can be simply imitated by using three distinct groups asthma, diabetes and treatment. If you have this groups you can always select multiple groups and compute an intersection between them.

Therefore, my proposal here is to keep it simple and use default behaviors, that people do expect (i.e. filesystem-like, without special cases like symlinks/hardlinks/etc). Thus, save groups simply like filesystem paths.

But currently the groups don't behave like filesystem paths. There's no filesystem analog for "have this file appear in any directory called "treatment" wherever that folder is on the disk.

The sample of @mpatas is fairly typical for me. I collect way, way too many references when I write a paper and I tend to separate them by theme so I know where to find them easily when I'm tangling with that theme. (edit: I know, keywords, but I like the visual representation of the tree)

(and the hardlinks behavior is already in JabRef and any other reference manager I can think of. A reference can legitimately and easily be in multiple reference groups, and no group has any special status as the "actual" location of the reference. Symlinks would indeed be confusing that way.)

When I download a file, to prevent a name clash the file is
automatically appended with an incremental number such as Alpha,
Alpha-1, Alpha-2.

Couldn't the group name creation process in JabRef be similarly equipped
so that when a new group is created having the name of an existing
group, an incremental number, perhaps no visible, is appended?

Haines Brown

@tillschaefer

Actually, the usecase illustrated above can be simply imitated by using three distinct groups asthma, diabetes and treatment. If you have this groups you can always select multiple groups and compute an intersection between them.

That is true. Yet there are use cases of a somewhat different kind where this solution seems to be rather awkward. For example:

  • Biology

    • Biochemistry

  • Chemistry

    • Biochemistry

In general, grouping of references by topic will result in something like that for any multidisciplinary topic. Of course, one can make "Biochemistry" a group that is not a subgroup of either group, but that would seem to lose some of the benefits of hierarchical structure (for example, the ability to collapse the groups one is not interested in at the moment).

If done carefully, such structure with repeated groups should avoid any unintended side effects. Although, of course, done carelessly it could even result in cycles...

Hereby, I officially surrender in front of this bug.
The following steps are needed in order to fix the issue:

  • Static groups should match against the path in tree instead of its name, i.e. Biology > Biochemistry instead of only Biochemistry.
  • Migrate old groups to this new format.
  • Update a static group as soon as its position in the tree is changed.

While each step in itself is not that difficult to accomplish, I have not the time to implement it right now. Sorry for the inconvenience caused!

However, there are two workarounds:

  • Instead of creating a static group Biochemistry use a keyword group that matches Biology > Biochemistry in the groups field. You can have multiple keyword groups with the same name but different search pattern, so this should work as desired.
  • Use hierarchical keywords (e.g. keywords = { Biology > Biochemistry, Chemistry > Biochemistry }) and use the new automatic group feature of JabRef 4.0 (released soon) to generate a working groups tree
    ````
  • Biology

    • Biochemistry

  • Chemistry

    • Biochemistry

      ````

What about creating an additional uniqueID and a mapping from this ID to the displayed group name?

If a group name is already unique - use the name as ID.
If a group name is duplicated - create unique IDs name1, name2 or sth else.

In the groups field the uniqueID should be stored. (This is a downside as there is a divergence between uniqueID and displayed group name - but I think in most cases groups won't be edited manually so this is not a real issue...)

The ID -> display name map should not be stored in some existing metadata but should be added to a new one. Thus, it should be possible to use the current implementation of the "search logic" by using the unique ID as search term. Also an automatic migration to the new format should be possible, and the groupstree is generally usable in older versions as well...

Only in the UI some bigger changes should be needed to show the display names...

As I have not looked at the groups code for ages... do I miss something? Or would this generally be feasible?

I haven't followed the whole history of the groups format, but it looks to me like most of these comments boil down to "let's just go back to groups format 3". If the point of groups format after-3 (4?) was that you could edit the group assignment in the entry itself without having to look up stuff elsewhere, none of the above seem to accomplish that.

Can't we just copy the hierarchical tags idea for groups as suggested above? So something like

  • Asthma

    • Treatment

  • Diabetes

    • Treatment

would be represented as

groups = { Asthma/Treatment, Diabetes/Treatment }

(or whatever other separator is deemed more appropriate) and have JabRef figure out on its own how that translates visually.

@retorquere To use this as an identifier for the groups would be possible, too. But this would cause either a lot of overhead if groups are moved in the tree, or would lead to inconsistencies between the actual position in the tree vs. the assumed position in the tree:

Let's assume that the structure:

  • Asthma

    • Treatment

  • Diabetes

Is changed to:

  • Asthma
  • Diabetes

    • Treatment

The entries assigned to this group would have the "key" groups = { Asthma/Treatment } although the "Treatment" group is now a subgroup of "Diabetes". And changing the keys of all affected entries would massively affect the performance in huge databases... 馃槥

@matthiasgeiger then perhaps the desiderata "easy to edit by hand with no more context than the reference" (groups format post-3) and "easy to keep consistent and well-performing" (groups format 3) are at odds with each other... personally I still prefer format 3; the various downsides discussed above undo the benefits of post-3 from my pov.

I'm not sure whether I remember this correctly, but I think the main reason to change the group format was a more technical one, as the "static" groups are now implemented exactly the same way as all other groups: Now it is possible to simply perform a search in the background - as it is the case for keyword or freetext based groups. The "easy to edit by hand and directly showing the groups in the entry" was a nice side effect.
But perhaps @tobiasdiez could provide some more insights in this...

But if that was the aim, surely the current implementation doesn't accomplish this. The "static" groups as we had them in groups format 3 is not implemented in the new format (as per this issue). The new groups format implements a kind of static grouping, and it seems plausible it now uses the same search-based infrastructure as the other grouping methods did, but in the format-3 version even search-based groups could live nested under other groups -- why isn't that still possible?

As one stuck with format 3, let me throw in another consideration. When you
have databases having 15-20,000 entries as I do, a manual conversion of
group names is prohibitive.

Haines Brown

But would you do that manual conversion by hand? By script? Or using JabRef? Because only in the first case it would seem to be prohibitive. For the other two it would just be an implementation detail? I'd probably reach for option 1 only when my file is corrupt.

Just out of curiosity: When talking about "groups format 3" and "groups format 4", do you mean JabRef 3.x vs. JabRef 4.x? I thought the change in group structure happened between JabRef 2.11 and JabRef 3.x?

Or are you talking about new changes in JabRef 4.x?
Or is the "groups format" not related to the JabRef version?

I'm not sure how the groups format relates to JabRef versions; there used to be a

@comment{jabref-meta: groupsversion:3;}

in the Bib(La)TeX files created by JabRef, now there is

@comment{jabref-meta: databaseType:biblatex;}

or

@comment{jabref-meta: databaseType:bibtex;}

since this is the first new format I've seen since groupsversion:3 was depracated, I've been calling it format 4, but only because it's 3+1, no other reason. I don't know what the internal name is the JabRef devs use for either of these formats.

Thanks for the clarification!

Mapping this to JabRef versions its "up to JabRef 3.3" and "since JabRef 3.4"

The main advantages of the new format are outlined in https://github.com/JabRef/jabref/issues/629. The id-based strategy suggested by @matthiasgeiger should work but removes some of the advantages (at least for groups with duplicate names). The issue is not that a solution is not technically feasible but that I don't see an easy solution, i.e. a lot of time is needed to program it.

Maybe I should stress the point, that the workaround above is very easy, has the same behavior as explicit groups and results in no performance loss or other problems (that I can see). The idea is the following. Assuming that you already have a tree with the groups Asthma and Diabetes and want to add two explicit groups Treatment as subgroups. Instead of creating Threatment as an explicit group (since this would lead exactly into this bug), we will add it as a keyword-based group working on the groups field and searching for Asthma > Treatment and Diabetes > Treatment.
So you would end up with the following tree:

  • Asthma (explicit group, or whatever you want)
    - Treatment (keyword group: field = groups, search text = Asthma > Treatment)

    • Diabetes (explicit group, or whatever you want)



      • Treatment (keyword group: field = groups, search text = Diabetes > Treatment)



If you now add an entry to the first Treatment group Asthma > Treatment will be added to the groups field, while for the second group you get Diabetes > Treatment. Hence only the correct entries are matched by each Treatment group. In the end, you implement the solution proposed above by hand.

Note, that if you still use an older JabRef < 3.4 you can simply change the explicit groups with duplicate names accordingly to this scheme (using JabRef, rightclick on group -> edit -> change to keyword group with field = groups and search text whatever you want - all your entries are automatically assigned to this new group). Now you can upgrade to later JabRef versions and don't have the trouble that groups are accidentally merged.

I wonder, are we still doing something regarding this issue? Any actions planned?

Because as far as I can see, the groups format is pretty much stable as it is now. So I guess this issue can be closed.

Shouldn't this be kept open? Because the actual issue has not been fixed, there are only workarounds for it - as far as I understand. I, for example, opted to replace subgroup names that appear multiple times in my database with a new name that is based on both the parent group and the subgroup. While this certainly works, it is just a workaround. I think something that could help alleviate the problem a bit would be JabRef notifying the user that a specific group name has already been used (similar to the duplicate Bibtex key notification). This would not solve the issue, but I think it would be a very helpful workaround.

From @tobiasdiez replies, however, I understand that solving the issue itself would require an awful lot of work, which is (currently?) not feasible, especially since there exist workarounds. My suggestion would be to keep the issue open, but maybe add a tag, that this is a longterm problem that may or may not be solved in the distant future (if such a tag exists).

yes, please keep this open. the problem can silently destroy your grouping. there is no warning issued to the creator of such a group. it just merges groups with the same name and this is not an expected behaviour.

@AEgit @tillschaefer Thanks for the explanation!

Sure we can keep this open. And now I recall what we wanted to do here:

  • Implement a warning message if users create groups with duplicate names.

I'll be optimistic and add this to the 4.0 milestone as well. Let's see if we can make that depending on how well the next beta goes.

@lenhard: Cheers, thanks a lot! Yea, I think the warning message would be very helpful.

A warning message is now shown.
Sadly a real fix of this issue requires a lot of time that currently none of the developers can invest. Thus, I put this issue on-hold.

Cheers, thank you for the work. I think the warning message is a good workaround.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

caugner picture caugner  路  3Comments

tobiasdiez picture tobiasdiez  路  4Comments

Siedlerchr picture Siedlerchr  路  3Comments

a-torgovitsky picture a-torgovitsky  路  3Comments

Siedlerchr picture Siedlerchr  路  4Comments