Openrefine: Replace ORG.JSON library

Created on 15 Jun 2018  路  14Comments  路  Source: OpenRefine/OpenRefine

Because of the extra line in the JSON.org license, it is considered as non-free license by GNU/FSF, Debian, Fedora, Red Hat and Google and is causing issue to join the SFC.

Alternative free license exist to avoid this issue. See for example: https://wiki.debian.org/qa.debian.org/jsonevil

@thadguidry is also looking at Jackson library

maintainability Critical

Most helpful comment

All 14 comments

@magdmartin No, I ALREADY looked at Jackson and it's usable, but quite a bit of work for @jackyq2015 or whomever, like... A LOT of work actually :-)

And @wetneb has mentioned within https://github.com/OpenRefine/OpenRefine/blob/master/extensions/wikidata/src/org/openrefine/wikidata/utils/JacksonJsonizable.java#L41

Migrating out of this library is a lot of work that is going to cause major incompatibility with most extensions. It's something that we want to do eventually, but it's most likely going to happen gradually. It could be possible to migrate gradually, by isolating the parts that use Jackson / JSON: for instance, the Wikidata extension uses Jackson everywhere except where it interfaces with the core.

I am adding this ticket to the front/end back end separation so we can take it into account when we define the technical roadmap.

I guess we need to discuss exactly how we are going to approach this? Can it be broken down into smaller tasks that less experienced developers (like me!) can tackle?

@ostephens I would probably begin looking at this through an Interface lens. So that TDD (Test Driven Development) is approached from the very beginning. We actually have an Interface:

com.google.refine.Jsonizable
who's consumers implement it, and would be the first test points :

com.google.refine.model.AbstractOperation
com.google.refine.model.Cell
com.google.refine.clustering.Clusterer
com.google.refine.model.Column
com.google.refine.model.ColumnGroup
com.google.refine.model.ColumnModel
com.google.refine.commands.Command
com.google.refine.commands.browsing.ComputeClustersCommand
com.google.refine.commands.browsing.ComputeFacetsCommand
com.google.refine.grel.Control
com.google.refine.browsing.DecoratedValue
com.google.refine.browsing.Engine
com.google.refine.expr.EvalError
com.google.refine.browsing.facets.Facet
com.google.refine.grel.Function
com.google.refine.commands.history.GetHistoryCommand
com.google.refine.commands.history.GetProcessesCommand
com.google.refine.commands.project.GetProjectMetadataCommand
com.google.refine.history.History
com.google.refine.history.HistoryEntry
com.google.refine.commands.HttpUtilities
com.google.refine.model.medadata.IMetadata
com.google.refine.importing.ImportingJob
com.google.refine.operations.cell.MassEditOperation
com.google.refine.browsing.facets.NominalFacetChoice
com.google.refine.model.OverlayModel
com.google.refine.util.Pool
com.google.refine.preference.PreferenceStore
com.google.refine.process.Process
com.google.refine.process.ProcessManager
com.google.refine.model.Recon
com.google.refine.model.ReconCandidate
com.google.refine.model.recon.ReconConfig
com.google.refine.model.ReconStats
com.google.refine.model.ReconType
com.google.refine.model.RecordModel
com.google.refine.model.Row
com.google.refine.preference.TopList

  1. We'd want to start with Org.JSON 's serialization/deserialization feature set and compare and map those to Jackson's
    https://github.com/FasterXML/jackson-databind to build a utility class to replace org.json.JSONObject
  2. With that utility class that can mimic existing JSONObject and object manipulation with POJO's and affords the same options as Org.Json did begin conducting test cases across "the work set": https://github.com/OpenRefine/OpenRefine/search?l=Java&q=%22JSONObject%22
  3. ??? - @jackyq2015 what next ? probably Writer and Exception ? Maybe Exception should even be first, since we want to do TDD

I think starting from the function package might be a good starting point. It is isolated and easier to testing. so you can get familiar with the migration path. Then can move to more critical part and modules which need more effort to do the testing.

I have started to work on this. Here is my approach:

  • Write tests to pin down the current behaviour of the serialization / deserialization functions for each Jsonizable instance. The tests are written with the current implementation, so they rely on org.json, but are designed with the migration in mind: the test cases will be easy to migrate to Jackson. When the classes can be both serialized and deserialized, the test generally consists in deserializing and re-serializing a given JSON payload, and checking that the result is equal to the original (as a JSON object). Sample JSON representations can be collected by hijacking the write method to dump every representation it writes to a file, and running OR on diverse projects.
  • Rewrite some parts of the code where JSON objects are explicitly used are storing objects by introducing classes for the data they represent (example: engine configuration in operations). This is cleaner and will ease the migration to Jackson by reducing the footprint of org.json in the code base. List of places where this is needed:

    • Engine -> create EngineConfig (also used in operations)

    • ImportingJob -> create ImportingJobStatus

    • Probably others, to be discovered later

  • Annotate all classes for serialization with Jackson, without breaking the existing implementation. Thanks to the tests we can easily check that the JSON format is preserved.
  • Annotate all relevant classes for deserialization and replace all calls to the static reconstruct, loadStreaming and other deserialization methods to the Jackson equivalents. This includes the serialization tests. Deserialization should be made as flexible as possible (use @JsonIgnoreProperties(ignoreUnknown = true) almost everywhere) to match current behaviour.
  • Remove all the old org.json-based (de)serialization code, remove the jar from the repository.

The target end state is:

  • The Jsonizable interface is deleted.
  • All classes that are implemented Jsonizable are annotated with Jackson configuration, similarly for deserialization. Most classes should not need any custom code for serialization or deserialization, the annotations should be sufficient. The serialization format is kept identical in all cases.
  • The registration mechanism for commands, operations, functions, overlay model and others is kept as is (this will require the implementation of various TypeIdResolver to interface cleanly with Jackson). However this registration mechanism will go away when we migrate out of Butterfly - this will require some changes to these resolvers.

As an extension/fork maintainer, here is what will happen:

  • Compilation breaks because the Jsonizable interface does not exist anymore.
  • Every class that interacts with OR needs to be rewritten to use Jackson for (de)serialization. This mostly applies to classes that are registered (such as operations or overlay models).

It would be useful to provide a migration guide to the extension maintainers, with example transformations of classes to explain the migration process.

@wetneb You came to similar conclusions that I did when I first looked at Jackson... that most of our classes should not need any custom code for de/serialization since annotations should be sufficient, and we ensure the serialization format is kept identical.

Good Job. I like the approach.

And agree about a migration guide for developers and extension authors.

Quick update on this: pull request #1755 contains all serialization changes, and deserialization is ongoing work in the same PR.

In many cases it is possible to use Jackson annotations to make the (de)serialization implicit with Jackson, which is cleaner, but in others the entanglement with business logic is just too deep to go down this route - so for things like importers I am just translating from org.json classes to Jackson classes (JSONObject -> ObjectNode and others). The end result is not as clean but it is faster and less error-prone. If we want to do more refactoring in this area it will still be possible later (and not harder than before).

Let's continue to make small incremental changes, as you are doing, and not get bogged down, and as you say we can do more refactoring later. After all, with all the work you are putting in graciously, we want some Christmas presents this year, not next ;)

Fixed by #1755.

@wetneb hey on this paragraph on the migration guide, can you add a line link to one of our tests as an example of this, so that authors are shown a good way to frame this kind of testing. I'm worried that some might not fully know :

Before undertaking the migration, we recommend that you write some tests which serialize and deserialize your objects. This will help you make sure that the JSON format is preserved during the migration.

@thadguidry makes sense, done.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wetneb picture wetneb  路  3Comments

thadguidry picture thadguidry  路  3Comments

tfmorris picture tfmorris  路  3Comments

stellasia picture stellasia  路  4Comments

ralcazar-oeg picture ralcazar-oeg  路  3Comments