Pandoc: Split off core package

Created on 26 Mar 2020  Â·  30Comments  Â·  Source: jgm/pandoc

I'm wondering whether it would make sense to split off parts of pandoc into a separate pandoc-core package. This would make it easier to move other parts into separate packages as well.

My motivation here is the Lua system. It is growing quite large, but, with the exception of the pandoc.read function, is built only on a small part of pandoc the library. The pandoc core (i.e., T.P.Class etc) as well as the Lua system are relatively stable, so the overhead of having additional packages to maintain seems acceptable.

In a similar vein: while writing jira-wiki-markup, I would have liked to have a pandoc-parsing library. Depending on such library would make it easy to ensure that library uses the same parser as pandoc. It could include some of the fixes and convenience functions available in Text.Pandoc.Parsing.

Most helpful comment

  • so the main motivation is reducing compile-time? either when working on the lua subsystem, or when compiling normal pandoc? (using package-level build cache, or...?)

At least for me, that's the primary motivation. I also like the idea of having additional clear delimitations in the code-base and serves as a motivation to untangle the dependency graph (esp. with regard to T.P.Lua).

  • a potential downside would be that making refactorings that require changes to several packages becomes a lot harder, like we're already seeing with pandoc-types? Or is the idea to keep it in one git repository? a monorepo?

My understanding is that we are indeed aiming for a monorepo.

  • naming: I think pandoc should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybe pandoc-app, or pandoc-readers-writers instead?

This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.

All 30 comments

I'm open to exploring this, but what exactly would you conceive as being the core modules?

I'd considere everything required to define PandocMonad as "core", so the modules

  • BCP47
  • Class
  • Data
  • Error
  • Logging
  • MediaBag
  • MIME
  • Process
  • Translations
  • UTF8

plus the function uriPathToPath from T.P.Shared.

Additionally maybe Emoji, UUID, and XML, although those would increase the dependency footprint.

I'm still not sure I understand the motivation. This would allow creation of a package pandoc-lua with the lua system. This package would depend on pandoc-core. But what would depend on this package, besides pandoc itself? IF the answer is nothing, then I'm not sure it's worth the hassle of splitting.

A main motivation for me is compile time: e.g., switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules. Finding a way to reduce compile times would remove a huge bottleneck from my workflow. Splitting of smaller modules seems like a good option to achieve this, and should also reduce the frequency with which I'd have to switch branches.

switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules

One way to deal with this kind of thing is to clone the branch in a separate directory.

switching branches to fix a bug while working on Lua frequently causes recompilation of all pandoc modules

One way to deal with this kind of thing is to clone the branch in a separate directory.

Along these lines git subtree is quite useful for this too because it shares the object store with the main repo and so is fast and easy on the file system.

Also –and this is a bit for advanced foo– it is possible to stage and commit patches against branches that are not currently checked out at all. If you see little fixups that need committing somewhere other than the branch you are on there are ways to make the change but commit them to a different branch. A poor-man's way to do this is just have fun with stashes, but there also git tools to actually patch branches without checking them out.

Not directly related to the above but also very useful for keeping things like rebases from causing rebuilds, git revise is great for editing earlier commits without touching the file system and hence triggering rebuilds.

There is also ghcid, which I found very convenient and easy to use in other projects, but so far wasn't able to really use it with pandoc. This is due mostly to the size of the library and tests. E.g., I wasn't able yet how to restrict the number of tests to run.

Thanks for the hints @alerque. I grew lazy and usually just use Emacs with magit for most git tasks, but I'll checkout the things you mentioned.

See https://www.reddit.com/r/haskell/comments/fz3s2y/hakyll_status/
which notes that pandoc takes a lot of memory to compile. It's possible that splitting pandoc would help with this. On the other hand, this would make things less convenient for developers in many cases.

I guess part of the idea here would be to split off the lua system into a separate package, depending on pandoc-core (or whatever it is called)?

I'm warming to this proposal. I'm wondering whether pandoc-core is the right name, though. One might expect that to include things like Shared and Parsing -- things you need to write a reader or writer. Maybe everything except the readers and writers themselves, App, PDF, and SelfContained?

Maybe

pandoc

(Prelude)
Text.Pandoc
Text.Pandoc.App
Text.Pandoc.App.CommandLineOptions
Text.Pandoc.App.FormatHeuristics
Text.Pandoc.App.Opt
Text.Pandoc.App.OutputSettings
Text.Pandoc.Highlighting
Text.Pandoc.PDF
Text.Pandoc.RoffChar
Text.Pandoc.Readers
Text.Pandoc.Readers.HTML
Text.Pandoc.Readers.LaTeX
Text.Pandoc.Readers.LaTeX.Types
Text.Pandoc.Readers.Markdown
Text.Pandoc.Readers.CommonMark
Text.Pandoc.Readers.Creole
Text.Pandoc.Readers.MediaWiki
Text.Pandoc.Readers.Vimwiki
Text.Pandoc.Readers.RST
Text.Pandoc.Readers.Org
Text.Pandoc.Readers.DocBook
Text.Pandoc.Readers.JATS
Text.Pandoc.Readers.Jira
Text.Pandoc.Readers.OPML
Text.Pandoc.Readers.Textile
Text.Pandoc.Readers.Native
Text.Pandoc.Readers.Haddock
Text.Pandoc.Readers.TWiki
Text.Pandoc.Readers.TikiWiki
Text.Pandoc.Readers.Txt2Tags
Text.Pandoc.Readers.Docx
Text.Pandoc.Readers.Odt
Text.Pandoc.Readers.EPUB
Text.Pandoc.Readers.Muse
Text.Pandoc.Readers.Man
Text.Pandoc.Readers.FB2
Text.Pandoc.Readers.DokuWiki
Text.Pandoc.Readers.Ipynb
Text.Pandoc.Readers.CSV
Text.Pandoc.Readers.Docx.Lists
Text.Pandoc.Readers.Docx.Combine
Text.Pandoc.Readers.Docx.Parse
Text.Pandoc.Readers.Docx.Parse.Styles
Text.Pandoc.Readers.Docx.Util
Text.Pandoc.Readers.Docx.Fields
Text.Pandoc.Readers.LaTeX.Parsing
Text.Pandoc.Readers.LaTeX.Lang
Text.Pandoc.Readers.Odt.Base
Text.Pandoc.Readers.Odt.Namespaces
Text.Pandoc.Readers.Odt.StyleReader
Text.Pandoc.Readers.Odt.ContentReader
Text.Pandoc.Readers.Odt.Generic.Fallible
Text.Pandoc.Readers.Odt.Generic.SetMap
Text.Pandoc.Readers.Odt.Generic.Utils
Text.Pandoc.Readers.Odt.Generic.Namespaces
Text.Pandoc.Readers.Odt.Generic.XMLConverter
Text.Pandoc.Readers.Odt.Arrows.State
Text.Pandoc.Readers.Odt.Arrows.Utils
Text.Pandoc.Readers.Org.BlockStarts
Text.Pandoc.Readers.Org.Blocks
Text.Pandoc.Readers.Org.DocumentTree
Text.Pandoc.Readers.Org.ExportSettings
Text.Pandoc.Readers.Org.Inlines
Text.Pandoc.Readers.Org.Meta
Text.Pandoc.Readers.Org.ParserState
Text.Pandoc.Readers.Org.Parsing
Text.Pandoc.Readers.Org.Shared
Text.Pandoc.Readers.Metadata
Text.Pandoc.Readers.Roff
Text.Pandoc.Writers.Docx.StyleMap
Text.Pandoc.Writers.Roff
Text.Pandoc.Writers.Powerpoint.Presentation
Text.Pandoc.Writers.Powerpoint.Output
Text.Pandoc.Writers
Text.Pandoc.Writers.Native
Text.Pandoc.Writers.Docbook
Text.Pandoc.Writers.JATS
Text.Pandoc.Writers.OPML
Text.Pandoc.Writers.HTML
Text.Pandoc.Writers.Ipynb
Text.Pandoc.Writers.ICML
Text.Pandoc.Writers.Jira
Text.Pandoc.Writers.LaTeX
Text.Pandoc.Writers.ConTeXt
Text.Pandoc.Writers.OpenDocument
Text.Pandoc.Writers.Texinfo
Text.Pandoc.Writers.Man
Text.Pandoc.Writers.Ms
Text.Pandoc.Writers.Markdown
Text.Pandoc.Writers.CommonMark
Text.Pandoc.Writers.Haddock
Text.Pandoc.Writers.RST
Text.Pandoc.Writers.Org
Text.Pandoc.Writers.AsciiDoc
Text.Pandoc.Writers.Custom
Text.Pandoc.Writers.Textile
Text.Pandoc.Writers.MediaWiki
Text.Pandoc.Writers.DokuWiki
Text.Pandoc.Writers.XWiki
Text.Pandoc.Writers.ZimWiki
Text.Pandoc.Writers.RTF
Text.Pandoc.Writers.ODT
Text.Pandoc.Writers.Docx
Text.Pandoc.Writers.Powerpoint
Text.Pandoc.Writers.EPUB
Text.Pandoc.Writers.FB2
Text.Pandoc.Writers.TEI
Text.Pandoc.Writers.Muse
Text.Pandoc.Writers.OOXML

pandoc-core

(Prelude)
Text.Pandoc.Options
Text.Pandoc.Extensions
Text.Pandoc.Shared
Text.Pandoc.MediaBag
Text.Pandoc.Error
Text.Pandoc.Filter
Text.Pandoc.UTF8
Text.Pandoc.Templates
Text.Pandoc.XML
Text.Pandoc.SelfContained
Text.Pandoc.Logging
Text.Pandoc.Process
Text.Pandoc.MIME
Text.Pandoc.Parsing
Text.Pandoc.Asciify
Text.Pandoc.Emoji
Text.Pandoc.ImageSize
Text.Pandoc.BCP47
Text.Pandoc.Class
Text.Pandoc.Class.CommonState
Text.Pandoc.Class.PandocMonad
Text.Pandoc.Class.PandocIO
Text.Pandoc.Class.PandocPure
Text.Pandoc.Filter.JSON
Text.Pandoc.Filter.Lua
Text.Pandoc.Filter.Path
Text.Pandoc.CSS
Text.Pandoc.CSV
Text.Pandoc.UUID
Text.Pandoc.Translations
Text.Pandoc.Slides
Text.Pandoc.Image
Text.Pandoc.Writers.Math
Text.Pandoc.Writers.Shared

pandoc-lua

(Prelude)
Text.Pandoc.Lua
Text.Pandoc.Lua.Filter
Text.Pandoc.Lua.Global
Text.Pandoc.Lua.Init
Text.Pandoc.Lua.Marshaling
Text.Pandoc.Lua.Marshaling.AST
Text.Pandoc.Lua.Marshaling.AnyValue
Text.Pandoc.Lua.Marshaling.CommonState
Text.Pandoc.Lua.Marshaling.Context
Text.Pandoc.Lua.Marshaling.List
Text.Pandoc.Lua.Marshaling.MediaBag
Text.Pandoc.Lua.Marshaling.ReaderOptions
Text.Pandoc.Lua.Marshaling.Version
Text.Pandoc.Lua.Module.MediaBag
Text.Pandoc.Lua.Module.Pandoc
Text.Pandoc.Lua.Module.System
Text.Pandoc.Lua.Module.Types
Text.Pandoc.Lua.Module.Utils
Text.Pandoc.Lua.Packages
Text.Pandoc.Lua.Util
Text.Pandoc.Lua.Walk

I'd like to get the table changes merged first, though, before messing with this.

One complication: PandocMonad depends on Text.Pandoc.Data (dataFiles) when embed_data_files is turned on. That means that Data, and all the data files, would have to go in core. This seems conceptually wrong to me. The templates, for example, naturally go with pandoc, not pandoc-core. And some of the data files are things like the pandoc manual itself. I don't see a very clean solution to this.

Actually there is a clean solution. We could store a field for dataFiles in the CommonState of PandocMonad. (A bit tricky though because this means that anyone using pandoc as a library will have to remember to set this field in commonstate before running readers/writers....)

This sounds really nice. We could keep the new packages in the same repo as the main app in the beginning, which should minimize friction (and preserve the git history).

Remaining problems: there probably needs to be a mechanism to decouple Text.Pandoc.Filter.Lua from T.P.Lua, or that module cannot be in pandoc-core. Also, the Lua module must be changed such that functions getReaders can be injected, or we'd run into a dependency loop.

Can you look into those remaining issues to see if you can find a solution? I don't want to mess with these changes if it's not going to work in the end. Multiple packages in the same repo is the way to go, I think, now that the tooling supports this well -- we might even think about bringing in pandoc-types eventually.

Btw, it wouldn't be disastrous if Text.Pandoc.Filter had to go in pandoc rather than pandoc-core, because of the lua dependency. I'm more worried about potential circular dependencies in the lua stuff. E.g. I notice that Lua.Module.Utils imports T.P.Filter.JSON. I guess we could have T.P.Filter.JSON in core and the rest of the filter stuff in pandoc, though.

Yes, I'll look into it.

I guess it should it be ok to use Template Haskell to remove pandoc.lua and pandoc.List.lua from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.

I fooled around a bit with the idea mentioned above for data files. I made T.P.Data an exported module, exporting initializeDataFiles, which initializes stDataFiles in common state with the baked in data. Problem is, you need to remember to run this every time you run a PandocMonad, and that's fragile. Maybe we'll need to provide wrappers for runIOEither and runIOorExplode in the pandoc package, which ensure that this initialization step is always done?

I guess it should it be ok to use Template Haskell to remove pandoc.lua and pandoc.List.lua from the data files? Including the Lua via quasiquotes in seems like a clean and easy solution, and if I remember correctly, we already depend on TH and no longer support building without it.

Correct.

I just pushed an initialize-data-files branch which contains my idea for decoupling data files from pandoc-core. It's a bit awkward because you can't forget to add the initializeDataFiles when you run a PandocMonad instance. But it seems to work.

I haven't been following this closely, so sorry if I'm missing something, but a few thoughts:

  • so the main motivation is reducing compile-time? either when working on the lua subsystem, or when compiling normal pandoc? (using package-level build cache, or...?)
  • a potential downside would be that making refactorings that require changes to several packages becomes a lot harder, like we're already seeing with pandoc-types? Or is the idea to keep it in one git repository? a monorepo?
  • naming: I think pandoc should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybe pandoc-app, or pandoc-readers-writers instead?
  • so the main motivation is reducing compile-time? either when working on the lua subsystem, or when compiling normal pandoc? (using package-level build cache, or...?)

At least for me, that's the primary motivation. I also like the idea of having additional clear delimitations in the code-base and serves as a motivation to untangle the dependency graph (esp. with regard to T.P.Lua).

  • a potential downside would be that making refactorings that require changes to several packages becomes a lot harder, like we're already seeing with pandoc-types? Or is the idea to keep it in one git repository? a monorepo?

My understanding is that we are indeed aiming for a monorepo.

  • naming: I think pandoc should be the unambiguous name for the whole combined codebase... (well, we already have pandoc and pandoc-types, but whatever). But maybe pandoc-app, or pandoc-readers-writers instead?

This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.

Reducing the compile time for developers, and reducing the total memory required to compile pandoc (which is getting ever larger and has made it hard to build pandoc on some systems) are both motivations.

It's true that this would make development a bit more complicated, and that would have to be weighed heavily. I'm developing commonmark-hs this way, as four packages in one repository (also skylighting, skylighting-core), so I have some experience with it. It's not too bad, but you have to think about things like version numbers (if you follow the versioning policy for each package, then they will get out of sync, and this might lead to confusion; in skylighting we simply force them to be in sync but this isn't ideal).

I'm still not sure about he idea, in the end. I don't like the approach in my initialize-data-files branch and now I'm leaning towards thinking that maybe Text.Pandoc.Data and all the data files should, after all, go in core. This is ugly, because if you modify a writer and a template, for example, you'd have to modify two packages. But it's also ugly to require a special initalizeDataFiles command every time you do runIO or runPure.

Yeah, if it's done as a monorepo, I think that could work... feels like as a developer, when you build master, you would just want to use the code that's currently on master as well for the other packages. But then you lose the ability to use the cache?

About how to split it, definitely the case of using pandoc the library vs the application should be part of that decision I think...

This is tangential, but if we were to split "pandoc-executable" from "pandoc-the-library", then we could add a cabal.project.freeze file to fix the dependencies of the executable to specific versions – that would come in handy when building docker images.

Nothing stops you from generating your own freeze files and using them for the docker images.

This proposal has more implications than I assumed. The potential benefits would likely not match the investments, so I'm closing this. It is probably a good idea to untangle Lua dependencies regardless.

Nothing stops you from generating your own freeze files and using them for the docker images.

That is true, I'll do that. Other projects are doing the same, Alpine for example. There might still be value in making it more likely for all binaries out there use the same dependencies.

I wouldn't mind keeping this open; it still seems possibly worth doing, I just can't decide.
(Unless you have new insights not mentioned above.)

The only insight not mentioned is that I'll need to refactor HsLua as a prerequisite to untangle and improve T.P.Lua.. Refactoring will probably take quite a while, and I didn't want to leave a stale issue hanging around. I'll happily bring the topic up again once I'm confident that it could be completed in a predictable time-frame.

Might as well leave this open, though, since it contains some useful notes on what would be required.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jclement picture jclement  Â·  117Comments

nrnrnr picture nrnrnr  Â·  49Comments

kevinushey picture kevinushey  Â·  79Comments

anton-k picture anton-k  Â·  53Comments

brainchild0 picture brainchild0  Â·  66Comments