Newsboat: New format for urls file

Created on 17 Dec 2017  Â·  15Comments  Â·  Source: newsboat/newsboat

As first discussed in https://github.com/akrennmair/newsbeuter/issues/158, moving urls file from plain text to a more sophisticated format like YAML will let us have per-feed settings (as requested in https://github.com/akrennmair/newsbeuter/issues/248, https://github.com/akrennmair/newsbeuter/issues/326, https://github.com/akrennmair/newsbeuter/issues/465, https://github.com/newsboat/newsboat/issues/77, and probably touched on in https://github.com/akrennmair/newsbeuter/issues/176). #77 is fairly recent, and there is also a related discussion on our mailing list, so I'm creating this tracking issue here now.

The first order of business is to decide on the format. Requirements Wishes are:

  • format is well-known, or at least exists already (frees us of burden of designing it; potentially some users will already be familiar with it; potentially some text editors will already have support for it);
  • there are parsers we can use;
  • as little syntax as possible (e.g. JSON has square brackets, curly brackets and commas—easy for programmers, hard for users).
enhancement

Most helpful comment

As discussed on IRC, I took a look around the world of configuration languages, there's... not much to choose from, ini and yaml are the only languages I found that fit our criteria so I'm just going to break them down so we can decide:

I'm going to comment with the mindset that they will be used only for the urls file, meaning we need to have a feed identifier (url) and some associated data whose data format varies (strings, lists etc).

INI/TOML

Pros:

  • Ini is very widely used and common in both windows and linux configuration land so users should already be mostly familiar with it.
  • Dead simple syntax
  • Tons of different parser implementations to choose from

Cons:

  • (for ini) No proper specification, TOML is an extension with much better specification
  • No support for lists/dictionaries if we ever need them (But are supported by TOML)
  • No support for nested values (TOML does support them but the syntax is quite cumbersome)

Sample urls file (at least how I image it looking)

# feed without any tags
[https://github.com/newsboat/newsboat/releases.atom]
# feed with tags and name
[https://newsboat.org/news.atom]
name="newsboat.org"
tags="newsboat,news"
#- or, with TOMLs lists
tags=["software", "updates"]

YAML

Pros:

  • Widely used and well known
  • Simple syntax
  • Supports lists,dictionaries and pretty much any data type we'd want out of the box
  • Supports nested lists/dicts
  • A lot of advanced features that can come in really handy. e.g. anchors/references that could allow you to easily re-use the same tags/configurations between many feeds.

Cons:

  • Can get fussy with indentation

Sample urls file

# feed without any tags
- url: https://github.com/newsboat/newsboat/releases.atom
# feed with tags and name
- url: https://newsboat.org/news.atom
  name: "newsboat.org"
  tags: "newsboat,news"
#- or
  tags:
      - newsboat
      - news

Rejected languages:

  • UCI: Semicolons and brackets breaks the 'minimal syntax' rule, plus if a feed is a section I don't know how you'd define a feed without any extra settings
  • CSON: Promising, but is coffescript spefic, no C++ parser available.
  • JSON: As mentioned in the original post, too much syntax and it doesn't support comments

In my opinion yaml is the best bet for customization and future expandability while ini/toml is still a good choice and a safe bet, it's not nearly as powerful.

All 15 comments

As discussed on IRC, I took a look around the world of configuration languages, there's... not much to choose from, ini and yaml are the only languages I found that fit our criteria so I'm just going to break them down so we can decide:

I'm going to comment with the mindset that they will be used only for the urls file, meaning we need to have a feed identifier (url) and some associated data whose data format varies (strings, lists etc).

INI/TOML

Pros:

  • Ini is very widely used and common in both windows and linux configuration land so users should already be mostly familiar with it.
  • Dead simple syntax
  • Tons of different parser implementations to choose from

Cons:

  • (for ini) No proper specification, TOML is an extension with much better specification
  • No support for lists/dictionaries if we ever need them (But are supported by TOML)
  • No support for nested values (TOML does support them but the syntax is quite cumbersome)

Sample urls file (at least how I image it looking)

# feed without any tags
[https://github.com/newsboat/newsboat/releases.atom]
# feed with tags and name
[https://newsboat.org/news.atom]
name="newsboat.org"
tags="newsboat,news"
#- or, with TOMLs lists
tags=["software", "updates"]

YAML

Pros:

  • Widely used and well known
  • Simple syntax
  • Supports lists,dictionaries and pretty much any data type we'd want out of the box
  • Supports nested lists/dicts
  • A lot of advanced features that can come in really handy. e.g. anchors/references that could allow you to easily re-use the same tags/configurations between many feeds.

Cons:

  • Can get fussy with indentation

Sample urls file

# feed without any tags
- url: https://github.com/newsboat/newsboat/releases.atom
# feed with tags and name
- url: https://newsboat.org/news.atom
  name: "newsboat.org"
  tags: "newsboat,news"
#- or
  tags:
      - newsboat
      - news

Rejected languages:

  • UCI: Semicolons and brackets breaks the 'minimal syntax' rule, plus if a feed is a section I don't know how you'd define a feed without any extra settings
  • CSON: Promising, but is coffescript spefic, no C++ parser available.
  • JSON: As mentioned in the original post, too much syntax and it doesn't support comments

In my opinion yaml is the best bet for customization and future expandability while ini/toml is still a good choice and a safe bet, it's not nearly as powerful.

As much as I normally prefer toml, I believe yaml is better for this.

Thank you for the analysis! I'll go read the YAML spec to see what pitfalls it has for the non-technical user. The only such pitfall I can see right now is "no tabs for indentation" rule—we'll need to make sure that our parser reacts to this with a friendly message.

Just for the record. It probably won't affect us very much as the feed config doesn't really benefit from them, but multi-line strings in YAML are a bit… let's say: scary (you may also want to read through those comments). Nevertheless, YAML still seems like the best choice suiting our needs.

Until now I've never heard of CSON and it indeed looks even better than YAML.

Hey, why not using XML, Newsboat already has a parser for that?! :-> No, please don't.

I finally read the Preview chapter of the spec.

After reading about structures in YAML, I changed my focus from "what pitfalls await the user" to "how much potentially useless things YAML contains" (useless for Newsboat's urls file). I think the answer is "quite a lot": aforementioned structures, complex mapping keys, myriad ways to escape/multiline strings, explicit typing. The only remaining features are structures, scalars, and tags.

TOML has everything but the tags, but we can emulate them in code by adding "settings groups" or something:

[group.frequent]
reload-time = 1
reload-retries = 1

[group.broken_ssl]
ssl-verifypeer = false

['https://example.com/feed.atom']
groups = [ 'frequent', 'broken_ssl' ]

After that, I read TOML spec with the same focus. It's better; potentially useless features are: special handling for dates and time, arrays of tables. There are only four types of strings, which is acceptable I guess.

One inconvenience is that tables use keys for names, and bare keys can only contain ASCII alphanumerics, underscores, and dashes. This means @tsipinakis's example would look slightly differently:

# feed without any tags
['''https://github.com/newsboat/newsboat/releases.atom''']
# feed with tags and name
['''https://newsboat.org/news.atom''']
name="newsboat.org"
tags="newsboat,news"
#- or, with TOMLs lists
tags=["software", "updates"]

One set of quotes might be enough (i.e. 'https://newsboat.org/news.atom'), but not always; since URLs might contain single quotes (they don't have to be percent-encoded IIRC), it's better to always use triple quotes. Unfortunately for us, this is part of the format, so the parser would hide this detail from us, and we wouldn't be able to enforce it. In other words, that's a pitfall for the user.

I also didn't see any requirement on the order of the tables, i.e. it's not guaranteed that parser will return URLs in the same order as they are in the file. This is a problem because we depend on that order (by default feedlist shows feeds in the same order as they're in the file).

I invite everyone to think hard about things that we might add to urls file and which can't be (conveniently) expressed in TOML.

@gregf, @der-lyse, I'd be glad to see expanded versions of your comments to see what makes you prefer YAML in this case.

To clarify: "useless" features worry me because user might inadvertently trigger them and either get a confusing error message (most likely), or, worse yet, get a valid file that doesn't do what they think it does (highly unlikely, and I struggle to come up with an example such that it would be both valid YAML and acceptable from Newsboat's standpoint). Less is more.

To start off with, advantages and disadvantages on the YAML vs. TOML topic are very minimalistic in my opinion. But here are my thoughs (later points are just brainstorming, probably not helpful at all, but who knows):

  1. TOML indeed does not guarantee in the spec that the tables are ordered. It depends on the actual parser and some of them actually do preserve the order of definition. So this problem could be considered taken care of, I reckon.
  2. I personally dislike being forced to quote strings in any case in configs, it just feels not elegant to me. Of course, nothing major, just very subjective taste. YAML and good old INI on the other hand would let me safe the quotation marks on most situations.
  3. Most of my feeds won't have any additional settings, so declaring tons of empty tables seems a bit odd. In contrast to that, YAML arrays read a tiny bit more naturally. Indeed, with the url key it's even longer than TOML, so I'm obviously biased as I'm more used to YAML than to TOML (in fact only used INI so far).
  4. Speaking of that – watch out, now it's getting sketchy – we may could just allow an abbreviation for that by allowing strings instead of objects:

    # feed without any tags, just a string and no object
    - https://github.com/newsboat/newsboat/releases.atom
    
    # feed with tags and name
    - url: https://newsboat.org/news.atom
      name: newsboat.org
      tags: [software, updates] # I definitely prefer using arrays here
    

    But of course this makes editing the file harder if one decides to finally add some settings. :-(

  5. Another idea: having the URLs as keys, but this makes things even worse, because as far as I know the order of keys in YAML is not guaranteed either and for empty settings an explicit object is required, making it visually unpleasing.
  6. Maybe go with a custom DSL? Yes, this defeats the requirement of being already established, so no syntax highlighting for editors out of the box.

Yeah, a lot of quoting is indeed bad. Let's not do that. I also agree that most feeds will have no custom settings.

I think if we adopt TOML or YAML, we'll have to provide an interface to edit them :)

Thinking of a custom format, the first thing that comes to mind is a mix of current format plus our config format:

https://example.com/atom.xml
https://newsboat.org/news.atom "awesome software" "buggy software"
- max-items 40
https://github.com/newsboat/newsboat/releases.atom
    - use-proxy no
   - download-full-page yes

In this new format, a line either:

  • starts with an HTTP schema, and declares a feed; or
  • starts with optional whitespace followed by a mandatory dash ('-'); the rest of the string is interpreted as a line in Newsboat's config would be, but it only affects the last declared feed. It's illegal to have such line if no feed was previously declared. Whitespace is any sequence of space or/and tab characters.

Pros:

  • Newsboat stays backwards compatible with existing urls files;
  • upgrade to new format is seamless for the user—simply start adding options under your URLs.

Cons:

  • doesn't support "do not repeat yourself" principle—it would be impossible to factor out common sets of settings into some "blocks" that can be "linked to" under each URL.

This is just a proposal. It feels like a half-measure, and I'm not sure if it's enough or if I'm just delaying the inevitable.

I like that idea a lot, this is absolutely great! It marries both the _urls_ and _config_ in a mostly seamless manner (just the dashes are truly new), I'm really amazed. :-)

Speaking of dashes, maybe just say, that the per-feed config must be indented by at least one whitespace, so no leading dashes at all. Indentation is a common thing to do when organizing things hierarchically, even non-technical people do this I reckon. To go even further, we should also be able to recognize settings without any indentation or dash prefixes because they look quite different to regular URLs, query and exec feeds. However some dedicated marking (indentation and/or dash) of per-feed settings certainly don't hurt. I'm not arguing against the dashes, I'm totally fine with them, it's just another idea to (maybe) further simplify the new urls file format.

Regarding reusable configuration blocks we could add them, too. E.g. if the line starts with @ (or whatever else) it's recognized as a named block definition which can be used anywhere else:

https://example.com/atom.xml
https://newsboat.org/news.atom "awesome software" "buggy software"
  - @news
  - max-items 40

https://github.com/newsboat/newsboat/releases.atom
  - @news

@news
  - reset-unread-on-update yes 
  - reload-time 60

On a side note: it seems that special care has to be taken with some of the settings, like reset-unread-on-update which takes a list of URLs at the moment but in the _urls_ file should probably take a boolean instead.

And as the example above illustrates, I'd like if one could use the reusable block even before it was defined. Defining it a second time would be an error.

We may also consider to introduce a tags setting (which is valid in the _urls_ file only), so that even the tags could be reused quite easily. But I don't use them, so I'm not sure if that's actually something useful or even worth the hassle with per-feed only.

If a custom format is the way forward, and it'll require indentation, I suggest limiting it to spaces or tabs. I hate to bring up that old argument but allowing both can create drama.

@der-lyse, good points, I'm still thinking :)

@sungo, what kind of drama do you mean? I understand how this leads to holy wars between programmers, but urls file is not source code: it's used by a single person, and there is no collaboration on a config. As a result, every user can pick their own style, and noone will get mad because noone else uses that particular file.

@Minoru agreed, mostly. The issue has arisen for me mostly because of bad editor defaults that don't visually distinguish between hard tabs and spaces. I have ended up with personal drama because a file ended up with mixed spaces and tabs. Sure, that's my fault for accepting bad defaults but it caused an issue, regardless.

On the flip side of my own argument, YAML's fussiness around indentation, as @tsipinakis noted, is one of its worst qualities.

If the intention is to only ever allow a single level of indentation, mandating spaces over tabs over allowing mixed doesn't matter all that much. If there might be a possibility for multiple levels of indention, it might be something to consider.

If the intention is to only ever allow a single level of indentation, mandating spaces over tabs over allowing mixed doesn't matter all that much. If there might be a possibility for multiple levels of indention, it might be something to consider.

Yeah. When I wrote the proposal, I had a single level in mind. My thinking was that more levels means more complex structures, and if we need that, we're definitely would be better off reusing some existing format. OTOH limiting ourselves to one level might prevent us from implementing something in the future.

I guess the next step is to find out where we could use more than one level of indentation. I'd go through the list of options we already have and see if any of them would benefit from being turned into arrays or dictionaries (which can be handily represented as a list with some indentation).

To clarify, I think you can preserve the order of a toml dictionary with serde by deserializing into a Vec<(String, Settings)>. If you want to write your own parser I would recommend looking at nom: https://github.com/Geal/nom I can provide a rust parser for the format you suggested if there's interest.

Would be cool if this new format would support per-feed proxy settings, that's the only thing I wish newsboat had.

Was this page helpful?
0 / 5 - 0 ratings