Cylc-flow: unicode

Created on 7 Mar 2019  ·  13Comments  ·  Source: cylc/cylc-flow

Follow on from #2966 to cover unicode support in Cylc.

What unicode should Cylc support and where?

  • [x] Suite registration #3274 (see also #2281)
  • [x] Task names (and graph strings) [done]
  • [x] Task outputs #3428
  • [x] Job scripts (native*)
  • [x] Event handlers (native*)
  • [x] XTriggers #3732 (see also #2987)
  • [x] Suite metadata (native*)

*native: Python3 defaults to the OS "default encoding" for file opens which it typically UTF-8, 3.7+ has a UTF_8 flag which overrides this. Consequently these came for free with the Python3 upgrade.

Most helpful comment

All 13 comments

suite names

@oliver-sanders - suite names would be a good one to nail down right away. They are currently completely unrestricted, as you've noted.

Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102
He's currently using / but hierarchical suite names are going to break that.

The minimal set of Unicode characters is probably alphanumeric plus dash, underscore, forward-slash:

'^[\w\-_/]+$'

Then @dwsutherland could use % (say) as a delimiter in the code.

Is there any compelling reason to allow more than this in suite names?

The only other characters that we may want to support are .+@.

OK, I've quickly run some characters from different unicode tables against Python re.

The regex:

  • ^[\w_\.+@-]+$

Changes we might want to make:

  • Include =, %
  • Prohibit - and . from being the first character

Conclusions:

  • Text and number characters seem to be well matched.
  • Punctuation characters are not matched.
  • Emoji are not word characters.

I think we are good to go with this Regex?

Matching Chars

basic latin

abc
ABC
0123

explictly supported special chars

_.+@-

latin supplement

àðØ

latin extended

ĐĵŌ

IPA extensions

ɐɶʍ

greek and coptic

ΘχϢ

anchient greek numbers

𐅅𐅉𐅌

cryllic

ФШѸ

armenian

ԱՔփ

Non-Matching Chars

emoji


😅
🐼

special chars


!

/
\
~
:
|
>
<

Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102

How about :, it is valid in Unix filenames, but obviously not a sensible choice.

The colon : is used in PATH-like variables, so no good for anything else.

Exactly!

See also #3117 where I investigated & documented task & (to some extent) suite names that work normally, which is now in the docs here & here.

I suggest we have the same limitations on suite name as on task name, as documented above, as largely in agreement with the comments above.

Note there is also a maximum length that suite names can safely be, given the OS restriction on file names length. It might sound a bit over the top to point this out, but I have seen some exceedingly long suite names across the MO. Can we validate on length too for extra care?

Can we validate on length too for extra care?

We can check for number of characters easily enough (can even do it in a regex), unfortunately most file-system file-name limits are in bytes.

https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits

We could just slap on a 255 character limit anyway.

Could we use | as the external ID delimiter?

Note - The other reason why I was originally vying for / or . was to be consistent with CLI task/job args. But as the external ID has to contain suite and owner/user info, this issue has come about.

Could we use | as the external ID delimiter?

I suppose we could (but it looks annoyingly like pipe or OR to me). I think we can go with the possibly-temporary solution and make a final decision once allowed suite-name chars are defined (within a day or so, I expect).

All items addressed, closing.

Was this page helpful?
0 / 5 - 0 ratings