Julia: string interpolation, juxtaposition & multiline strings

Created on 14 Feb 2013  路  47Comments  路  Source: JuliaLang/julia

Here's what I think we should do:

  • [x] Keep string interpolation. It's just too convenient and taking away popular features for language purity is obnoxious.
  • [x] Implement interpolation in the parser. Currently, it's implemented as a macro. This leads to problems like #455. Instead, we should implement string interpolation for normal strings in the parser, which will allow handling this just fine since it will know when a string is or isn't finished.
  • [x] Non-standard string literals are on their own. This means that non-standard string literals cannot actually fully emulate the behavior of standard string literals. They can use interp_parse to get close, but they won't be able to handle string literals in interpolated expressions [#455]. That's fine. You can either have full-blown parser-supported string interpolation _or_ you have a custom string literals, but not both at the same time. This is an acceptable compromise.
  • [ ] Use Allow string literal juxtaposition for concatenation. Merge this branch and make string literal juxtaposition the a syntax for string concatenation. If you want to concatenate two variables as strings, you can do foo "" bar, which parses as string(foo,bar) rather than string(foo,"",bar). If you need to reference the string concatenation operation as a function, you use string. This is not really any different than having special syntax for e.g. indexing.
  • [x] Deprecate * for string concatenation. This was a cute experiment, but it's bad. The biggest problem is the Char*Char issue, which we should not deprecate, but rather just remove (and perhaps make Char a proper integer type again?).
  • [x] Parse multiline string literals as macro calls. When an unprefixed multiline string literal (i.e. triple-quoted string) is encountered, handle interpolation in the parser and emit it as a macro call (maybe @mstr?) for further processing. When a prefixed multiline string literal is encountered, no interpolation is done, but the string is passed off to an appropriate macro.

Regarding the last point, consider the following:

# note the indentation
  """
  Four score and $n years ago...
  """

Should be equivalent to:

@mstr("\n  Four score and ", n, " years ago...\n  ")

The @mstr macro can handle indentation stripping a la #70 by looking at the trailing whitespace of the last string literal macro argument and stripping that from the indentation of all the string literal arguments.

Prefixed triple-quoted strings should emit macro calls just like prefixed single-quoted strings do, although I'm not sure whether they should call a different macro or the same one. Having to have define different macros to support both normal and triple strings is annoying, but on the other hand, you might want to handle triple-quoted literals specially.

breaking decision

Most helpful comment

I was surprised that string literal juxtaposition doesn't work. I was trying to copy-paste some C++ code that had

std::string points =
        "248258.441322 7417253.63825 44.2832223546\n"
        "248258.909841 7417253.42727 44.066906061\n"
        "248258.985642 7417253.11483 44.5358143357\n"
...
        "248267.816489 7417238.83666 44.6165076596\n";

and thought that might work in Julia (with some bracketing).

Should we support this? It's the same as multiplication juxtaposition, no? Unlike numbers, it "looks" more like the true result than 6 = 2 3 (which is also unsupported).

All 47 comments

There was a point, I think, where string concatenation created RopeString types. Now string(a,b) creates a bytestring (ASCII or UTF8 as the case may be). Is this the intended behaviour? Is RopeString meant to be used only explicitly?

@JeffBezanson hates rope strings ;-). Currently, yes, I believe RopeStrings are only created explicitly. The string stuff is going to need an overhaul at some point for greater performance, so I wouldn't depend on the precise representation of strings in general. Fortunately, you should be able to write code to generic strings, no?

That's fine, just wanted to clarify the intended behavior. The only time I'd care is when building up large strings dynamically in many parts. In such situations, it is fine to create a rope explicitly.

This is a good proposal. All I'd say is we should also get rid of interp_parse, since otherwise we have two different forms of string interpolation that work slightly differently, which seems awful.

RopeStrings are the wrong default, since if you use them to concatenate small strings performance is _terrible_. In the worst case where each RopeString node has 1 character, it uses 30x more memory.

What? Who is this? "This was a cute experiment, but it's bad."? Like the @RealStefanKarpinski would ever say something like that, please.

"You'll pry * for string cat out of my cold, dead, algebraically correct hands", that's what @RealStefanKarpinski would say.

I think he's going to have to create that github account now :)

...I was liking * (and it's extension ^ for repetition). And a "" b is even weirder than that, and doesn't give us an obvious string repetition operator. How is it better?

That's a fair point, @pao. But repetition isn't very common. There are a number of bad problems with * for strcat, which I'll get into in a bit if @JeffBezanson doesn't beat me to it...

Yeah, I've seen some of the problems with extension to Char in other issues, so I know it's not all roses. But we already catch people out on our unusual choice of concatenation operator. At least we can say "hey look at this theory!" and what we're using _looks_ like an operator. This would make "" the de facto concatenation operator, which has certain usability issues in that it's even less obvious what it should mean.

So one of the major issues is 'a'*'b'. You can make this do string concatenation, but it's at odds with the very useful interpretation of characters as integers via their code point, as in 'b'-'a' or '0'+3. It's weird to have both + and * defined for characters but not with the usual relationship that + and * have. But the trouble goes deeper and lack of associativity seems to be at the heart of it. Now string concatenation is associative and numerical multiplication is associative (for numbers or matrices or whatever), so what's the issue? The problem arises when you mix the two possible meanings of * as in 2*3*"x" 鈥撀爓hat should that do? Currently, it's actually a no method error, which is really the best you can do, but it seems like it could either mean "23x" or "6x". And can you guess what 2*'x' does? Does it produce "2x" or 240? Yeah, so this is really kind of a disaster currently. The fact that the best behavior for 2*3*"x" is a no method error is a pretty strong hint that we're trying to cram two very different meanings into a single operator.

The a "" b is a bit odd, but it actually just falls out naturally just from allowing juxtaposition with a string literal to imply concatenation, which C also does, so it's not really without precedent. The fact that a "" b can be parsed as string(a,b) instead of string(a,"",b) is really just an implementation detail since they both produce the same string in the end. Better suggestions for a string concatenation operator would be welcomed, but so far I haven't been able to come up with any. Please no one suggest + because that's got even more problems than * 鈥撀爐hink about adding chars.

We might want to go the rest of the way and make Char not a subtype of Integer. Interestingly, this patch seems to cause no problems:

--- a/base/char.jl
+++ b/base/char.jl
@@ -44,7 +44,7 @@ promote_rule(::Type{Char}, ::Type{Uint128}) = Uint128
 ## character operations & comparisons ##

 -(x::Char, y::Char) = int(x)-int(y)
-+(x::Char   , y::Char   ) = char(int(x)+int(y)) # TODO: delete me
++(x::Char   , y::Char   ) = error("no method")

But if we stopped using * for stringcat we could go the other way and make it a fully functional integer.

@JeffBezanson: I'm fine with deleting interp_parse although I must confess I will be a little sad to see that bit of my handiwork go. It has served it's purpose, which was to sneak string interpolation into the language despite your deep aversion to it :-)

Insisting that Char is not an integer just seems really wrong to me. It's annoying and causes all sorts of problems. The basic problem is that string concatenation and multiplication are very different things. You might want to multiply the values of two characters or you might want to concatenate two characters into a string.

Insisting that Char is not an integer just seems really wrong to me.

Why? 'b' - 'a'or '0' + 3 or 'a' < ',' etc. just seem like gibberish to me.

True. My point is more that Char should either fully be an integer or fully not be. Right now it is a subtype of Integer but doesn't act that way, which is just buggy.

Having Char be a subtype of Ordinal (along with Ptr) is perfectly defensible; 'a'*'b' would just be an error like it is for pointers, which is fine.

If Char is not an Integer any more, maybe + would be an option ? And what about using nothing more than simply juxtaposition of strings ?

Wiki: http://en.wikipedia.org/wiki/Concatenation

In many programming languages, string concatenation is a binary infix operator. 
The "+" operator is often overloaded to denote concatenation for string arguments: 
"Hello, " + "World"; has the value "Hello, World".

Concatenation of sets of strings
...
In this definition, the string vw is the ordinary concatenation of strings v and w as defined in the introductory section.
In this context, sets of strings are often referred to as formal languages.
There is typically no explicit concatenation operator, simply juxtaposition (as with multiplication).

...but it actually just falls out naturally just from allowing juxtaposition with a string literal to imply concatenation, which C also does, so it's not really without precedent.

I've never seen anyone use "" to concatenate string variables in C. It may fall out, but I'd be curious to know if it is in fact intuitive. My intuition is it isn't, but I'd be happy to be shown wrong here.

Why? 'b' - 'a'or '0' + 3 or 'a' < ',' etc. just seem like gibberish to me.

These aren't gibberish at all. They get used all the time in C-style string parsing and printing code. See here or here for example. Comparison of characters as integers is implicit in all string ordering.

I'm just saying that Char's integral behavior is a side-effect of the implementation. They're not intrinsically integers. By "gibberish", i mean that the result of 'a' < ';' is encoding dependent -- there's no right answer.

That's not true at all 鈥撀燙hars don't have an encoding, they are code points, which are completely encoding-independent 鈥撀爐hat's the whole point. Unicode code points _are_ intrinsically integers.

+1 to string interpolation in the parser.
+1 to chars as ordinals, too! You could also view chars as something like an affine space, I guess that amounts to pretty much the same thing. I think that all the meaningful integer operations and involving chars, and definitely all the ones in your example above, Stefan, go under this heading. I also don't se a real problem with keeping to use * and ^ as we do now (and I do like them!), since they are not meaningful for ordinals.

Also, it seems nice and consistent that when juxtaposition is meaningful, it coincides with multiplication.

I dislike the a "" b concatenation approach and would be equally happy with * or +. What I don't understand is the expectation that you should be able to concatenate chars in the same way as strings. 'a'_'b' should either be integer multiplication or undefined. Require an explicit string('a','b') if you need to concatenate chars or turn them into strings separately first. Likewise 2_'a' should either be integer multiplication or undefined. 2_3_"x" should be "6x" in accordance with operators.jl line 44.

Unicode code points are intrinsically integers.

Unicode code points serve to enumerate characters. The ordering of Unicode characters is a convention, and in the least significant bits doesn't necessarily mean anything.

However, it is also true that the values of code points have some meaning, particularly in the most significant bits (separating code planes). So Char is conceptually overloaded.

We may want access to both interpretations, but which should be the default? And should it have anything at all to do with how we deal with String? (I am truly unsure of how best to answer either question.)

Regarding *, I complained vociferously about it in a very early discussion... but now I kind of like it (_sheepish grin_). It's just one of the things in Julia that took a little getting used to.

I also don't think it's necessary to handle every combination of * of different types. 2 * 3 * 'a' is nonsensical, unless maybe you take it to mean "aaaaaa", as Python does (but we already have ^ for that, if * keeps its current meaning). If you want to treat 'a' as an integer, cast it. It should be rare enough that inconveniencing the use of Char as an integer isn't a big deal.

(Of course, an alternative is just to break down and use + for concatenation (_ducks_) and * for repeat, and arbitrarily ignore, disallow, or work around all of the problems.)

( I like the python way of + for concatenation and * for repeat, I found it very intuitive. )

For now I'm retracting the deprecation of * for string concatenation. It's largely incidental to this issue, most of which can be done without deprecating that usage of *. We can work the concatenation thing out later.

Ok, this should all be finished, except for string juxtaposition.

I had to create separate @*_mstr macros for prefixed triple-quoted strings, since whitespace stripping is handled by these macros. Do we want this? One alternative would be to move the stripping code into the parser.

In order to remove interp_parse, I"...", b"...", and B"..." interpolation is handled by the parser. I don't mind creating an exception for I, but b and B might be pushing it. Should we interpolate b strings? I only see one use of this (in extras/image.jl), and it doesn't really require interpolation:

    ss = sort(b"$s")

I'm not even sure what B"..." is for. Its only use seems to be making invalid utf-8 strings. And multiline b"""...""" literals are a bit odd, since that feature is clearly for text and not binary data. It's fine if it works, but no need to bend over backwards to support it.

I once more attempted the exercise of making Char not an Integer. What happens is you need to duplicate a lot of the scalar definitions in number.jl. Then there is a lot of code that does things like 7 <= c <= 13, and c & 0x3F. It might make sense to have some kind of Scalar type with Number and Ordinal below it, but there are just too many cases where a Char is treated like an integer. So now I feel the most convenient thing is to make it a proper integer, and not use * for concatenating Chars.

I will also add that I think *, string juxtaposition, and interpolation is too many syntaxes.

We should probably have a bikeshedding session about the non-standard string literal prefixes in Base. There are too many of them now and they're too hard to remember and the behavior of L"..." is a bit questionable since there are certain things you simply can't express (although it's also fairly handy).

I agree that making Char not an Integer feels really contrived and awkward. I would be cool with discarding * for string concatenation in general at this point and embracing juxtaposition, interpolation and explicit use of the string function for concatenation. That's plenty of ways to skin this cat.

Just make sure we don't lose a replacement for string^n. I do actually use that.

Although I still haven't figured out why Char creates a problem for string concatenation. And juxtaposition and * are equivalent elsewhere in Julia.

Yes, repeating strings is definitely necessary. I think some kind of rep function that repeats vectors or strings or whatever iterable in general makes sense.

@pao: the issue is whether 'a'*'b' is "ab" or 9506. If Char is an Integer, then the latter is the correct answer but if * is the string concatenation operator then it should also work for Chars, making "ab" the correct answer.

Shall we get rid of B"..." strings?

Yes I'd say so.

I still don't understand why the string concatenation operator must also work for Chars. Why is it so bad to require that you first make strings out of your Chars if you want to concatenate them?

Not a problem any more. Char now behaves like an integer, * and all.

if * is the string concatenation operator then it should also work for Chars

That's the assertion I'm challenging (as @GunnarFarneback notes).

Well now * is only string cat, and does not concatenate Chars, so that's where we ended up.
My personal preference would be to use only string() and string juxtaposition, and eliminate * and interpolation. But I'm outvoted on that, and having 3 syntaxes for it is crazy, so here we are.

String juxtaposition with "" just looks wierd.

... of course, I said the same thing when I saw * used for concatenation.

I do agree with Jeff that 3 syntaxes is silly. I propose that, having heard everyone's views, Stefan and/or Jeff just make an executive decision and let everyone just get used to things.

(Please just make a good choice. ;-) )

Now that we have call overloading, it occurred to me that we can do this:

julia> Base.call(s::String, args...) = join(args, s)
call (generic function with 855 methods)

julia> a, b, c = "foo", "bar", "baz"
("foo","bar","baz")

julia> ", "(a, b, c)
"foo, bar, baz"

I'm not saying we _should_ necessarily do this, but we could. The reason I was thinking this is that if we used juxtaposition for string concatenation, then you would write a "" b to concatenate a and b. But that leaves one wondering how to pass the concatenation operation as an object, say to a higher order function. But the empty string could serve that purpose:

julia> words = [a, b, c]
3-element Array{ASCIIString,1}:
 "foo"
 "bar"
 "baz"

julia> reduce("", words)
"foobarbaz"

Slightly weird but it does have a certain internal consistency.

I was surprised that string literal juxtaposition doesn't work. I was trying to copy-paste some C++ code that had

std::string points =
        "248258.441322 7417253.63825 44.2832223546\n"
        "248258.909841 7417253.42727 44.066906061\n"
        "248258.985642 7417253.11483 44.5358143357\n"
...
        "248267.816489 7417238.83666 44.6165076596\n";

and thought that might work in Julia (with some bracketing).

Should we support this? It's the same as multiplication juxtaposition, no? Unlike numbers, it "looks" more like the true result than 6 = 2 3 (which is also unsupported).

@andyferris I also think it's weird that you can't concatenate string literals at the parser level. If you do

points = "248258.441322 7417253.63825 44.2832223546\n" *
        "248258.909841 7417253.42727 44.066906061\n" *
        "248258.985642 7417253.11483 44.5358143357\n" * #...
        "248267.816489 7417238.83666 44.6165076596\n"

you are actually creating machine instructions for each *(::String, ::String). Obviously in this case you can use """ since you want newlines, but it's typical to use juxtaposition to break up long literals (without newlines) in C, C++, and Python.

Was this page helpful?
0 / 5 - 0 ratings