The frequency and vehemency of discussions around this subject beg for a change. *
and ^
were introduced for strings back when the language wasn't as strict on operator punning and overall meaning.
As a first, step, I propose we deprecate these two methods for string operations.
As a next discussion, we can talk about the possibility of using a different operator(s) for concatenation/repetition. Just using repeat
, with no operator, has been suggested, as well as the following for string concatenation:
++
, as a general sequence concatenation operator..
, similar to LuaThings to consider:
vcat
/hcat
?+1 for no infix operators at all. This subject attracts too much noise, and O(n) for a * b * c * d ...
concatenation isn't good.
If there is discussion about alternatives, then +100 for moving it to the julia-infix-operator-debates
mailing list.
+1 for no infix operators at all. This subject attracts too much noise, and O(n) for a * b * c * d ... concatenation isn't good.
+1 to that
If there is discussion about alternatives, then +100 for moving it to the julia-infix-operator-debates mailing list.
:laughing:
LOL. +1 on the julia-infix-operator-debates
.
(I'll personally feel sad to see this use of *
and ^
go...)
Stefan recently gave a nice, succinct explanation for why he wants to see them go, I'm going to quote it here:
My problem with * for string concatenation is not that people find it unexpected but that it's an inappropriate use of the * generic function, which is agreed upon to mean numerical multiplication. The argument that strings form a monoid is kind of thin since lots of things form a monoid and we're generally not using * for them. At the time I introduced * for strings, we were a lot less strict about operator punning – recall | and & for shell commands – we've gotten much stricter over time, which is a good thing. This is one of the last puns left in the standard library. The reason ++ would be better is not because it would be easier to learn (depends on where you're coming from), but because the ++ operator in Julia would unequivocally mean sequence concatenation.
Note that the operator punning is not an entirely academic concern. The Char corner case shows where the punning can cause problems: people might reasonably expect 'x' * 'y' to produce either "xy" or 241. Because of this, we just make both of these operations no method errors, but it would be perfectly reasonable to allow 'x' ++ 'y' to produce "xy". There's a lot less of a case for having 'x' * 'y' produce 241 or 'ñ', but the sequence concatenation operation does actually make sense.
References to prior discussions:
https://groups.google.com/forum/#!msg/julia-dev/4K6S7tWnuEs/RF6x-f59IaoJ
https://groups.google.com/d/msg/julia-users/nQg_d_n0t1Q/9PSt5aya5TsJ
https://groups.google.com/d/msg/julia-users/JnTy-XcfLF8/JeeHREk2TvwJ
I for one agree that ++
as a general sequence concat operator is clear and explicit, and agree that the Char example brought up by Stefan is a good example of where this simplifies things by disambiguating the user's intent.
I didn't see this before unleashing a rant on the unsuspecting lastest derailed mailing list thread, where I suggested julia-stringconcat
in lieu of the (better) julia-infix-operator-debates
. +INTMAX. Kill the infix operators.
I _really_ think you should avoid using anything that already has a meaning (other than string concatenation) in other major languages in the world.
I spent too much time seeing bugs because of developers going back and forth between multiple languages that overused simple operators, which meant different things in different languages or
in different contexts.
1) You need something that does not have another meaning for vectors, because people who
do string processing for a living expect to be able to use strings as vectors of characters
(and vice-versa). That would rule out +, *, ^, &, and ||.
2) You need something that is not confusable to most programmers (not just the numerical
computing world). That rules out , <> (SQL and other languages).
I think ++ would be a _little_ confusable, but as it is a unary operator in C/C++/Java/etc.,
and this would be a binary operator, I think that it would be fine.
3) You need a simple infix operator, at least for concatenate, otherwise you'll get pasted with
tons of virtual tomatoes by all of us who are doing string processing.
I'd vote for ++, it is used for concatenation in a reasonably popular language, i.e. Haskell,
it does evoke the idea of adding to strings together, i.e. concatenating them, and it does
not have any other meaning for vectors/arrays, and could be used as a general
vector/array concatenation operator, which is also good (per point 1 above)
I don't think ++
as a general sequence concatenation operator is particularly clear. Does "abc"++[1, 2, 3]
return:
"abc[1,2,3]"
"abc\x01\x02\x03"
['a', 'b', 'c', 1, 2, 3]
["abc", [1, 2, 3]]
If we're going to have a string concatenation operator, I'd rather it just be a string concatenation operator and nothing else. (Has anyone complained about the lack of an infix operator for other sequence concatenation operations?)
I'm also fine with not having a string concatenation operator, but the presence of such an operator in most other languages makes me wonder if I'd miss it if I were doing more string-heavy projects like web stuff. I'm fine with not having an infix operator if we decide we don't need it because interpolation tends to be more useful than concatenation, but if it's because numerical workflows don't do too much concatenation, I'd think twice.
Whether there should be a replacement is a decision that can be deferred. For once, can we keep a string concatenation-related issue narrowly defined?
If we're going to introduce a replacement, I think it makes the most sense to deprecate *
and introduce the replacement at the same time, so that people can actually use the replacement when they update their code.
@StefanKarpinski you'd also get the nice behavior of "mystring" ++ '\u2000'
, which very annoyingly doesn't work now with "mystring" * '\u2000'
.
@simonstr, it makes sense to me, as somebody who spents most of their time with string processing...
a = Vector{UInt8}[1,2,3]
"abc" ++ a
[97, 98, 99, 1, 2, 3]
(if you combine a Vector with a string, (which is immutable), you'd much rather get back another mutable vector, you can always convert it to an immutable string with UTF8String later)
Then this issue will devolve into every other discussion about this ever. The community has already established that it is unable to handle the topic. It's the ultimate bikeshed and there are a lot of colors to choose from.
If I sound irritated by this, it's because I am. Here's my experience. "Hey, you can't glue strings together with +
?" "Yeah, that's because we use *
." "Oh, okay then." At which point I moved on with my life.
So no, I don't think we should discuss alternative infix operators in this issue, because we'll never make progress if we do.
To voice my opinion on the matter, I have used languages whose string concatenation operator was .
,+
,(space),
_
and++
. When I started julia and learned that_
was the concat operator, my first thought wascool, that makes sense
, because I never really liked+
. The one argument in favor of not using*
I like is the one given by @StefanKarpinski about the ambiguity betweenChar
as an integer andChar
as a 1 character string. As such, it seems++
as a concat operator is reasonable, though in that case we should give it clear semantics. The three options for generic++
(what it should do if the type is equal seems clear) that seem reasonable to me are:
++(x,y) = ++(string(x),string(y))
++(x,y) = #MethodError
++(x,y) = ++(promote(x,y)...)
Where promote promotes an appropriate container type. The last option would imply
x = Uint8[1,2,3]
"abc"++x == Uint8['a','b','c',1,2,3]
@keno, I that's not correct, because 'a' is Char, a 32-bit type.
So, the answer would need to be either: UInt8[97, 98, 99, 1, 2, 3], or Char['a','b','c','x01','x02','x03']
I vote for ++
Actually, if you have a ASCIIString, it could promote to just UInt8[], but a UTF8String (as well as UTF16String and UTF32String) would need to promote to Char[].
(and that sort of promotion would be very useful for my string processing...)
This issue could be titled "Taking string concatenation seriously".
the ambiguity between Char as an integer and Char as a 1 character string.
I'll just note that:
julia-0.4> Char <: Integer
false
julia-0.4> 'a' * 'b'
ERROR: MethodError: `*` has no method matching *(::Char, ::Char)
Closest candidates are:
*(::Any, ::Any, ::Any)
*(::Any, ::Any, ::Any, ::Any...)
so no, Char
is not an integer, and hasn't been since a while in the 0.4 series, and therefore there's no ambiguity whatsoever. String
* Char
could perfectly well return the concatenated string, etc. That argument is just obsolete.
Please let's not subject ourselves to 200+ comments before we feel like it's been taken seriously enough.
Can someone just make a PR? I think everyone is in favor of deprecating *, ^ (if only to remove the mailing list bug). The ++ operator seems to be getting decent traction, but it's obviously tricky and not obvious to make it general. There are tricky semantics (similar to push!
vs. append!
), poor algorithmic complexity, and there's not a clear need for other iterables. So let's just make it work well for strings (and maybe chars) and call it a day.
@ScottPJones Sure, I was writing it that way for illustrative purposes, since Char
s can convert to Uint8
s if they are in range. Agreed on the UTF8String promotion problem.
@jiahao: This issue could be titled "Taking string concatenation seriously".
LOL.
Anyone in for a batch order ?
I think I'd want one, but can I get it with ++
instead of *
?
Okay, sorry. Continuing the injokes is fun, but let's stay focused. Let's try to come up with a bare minimum set of features that a PR could reasonably implement:
*
and ^
for strings++
for strings on stringsAnything that generalizes to other containers I think we can hash out inside the PR.
I want one with ++! :grinning:
@staticfloat :100: :+1:
If we want to have a real "taking strings seriously" discussion, for example, like performance issues related to trying to make strings be 0 terminated, where can we do that? (think about the very common substring or slice operation on a string... with Julia you have to create a new string every time)
if we're incurring string breakage anyways, it seems like as good a time as any to eliminate $
too.
my next-best-favorite alternative to not causing breakage is probably the operator-free version (https://github.com/JuliaLang/julia/tree/jb/strjuxtapose)
+1 to deprecating * and ^ for strings.
I sense a lot of obscurity around the ++ operator. Right now it's nice, for example, that "$a$b"
and string(a,b)
do exactly the same thing. It would be easy to confuse this with a++b
. How often do you need to concatenate a string with an array? That's a strange operation, since it's not clear what the array elements refer to --- could be code points, or raw data.
I'm reluctant to even engage in this discussion, but I feel compelled to mention one possibility that has come up in the past (there was even a PR implementing it at one point): using juxtaposition for string concatenation. You would write the following:
"foo" "bar" # "foobar"
"foo" bar # "foo$bar"
foo "bar" # "$(foo)bar"
foo "" bar # "$foo$bar"
Before, this had the drawback that there was no operator form of it, e.g. that you could pass to reduce
, but that's not true anymore since you can use call overloading to make ""(args...)
do string concatenation. Thus, you could write reduce("", objs)
and get a concatenation of the stringifications of a collection of objects. This could be generalized by this:
julia> call{S<:String}(str::S, args...) = join(args, str)
call (generic function with 934 methods)
julia> reduce("", [1,"foo",1.23])
"1foo1.23"
julia> reduce(",", [1,"foo",1.23])
"1,foo,1.23"
If you're about to comment on what @StefanKarpinski just wrote, please read #2301 first.
@stefankarpinski Ugh!!! Had no end of errors in code from Multivalue/Pick applications, because they used juxtaposition... hard to tell just what the code was really doing.
Also, what happens with macro arguments... whitespace is significant in Julia, so
@foo "Scott" "Paul" "Jones"
to a macro expecting 3 arguments just starts breaking, right?
@JeffBezanson If I have to use an Vector{UInt8} or Vector{Char} for mutable strings, to do my string processing, then I really would like to be able to concatenate an immutable string to one of them... just like people complain about not being able to concatenate strings and Chars now, those are both operations that are frequently done.
But what does concatenating a string with a Vector{UInt8} do? What if the vector contains UTF-8?
@JeffBezanson Concatenating with a Vector(UInt8) and a UTF8String should probably be an error. Concatenation with an ASCIIString would be fine (returning a Vector{UInt8}).
Concatentation of a Vector{Char} with a UTF8String should return a Vector{Char} (i.e. do the UTF8->UTF32 conversion first)... for performance, I'd check the UTF8String for how many logical characters first, create the output buffer big enough for both, then copy the Vector{Char} in, and convert the UTF8String right into the buffer...)
Actually, it probably would be better to punt on any concatenations with Vectors, except maybe Vector{Char}, and have a mutable string package, and add methods for ++ there... A lot cleaner, IMO.
Yes, I agree, it gets a bit complicated otherwise.
I think it would be a terrible decision not to have any infix operators at all for string concatenation. It should be a clue that nearly every modern general-purpose language has opted to define some infix operator for this operation. And the fact that other languages make _many different_ choices for the operator indicates that there is no ironclad convention that we stray from to our peril.
I agree with @pao that the bikeshed over this is counterproductive, and I find it hard to understand why people care so much about the spelling of this. *
is easy to get used to, is not _that_ weird, and Char*Char
does not come up often enough to be worth worrying about.
The sequence a * b
is an alias for string(a, b)
except in the special case where a
and b
are numerical, oh yeah, or numerical arrays, then it means multiply.
It would be better to give string catenation its own operator so that de-sugaring is always true. And if its not used in any other language then it is fair to all by making everybody equally unhappy :)
That would also make it easier to make a op b op c op d
to mean string(a,b,c,d)
with the obvious performance implications. So only string()
then needs performance optimisations (since at the moment its a very general function).
++
is good. What it does for non-strings can be worked out later.
@stevengj 1) Why do you assume that Char ++ Char does not come up often enough to worry about?
This is something that bugs me about the discussions here... I see a lot of “this just isn’t important”... but that is just an opinion, and you have people with experience in string processing telling you that it _is_ important. 2) * is rather confusable for lots of people, as I’d say that for most people doing string processing, they’d first think of repetition, never concatenation. I’ve seen many people have brought that up. 3) Maybe the amount of negative comments about * as concatenation operator, going back years from what I’ve seen, should have been a clue that it wasn’t the best decision, and it should have been reconsidered back in version 0.1 or 0.2, not when people want to get 0.4 released...
@simonster Regarding "abc"++[1, 2, 3]
. This is a nice example that the "operator with dot" symbolic inherited from matlab bites us from time to time. To compare it, the concatenation operator in J/APL is ,
and it comes with a "family of dot operators" distinguished by the slices the operator should work on.
'abc' , '123'
abc123
'abc' ,"0 '123'
a1
b2
c3
or even
'abc' ,"1 0 '123'
abc1
abc2
abc3
This doesn't adress the question of type promotion you addressed.
_Edit: Argh, I wasted the chance to say nothing_
@ScottPJones, plenty of other languages seem to have string-concatenation infix operators but not char-concatenation operators. I don't see a clamor of complaints. You can still concatenate chars by doing string(char1, char2)
(or use length-1 strings as in Python), so there is no missing functionality. If you look at existing code in any widespread language, the number of uses of string concatenation vastly outnumber the number of instances of concatenation of two chars.
Claims that char concatenation is anywhere near as important or useful as string concatenation are simply not plausible.
There will always be negative comments about spelling choices. (People coming from Python will always complain that we need end
rather than using indentation.) Tastes differ, and a few people with strong feelings can make a lot of noise. If we choose ++
, I guarantee you that newcomers will still complain — "Why didn't you use +
? +
is so much more discoverable and intuitive because I am used to it from language _X_."
It's not so much that I particularly like *
; I simply don't care that much. My feeling is that continual code churn over pointless spelling changes is more detrimental to Julia that any benefit we will get from substituting one character for another.
Aside from all of that, ++
will be extremely painful from an upgrade standpoint. Because ++
does not currently parse as an infix operator, there will be no clean way to maintain backward compatibility with Compat
— it will be a flag day upgrade, requiring every package using string concatenation to fork into 0.3 and 0.4 versions (or use string(a,b)
, giving up on infix concatenation entirely).
The fact is that continual code churn over pointless spelling changes is more detrimental to Julia that any benefit we will get from substituting one character for another.
Yes, it should only ever be changed once, from what it is now to the final state (or no change if thats the decision). Deprecating now and adding an operator later when everyone has changed their code to string(a,b) or "$a$b" is just being mean to the users.
and O(n) for a * b * c * d ... concatenation isn't good.
Can you do better than O(n) for string concatenation?
@stevengj The languages that I have used (that are also heavily used in string processing) tend not to have a separate character type (M[UMPS], Pick, JavaScript, Lua, Python, and many more...), so it never even comes up as an issue, and ones that do (Java, C++) handle concatenation of strings with characters and characters with characters just fine.
I don't know what existing code you've been looking at, but in a large part of the code I dealt with (both internal code and code at 1000s of customers), concatenating characters with characters, and with strings, was done heavily (and specially optimized because of how frequently it was done) [and just 1 of those customers is responsible for around 54% of the medical records within the US].
This isn't a "pointless spelling change", quite a lot of people have brought up a number of good objections to using * for string concatenation... if it didn't aggravate people seriously, you wouldn't see this conversation continuously come up.
If you aren't using it that much, and don't particularly care that much, and the people who want to do string processing in Julia see this as a serious issue, why do you object so much?
I've never advocated changing * to something else based on taste, or because some other language does it a different way, my arguments have been about: confusability (people are more inclined to think of repetition... (* means make multiples of something!), lack of consistency with vectors (which is an issue for people who do a lot of string processing, we tend to think of strings as being vectors of characters), and issues with using Char consistently (Char * Char, which has been removed, but what about other numeric operators with Char? Sometimes Char acts like an UInt32, sometimes not...)
Pick, for example, uses single character "system delimiters", denoted by things like @RM
, @VM
, @FM
, @SVM.
.. (record, value, file, subvalue marks), so you'd see "Scott":@SVM:"Jones":@VM:1:@SVM:"Memorial":@SVM:"Drive"
, to build up a record, that in JSON would probably look like [["Scott","Jones"],["1","Memorial","Drive"]]
.
In Mumps, that would be done like this:
"Scott"_$c(1)_"Jones"_$c(0)_1_$c(1)_"Memorial"_$c(1)_"Drive"
(but in practice, you'd use a macro for the $c(0)
and $c(1)
, such as $$$subdlm
and $$$valdlm
... these are the same as Char(0)
and Char(1)
in Julia)
And yes, there are also a lot of places where it is concatenation multiple characters... in M[umps], there is the syntax $char(codepoint,...)
, so you can have $c(1,a,1,b,2,c)
... [where a,b,c are evaluated as integers, i.e. code points).
@simonbyrne I don't even understand why somebody said a * b * c * d should be O(n)... That's crazy!
(but maybe it is that way in Julia, another indication that string handling simply has not been taken seriously)
String-concatenation data from some of the most popular programming languages:
+
, no character concat (or character type).
, no character type+
, no character type+
for both strings and chars (but only for string+char, not char+char)//
(chars and length-1 strings are treated almost interchangeably)+
, no character concat++
, no character concat (but prepend via :
)+
, +
for strings and chars (16-bit "unicode" chars)+
, no character type+
, character concat (apparently exists in some Pascal versions) is poorly documented+
and &
, character concat apparently exists but is poorly documentedConclusion: a string-concatenation infix operator is useful, and if we change to anything, it should be +
— this is the dominant convention by far, though by no means universal. At least backwards/forwards compatibility will be easy to implement (two lines in Compat.jl). But it is a mere spelling change that adds zero functionality and saves zero characters of code, at the price of a lot of code churn, hence I question its utility.
Also, the data indicate that overloading the meaning of concatenation with an arithmetic operation is not a serious problem: people seem to easily get used to it, and continue to adopt it in new languages.
And the data indicate that syntax for character concatenation is not an important problem — many languages commonly used for string processing do not even have a character type, and many languages that do have a character type either do not have character+string or char+char concatenation, or if they do they don't bother to document it well. If you are doing a lot of character concatenation in inner loops you probably need more specialized code anyway, and if it is not in performance-critical code you can always use string
or length-1 strings.
(In any case, since Char
is no longer an Integer
subtype in 0.4, if we want to we can always add a character concatenation operator at some future time. Or you can add this yourself in your own program if it is an important operation in some specialized use-case. The whole point of Julia's design is that "built-in" functions typically have no particular performance advantage over user code.)
@ScottPJones, if it allocates a new contiguous string, a * b * c * d
is necessarily Ω(n), where n is the total number of chars, because it needs to touch all of the data. You can only do better if it produces a "rope string" or similar data structure that doesn't actually copy the data, but that has its own drawbacks (subsequent operations may be slower). Defaulting to allocating a new contiguous string is hardly "crazy", and in fact it seems to be what _every_ other language defaults to as well.
I don't think there's any issue at all with a * b * c
, at least not any more than there is with string(a,b,c)
, since it's translated to that:
julia-0.4> @which "a" * "b" * "c"
*(s::AbstractString...) at string.jl:77
That said, I agree 100% with @stevengj so I'll leave it at that.
That said, I agree 100% with @stevengj so I'll leave it at that.
:+1:
In case it was unclear, while others may object to *
on the grounds that it's unusual or unexpected for programmers coming from other languages, I am NOT concerned about that – this is not the reason I don't like *
. I am _only_ concerned that it is an abuse of the meaning of the *
generic function. In other words, is string concatenation really a form of multiplication? Accordingly, I'm completely against using +
for string concatenation, no matter how many languages may use it: this is a clear and flagrant abuse of the meaning of the +
function – in no sense is string concatenation a form of addition.
@stevengj That's your conclusion... I would strongly disagree that it should be +
, if anything, for many of the same reasons that *
is a problem in Julia, in that it makes a lot of problems when dealing with Chars and being able to act the same on strings and Vectors.
Your data is incorrect though, + for Java works just fine with a character, and VB does also (go try them if you don't believe me! [and it _is_ documented])
In Haskell, just do string++[ch] or [ch]++string.
Pascal also works just fine (I can send you an example)
So, for Python, Perl, Ruby, C++, Java, C#, JavaScript, VB, Fortran, Haskell, Pascal all have NO problem concatenating characters...
(also, you see that languages that are really focused more on string processing usually _don't_ have a separate character type, and just one length 1 strings, like the languages I worked on, where everything was a string...)
You are left with Go & Objective C that make it a little bit harder...
I think you are totally misinterpreting the "data"... it is not that character concatenation is not an important problem, it is simply that for the vast majority of languages, it's handled, works as expected, and so you don't get all the complaints that you are getting about Julia.
Python, Perl and Ruby don't have a character type.
@simonbyrne Sorry, I misread that... I was thinking about O(n^2), which is what I've been seeing in Julia lately... (because strings are immutable, and if you have a loop building up a string, it allocates a ton of memory, and spends a lot of time doing GC...)
so you don't get all the complaints that you are getting about Julia.
This comes on the bikeshed mailing lists up every few months and sparks a mild discussion, usually just a few mild messages. I've worked with a lot of people doing day-to-day Julia programming, much of it with strings, and this literally never comes up.
@StefanKarpinski That was part of my point, that languages that focus on string processing don't have a special character type, a character is simply a string of length 1... (or Haskell, where strings aren't really special, they are simply [char].
because strings are immutable, and if you have a loop building up a string, it allocates a ton of memory, and spends a lot of time doing GC
You don't want to be doing this. You should print to an IOBuffer object instead and then take the string at the end. This is similar to the StringBuilder pattern in Java.
Stefan might end up like all the C++ leaders: they claim loudly that using unsigned integers for std::vector subscripting was a major mistake they made, but still most C++ programmers still believe that it's what makes C++ so nice. Fame is coming to Julia ;-)
I vote for ++ by the way.
@ScottPJones, it is poorly documented then, if you can't find documentation in 10 minutes of Googling on "concatenate character language X". (In general, if you search for "concatenate strings" vs. "concatenate characters" it is immediately obvious which one people care about more.)
@stevengj Umm... for people who care about it, they generally just treat everything as strings, so you won't see "concatenate characters" come up on a search. Doesn't mean that it isn't important, or heavily done... (by that argument, multiply is by definition commutative, as that's what a Google search will tell you!)
It's a bit of a hack but to allow Compat to handle ++
it could look for this kind of AST pattern:
julia> :('x'++'y') |> dump
Expr
head: Symbol call
args: Array(Any,(3,))
1: Symbol +
2: Char x
3: Expr
head: Symbol call
args: Array(Any,(2,))
1: Symbol +
2: Char y
typ: Any
typ: Any
The ..
operator is also available. No one seems to like juxtaposition / "" for concatenation.
@StefanKarpinski, if you aren't worried about the question of familiarity, I question the philosophical purism of the "+
and *
are only for arithmetic" viewpoint. This is a language question, hence a question of _convention_, not _correctness_, and the _vast majority_ of computer languages use an arithmetic symbol for string concatenation with no apparent distress. Human beings are used to this.
The character discussion is a tangent. If we had a string concatenation operator, I absolutely agree it should handle characters too.
You don't want to be doing this. You should print to an IOBuffer object instead and then take the string at the end. This is similar to the StringBuilder pattern in Java.
@StefanKarpinski That's just what you have to do in Java or Julia for performance, because of the immutable strings... it doesn't mean that it is easy to use, or that people would understand at first just why Julia is so slow compared to Python doing something like building up a string...
(I don't like Java for string processing either, for that reason)
@StefanKarpinski, Compat cannot look for that AST pattern, because then it will screw up x + +y
expressions where x
and y
are numbers. Granted, that doesn't come up very often, but I'd hate to see @compat
perform a transformation that potentially produces incorrect code.
Aren't python strings also immutable?
@ScottPJones I have been teaching math for years. Google says stupid things.
Plus + : In maths, the convention is that + is always a commutative operator
Times * : In maths, the convention is that * can be either commutative (as with numbers) or non-commutative (as with matrices). That's why we have commutative and non-commutative rings.
Now, if we go back to this kind of argument, you should stop using + to add floating points because + is not associative with floating points. And + is always associative in mathematics. This is just to show you that this "non-commutative" argument to prevent using + do concatenate strings does not hold. I tend to prefer ++ but I think that algebra arguments should not enter this game.
No one seems to like juxtaposition / "" for concatenation.
White space is already overloaded. It would conflict with macro/hcat contexts. Which brings me to a point I raised on the ML:
Julia already has two concatenating operators, namely h
- and vcat
. Why not use hcat
for string concatenation, MATLAB-style? Does it make any sense to build a matrix of strings?
Whatever it turns out to be (..
, ++
, [ ... ]
), I'm in favour of an explicit, distinct concatenation operator for strings and chars.
We could add ++
as an operator to 0.4, and then make the deprecation occur once we're on 0.5-dev. Is there a real hurry here?
Fortran: // (chars and length-1 strings are treated interchangeably)
seems like a good "rational" option for Julia[1].
i would expect that building a string with *
/string
is roughly O(n*m) in the number of strings being joined (n) and the total number of characters being joined (m). Some sort of string builder object (cStringIO in python, StringBuilder/StringBuffer in Java, IOBuffer in Julia) is essential for good performance when building anything large.
[1] for context:
> typeof(1//2)
Rational
The purist argument here reminds me of the .+
debacle for array+scalar. When philosophical purism collides with linguistic convention and practicality, purism loses.
Ah, good discussion of O(n^2) string building and such re: python: http://stackoverflow.com/questions/4435169/good-way-to-append-to-a-string
I like the solution of building a list or array and then calling join
if necessary.
I know people build strings all the time, but it seems like something to avoid. If it's for I/O (usually the case) you will obviously get even better performance doing the I/O directly, rather than building a string first and then sending it out. Even things like cPython's optimized append
are only _amortized_ O(n), and you're likely to give up a factor of 2 or so.
@mbauman That seems reasonable to me, at least.
+1 to @mbauman 's proposal. Everyone seems fine with ++
and having a generic sequence concat operator is quite appealing.
Assuming ++
is being introduced, I think @stevengj 's points on Compat pretty conclusively indicate *
can't be deprecated in 0.4.
@vtjnash // has the same problem of already having a meaning in Julia as a binary operator... something that I think it would be nice to avoid...
Since we're designing by mailing list complaints, keep in mind we haven't yet heard from all those who expect ++
to be an increment operator.
I don't think I ever said that * for concat had to be deprecated immediately, just that it should be, sometime after .. or ++ or whatever is introduced
@JeffBezanson That's why I preferred the Lua ..
, but it seems most people would rather see ++
.
I also think that it's not a big deal, because one is unary, and the other binary.
(just like DataFrames overloaded ~)
@JeffBezanson When I ran a simple string building test on Python and Julia, right after I first downloaded Julia last month, I saw that it was >800x slower than Python, and made the mistaken assumption that it meant that strings were mutable.. In Mumps, mutable vs. immutable never would come up, because there are no references to things, you just have values stored in associative arrays, in memory or on disk, nothing else. I did the same sorts of optimizations that it seems CPython has done (with reference counts on large strings, copy on write under the hood).
@ScottPJones, making S=""; for s in list; S*=s; end
an O(n) operation for immutable strings rather than O(n^2) is quite different from the discussion here; can I make a plea for focus?
@stevengj It was @timholy and @johnmyleswhite who brought up O(n).. and I unfortunately misread that as O(n^2), which is what I'd seen for building up strings compared to other languages such as Python... my mistake!
I may be misreading things, but it seems that issue was started not because many people have problems with string concatenation in Julia as it is now, but because many are fed up with the recurring discussions, especially the last one which went on for a while. However, in this case I am not sure that this issue is a problem with Julia _per se_ in the technical sense.
I have re-read a few of the previous discussions and none of them share the tone of the most recent one. Most of them are very friendly: people can't figure out string concat, ask on the list, learn about *
, maybe ask about its history/justification, and then generally move on.
If it's the discussions, but not the choice of operator itself that is a concern, maybe it could be handled without changing the language, at least for now. The FAQ could explain the situation, and suggest that new users are kindly asked to refrain from opening the issue up for a while.
IMO this would be the least costly solution, in terms of broken code and programmer hours.
This thread raises my blood pressure, which is no easy task, but I feel since it is a design-by-mailing-list
type discussion, I just want to throw my (day-to-day Julia-using) weight behind pretty much everything @stevengj says. I think changing the operator to ++
is OK, but will be just triggering a new round of moaning that it isn't just +
(and that it looks like increment), as @JeffBezanson points out. I don't understand the issue with string * char
given that char
s aren't integers anymore - is there any problem there?
If the operator is to change, @mbauman has the right idea of the timeline. I think it'd need to be:
*
now. No one who is supporting 0.3 and 0.4 can use it. Adding a deprecation warning for *(string,string)
would be inappropriate as anyone using 0.4 with a package trying to support 0.3 would be hammered with them.++
.*
from 0.6.Is that really worth it?
Adding a sequence concatenation operator, which we simply don't have, is a legitimate idea. Obviously *
won't be used for appending lists or arrays. However perhaps having an operator invites O(n^2) x ++= y
loops too much. Technically the syntax and performance are orthogonal, but in practice syntax can have this kind of effect.
@IainNZ I don't think it is really true that Char's aren't integers anymore, it is just that certain operations have been removed (*, /, and ^), but I think that just creates other inconsistency issues... i.e. why do + and - work, but not *...)
@ScottPJones
julia> Char <: Integer
false
EDIT: ...and I'm with @IainNZ on the blood pressure thing.
@JeffBezanson ++= y wouldn't invite O(n^2) loops any more than *= y does currently, and might give the compiler more of a chance to optimize that maybe?
@pao, correct, I'd noticed that * stopped working between 0.3 and 0.4, and that +/- still did... so those must have been added specifically for the Char type?
I'm more worried about the problem spreading to other types, assuming ++ supports arrays and lists.
We now treat Char
as an ordinal type.
@JeffBezanson I'd say you can't stop people from doing stupid things... try to give them good tools to make their life easier, and try to warn them yes, but don't not change because some people might misuse it...
I think it would allay my fears almost entirely if we don't have a ++=
operator. Many people assume, and I really can't blame them, that an operator like +=
is mutating.
Having the operator might you the chance of optimizing things like CPython did easier...
It's not really practical to optimize S=""; for s in list; S*=s; end
like CPython, regardless of the spelling. Julia doesn't have reference counts, so there's no way for the *(S,s)
function to know that it has the only reference to S
and hence that S
is safe to mutate. The compiler could figure this out in some cases, of course, but the main point of Julia's design is that it doesn't privilege built-in types over user-defined types. Moreover, magical compiler optimizations make it much more difficult to reason about code performance and to _predictably design_ code for good performance (c.f. CPython vs. PyPy). A ++=
operator would change nothing about this, and as @JeffBezanson says it would probably encourage people to over-use array concatenation too. See also the discussion in #7052.
In any case, I don't see it as a big problem that the optimal code structure in Julia is slightly different from the optimal code structure in CPython. It would be a big problem if it were much _harder_ to get good performance in Julia than in CPython, but join(list)
is not particularly hard/convoluted, nor is writing to an IOBuffer()
.
Note also that Julia is not particular unusual in repeated concatenation being O(n^2); see e.g. Ruby or Go. The fact that concatenation (as opposed to e.g. append!
) is never mutating is an easy to understand, predictable, common behavior.
The whole +=
overloading in python is very troubling by the way, it opens up subtle aliasing bugs like this :
>>> x = [1,2,3]; y = x
>>> x += [4]
>>> print(x,y)
[1, 2, 3, 4] [1, 2, 3, 4]
>>> x = x + [5]
>>> print(x,y)
[1, 2, 3, 4, 5] [1, 2, 3, 4]
I think that this discussion is actually coming to a meaningful end (!!).
The next step sounds like it'd be a good up-for-grabs project that most any intrepid coder could do: create a well tested, rock-solid sequence concatenation function. She wouldn't even need to spell it ++
, but it's also not hard to add it to the parser's operator table. So, if you feel strongly here, have at it. (Usual caveats apply, just because you code it doesn't mean it'll get merged, etc).
It's a pity we couldn't have "_" as a string concatenation operator. Since that's sort of what it does anyway, when used in variable names...
If it were a string concatenation operator, you couldn't use it in variable names...
@cormullion Program in M[umps] or CacheObjectScript then! :grinning: Seriously, having a separate operator for concatenation, that wasn't used for anything else, was great, although sometimes I wished there had been another character available, because people coming from most everywhere else see this_is_foo_bar and think it might be a name... (but M dates back to the mid 60's, before C variable/macro name conventions)
Just a question... and please don't anybody get their hackles up! Could combinations such as $+
$/
$*
be used as an operator? Would those have the problem that was brought up for ++
(i.e. a + +b
)? (besides it being _slightly_ confusable at first glance for C/C++/C#/Java programers...)
I kind of like $ as a (part of) a string catenation operator, it has a nice similarity to the use of $ as a string interpolation operator and $ looks like S for string.
If I can figure out (maybe with some help :-) ) how to allow the Scheme parser to allow $+ $* $/, etc. as tokens, then I will make a fork for people to try this out, see how they like it...
It's actually very easy. I think all you need is to add it to the list of operators at the proper precedence level: src/parser.scm:16. You don't even need to rebuild all of Julia to test it out — just make -C src
and it'll take 5 seconds. I think $$
is also open.
if *
is hard to remember, I don't see how $+
is any easier. also, i think it is better just to teach that IOBuffer is the string-builder abstraction in Julia and be done with it, rather than having a mutable string (with append!
and insert!
methods) and the IOBuffer
object (with print
and write
methods) and telling the user to try to pick the right one.
There is some conceptual advantage to using a separate non-arithmetic operator for sequence concatenation. IOBuffer
is the right tool for strings, but it doesn't really extend to arbitrary container types.
I think $$
would be a lot easier to remember than eg $+
.
Just some thoughts:
++
, why not **
instead? [1] $
combinations! IMHO that's to cryptic. **
is based on the actual *
and the proposed ++
.I think "a" ** "b"
to get "ab"
(or concatenate other sequences) doesn't seem bad.
**
for exponentiation instead of ˆ
(there is even a syntax warning):julia
julia> **
ERROR: syntax: use "^" instead of "**"
It could just be documented in differences from other languages in the FAQ.
@vtjnash The problem with *, is that it is confusing to most everybody out there, who would more naturally associate multiply with making multiples of something, which would be repeat for a string.
$ is often associated with strings anyway, and + for concatenation (+ meaning that you are _adding_ something to the string, i.e. concatenation)... hence I thought that $+ would be a good choice...
1) It is not confusable with any math operators (in at least the major languages that I am familiar with),
unlike ++
, **
, +
, *
, <>
, ||
, &
, ~
that are used in other languages.
2) It doesn't overload something else and so could be used as a generic concatentation operator,
i.e. vcat for vectors...
3) It doesn't have the parsing problems (at least, that I am aware of so far), that ++
has, which
I think @StefanKarpinski brought up.
4) It is reminiscent of both "string" $ and addition +.
For people doing string manipulation, it's a royal pain to have to try to code everything using IOBuffers... the code doesn't look at all intuitive, doesn't have the string operations which you'd want, may have more overhead than when you simply want to do some operations like:
`mystr[5:10] = "fish" # i.e. replace the substring between characters 5 and 10 with "fish"``
@toivoh At least for people coming from one of the languages used for string/database manipulations, $$
would confuse people, it indicates an explicit function call... and it doesn't have anything that indicates concatenation as opposed to repetition or dividing or cutting up a string by a character or substring... For example, $*
could mean repetition, "foo" $* 3
would get "foofoofoo" and "foo,bar,baz" $/ ","
(or even ','
) would return the tuple ("foo","bar","baz")
, which I think would be quite nice ;-)
@Ismael-VC ** is _very_ frequently used for exponentiation... My whole reason for bringing up some of this mess is to avoid confusion as much as possible, both for people coming from outside of Julia, and also for Julians, where overloading the * operator with something conceptually totally different can cause headaches.
If you see a ** b in the code, are you really going to think first of string concatenation?
First off, you likely to think it is exponentiation, if you've come from many other languages...
Secondly, there is nothing to even indicate that you are doing a string operation.
+
on strings could give an error, ERROR: syntax: use "$+" instead of "+"
*
on strings could at some point be deprecated (no rush, if we have new consistent syntax that people move to), and it could say: ERROR: syntax: use "$+" instead of "*"
,
same for "^"
... ERROR: syntax: use "$*" instead of "^"
.
@mbauman Great help!!! Just a couple simple additions to already present tables, I did the make -C src, yes, super fast (I wish I had known about that before ;-) ) and then I did $+ = string
(which I realize may be too simplistic), but hey, it worked! Oh, all praise to the beauty and power of Julia! ;-)
(and kudos to whoever wrote the parser... I used to get paid as a UTA to help students with their Scheme code in 6.001... even after not having used Scheme in 30 years, that code looked pretty understandable)
The one thing I'm trying to figure out now, is how to also make $+ act like vcat on vectors...
Help please?
since $
is the interpolation symbol, $+
is already means to interpolate +
and $$
is already taken as the operator for interpolating twice (e.g. from the outer scope).
and adding a method for this operation doesn't change the fact that the operation is O(n*m)
and thus undesirable to teach
I am only concerned that it is an abuse of the meaning of the * generic function.
I don't see this as an abuse of the meaning of *
. On the other hand, it would be interesting (if toy) if a caesar shift could be written as (((uppercase(string) - 'A') + shift) % 26) + 'A'
. Since + is element-wise, this is reasonable (while .*
would have been the element-wise string multiplication operator). I think the general question this creates is just whether the string * non-string
case should promote the non-string to a string, or if it is only valid to use when combine strings.
@vtjnash That's not correct. $
ONLY means the interpolation symbol within the context of a string literal.... same thing with $$
. Outside of a string literal, $
usually means bitwise XOR in Julia, and
$$
, $+
, $*
, $/
don't mean anything at all. Those aren't parsed at all currently. Interestingly, somebody already added support for parsing ..
, but it is not used for anything at the moment.
@ScottPJones It's a bit more complex than that: http://docs.julialang.org/en/latest/manual/metaprogramming/#interpolation
..
needed to be parsed as an operator since ...
is a valid operator and, as a quirk of the parser, all prefix strings of an operator also need to be in the list of operators
@nalimilan Darn! Does that totally preclude being able to use $+ as a general concatenation operator? Everything seems to work just fine when I changed the parser to accept it, and ran all of the unit tests...
Thanks for the pointer... I'd just been looking at string interpolation.
@vtjnash OK, that's strange... I'd seen talk about using .. for various things, and thought that it was already supported in the parser for that reason (for packages testing ideas out that used it)
@ScottPJones, it might not have been clear, but @vtjnash was explaining why ..
is parsable but not used. Defining and using it is fine:
julia> ..(a, b) = a:b
.. (generic function with 1 method)
julia> 1..4
1:4
I had actually imagined that ..
might be used for something, maybe intervals or concatenation. It has been proposed for "cons" as well (#3160).
If it were up to me, I'd much prefer only having string()
. It avoids this debate, it unambiguously converts things to strings (as opposed to treating them as sequences), and it helps discourage O(n^2) string building. string(string(string(a,b),c),d)
looks as inefficient as it is. With an operator, it becomes tempting to write
a ++= b
a ++= c
a ++= d
With function syntax, the obvious way to write that is string(a,b,c,d)
. Of course this does not "solve" O(n^2) string building, but I think it helps.
@JeffBezanson: :+1:
Also, in general, if people insist on infix for their favorite function, then Julia just may not have enough operators for everything (restricting to ASCII). A lot of the discussion above just reflects that: it is hard to find an infix op for string concatenation because there aren't many left to choose from. Given string
, $
-interpolation, @sprintf
and libraries (eg Formatting), is an infix op for string concat that necessary?
On 3 May 2015 at 16:41, Tamas K. Papp [email protected] wrote:
@JeffBezanson https://github.com/JeffBezanson: [image: :+1:]
Also, in general, if people insist on infix for their favorite function,
then Julia just may not have enough operators for everything (restricting
to ASCII). A lot of the discussion above just reflects that: it is hard to
find an infix op for string concatenation because there aren't many left to
choose from. Given string, $-interpolation, @sprintf and libraries (eg
Formatting https://github.com/lindahua/Formatting.jl), is an infix op
for string concat that necessary?
​Somebody has to say it: since the addition of emoji, why not :smile_cat:
for catenation :)​
—
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/11030#issuecomment-98443533.
@JeffBezanson This sort of comment is what really aggravates me:
If it were up to me, I'd much prefer only having string()
How would you feel if I said that Julia should not have .* ./ etc?
Would you really be happy always having to do something like matrix_multiply(a,b)?
Also, your argument about efficiency is rather patronizing... you are telling people that you know better than they do, about what makes sense for their programming.
I also don't see that it really helps... I find that people will program poorly no matter what tools you give them to do better. I see lots of O(n^2) [or O(n log n) only because the allocation goes up by powers of 2, but that's not alwasy good either] code where strings are built up with push!... (and I've tried to fix those, with some rather good results for performance).
If ..= or ++= or $+= is a general vector concatenation operator, then a ++= b is also essentially the same as doing push!(a,b).
Are you then going to say that push!(a,b) should be removed, because it can lead to O(n^2) code, which I've already seen real examples of in the code in Base?
The operator also makes perfect sense for adding to a mutable string (or IOBuffer)... why do you keep seeing it in terms of performance of just one type?
@tpapp I'm sorry, but that argument seems a bit absurd to me... given that 1) Julia's parser already defines hundreds of obscure Unicode characters that you can use as infix operators... go look at src/julia-parser.scm. 2) string concatenation is a fundamental operation in most all languages... using string or string interpolation or @sprintf is ugly compared to having a simple infix operator, may not perform as well [I'd have to see what code @sprintf generates]. This is NOT a case of people wanting an infix operator for their favorite function, this is a case of a whole class of programmers wanting a fundamental operation easy to understand (which overloading * is not), and easy to use...
@elextr Given the hundreds of infix operators that Julia already allows in the parser, maybe somebody should just add _every_ possible Unicode character as an operator! ;-)
@kmsquire Yes, I understood that, and elsewhere already had commented on being able to do just that with .. (the issue was that ..= wasn't handled in the parser).
I believe the unicode operators are chosen from a specific subset based on category codes. You are free to pick a unicode operator such as ⊞
and do whatever you want with it in your own code (and it's quite easy to type that in the Julia REPL, \boxplus
then hit tab - many editors have plugins for this same functionality). In most cases the unicode operators may have application-specific meanings so we haven't assigned standard definitions in Julia for the majority of them.
Like so many other syntax arguments, it seems this boils down to there not being any unclaimed ascii operators with which we could avoid the ambiguity here. There are lots of useful operations that keep coming up repeatedly, where it would be nice if they could have non-unicode operator/delimiter/bracket syntax for them, but it's rare to find something usable.
It might be a worthwhile exercise to put together a syntax table of one- and multi-character ascii operators to enumerate what ambiguities exist for any unclaimed combinations.
Julia's parser already defines hundreds of obscure Unicode characters that you can use as infix operators
For the second time, @ScottPJones , those aren't just random characters we picked. They are obviously operators. Unicode category codes say they are, and well, they just are. If we've screwed that up so badly, how would you do it? Use those characters as normal identifiers? Ban them entirely?
You keep saying repeated push!
is O(n^2) but it isn't. Isn't. Isn't. Not a matter of opinion.
I'm using the fact that syntax suggests how things should be done. Most people, myself included, don't take the trouble to learn the performance characteristics of everything in detail. We learn idioms, and go by what syntax is provided and what looks good. Update operators like += suggest incrementally modifying something. In julia they are non-mutating. Doing this for strings is not efficient. I don't see this as patronizing, I see it as trying to avoid leaving performance traps. It's not patronizing for car designers to make the brake pedal big. "What, you think I can't find the brake pedal?"
It's funny that you mention matrix_multiply(a,b)
, because so far we have been happy to share that very operator with string concatenation!
I believe the unicode operators are chosen from a specific subset based on category codes. You are free to pick a unicode operator such as ⊞ and do whatever you want with it in your own code (and it's quite easy to type that in the Julia REPL, boxplus then hit tab - many editors have plugins for this same functionality).
@tkelman In my editor, most all of those characters in julia-parser.scm come out as blanks :-( It made it a bit difficult to edit!
There are lots of useful operations that keep coming up repeatedly, where it would be nice if they could have non-unicode operator/delimiter/bracket syntax for them, but it's rare to find something usable.
That is what bothered me about being told that ~
was unavailable, because it had already been claimed by DataFrames.jl.
It might be a worthwhile exercise to put together a syntax table of one- and multi-character ascii operators to enumerate what ambiguities exist for any unclaimed combinations.
Yes, I think that's a very good idea.
@JeffBezanson I didn't say they were random, just that they were obscure... about half of them didn't even display in my editor... Somebody else, you?, accused my idea of using $+
, $*
, and $/
as being "random splats", after I'd already explained my reasoning (similar to having .+
, .*
, ./
)
You yourself talked about O(n^2):
Adding a sequence concatenation operator, which we simply don't have, is a legitimate idea. Obviously * won't be used for appending lists or arrays. However perhaps having an operator invites O(n^2) x ++= y loops too much. Technically the syntax and performance are orthogonal, but in practice syntax can have this kind of effect.
Also, I am wanting to use the ..=
or $+=
on mutable strings (or vectors, or IOBuffers), not on immutable strings.
I think having:
myStringBuffer ..= ".txt"
Is _very_ clear, readable, and useful!
Please go back and read what I actually said about push!
... I said that, if it weren't for the way the memory allocator worked, it would be O(n^2), but because of the trick (which can waste memory) of increasing the size by powers of 2, it is only O(n log n).
You've been happy to share matrix_multiply with string concatenation, but for me, it is confusable, and prevents treating strings and vectors in a consistent fashion in places where it would make sense.
I have seen a lot of comments on Google and GitHub about how *
for concatenation surprised people at first...
myStringBuffer ..= ".txt"
Is very clear, readable, and useful!
I agree, and I hate to do this, but unfortunately you're going to be disappointed: a += b
is immediately lowered to a = a+b
, so you cannot define +=
separately from +
(same for all other operators of this form). To support this, we would have to make an exception for ..=
.
Also, a very subtle point: the performance of push!
does not depend on how the memory allocator works. It's part of the implementation of the data structure to explicitly request more space, and keep track of how much slack is available. This performance is portable to any kind of allocation or GC scheme, unlike approaches based on reference counting.
@JeffBezanson
I agree, and I hate to do this, but unfortunately you're going to be disappointed: a += b is immediately lowered to a = a+b, so you cannot define += separately from + (same for all other operators of this form). To support this, we would have to make an exception for ..=.
Ok, now I think you may be understanding where I'm coming from... What would be involved in making an exception for ..=
? I noticed that ~
is handled in a special way, as a macro, that still allows modules to control it's meaning... could something like that be done? I dislike the idea of some specific exception... it would be better if a more general technique could be used.
Thanks for that information.
About push!
, I was going by what I'd been told... it is good to know that that is only controlled by the implementation of the data structure, and can be changed (I don't want my strings _always_ doubling in size, when the size is many megabytes, just because I added one extra character (like the \0
termination ;-) )
Minor nitpick: Building up a sequence with push!
, which doubles the size of the buffer at each reallocation, is actually O(n)
amortized time in the size of the result; each reallocation from size k
to 2k
is paid for with k/2
reallocation free pushes that came before it.
@toivoh Somebody else had told me it worked out to O(n log n)... don't you normally have to copy the data multiple times? I.e. you fill up 16 bytes, then you allocate 32 and copy 16, fill up 32, allocate 64 and copy 32, and so forth, plus the time to copy the characters into the buffer in the first place...
The sum of all powers of 2 up to and including N is 2N-1, which is O(N). Of course push!
can be slow in practice due to extra bookkeeping, and does waste memory.
What would be involved in making an exception for
..=
?
Please see https://github.com/JuliaLang/julia/issues/249 and https://github.com/JuliaLang/julia/issues/3217 and https://github.com/JuliaLang/julia/issues/7052, the last of which has already been linked to here. I don't think changing the way any op=
operators work is especially likely here (including not-yet-assigned ones). Given there are already packages for doing in-place operations on arrays, I think your best bet is to pursue writing a package for in-place operations on strings in a way that does not require special cases in base Julia.
@ScottPJones: Yes, please sum it up! Let's disregard the copying of each byte as it enters the buffer the first time (we know that part is O(n)
). Let's say we start with a buffer with 16 bytes of data, and that we have already had to do 16 byte copies. We reallocate and copy, so 32 copies in total. When the buffer is at 32 bytes, we reallocate and copy again, going up to 64 copies in total, the next time 128, and so on. So yes, you have to do multiple copies, but in this case that only applies to such a small part of the data that the total cost is still O(n)
.
This hinges on the fact that you grow the buffer by at least some given factor each time, if you grow by a fixed amount you get O(n^2)
. So if you don't want to allocate 2 megabytes for a one megabyte string, I think your best bet is to preallocate how much you need (or slightly more, if you can only overestimate it).
@toivoh OK... I had seen the comments here about *= being O(n^2) for strings, had seen the bad performance of the conversion routines using push!, somebody else here told me that the way the buffer was grown meant it was O(n log n) and not O(n^2), and since you guys are mostly numbers guys, I took that at face value, and didn't add things up myself... I just saw that the those first 16 characters get copied 16 times if you increased by buffer by powers of 2 up to 1M, the next 16 characters get copied 15 times, and so forth... so when they said it was O(n log n), it made sense... sorry about that confusion, and thanks for enlightening me!
push!
and *=
are not related operations. push!
is O(n)
amortized when implemented as a % growth operation, string concatenation is not O(n)
because there are no free operations possible, every new string combination requires copying both the old and new data to a new string. The first string gets copied N times, the next string gets copied N-1 times, ... the last string gets copied 1 time – which is O(N^2)
.
In my editor, most all of those characters in julia-parser.scm come out as blanks :-( It made it a bit difficult to edit!
i think it's time you found a new editor, or at least download a better font https://github.com/JuliaLang/julia/blob/master/README.windows.md#unicode-font-support
this concern had came up originally as the new operators were proposed to be added, with the conclusion that any editor that didn't support unicode in 2015 was probably too irrelevant to matter. there are now julia plugins for most popular editors that support \pi<tab>
--> π
completion.
OT: i believe it should be push!(mutablestring, character)
or append!(mutablestring, string)
. since push!
implies element-wise changes, whereas append!
implies combining two similar objects. the non-mutating form of append!
I believe is most nearly vcat
, so vcat(string, string) => string
would presumably be a meaningful operation.
@vtjnash I was talking about having having a string append operator that also worked on vectors...
not the current *=
operator, and I know all about the O(n^2) issue... I just made the mistake, not knowing just how the Julia memory allocation was happening for an array with push!, that it was O(n^2)
(give the performance I saw, it was an easy assumption)
About the editor, it does support Unicode, and about half of the list of strange Unicode characters did show up... It's really more a font issue... but not an uncommon one! Until that happened, I hadn't realized the font was missing any characters... It's got the Japanese, Chinese, Korean ones, and most of those math ones....
OT: i believe it should be push!(mutablestring, character) or append!(mutablestring, string). since push! implies element-wise changes, whereas append! implies combining two similar objects. the non-mutating form of append! I believe is most nearly vcat, so vcat(string, string) => string would presumably be a meaningful operation.
I think most "string" people, like me, would consider that a character is really just a 1-element string,
or a string is just an array of characters... think of C, C++, Go, Haskell, ... all the languages that don't even have a separate character type (you just use strings of length 1)...
Making that distinction doesn't make since for my programming, and it isn't even consistent now in Julia...
julia> string("foo",'\n')
"foo\n"
julia> "foo" * '\n'
ERROR: MethodError: `*` has no method matching *(::ASCIIString, ::Char)
Closest candidates are:
*(::Any, ::Any, ::Any)
*(::Any, ::Any, ::Any, ::Any...)
*(::AbstractString...)
julia> "foo" * "\n"
"foo\n"
This is why I think a new operator is needed, that would work consistently in all these cases... and could even work on IOBuffers...
MyBuff ..= myline
.... other code ...
MyBuff ..= '\n'
MyBuff ..= "Scott was here!"
I'd also say, that maybe this operator, which you really only want people to use on mutable types, could simply give an error if you try to use it on an immutable type...
MyString::UTF8String = "My string"
... code ...
MyString ..= "\n My bad!"
-> would get a compile time error... (or would it be a run-time error?)
Speaking of ..
, it should be noted that #5715 is still happening in v0.4:
# Version 0.4.0-dev+4629 (2015-05-04 00:27 UTC)
julia> ..(a, b) = a * b
.. (generic function with 1 method)
julia> "a".."b".."c"
ERROR: syntax: extra token ".." after end of expression
TL;DR (I tried to, scanned and read most.. and all in the julia forum I started, and just now discovering this thread).
"a * b * c, at least not any more than there is with string(a,b,c)"
I'm not sure what this meant that a_b_c gets converted into string(a,b,c), that is one application of string function (or three).
I didn't understand for or against operators + or *. Unless + is used I do not have a preference for any other. Just know || is used in SQL or just a function concat(a, b) or cat(a, b). I could go with no operator.
What I'm most concerned about is if Julia is slow or not. And by default. Could whatever concatenation not just give you a ropestring and be faster? And if you try to substitute in a middle of a stringthen you get a non-ropestring?
@PallHaraldsson See https://github.com/JuliaLang/julia/issues/2301#issuecomment-13573303 for why we don't use RopeStrings by default. But further discussion of string performance should take place in a separate issue.
It would seem to me that strings in Julia are syntactically similar to character arrays (e.g. a string can be indexed/subsetted in the same way that an array can). So I think the semantic of append!
may be more appropriate than push!
when it comes to concatenating strings.
As for introducing a new overloaded operator for the concatenation, I tend to think +
is a better choice simply due to the wider adoption. The problem, of course, is that +
has already been overloaded for arrays and its semantic has nothing to do with concatenation. The selection of operator to override, however, is mostly a matter of personal taste -- what is more important, I believe, is consistency in the semantics of the operator; whatever operator we come up for this task should be able to be used to concatenate all one-dimension (or, multi-dimensional) arrays with minimal unexpected behaviours.
Maybe we should simply deprecate the ^
and *
operators and leave the concatenation to explicitly named functions to avoid confusions.
-100 for removing *
for string concatenation. It is very handy in practice and makes code very readable. The argument that it can be inefficient is not valid in my opinion. In that case we should also remove the addition of three vectors a+b+c
which is also inefficient due to the temporaries created.
@tknopp I disagree that *
makes code very readable, as it is a surprise to everybody new coming to the language, and makes people think more of repetition, rather than concatenation, as well as being confusable as to whether the things it is being applied to are strings or vectors or scalar numbers...
I think the proposal that seems to have the most support is replacing *
with ++
, a la Haskell.
This issue should be driven by practical experience rather than from an academic point of view. In the last years there were never a lot of discussions about *
being the wrong thing. Why is there suddenly a problem?
In principle I would be fine with +
but I don't see that there is enough justification that +
would be better.
The use of *
is by the way quite common in theoretical computer science (formal grammars). Here *
with the empty string "" form a monoid.
I'm starting to think we should just ask that a final decision be made by the powers that be and then stop worrying about this topic.
@tknopp As has been said already many times in this thread, +
would not be a good choice for some of the same reasons that *
was not a good choice. There also have been a number of discussions about *
being confusing, etc. ++
would work because it is only a unary operator in other languages, except for Haskell, where it is the concatenation operator.
@johnmyleswhite: yes I agree. But I thought that they have done this several years ago. At some point one has to stick with a choice.
Well, perhaps the powers that be will decide to close this issue.
O(n) for a * b * c * d ... concatenation isn't good.
In that case we should also remove the addition of three vectors a+b+c which is also inefficient due to the temporaries created.
Is it possible repeated infix operators could be lowered to +(a, b, c)
/ *("a", "b", "c")
? When this makes sense / is defined...
_(IMO the ship has sailed on * for string concat, any argument for breaking everyone's code needs to be... strong.)_
Is it possible repeated infix operators could be lowered to +(a, b, c) / *("a", "b", "c") ? When this makes sense / is defined...
Is it not?
julia> Meta.show_sexpr(:("a" * "b" * "c"))
(:call, :*, "a", "b", "c")
@hayd I don't think the ship has sailed, not for making something like ++
a general concatenation operator (not just for strings), or even for possibly deprecating *
as the string concatenation operator, if people get behind using ++
instead.
Whether the ship has sailed on new concatenation operators or not, * and + are only a binary operators?
I might be wrong:
julia> a=10
julia> dump(:(a_a_a))
Expr
head: Symbol call
args: Array(Any,(4,))
1: Symbol *
2: Symbol a
3: Symbol a
4: Symbol a
typ: Any
julia> println(a, a, a)
101010
​julia> println(a_a_a) # not same as if a happened to be the string "10"
1000
Just as println and many functions, including b=string(a, a, a), that
concatenates, I'm not sure we need more operators to do it? Cant that be the
recommended way?
As shown above, * or whatever, might be evaluated once as a string concatenation operation, not
sure. That is is what is already done with string, and in my idea for new mutable
strings it would be useful to not do the operation three times (would save a space).
[There is a precident for , for concatenation, you need to look no further
that Informix 4GL :) ]
@ScottPJones: "++ a general concatenation operator (not just for strings)", what do you mean? If 1++1 should be "11" and not 2, that is a bad idea?
@PallHaraldsson As I said, I wanted a general concatenation operator for strings and vectors (i.e. like vcat
). I never said that I wanted that operator to be used for scalars (I think it would not be a good idea,
it just makes things more potentially confusable).
There are a number of consistency problems right now, such as:
julia> b"a" * b"b"
ERROR: MethodError: `*` has no method matching *(::Array{UInt8,1}, ::Array{UInt8,1})
Closest candidates are:
*(::Any, ::Any, ::Any)
*(::Any, ::Any, ::Any, ::Any...)
*{T<:Union(Float64,Complex{Float32},Complex{Float64},Float32),S}(::Union(SubArray{T<:Union(Float64,Complex{Float32},Complex{Float64},Float32),2,A<:DenseArray{T,N},I<:Tuple{Vararg{Union(Range{Int64},Int64,Colon)}},LD},DenseArray{T<:Union(Float64,Complex{Float32},Complex{Float64},Float32),2}), ::Union(SubArray{S,1,A<:DenseArray{T,N},I<:Tuple{Vararg{Union(Range{Int64},Int64,Colon)}},LD},DenseArray{S,1}))
...
julia> vcat(b"a", b"b")
2-element Array{UInt8,1}:
0x61
0x62
A general concatenation operator, as I've been proposing, whether ++
or ..
or anything else, would eliminate that sort of inconsistent behavior.
@scottjones: "I never said that I wanted that operator to be used for
scalars" - is it a good (or bad) idea to exclude the numbers - that are
already handled:julia> print("Páll", 1.0, 1)
Páll1.01
julia> string("Páll", 1.0, 1)
"Páll1.01"
@PallHaraldsson The problem there is that both print
and string
do a lot more than just concatenation, they "stringify" their arguments... I'm not sure that that should be happening _implicitly_ with a general concatenation operator. It doesn't happen with *
currently when used as a string concatenation operator either.
BTW, you need to learn how to quote things here with Markdown... people here kindly showed me how to use triple-back quotes followed by julia around Julia code snippets, and put a blank line after quoting somebody with >
and your comment.
i.e. something like:
@scottjones:
I never said that I wanted that operator to be used for scalars
is it a good (or bad) idea to exclude the numbers - that are already handled:
julia> print("Páll", 1.0, 1)
Páll1.01
julia> string("Páll", 1.0, 1)
"Páll1.01"
@ScottPJones, unfortunately quoting doesn't seem to work when responding by
email, even if you edit after the fact (testing this here).
print("Hi GitHub! Is this quoted")
@PallHaraldsson, do be careful how you write someone's name, though. You
actually pinged a different Scott Jones in your message, who probably was
confused to get a notification from you about julia.
In either case, using the GitHub interface, rather than replying by email,
does help with both of these things.
On Sat, Jun 13, 2015 at 5:24 AM, Scott P. Jones [email protected]
wrote:
@PallHaraldsson https://github.com/PallHaraldsson The problem there is
that both print and string do a lot more than just concatenation, they
"stringify" their arguments... I'm not sure that that should be happening
_implicitly_ with a general concatenation operator. It doesn't happen
with * currently when used as a string concatenation operator either.
BTW, you need to learn how to quote things here with Markdown... people
here kindly showed me how to use triple-back quotes followed by julia
around Julia code snippets, and put a blank line after quoting somebody
with > and your comment.
i.e. something like:@scottjones https://github.com/scottjones:
I never said that I wanted that operator to be used for scalars
is it a good (or bad) idea to exclude the numbers - that are already
handled:julia> print("Páll", 1.0, 1)
Páll1.01
julia> string("Páll", 1.0, 1)"Páll1.01"—
Reply to this email directly or view it on GitHub
https://github.com/JuliaLang/julia/issues/11030#issuecomment-111706133.
Ugh... I didn't know that @PallHaraldsson was responding via e-mail, nor that e-mail had that problem (I use the CodeHub app when I'm not at my laptop... it has it's own problems, but not that).
Yep, that's a very different Scott Jones... not even the Scott A. Jones who was an MIT grad student when I was an undergrad, who also lived in Arlington afterwards!
LOL, yes, this was confusing! I’ve met “myself" a few times over the years, so this is not the first time that this kind of confusion has happened. :)
Cheers,
-Scott
On Jun 13, 2015, at 11:37 AM, Scott P. Jones <[email protected]notifications@github.com> wrote:
Ugh... I didn't know that @PallHaraldssonhttps://github.com/PallHaraldsson was responding via e-mail, nor that e-mail had that problem (I use the CodeHub app when I'm not at my laptop... it has it's own problems, but not that).
Yep, that's a very different Scott Jones... not even the Scott A. Jones who was an MIT grad student when I was an undergrad, who also lived in Arlington afterwards!
—
Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/11030#issuecomment-111727642.
For what it's worth I actually really like *
as string concatenation. For one, it matches the notation used in _Computability, Complexity, and Languages_ by Davis et al. It also gives you juxtaposition concatenation for free (not that I've ever seen that used, it's just neat). I find myself using *
all the time, and I've seen it used in a lot of other places, so I think the scale of the code churn for this deprecation would be massive, with (at least IMHO) little benefit.
I think we should just keep *
for strings, but possibly add ++
later as a generic concatenation operator (which would support strings as well as other things).
We may add ++
as a generic sequence concatenation operator in the future, but it seems like getting rid of *
and ^
for strings isn't going to happen. I'll say that I'm no longer particularly concerned about "punning" on *
, nor do I even actually think this is punning anymore – in abstract algebra, multiplication (represented as *
or juxtaposition) is often used as a non-commutative group operation on things that aren't numbers. The main issues here were from the fact that previously Char <: Number
but the *
operation for Char
was incompatible with *
for Number
. Now that Char
is not a subtype of Nubmer
, that's no longer a problem.
I would keep *
for string concatenation for the original reason.
This is what Wikipedia says about regular expressions as algebraic operations:
Given regular expressions R and S, the following operations over them are defined to produce regular expressions:
(concatenation) RS denotes the set of strings that can be obtained by concatenating a string in R and a string in S. For example, {"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}.
(alternation) R | S denotes the set union of sets described by R and S. For example, if R describes {"ab", "c"} and S describes {"ab", "d", "ef"}, expression R | S describes {"ab", "c", "d", "ef"}.
(Kleene star) R* denotes the smallest superset of set described by R that contains ε and is closed under string concatenation. This is the set of all strings that can be made by concatenating any finite number (including zero) of strings from set described by R. For example, {"0","1"}* is the set of all finite binary strings (including the empty string), and {"ab", "c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "abcab", … }.
In linear algebra, there is a unary operator, the adjoint operator, often denoted by *
. In Julia as well as Matlab, the adjoint operator is give by a single quote ('
), since *
is generally used for multplication. So I would propose string operators in Julia to be (*,+,'
) for concatenation, alternation, and Kleene star respectively.
As @stevengj pointed out, the argument between +
and its competitors is about convention, not correctness. And the data @stevengj provided at above has clearly proved that +
as a string concatenation operator is the most widely accepted convention in programming world (C++/C#/Java/Python/Javascript and many others). And all the other choices are apparently much less common, whether some people like it more or not.
Then the main reason I could think about keeping *
is because deprecating it would break existing code like “abc” * “efg”
. Could anyone explain what else +
would break if used as a string concatenation operator in Julia, to help me understand the background better? (I understand string concatenation is not a commutative operation.)
Nothing would break and if you really want it you can define +
to do this. It is, however, just bad math. In algebra, +
is always a commutative operation – and string concatenation is not commutative. You can see some of the confusion this causes since +
does not lead to a natural way to repeat strings. Do you write "hi" * 5
or 5 * "hi"
? Neither one really makes much sense. Compare this with *
for concatenation where it's obvious that you should write "hi"^5
. In any case, while we may introduce ++
for concatenation (including strings), we are not going to use +
for string concatenation no matter how many languages may have chosen this syntax.
I propose using .
for string concat to match PHP. With overloadable getfield, it'd even be trivially implementable:
getfield(x::String, y::String) = string(x, y)
"a"."b"."c"
We could use ..
for repetition " "..5
@ StefanKarpinski
Thanks for the explanation! I do get more sense about the background now.
(1) If *
is used for string concatenation, then ^
would be a logical and natural operation of string repetition.
(2) If +
is used for string concatenation, then *
would be the logical result of string repetition.
For option (1), I agree ^
is an intuitive operator for string repetition. For (2), even *
is the logical result, it might still not be intuitive enough (eg. “hi” * 3 == “hihihi”).
Do you write
"hi" * 5
or5 * "hi"
?
No, string repetition (whatever operator it is) is not a frequent operation to me. But string concatenation is. If this is a general case, it seems replacing repetition operator with a named function (eg. repeat
, already supported in Julia) makes sense.
And I realized creating _new_ operators is really a disputable/dangerous thing, while supporting more APIs is usually welcome:)
Edit: Found another thread that introduced /
and \
for strings. And this helps me understand better why *
was chosen for string concatenation.
Most helpful comment
We may add
++
as a generic sequence concatenation operator in the future, but it seems like getting rid of*
and^
for strings isn't going to happen. I'll say that I'm no longer particularly concerned about "punning" on*
, nor do I even actually think this is punning anymore – in abstract algebra, multiplication (represented as*
or juxtaposition) is often used as a non-commutative group operation on things that aren't numbers. The main issues here were from the fact that previouslyChar <: Number
but the*
operation forChar
was incompatible with*
forNumber
. Now thatChar
is not a subtype ofNubmer
, that's no longer a problem.