Mastodon: elasticsearch opt out

Created on 9 Mar 2018  路  70Comments  路  Source: tootsuite/mastodon

As #6718 was rejected summarily, lets discuss an alternative approach which allows marginalized users to opt-out of having their data indexed in elasticsearch.

I am not 100% against the elasticsearch feature, I just strongly believe that people who do not wish to participate should be able to elect not to.

I propose adding an attribute to the activity which says whether or not it may be indexed.

Would that be alright?

Most helpful comment

Mastodon's search allows you to find:

  • Your own toots
  • Toots you were mentioned in
  • Toots you favourited or reblogged

I do not believe this is dangerous, otherwise I wouldn't have implemented this (highly requested) feature. It basically only allows you to search things you've already seen anyway. Whether the data is in ElasticSearch or not, is absolutely beside the point, the data is already in PostgreSQL.

All 70 comments

I think opt-out is the wrong approach if the goal is protecting marginalized individuals. Being searchable should be opt-in with informed consent, not opt-out.

I disagree. It should be opt-out not opt-in if this is implemented, most people have set their toots to public which can be easily found directly /or with third-party tools regardless of this option.

I don't care if the signal is opt-out or opt-in. ActivityPub instances where I want this feature will always send the signal that says "don't do this, please."

While you are sort of correct that most toots are public at the moment, they still weren't _searchable_. Some people probably wouldn't have used "public" if they knew it would become searchable.

You can argue all day long that such a thing is silly, and it may be, but it's still true.

Mastodon's search allows you to find:

  • Your own toots
  • Toots you were mentioned in
  • Toots you favourited or reblogged

I do not believe this is dangerous, otherwise I wouldn't have implemented this (highly requested) feature. It basically only allows you to search things you've already seen anyway. Whether the data is in ElasticSearch or not, is absolutely beside the point, the data is already in PostgreSQL.

@Gargron

Again, it's not about what you can find, it's about what a tech bro can find when SSHed in as [email protected]. Most tech bros don't have the time or patience to learn enough SQL to really dig on the statuses database with any efficiency, but they will happily spend the 5 minutes it requires to learn how to query elasticsearch.

Also consider the misconfigured instance where elasticsearch is exposed to the public Internet. That alone is enough to make me not want my toots searchable. All it takes is the right social graph intersection and really bad people will be all over this.

and let's just be clear: there's lots of exposed instances because people just won't learn.

A "tech bro" can add ElasticSearch in an hour, ignore any opt-outs you come up with, and you will never know. This threat model is no different to the general problem of confirming whether a running program is built from a particular known source code. You either trust an instance or you don't accept followers from it.

IMO the exposed ES problem is a serious one, but one that would be better addressed by sane defaults/some sort of redundant check that it's not exposed instead of some filtering?

@Gargron anybody can go download a root exploit off of the Internet, so we shouldn't bother to secure our systems, right?

If you believe that our documentation for installing ES is insufficient in this regard, please submit corrections. Otherwise, that is out of scope of this repository. Any non-public component being accidentally exposed to the public is problematic, there is nothing unique about ElasticSearch in this regard. Not securing the PostgreSQL database would be even more catastrophic.

What I am saying here is that even though the policy can be bypassed, it is still worthwhile because 99% of the time instances will abide by it.

Or put differently, user and privilege separation in UNIX are still worthwhile constructs even though they can be bypassed through exploits.

I feel like the presence of a big data analysis package can encourage using it for big data analysis, though.

And, realistically? I feel like most of the Mastodon/ActivityPub privacy model is security-by-obscurity. As one Pleroma dev has pointed out, it's quite easy for a malicious instance admin who actually tries, to get private statuses and DMs. That doesn't mean that private statuses and DMs are a bad thing to have - they're useful tools to make it harder to get at the data.

One question is, while the search only allows you to find toots that you made, favorited, boosted, or were mentioned in... what does it index? Every toot that the instance sees? Or every toot that a user of the instance made, favorited, boosted, or was mentioned in?

It's very hard to make PostgreSQL open, because databases have to be owned by a role and roles require a password. I guess someone _could_ make the mastodon db "0.0.0.0/0 trust" but you would have to _know_ that to make it insecure. ES, by default, has no security. These are massively different threat models.

Redis by default has no security and i'm not seeing anyone suggest we should stop using that....

Redis is not a big data analytics package. Elasticsearch is.

I don't know what data is stored in Redis, so I don't know what threats could be had by that, but almost all distros ship Redis in Unix socket listening mode (/tmp/redis.sock or such) - not TCP. And that's how I have configured all the instances that I've set up.

Redis stores a user's personal timeline for caching. That could contain DMs/followers only posts.

ElasticSearch is not a big data analytics package. It is a search engine database based on Lucene. Many companies use PostgreSQL for analytics, that does not make it an analytics package, it is a database.

If ElasticSearch binds on 127.0.0.1 and not 0.0.0.0 by default, then it's the same as redis default configuration on many systems.

@Gargron

99% of real world Elasticsearch usage isn't even for search, it's for fast and flexible aggregations.

This is the whole point behind the Elasticsearch, Logstash and Kibana stack: it gives you derived metrics from the data that you ingest. The company which makes Elasticsearch primarily markets it for this purpose, even.

By just having it in Mastodon willy-nilly you are moving closer toward normalizing the concept of instance admins doing creepy things with the content they collect.

A way for users to say no to this is important.

My concerns are not about the search, they were never about the search functionality. My concerns have to do with importing data into software that is explicitly designed to create things like Twitter's "trending posts" feature.

I fear it sets a precedent that is ultimately extremely harmful to the fediverse, and to the original vision you had when you created Mastodon.

Anyway, I don't think discussing the merits of ES are really on topic here. There are pros and cons, even in the context of the fediverse.

What I would like to get back to discussing is a signal to not be included in the feature.

If we only put statuses into elasticsearch that have a non-null searchable_by property, like @bhtooefr suggests, would that resolve your concern? That is, local toots and toots that were faved and reblogged by local users.

In the four existing privacy settings,

  • Public

  • Unlisted

  • Followers

  • Direct

it seems that "unlisted" could easily be made to imply "do not index" in addition to its existing semantics.

That is not what unlisted was designed for.

@nightpool

No, I want to ensure that an unmodified Mastodon instance never indexes a message that an end user says she does not want indexed. This should probably be a secondary setting to visibility level.

@kaniini You are spreading FUD and wasting our time. Like 80% of this thread is correcting you.

@Gargron

In what way is advocating for data security "spreading FUD"? All of the things I have described, and more, are possible with the new Elasticsearch integration feature.

Can somebody other than @Gargron who actually cares about marginalized users and data security reopen this bug? Thank you.

I feel like this being so easy now really is a solid point. I can certainly imagine myself from a few years ago throwing a few queries at ES since it's "right there", and wouldn't require much time to actually think the implications of that through.

All of the things I have described, and more, are possible without the new Elasticsearch integration feature.

That's why you're spreading FUD. This sounds more and more like personal vendetta against a database engine on other people's dime.

@Gargron

I am fully aware that people can write complex SQL queries to accomplish the same thing as what Elasticsearch has out of the box.

I am not on a personal vendetta: I operate a 200+ node elasticsearch cluster for other purposes.

I am concerned explicitly because I know what it is capable of doing and how simple it is to do it. If anything, that should be taken as a ringing endorsement of the software for accomplishing the mission it is meant to accomplish.

Can we get back to finding a way to resolve this issue now please?

Worth noting that using the unlisted (or even private or DM) flags to prevent a toot from being indexed would break the desired search functionality - you can favorite any of those toots, boost unlisted toots, and you could easily want to search your own toots that are in those categories.

Let me make an analogy, consider we are going on a long distance trip.

SQL (for analytical use) is like driving a typical car. Yes, it can get you to your destination.
Elasticsearch is like flying in a jet. It will get you to your destination much faster and much easier.

The car option may be unattractive to many because it will take a lot of time and effort to get to the destination, while the jet provides the convenience of ease and accessibility.

As @rakiru said:

I feel like this being so easy now really is a solid point. I can certainly imagine myself from a few years ago throwing a few queries at ES since it's "right there", and wouldn't require much time to actually think the implications of that through.

Do you think @rakiru will take the car option? Or will they only be interested in taking the jet option?

@bhtooefr

Correct, that is why I believe it must be a separate property.

@Gargron From what I read, ES is the reason this thread exists because, while it doesn't add any security/privacy issue, it does make it easier to malicious people.

So maybe before saying anybody is spreading FUD it would have been nice to address the actual issue.

Because yeah, @kaniini got a point, if you care about your user (not saying you don't here), you may want to try not to expose unwilling people, or not making this easy.

I mean, currently, and from what I understood/read (correct me if I'm wrong) there is either a public toot, showing in the fediverse, or a "private" one, showing only to your followers, or the person you're DMing to.
This is actually a privacy issue for "marginalized people" (quoting @kaniini here)

If it need another issue to address it, let's do it, but isn't there anyway you could at least acknowledge this ? (because closing an issue when you ignored the whole point of it ...)

Maybe a fix would be to enable more granularity in a toot privacy option, such as "local" (toot only exist in your local instance and are only showed to people in this instance), "Whitelisted" (toot only appear in whitelisted instances).

@Gargron

To be clear, I would have commented on this elasticsearch feature in the PR had I been aware of it.

I wasn't paying attention, and for that, I am sorry. I understand that revising a major feature that was a notable part of your release announcement is not something you want to do.

But privacy is important, and privacy is why people have chosen Mastodon over other implementations.

You are absolutely right that a remote server can simply ignore any hint not to do something, but the fediverse is largely built on trust, and we have a solution for rogue servers: we just cut them out of the federation.

What I want to solve is the grey area between "not rogue" and "clearly rogue." These are the mastodon instances that are run by people who do not grasp the responsibility of running a public server but do so anyway.

In other words, the same crowd who would start an IRC server just so they could /kill their friends. Mastodon has had a lot of these servers come and go over the past year as people have gotten bored with it.

If these people have to learn SQL or learn Elasticsearch to get the data, learning Elasticsearch is going to be the more attractive option to most of them, because they can just query it with curl.

I hope this helps to more accurately describe the actual threat model I am aiming for with this.

So just so I have this correct, the actual, object level outcomes here are:

  • There is an additional opt-in property, "indexable", that most users will never see or understand. This property will do nothing user-visible except prevent the users who have already seen and interacted with that user's statuses from searching for them later, and prevent that user from searching for their own posts.

  • If the this property is opt-in, then most users will never even see the option for it and will be quote unquote exposed to the possible rogue mastodon administrators without even knowing it. If the property is opt-out, or the usage is widespread enough, then these same mastodon administrators who don't "grasp the responsibility of running a public server" will see that their results are broken and just copy-paste a one-line solution from the internet, to comment out the line that handles the opt-in property.

Your stated goals (protecting users) don't line up with the actual outcomes of your proposal (a very small percentage of users protected on the margin), so I don't know how to move forward with this issue.


@Ph4ntomas there are no users who are getting "exposed". To search for a status you have to already have seen and interacted with that status somehow. No posts are no visible or accessible to anybody they wouldn't have been accessible to before this code was implemented.

I mean, currently, and from what I understood/read (correct me if I'm wrong) there is either a public toot, showing in the fediverse, or a "private" one, showing only to your followers, or the person you're DMing to.
This is actually a privacy issue for "marginalized people" (quoting @kaniini here)

This is what I mean when I say he is spreading FUD. Toot privacy and which Mastodon servers toots are distributed to is completely orthogonal to ElasticSearch, which is just a way to perform the same thing PostgreSQL could do with less developer time and higher quality.

I am also seeing a lot of people who have never posted on this GitHub repository before here.

@Gargron

By the way, I have heard of cases where this is already happening on postgresql. So I am definitely convinced that adding ES into the mix is only going to bring more snooping, and more creative kinds of snooping.

And yes, this breaks down to "you should trust your admin."

But what choice do you have other than to trust the admin if you don't know they are doing wrong by you?

So I implore you to come up with a way to exclude statuses from the search feature because of the mechanisms it uses.

@nightpool

You are right that eventually there will be a solution posted that makes modifying the code easy. But it buys some time, and that is worth it.

@Gargron

I am also seeing a lot of people who have never posted on this GitHub repository before here.

Yes.

They use Mastodon instances and therefore will be affected by the new search feature. Some comments have even been in support of your point of view, so I don't see what this has to do with anything.

Can we get back to discussing a way to exclude messages from the search feature that the original author did not want searchable?

@nightpool

What I propose is that the indexable object property is tied to an account-level setting that works in the same way as the "Opt out of search engines" setting which generates a robots.txt entry concerning your profile.

In fact, arguably, it could be the same setting. Perhaps update it to "Opt out of searches of your content".

Yes I also assumed it was going to be the same setting. I don't see how that addresses my issues with it.

@kaniini Yeah, I was just writing a comment to that effect, that Opt-out of search engine indexing seems like a good UI toggle for this.

So you'd just need to federate that.

And yes, an evil admin could easily disrespect that. But an evil admin could easily disrespect followers-only or direct toots, as well, yet that functionality is still present. (TBH, a way to make those categories of toots end-to-end encrypted would be nice, but that's /well/ outside of the scope of this discussion.)

@nightpool

What the property would solve is that the toots would not be sent to Elasticsearch to begin with.

What issue do you have with the property? Anybody who selects the "I do not want my toots indexed" option would surely know the side effects of doing so. If not we can describe them in the preferences UI.

@nightpool then, how does ES index toots that users has seen ? Does every client that "see" a toot silently send a request to index it ? Or doesn't that mean all toots will be indexed, and toots you're supposed to be able to see will be shown to you when you search these ? There is a huge differences between what someone controlling ES index could do with it.

@Gargron yeah, for my case : I learned that ES would be added to Mastodon, and ... well I can't find anywhere in the doc something saying that the search bar will ask for ES in others instance, so I assumed that each instance will maintain it's own index.
This was what I considered a privacy issue, because now instances I don't know/trust may have copy of toots I made in a trusted instance.

Then I saw @kaniini toot, speaking of an open issue regarding this matter. Is that a problem ? Or maybe FUD was fed directly to my brain.

To be clear, under the model I propose, the toots would never make it to ES to begin with.

While it is possible that a malicious admin could comment out the code which filters them, the admin would have to reimport the toots and this would take time, which would likely be demotivating to the admin.

Instead the malicious admin would go find something else to do which is more interesting than waiting for toots to be indexed.

In fact, arguably, I would like to see a section in the onboarding modal that happens on initial account signup and version upgrade which briefs users on the actual privacy scenarios they will face on the instance.

But none of that solves the issue of toots authored by remote users, which is interesting because other than denying federation with a remote server, I have no options for what that server's admin may do with my toots in their new ES index, which adds additional risks that weren't there when it was just SQL (because now there's two authoritative places they can look to get the data; I don't consider redis authoritative because it is not anything resembling a complete dataset).

this has devolved into speculating about the possible behaviors of remote admins. any of us could construct a just-so story about how one way of obfuscating the data from people who already are in possession of it is more secure then another way of obfuscating it. Maybe it's not putting it into elasticsearch. Maybe we should rot13 all of our data before we put it into elasticsearch, since that would prevent naive admins from just querying the database directly and they would get fed up because all of their results look like gibberish. Maybe they have a really fast computer, so the re-indexing speed is negligible. None of this is a sound, evidence supported way of discussing the actual issue at play: how to determine whether or not to trust a remote admin.

By the way, I have heard of cases where this is already happening on postgresql

What suggestions do you have to address these cases? Any suggestion that doesn't address things that are already happening as well as things that could potentially happen is pointless.

My suggestion is to harden where realistically possible. Such as not sending data to Elasticsearch when the author of that data doesn't want it indexed.

I really don't understand why it is so difficult to just respect common decency and not index the data if requested not to in the stock install.

Here's the thing. I want to solve this problem. I really, really want to solve this problem. But your first suggestion, in the replies to Kit's post, was that we should have just used postgres' tsearch instead. How would this have had anything to do with your supposed worry about harassment or "indexing" by remote admins? If we built a full-text search solution around postgres instead of elasticsearch we would have been in exactly the same boat today, with admins being able to make a vastly simpler SearchService.call(whatever) ruby query across all users.

So yeah, the moving goalposts and repeated bad-faith arguments (and occasional outright lies) does make it feel like you're using this problem as a pretense to stir up shit. And as someone who's very dedicated to getting this kind of stuff right, seeing people use my problems as cover for their own grudges really pisses me off.

None of this is a sound, evidence supported way of discussing the actual issue at play: how should I determine whether or not to trust a remote admin.

Well that's easy.
Don't trust any remote admin with more data than what they should have.
And that also mean do not give them a tool that could (again, I haven't saw any doc on what is indexed by ES and where, so I'm assuming worst-case scenario. Maybe @Gargron or others could answer, instead of saying everybody is spreading FUD, without even trying to invalidate the actual issue (which is IMO the best way to spread actual FUD)) make a freaking copy of every data from every connected instances.

Or let the end user decide whether or not he want specific toots to be sent to every federated instances.

Look, you guys aren't understanding it.

The problem isn't the search functionality, it's all of the other types of "searching" that ES makes very convenient to do.

So what I mean, for example, is that it is trivial to write an ES query which maps certain keywords to certain users with a likelyhood of having a specific attribute (is this user transgender for example).

What that means is that an admin who is actually not very nice, could then be like "hey, lets go find all of the trans people and harass them."

Yes, again, I will agree that this can be done in PostgreSQL. Nobody is debating that. But ES offers GROUP BY aggregations on steroids and makes them very easy to access. That's what scares the hell out of me.

Just because there is already a problem with admins snooping in the database and airing people's laundry doesn't mean we need to make it at least an order of magnitude worse.

And remember, remote users have no way to opt out of any of this and may not even be aware of who the admin of a given instance is.

A possible solution to the "trusting admins" issue could be to have per-account whitelists of instances the user wants to interact with. Then she could look around and make her own decisions on who she trusts with her interactions. I think a separate issue should be opened for that functionality though.

@Ph4ntomas nobody is saying that mastodon makes a copy of the data from every connected instance. @kaniini is implying it without outright saying it, because they know better. That's what gargron means by "spreading FUD". The inclusion of full text search gives an admin zero additional data from what they already had on their server. @kaniini's point is that gives them the possibility of doing more analysis of the data they already had, if they look into the elasticsearch docs and learn the query language. But this is very different then giving them more data.

Mastodon makes a copy of the data from every connected instance.

There I said it.

Doesn't mean it needs to make a copy into ES by default if the user says they don't want it to.

@kaniini there is no way for mastodon to access private posts if no-one on that account follows the user. that's mastodon's security model. It's very different from making a "copy of the data from every connected instance".

DM posts will only go to the users mentioned. They won't go to every connected instance. Again, this is fundamental to mastodon's security model.

Public + unlisted posts will only go to your followers by default, but they can be "fetched" by any other instance if someone requests them. Again, if noone follows you then noone will get your data, regardless of if your instances are connected

That's the entire ActivityPub security model, it is not exclusive to Mastodon.

My point is that a large node, such as mastodon.social, intersects nearly everybody's social graph. If you are mastodon.social, you are definitely making a copy of almost every connected instance's data.

But my point is really that it doesn't matter, because what actually matters is if the admins running the nodes connected directly to your social graph are trustable or not.

The per-account whitelist feature I proposed could be a nice solution for those who want to have control of that.

how is a per-account whitelist different from having a locked account, which is already a whitelist of followers?

It's complementary. If I am an individual user and I trust awoo.space, that doesn't mean I necessarily want every user on awoo.space following me. One is an attestation about the users on the instance, and the other is an attestation about the instance itself.

@nightpool

should I make a separate bug about the per-account server whitelist feature? I think it is an interesting solution to the end-user trust problem.

It's not complementary, it's a strict subset. With a locked account, you can approve followers based on a) the user, b) the instance that user is on. With a instance whitelist you're only making a judgement based on the instance.

There are forks that support instance whitelisting on an instance level. I don't have a strong opinion about it myself, but @Gargron is very against it being included in core. I think that instance-level instance whitelists are the correct solution if you want instance whitelists, because it feels like a very difficult ask to suggest that users evaluate the professionalism and security of all possible instances you may come in contact with.

@nightpool Fun fact, I don't care what @kaniini said or implied. That does not change what I thought before knowing other users were concerned by the addition of ES. Now, could we just talk about the actual issue, not what you think made me think that way ?

I know for a fact that when you index anything, you are not only accessing data, you are storing it. If that make other spread FUD in your eyes, then so be it.

Right now, I do trust admins in the instances I'm in with storing my toots, and I'm not sending sensitives toots because others instances actually see these. That does not mean I want every instances to store this kind of data.

You know why ?

  • Because I don't want to track every instances to ask them to delete my data the day I want to opt-out.
  • Because if an instance is compromised in anyway, I want to be able to know it. And that is way easier if it's not the whole fediverse
  • Because if the admin from an instance I'm in decide to blacklist another instances, I don't want the index of said instance to have my toots in it, because there would be no way for me to know who has it.

I do know malicious user could still find a way to get to my data. Or to just store every data it can by running an instance. But that does not mean I'm okay to make it easy, and with a cute little frontend for every random dude that own an instance to find and analyse datas, if I don't have a control on who may or may not receive and store the toots I sent.

@Ph4ntomas Mastodon does not ever index into elasticsearch anything that it does not already have indexed in the local database for that instance. That's all I can really say in response to your concerns, because as far as I can tell they have nearly nothing to do with full text search or elasticsearch in general, just which instances have access to your data.

the indexing into elasticsearch (modulo concerns about analysis) is just the same as the storage that already exists in postgres.

Mastodon does not ever index into elasticsearch anything that it does not already have indexed in the local database for that instance.

@nightpool So, if I get you correctly, you could only search from your instance local database ?

because as far as I can tell they have nearly nothing to do with full text search or elasticsearch in general

Well kinda, although I know no data is truly safe, I don't think making it easy to access/analyze it is a good idea, and that what full text search does.

Now if the only data you can look into with the full text search is from the instance you're in, or at least the index is stored only in your instance index, then yeah, I still have an issue with the fact that IMO there is not enough granularity in privacy settings, but that become out of this issue topic

@Ph4ntomas 馃憤 yeah, discussion of granularity in privacy settings is a different issue. I was just trying to clear up confusion around what statuses it's even possible to search.

@nightpool

in my opinion, it is complementary because locked accounts can be used alongside the proposed whitelist feature

but yes, i agree that most likely a per-account whitelist feature is not likely to be accepted.

so i don't know what solutions @Gargron is willing to accept for the trust part.

i will continue to ask that users can opt-out of ES indexing.

alternatively, if an instance has ES already set up and doesn't mind me setting up Kibana, I could demonstrate what I mean by "easy analytics." you don't need to know the ES query language, because Kibana has a GUI to build the queries.

Isn't this kind of like asking for an option to not allow someone to search their emails for a keyword, if you sent an email to them? Except maybe in the case of managed email, of not having your admin be able to search your emails? But if you don't trust a recipient, then maybe don't send anything to them?

Well the issue is obviously not with the recipient, but with the instance it is in.

So I guess yeah, that's kinda like asking gmail or whoever not to index mails you sends. But at least with mails, you can PGP encrypt these.

A server can be a recipient, not necessarily a person. So as an example, it's more like "I'm not going to email anyone on Gmail or Yahoo because I don't want my sent emails to be indexed"... but anyone can index messages on a custom hosted server. I don't see a solution here except to defer trust to external admins, which is already the current threat model (talks of "easiness" aside...)

Tech bro logic seems just security by obscurity. And if you don't trust a podadmin, then you are free to switch to another or deploy one your own, that is the whole point of federation. Sorry If I am missing something

Was this page helpful?
0 / 5 - 0 ratings