Twint: question about how "place" attribute works

Created on 23 Oct 2020  路  11Comments  路  Source: twintproject/twint

Issue Template

Please use this template!

Initial Check

If the issue is a request please specify that it is a request in the title (Example: [REQUEST] more features). If this is a question regarding 'twint' please specify that it's a question in the title (Example: [QUESTION] What is x?). Please only submit issues related to 'twint'. Thanks.

Make sure you've checked the following:

  • [] Python version is 3.6;
  • [] Updated Twint with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint;
  • [] I have searched the issues and there are no duplicates of this issue/question/request.

Command Ran

Please provide the _exact_ command ran including the username/search/code so I may reproduce the issue.
(more of a general question, but the code I ran was)
c = twint.Config()
c.Geo = '37.17162, -122.22275, 50 km'
c.Since = '2020-08-16'
c.Until = '2020-08-26'

c.Limit = 100

c.Hide_output = True
c.Store_object = True
c.Pandas =True
c.Lang = 'en'
twint.run.Search(c)

Description of Issue

Please use as much detail as possible.

I guess I'm a bit confused about the "place" and "geo" attributes. Does the 'geo' attribute return the geo-tagged location of the tweets themselves or of the users? I thought it was the former, but what is throwing me off is that the above code returns about 60,000 tweets, about 3,000 of which have the "place" attribute with a specific lat/long for the tweet. Does that mean the remaining tweets were not geotagged & that their falling within the specified geo radius is being abstracted from the user profile? I only want geotagged tweets to be returned, and I'm not sure if there's currently a way to do this. Further, does the "place" attribute only get filled in the resulting dataframe/output if there is a specified geo radius? If anyone has any insight into how these functionalities work, do let me know! Many thanks!

Environment Details

Using Windows, Linux? What OS version? Running this in Anaconda? Jupyter Notebook? Terminal?

Most helpful comment

I'm not sure the best course of action in this case.

On one hand given the timing this post seems related to an earlier bug. A few weeks ago the scraper would return either zero results or completely unfiltered geo results depending on if your query contained a space.

On the other hand the question was framed so generally, that it matches the theme of your problem.

All 11 comments

"Geo" appears to be completely broken in it's current state. It's not you, Something is currently malfunctioning with the scraper.

From what I understand, Geo is not broken, since it actually scrapes the data inside the radius distance.
I made a test with a script similar to this:

twint -g="45.4642,9.1900,50km" --since 2019-12-01 --until 2020-01-01 -o 12-19-Milan-50km.csv --csv

The result I get is a lot of tweets. The ones with the place (lat, lon) attribute are tweets of others services (e.g. instagram) sharing the post with the location.
On the other hand, the geolocalized tweets from Twitter, like this, from my account, are actually scraped, since they are in the analysis radius, but the place is just empty.
It would be great to have the twitter ID of the place or the coordinates.

PS. I'm not sure this is something worth of opening another issue topic, I'm still not confident with GitHub etiquette.

I'm not sure the best course of action in this case.

On one hand given the timing this post seems related to an earlier bug. A few weeks ago the scraper would return either zero results or completely unfiltered geo results depending on if your query contained a space.

On the other hand the question was framed so generally, that it matches the theme of your problem.

@G14LL0
this needs to be tested.
I'll take a look into this and update you with my findings.

@himanshudabas Thank you. If you need other info from my side just hit me up.

@coreyryanhanson Yeah, I saw the discussions about the space in the query, but it seems to me that's a different issue. But I'm not no developer, so I'm just guessing here. Thanks.

@G14LL0

I looked into this and _json parser_ needs a little bit of tweaking.

Place attribute should contain the name of the place which twitter shows when a tweet is _expanded_. For example, in this tweet, (which comes up at the top from your search query) the Place should be Busto Arsizio, Lombardia. For every geo tagged tweet its _Place_ should _always_ be available, (I am make a guess here, considering that when we make a geo search, it only returns geo tagged tweets, and until now I haven't found a tweet which doesn't contain the _Place field_ in it).

Now when it comes to the geo attribute of _twint_ (which gets saved to the csv) things get a little complicated.
Currently it saves the query attribute (45.4642,9.1900,50km in your case) to the geo field in csv. I don't know if this was intended originally or was decided along the way. I mean this is a redundant attribute, if someone is making a query with a particular geo location they can simply name the csv with that geocode or there are other ways to handle this.
Twitter itself provide approximate geo coordinates for each geo tagged tweet, so it might be better to store that in the geo field instead. I don't know how accurate it is for different type of queries, but for your query (which has a radius of 50km) I tested it to be giving the approx location within a radius of 6 km, which is much decent compared to saving the query geo coordinates (50km in your case).

Anyways, I think we'll have to discuss this more before making changes to the geo field, as there might be some privacy concerns too.

As far as the Place attribute is concerned, I'll put up a patch for it soon.

Thank you so much for checking on this, @himanshudabas. Just to clarify, once it's functioning correctly, the "place" attribute should include a general location for any geotagged tweet?

@lilygrier
If by general you mean what is shown in a tweet when you open a geo-tagged tweet, then yes.
Like in the above case Busto Arsizio, Lombardia, Italy would be shown in Place attribute.

@himanshudabas Thank you very much for the complete answer and the time you put in it. I completely agree on the redundancy of the geo field, that was something that was bothering me when I first looked at the data.

Regarding the accuracy of geo tagged tweet I am pretty sure that it depends from the way you geotag yourself. Let me clarify: Twitter has _Places_ with his _ID_ and corresponding coordinates, if I geotag my self in the city of Milan, Twitter gets the coordinate of the centroid of Milan Place, on the other hand if I geotag me in a particular venue, it gets the coordinate of the selected address.
The tweet you posted was, most probably, made with the former geotag, this is an example of the latter.

I'm writing this because in my field of work (transport planning / GIS analysis) would be great to have the true coordinates (when available) to understand how the social activity is shaping the different parts of the city.

@G14LL0
Thanks for clarifying how the geotag actually works on twitter. I myself don't use twitter that much, :P so I have little knowledge of how things like _geotag_ actually works.

On the other hand if this is how the geo coordinates are generated on twitter, does that mean instagram does something similar? I mean the tweets with insta links always have a _variable_ point which contains a single value of geo coordinates (by single I mean a single point. Because in case of geotagged tweets it is an object which have 4 coordinates in it representing the corners of an square.) then unlike what I previously thought/assumed that might also be something similar. I don't even have a insta account so I don't know how that works either.

@himanshudabas I think that everything is working as a point. Even if twitter has data about the actual shape of the object, thus 4 coordinates, it surely would rely on the centroid of the polygon, so a point. I'm saying surely because I'm thinking of all other web service that relies on this kind of technology and it's just more efficient to use the centroid.

Regarding instagram, I think that's quite a coincidence we can scrape this data. On instagram you can geotag the posts in a similar way as twitter does, but as far as I know, insta blocked the scraping of the posts some years ago, so the fact we are scraping these it's because when linking the tweet with the insta post, it just takes the metadata of the geotag with it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wagerpascal picture wagerpascal  路  4Comments

dmuth picture dmuth  路  4Comments

Nestor75 picture Nestor75  路  3Comments

adnanmig picture adnanmig  路  3Comments

sillyfatcat picture sillyfatcat  路  4Comments