Twint: Question: Adding Language to Tweet Metadata

Created on 6 May 2020  路  7Comments  路  Source: twintproject/twint

I noticed that even though you can filter on language, Twint doesn't return a field with the language code in it with the tweets.

I was looking at adding this, as it isn't hard to pull out of html that Twint scrapes, but I haven't quite figured out how to add that field to the tweet object that Twint creates. I mean, I have added it to the Twint object, but clearly there is other processing happening somewhere in Twint that isn't picking up my added attribute on the object. Does someone who is better at untangling OOP know what else I need to change? Here's what I've done:

Locate the language code in the html and add to tweet object in tweet.py
```
t.tweet = getText(tw)

# add language field
t.lang = tw.find('p', 'tweet-text')['lang'] 

t.hashtags = [hashtag.text for hashtag in tw.find_all("a","twitter-hashtag")]

```

Is there somewhere else in the Twint code that I need to make a change so that the output includes this attribute? It looks like something like this needs to be added to the format.py code, but I've done this too and still it doesn't show up in the saved text.

output = output.replace("{lang}", t.lang)

Anybody have any advice or ideas of what I'm missing?

New Feature

Most helpful comment

I was just about to post this as a PR. I've been using this for a week or so now on my fork and it seems to be working pretty well, if someone wants to clean it up and submit a proper PR feel free.

t.lang = tw.findAll("div", "js-tweet-text-container")[0].findAll("p")[0]['lang']

All 7 comments

I was just about to post this as a PR. I've been using this for a week or so now on my fork and it seems to be working pretty well, if someone wants to clean it up and submit a proper PR feel free.

t.lang = tw.findAll("div", "js-tweet-text-container")[0].findAll("p")[0]['lang']

EDIT: See my post below, I figured out what needed to be added to ensure the new field saves to disk.

I was just about to post this as a PR. I've been using this for a week or so now on my fork and it seems to be working pretty well, if someone wants to clean it up and submit a proper PR feel free.

t.lang = tw.findAll("div", "js-tweet-text-container")[0].findAll("p")[0]['lang']

Thanks. I added this line, and the language field is now showing up in the tweet object in RAM, but somehow it's not being added to the object to be stored on disk via Store.json=True. @pielco11 - is there an additional step I'm missing to add a tweet object attribute to the json output?

Edit: Also, my code adding the attribute above t.lang = tw.find('p', 'tweet-text')['lang'] also adds the language field successfully in RAM but not in the stored version.

I figured out the place to add the language attribute to the saved JSON object. In write_meta.py, add the line "language" : t.lang, to the TweetData() function.

If it's okay with @pielco11 , I can figure out how to do my first PR :)

Any update here? Would love this!

I have a pull request submitted - #749.

The pull request has now been merged. This functionality should be available.

Was this page helpful?
0 / 5 - 0 ratings