Twint: Working with twint in a stream fasion

Created on 29 Mar 2020  路  7Comments  路  Source: twintproject/twint

I want to use twint as some sort of watcher over a user account, The first time I'll run it, it will give me all of the users tweets and then it will keep "listening" the user and when he tweets a new tweet twint will download the new tweet.
Is this behavior is achievable using twint or ill need to find some sort of workaround to make this work?

question

All 7 comments

You'd have just to run twint continuously in a while loop with a timeout

How would I prevent from downloading tweets I already downloaded this way?
Is there a tweet ID I could configure that twint would know that I want to get all tweets tweeted after the given tweet?

I'd use since and until to select the right window; for each run the until param needs to be the date of the first scraped tweet of the previous session

A good way to do this is to pass a list to twint to use as the tweet list and run twint in a thread. This way you can still execute code with logic to check the results twint has returned and break intelligently. It requires a little scaffolding.

General pattern looks something like this:

FULL DISCLAIMER! I hand typed this in github and have not run this code so it could have syntax and or other issues!

import twint
import threading

from concurrent.futures import ThreadPoolExecutor

master_tweet_list = some_existing_list_from_previous_collection

with ThreadPoolExecutor(max_workers=1) as executor:
    exit_event = threading.Event()
    while not exit_event.is_set():
        try
            executor.submit(my_twint_threaded_code, config).result()
            exit_event.wait(3600)  # wait for 1 hour
       except KeyboardInterrupt:
           break

def my_twint_threaded_code(config):
    tweets = []
    config.Store_object = True
    config.Store_object_tweet_list = tweets
    processed_tweets_count = 0
    stop_running = threading.Event()    

    twint.run(config)

    while not stop_running.is_set():
        if (len(tweets) - processed_tweets_count > 20):
            print(f"Twint has collected 20 or more new tweets")
            new_tweets = tweets[processed_tweets_count:])
            processed_tweets_count += len(new_tweets)

            for tweet in new_tweets:
                wait_condition = YOUR_LOGIC_HERE_TO_STOP_BASED_ON_NEW_TWEETS                            
                if wait_condition:
                    stop_running.set()

         # some logic here to add new tweets to the master tweet list
         stop_running.wait(1)  # wait for 1 second to check twint again

@llunn Your solution seems like a much more elegant solution than what I currently have,
right now my solution is based on this blog. I want to store the tweets in Kafka like in the blog post.
Would you recommend using your solution for this purpose as well?

btw, in the wiki, Store_object_tweet_list
isn't mentioned in the config list given at the beginning of the page, it's mentioned only later in the Storing data in lists section at the bottom of the page, I think it should be mentioned with the rest of the configuration.

The documentation in the wiki could certainly use some updating. I bet that @pielco11 would welcome any contributions that address that.

Personally, I would not use the the method of overriding functions in the twint code for one simple reason: this function is internal to the workings of twint and you cannot be guaranteed that future changes to twint may rely on that function as it is currently implemented. By changing it you may inadvertently break other things.

With that said, the override function simply writes the __dict__ attribute of the user. You do not need to replace the twint.write.Json with this function. You should be able to simply use the Store_object = True configuration setting, then transform the user instances located in Store_object_users_list to json.

One other caution! It is also not guaranteed that someuser.__dict__ will always work! If for example, the twint code introduces another property to the twint.user.user that is another object (say for example another twint.user.user instance) then __dict__ will not work as a technique to transform to a json like structure.

I think that a lot of code should be re-written, but right now Twint is not on my top of priorities

Was this page helpful?
0 / 5 - 0 ratings