twint is returning duplicate records.
Eg: When I scrape all tweets for "nbcnews" with default config parameters, it returned total of 142997 records out of which only 41 were unique tweets.
twint was installed using pip3 install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2
Below is the code used to get tweets from nbcnews:
c = twint.Config()
c.Search = "from:@nbcnews"
c.Store_object = True
twint.run.Search(c)
tlist = c.search_tweet_list
The problem exists even with c.Profile_full = True
Can someone help me with solving the duplicate tweet problem?
c.Limit = 1
And dont forget
c.Since = startdate
c.Until = enddate
c.Limit = 1
And dont forget
c.Since = startdate
c.Until = enddate
This return 20 unique tweets by default. Is there any parameter to increase the number of unique tweets?
c.Limit = 1
And dont forget
c.Since = startdate
c.Until = enddate
As mentioned above, c.Limit = 1 would only return 20 unique. But, we could use "Since" and "Until" differently.
One "not so efficient" way of getting more tweets is to iterate over the date and get the tweet for one day at a time.
The following returned only 33 unique tweets for 9 days:
c = twint.Config()
c.Search = "from:@nbcnews"
c.Store_object = True
c.Limit = 1000
c.Since = "2015-12-01"
c.Until = "2015-12-10"
twint.run.Search(c)
tlist = c.search_tweet_list
Where as the below code returns 266 unique tweets for same date range:
import twint
from datetime import datetime
from datetime import timedelta
c = twint.Config()
c.Search = "from:@nbcnews"
c.Store_object = True
c.Limit = 50
def get_tweet(config):
twint.run.Search(config)
tlist = config.search_tweet_list
return tlist
start_date = "2015-12-01"
stop_date = "2015-12-10"
start = datetime.strptime(start_date, "%Y-%m-%d")
stop = datetime.strptime(stop_date, "%Y-%m-%d")
output_tweets = []
while start < stop:
next_day = start + timedelta(days=1)
c.Since = str(start)
c.Until = str(next_day)
start = next_day
output_tweets+=get_tweet(c)
Edit : For the above script to work, twint needs to be installed as suggested in #917 using pip3 install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2. Have been using the above technique intensively and seem to be fetching all the tweets.
c.Limit = 1
And dont forget
c.Since = startdate
c.Until = enddateAs mentioned above, c.Limit = 1 would only return 20 unique. But, we could use "Since" and "Until" differently.
One "not so efficient" way of getting more tweets is to iterate over the date and get the tweet for one day at a time.
Example
The following returned only 33 unique tweets for 9 days:
c = twint.Config() c.Search = "from:@nbcnews" c.Store_object = True c.Limit = 1000 c.Since = "2015-12-01" c.Until = "2015-12-10" twint.run.Search(c) tlist = c.search_tweet_listWhere as the below code returns 266 unique tweets for same date range:
import twint from datetime import datetime from datetime import timedelta c = twint.Config() c.Search = "from:@nbcnews" c.Store_object = True c.Limit = 50 def get_tweet(config): twint.run.Search(config) tlist = config.search_tweet_list return tlist start_date = "2015-12-01" stop_date = "2015-12-10" start = datetime.strptime(start_date, "%Y-%m-%d") stop = datetime.strptime(stop_date, "%Y-%m-%d") output_tweets = [] while start < stop: next_day = start + timedelta(days=1) c.Since = str(start) c.Until = str(next_day) start = next_day output_tweets+=get_tweet(c)
Thanks for your post. I fixed it from this perspective
@HegdeChaitra this has been solved in #944.
Please close this issue.
I am using a terminal to execute twint. I have the same problem and I tried to use 'since' and 'until' but it does not work for me.
Also, if I use the method you have given above. I get an error like this. Please help me!
AttributeError: partially initialized module 'logging' has no attribute 'getLogger' (most likely due to a circular import)
Also, if I use the method you have given above. I get an error like this. Please help me!
AttributeError: partially initialized module 'logging' has no attribute 'getLogger' (most likely due to a circular import)
Can you give details on how you have installed it and what exactly is the code you are running?
I install twint by using git
git clone --depth=1 https://github.com/twintproject/twint.git
cd twint
pip3 install . -r requirements.txt
To be safe, after I installed it, I do it again by using pip
pip3 install twint
After that, I open a python file in \twint\twint, and run the code you have given.
`import twint
from datetime import datetime
from datetime import timedelta
c = twint.Config()
c.Search = "from:@nbcnews"
c.Store_object = True
c.Limit = 50
def get_tweet(config):
twint.run.Search(config)
tlist = config.search_tweet_list
return tlist
start_date = "2015-12-01"
stop_date = "2015-12-10"
start = datetime.strptime(start_date, "%Y-%m-%d")
stop = datetime.strptime(stop_date, "%Y-%m-%d")
output_tweets = []
while start < stop:
next_day = start + timedelta(days=1)
c.Since = str(start)
c.Until = str(next_day)
start = next_day
output_tweets+=get_tweet(c)`
As soon as I run the code, the error occured.
AttributeError: partially initialized module 'logging' has no attribute 'getLogger' (most likely due to a circular import)
After that, I open a python file in \twint\twint, and run the code you have given.
Seems to me that this might me the reason you are getting the circular import error.
Try to create the .py file outside the twint package directory.
Once you have installed _twint_ you can access it from anywhere on your system so there's no need to stay in the twint package directory.
Also first try to uninstall twint then do a fresh install using
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
After that simply run the following line in your terminal and see if that works
twint -u realDonaldTrump
@HegdeChaitra @yangyangdotcom
One more thing,
There is no c.search_tweet_list in the master branch of twint.
The code you are using is outdated as it was for the proposed Pull Request #917 which wasn't actually merged into the master because it was returning incomplete tweets at that moment.
After that, I open a python file in \twint\twint, and run the code you have given.
Seems to me that this might me the reason you are getting the circular import error.
Try to create the .py file outside the twint package directory.
Once you have installed _twint_ you can access it from anywhere on your system so there's no need to stay in the twint package directory.Also first try to uninstall twint then do a fresh install using
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
After that simply run the following line in your terminal and see if that works
twint -u realDonaldTrump
I followed your instruction but it does not work for me. It does scrape some tweet of Donald Trump succesfully. But the following error occured halfway through the process and twint stoped scrapping.
CRITICAL:root:twint.run:Twint:Feed:noData'tweet'
sleeping for 15 secs
I also try to execute it using IDE, no tweets were scraped and following error occured.
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 1.0 secs
Does these happened to me only?
After that, I open a python file in \twint\twint, and run the code you have given.
Seems to me that this might me the reason you are getting the circular import error.
Try to create the .py file outside the twint package directory.
Once you have installed _twint_ you can access it from anywhere on your system so there's no need to stay in the twint package directory.
Also first try to uninstall twint then do a fresh install using
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
After that simply run the following line in your terminal and see if that works
twint -u realDonaldTrumpI followed your instruction but it does not work for me. It does scrape some tweet of Donald Trump succesfully. But the following error occured halfway through the process and twint stoped scrapping.
CRITICAL:root:twint.run:Twint:Feed:noData'tweet' sleeping for 15 secsI also try to execute it using IDE, no tweets were scraped and following error occured.
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) sleeping for 1.0 secsDoes these happened to me only?
As mentioned in the first comment on this issue, above code/method works when you install the twint using pip3 install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2
Most helpful comment
As mentioned above, c.Limit = 1 would only return 20 unique. But, we could use "Since" and "Until" differently.
One "not so efficient" way of getting more tweets is to iterate over the date and get the tweet for one day at a time.
Example
The following returned only 33 unique tweets for 9 days:
Where as the below code returns 266 unique tweets for same date range:
Edit : For the above script to work, twint needs to be installed as suggested in #917 using
pip3 install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2. Have been using the above technique intensively and seem to be fetching all the tweets.