Twint does not cache results, queries or anything else. Every single piece of data is provided by Twitter. It makes sense to run multiple searches until it makes sense to you, for example you could monitor specific hashtags and see which user deletes more tweets than others. On the other side, you will most probably get duplicates if you don't filter the data with since and until (or other parameters).

If the hashtag is really popular, I'd split the searches in months if not even weeks.

pielco11 on 12 Dec 2019

Thanks. So, is it normal though that if I run the exact same search for the same query 12 hours apart that I get very different numbers of tweets back? When I ran it yesterday I got ~6800 results and when I ran it again this morning I only got ~4700. (When I ran it a third time it returned 0 tweets??)

It's not really that big a deal, I'm just hoping to understand the expected behaviour to explain to analysts when I explain the data to them. If I'm only getting a subset, or a random subset of the data, that's fine, but I need to understand that before they ask me about it.

Thanks

jomorrcode on 12 Dec 2019

🚀1

If you run the same script over time, so with the same date-time ranges and other parameters, the dataset should be "constant" since it does not depend by the relative time which you run the script at.

So if we have a variation and what can variate is the time interval or the dataset itself, I guess that (in this case) is the dataset that's changing. Most probably due to deleted tweets, accounts that go from open to closed, accounts that get delete/suspended.

Let's say that tweets with id = 1,2,3 are sent in the interval time (A,B). If we run the scripts at C with C > B, and the tweets don't get deleted/shadowed _et similia_ , we'll always get tweets with id = 1,2,3. The interval is closed and nothing can go it or get out by itself. The only way which a tweet can go out of that interval is just being deleted/shadowed and so on.

pielco11 on 12 Dec 2019

Thank you. I guess my strange results were a glitch. I mean, it's unlikely that 2000 tweets got deleted for that hashtag overnight. I'll keep experimenting.

Much appreciated.

jomorrcode on 12 Dec 2019

👍1

I can guarantee you that what Twint returns is what Twitter gives (and this can be proven, just run with --debug or config.Debug = True and in the twint-request_urls.log file you'll see every request made, and you can replicates the request with any software of your choice).

You can run the script with your handle as target, if you always get the same tweets everything is fine otherwise it needs to be investigated.

Best of luck!

pielco11 on 12 Dec 2019

Thanks, I'm sure it's something I'm doing. I will keep at it.

jomorrcode on 12 Dec 2019

Sorry to come back to this, but just wanted you to be aware of the results of my testing.
I ran this exact code multiple times in a row, and each time it returned very different results. I don't know if there is something wrong with my code (please tell me if so!) or if there is something else going on.

(running with python 3.6 and fully updated twint)

c = twint.Config()
c.Hide_output=True
c.Pandas_clean=True
c.Pandas=True
c.Search="#nfl"
c.Since='2019-12-01'
c.Until='2019-12-02'

twint.run.Search(c)

I just ran that exact code 4 times in a row. It returned this many tweets:
Run #1: 1,909 tweets
Run #2: 280 tweets
Run #3: 13,207 tweets
Run #4: 3,015 tweets

I'm really not sure what to do with that at this point? Am I doing something wrong? On the three runs with the lower values, it also returned this

CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!

Any advice would be welcome.

Edit: For what it's worth, it's definitely something to do with the feed being disrupted. That error is being thrown in the Feed method of Twint, and it seems to happen most of the time. I re-ran that script in a loop a bunch of times, and 13, 207 tweets seems to be the actual correct number, but it doesn't come back with that very often.

jomorrcode on 13 Dec 2019

👍6

I've tried your query and I can confirm that the results are not consistent. That's really strange and needs to be investigated.

I'll keep you updated

pielco11 on 13 Dec 2019

👍4

Thanks!

jomorrcode on 13 Dec 2019

So it seems that having HTTPS or not does not always have effects, for now my findings are there (I've run the script 3 times):

==> nfl1.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

==> nfl2.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

==> nfl3.csv <==
1201274724778565632,1201274724778565632,1575241187000,2019-12-01,23:59:47,CET,1180200593962369024,yotesglendale,GlendaleCardinals,,Brandin Cooks leaves Budda Baker in the dust💨 #NFL #NFLSunday #LARams #LAvsAZ #redsea pic.twitter.com/QowdS89R0S,[],[],[],0,0,0,"['#nfl', '#nflsunday', '#larams', '#lavsaz', '#redsea']",[],https://twitter.com/YotesGlendale/status/1201274724778565632,False,,1,,,,,,,"[{'user_id': '1180200593962369024', 'username': 'YotesGlendale'}]",,,,

So as we can see, Twint starts always at the same point. Which is good.

Now we have to see where it stops

==> nfl1.csv <==
1201248296645398528,1201248296645398528,1575234886000,2019-12-01,22:14:46,CET,178163508,randi_heatlifer,Randi Hilsercop,,The jets are so bad that the bengals just beat them 😂😂 #NYJvsCIN #NFL  pic.twitter.com/5OEHqX1I45,[],[],[],0,0,0,"['#nyjvscin', '#nfl']",[],https://twitter.com/Randi_heatlifer/status/1201248296645398528,False,,1,,,,,,,"[{'user_id': '178163508', 'username': 'Randi_heatlifer'}]",,,,

==> nfl2.csv <==
1200912390356783104,1200912390356783104,1575154800000,2019-12-01,00:00:00,CET,4059670933,blowoutbuzz,BlowoutBuzz,,YourDozen: Get your NFL Week 13 picks in to win 2017-19 sets >>  http://bit.ly/2Ok7pdQ  #collect @PaniniAmerica #TheHobby #NFL #predictions #picks pic.twitter.com/ZQFTduKvrh,['paniniamerica'],['http://bit.ly/2Ok7pdQ'],['https://pbs.twimg.com/media/EKkj3X2WsAALfcr.jpg'],0,0,0,"['#collect', '#thehobby', '#nfl', '#predictions', '#picks']",[],https://twitter.com/BlowoutBuzz/status/1200912390356783104,False,,0,,,,,,,"[{'user_id': '4059670933', 'username': 'BlowoutBuzz'}, {'user_id': '44128979', 'username': 'PaniniAmerica'}]",,,,

==> nfl3.csv <==
1201245479746646022,1201245479746646022,1575234215000,2019-12-01,22:03:35,CET,2691199254,seahawksreddit,/r/Seahawks,, https://ift.tt/2DyJ51V  Ravens Win! #Seahawks #NFL #GoHawks,[],['https://ift.tt/2DyJ51V'],[],0,1,6,"['#seahawks', '#nfl', '#gohawks']",[],https://twitter.com/SeahawksReddit/status/1201245479746646022,False,,0,,,,,,,"[{'user_id': '2691199254', 'username': 'SeahawksReddit'}]",,,,

If we take a look at the twint-last-request.log file (when Twint exists with error)

      <div id="main_content">
            <div class="system">
      <div class="blue">
        <table class="content">
          <tr>
            <td>
              <div class="title">Sorry, that page doesn't exist</div>
              <div class="subtitle"><a href="/">Back to home</a></div>
            </td>
          </tr>
        </table>
      </div

If we take a look at the latest scraped tweet in nfl2.csv we see that its time is 00:00:00 (relative of my TZ) which is good, it says that we reached the end of "my day".

A note about the time-zone. If we run the same with two different local times, most probably we'll get different results since my start (end) of the day is different that yours. That said, our aim is not to be sure that one gets each other's results, our aim is to get the same results each time we ask for them, comparing results individually. (FYI I got 25282 tweets)

Reasons why the issue might be related to HTTP(s) switch:

since the switch we've had error messages
in the past we've tested HTTPS and got the error messages

Reasons why the issue might not be related to that switch:

the error messages are printed at first, then those vanish
for valid requests of every run (HTTP or not), the content (tweets) is always returned

What happens when Twint gets those error messages:

1) Twint changes the UserAgent
https://github.com/twintproject/twint/blob/3a4f778233257dd902f6557a38998bfcc3a046bc/twint/run.py#L89-L94
2) Twint re-runs the same request (since self.feed and self.init don't change, the request's params are still the same)

Sometimes, luckily, error messages are not printed, even if running the same query. In such case the only thing that Twint has different than in the other runs is the UserAgent.

So maybe Twitter plays differently based on the UserAgent specified.

Updates soon.

pielco11 on 13 Dec 2019

It seems that using Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36 as UserAgent (as suggested by @o7n in #587), allows us to get almost all the expected results

So I suggest you to edit lines 158 and 159 in url.py and to specify that user-agent, or another one suggested. Then please run and let me know if you get consistent results @jomorrcode

pielco11 on 13 Dec 2019

Hmmm. My installed url.py only has 146 lines (installed twint 2.1.9 with pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint as per the instructions).

I looked through it and couldn't find any reference to user agent. I found it several places in run.py, but I didn't want to just start randomly changing the code without knowing that I was doing.

jomorrcode on 13 Dec 2019

Sorry I mean in get.py, my bad
https://github.com/twintproject/twint/blob/3a4f778233257dd902f6557a38998bfcc3a046bc/twint/get.py#L155-L161

Here is you'll end up with
immagine

pielco11 on 13 Dec 2019

❤2 👍1

Sorry, finally had a chance to try this. Yes, I made that change with that user agent and ran the code 4 times, and got back 13,011 tweets each time.

jomorrcode on 14 Dec 2019

👍3

Just to update, running with that user agent ususally seems to bring back a fairly complete list of tweets, but does randomly fail and a smaller subset are returned. I am playing around with a workaround to look at the last tweet returned to see if it's timestamp is close to 00:00:00, and if not, redo the query. Not sure if there's a more effective way to detect that the scrape finished early as a workaround.

jomorrcode on 14 Dec 2019

It would help a lot if there was some way to know if there were any errors during the search or profile request. Now the only indication of something going badly is a message on stderr, if any.

o7n on 14 Dec 2019

I thought I'd just add that after a lot more experimenting, results continue to be inconsistent regardless of the User Agent, though quite randomly so. Sometimes I can run the same code 3 times and get the exact same number of tweets and other times it will return a much smaller number, or even zero tweets.

jomorrcode on 19 Dec 2019

And does in happen regardless the query?

pielco11 on 19 Dec 2019

Huh, oddly it does seem to vary with the query. Some queries I tried seem to always return the same number, others vary a lot. Really strange that. That football hashtag (#nfl) is always very variable, but something like #france or #germany seems to be consistent.

jomorrcode on 19 Dec 2019

That makes the debugging even harder, as of now I'd exclude a flaw in Twint. So I guess that somehow the issue is related to Twitter

It'd be interesting to check if it returns less tweets even if it reaches the end of the day. Because if it stops before for unknown reasons, you could just resume from that point.

To try this out just run something like twint -s "#nfl" --debug --resume "test_1.session" --since "2019-12-18" --until "2019-12-19" --csv -o "test_1.csv"

import twint

c = twint.Config()
c.Search = "#nfl"
c.Debug = True
c.Resume = "test_1.session"
c.Since = "2019-12-18"
c.Until = "2019-12-19"
c.Store_csv = True
c.Output = "test_1.csv"

twint.run.Search(c)

If it does not stop at 00:00:00, you can just re-run the script/command as is and it will resume from where it started. You might want to apply some more complex logging, but keeping track of session files would be enough. Please consider that at every run the session file is overwritten, so you would like to do something like python3 script.py && cat test_1.session >> history.sessions or twint .... >> history.session. So that we can compare the session ids with the debug files (twint-request_urls.log is not overwritten at every run)

pielco11 on 19 Dec 2019

👍1

Oh, that is perfect. I will try that. I was trying to do something like that myself by checking for whether the last tweet collected was close to 00:00:00, but when it wasn't I had to rerun the whole script instead of just restarting from where it stopped.

So if I understand correctly, as long as the script hasn't terminated, re-running twint.run.search(c) will restart from the last tweet collected (assuming it hadn't reached 00:00:00), so a simple loop with a check for the latest time collected should do the trick.

Thanks, you've been very helpful with this.

jomorrcode on 19 Dec 2019

by checking for whether the last tweet collected was close to 00:00:00

That is a bit too subjective for me. When is a tweet "close" to 00:00:00 ? And for search queries with a lower volume there is no visible difference between "no tweet exists" and "no tweet was received".

o7n on 19 Dec 2019

That is a bit too subjective for me. When is a tweet "close" to 00:00:00 ? And for search queries with a lower volume there is no visible difference between "no tweet exists" and "no tweet was received".

This is very true. For what it's worth, because I was looking at fairly active search terms, I was considering it an incomplete search if no tweets were returned within 30 mins of 00:00:00, but as you say, that's not actually an effective way to handle this since it could easily happen that by chance there were no such tweets. I was including a cut out so that if the search ran 4 times without reaching 00:00:00 it would exit gracefully and print an alert. Again, a hack to get around this that doesn't actually work that well.

I don't have a better idea at the moment though.

jomorrcode on 19 Dec 2019

I do see that the file "twint-last-request.log" contains a field "has_more_items":true. Is that something that can be accessed and used as a check before the script terminates to say, if has_more_items is true, re-run the search and resume from the last tweet collected? I'm not sure where that information resides before it is written to the log file.

jomorrcode on 19 Dec 2019

@o7n you are absolutely right, but with the sample cited above we suspect that the latest tweets is 1200912390356783104,1200912390356783104,1575154800000,2019-12-01,00:00:00,CET. So considering this testing query (since with smaller ones I'm unable to replicate the issue) we can try to better understand what happens

As you said, and as we all agree, that's not a general rule. Instead it's a case-specific one

pielco11 on 19 Dec 2019

@jomorrcode so you mean like if at the request N we get has_more_items:true and at the request N+1 Twint breaks, Twint retries the query since we now that there are more tweets?

That sounds good, indeed when Twint breaks we lose the information about the previous request. Let me check what we can do

pielco11 on 19 Dec 2019

I create a new branch for this

https://github.com/twintproject/twint/tree/workaround-604

I'll push fixes there, so please remember to pull from that branch!

pielco11 on 19 Dec 2019

👍1

Added a new log file twint-requests-deep.csv, that contains rows formatted as follows f"had_more_items:{self._has_more_items};has_more_items:{self.has_more_items};init:{self.init};len_feed:{len(self.feed)}"

This will help us tracking down the requests.

If a large amount of tweets will be missing from a bigger set, we expect to see a difference in twint-requests-deep.csv

pielco11 on 19 Dec 2019

👍2

I may not have a chance to do it before the holidays, but I will do some structured tests when I get back and try and keep some records of results etc.

jomorrcode on 19 Dec 2019

Hi, I did a couple of tests using both the Master branch and then the workaround branch, with a fresh install, I ran your query for "#nfl" for 2019-12-01 to 2019-12-02. I made sure to uninstall the Master branch before installing the workaround branch.

Results for 4 runs with Master Branch:
Test 1: time= 1 min, total tweets = 680, last tweet = 1201347059531468800 @ 22:47:13
Multiple errors:
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Test 2: time= 10 sec, total tweets = 0
No errors returned at all

Test 3: time= 11 mins, total tweets = 13,074, last tweet = 1201003007342723073 @ 00:00:05
A few errors:
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

Test 4: time= 1 min, total tweets = 100, last tweet = 1201361765176659969 @ 23:45:39
Multiple errors:
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Results for 4 runs with Workaround branch
Test 5: time = 14 mins, total tweets = 13,074, last tweet = 1201003007342723073 @ 00:00:05
Dozens of errors
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1

Test 6-8: basically identical to #5 (between 13,070 and 13,080 tweets, all ending with the same last tweet)

If it's useful I save each of the session, request logs, and deep_test files for each of those, as well as the actual tweets csvs. Hope that helps.

jomorrcode on 20 Dec 2019

twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -es localhost:9200
( i launched after modifying the user agent like here https://github.com/twintproject/twint/issues/604#issuecomment-565633980 )
Schermata a 2019-12-26 21-54-31
here the result
twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -es localhost:9200
[+] Indexing to Elasticsearch @ localhost:9200
........................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
........................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
....................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
............................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
............................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
............................................................................................................................................................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
................................................................................................................................................................................................................................................CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0) [x] run.Feed
[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!
[+] Finished: Successfully collected 2718 Tweets.

twint -s meloni --since 2019-01-01 --until 2019-12-25 --stats --count -o meloni.json

Its working good and i launched before modifying the user agent file, but is not sending in elasticsearch db. in the search there are no points but tweet text as shell graphic.

can be the problem the elastic search?

AldebaranNapoli on 26 Dec 2019

Downgraded to the version 2.1.6 and it's working !

Aassifh on 7 Jan 2020

Thanks @Aassifh for notifying us, I'll do some tests!

pielco11 on 7 Jan 2020

👍1

how did u do downgrade?

AldebaranNapoli on 7 Jan 2020

pip install twint==2.1.6

Aassifh on 8 Jan 2020

👎2 👍2

i install 2 and works finally ... the problem is the last version with elasticsearch so ...

AldebaranNapoli on 8 Jan 2020

👍2 👎1

I was facing this issue when I travelled from Boulder to India. I was constantly shown the warning/error - 'CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)' and the output was inconsistent. I tried many different stuff, but I kept facing this error in India. When I was back in Boulder, I ran the same piece of code again, and guess what? It ran without any errors!

So I feel that at least in my case, it was some regional setting of the network w.r.t accessing Twitter data which was causing this error, I can't say in detail though.

UpasanaDutta98 on 13 Jan 2020

👍2

Maybe its has something to do with the timezone ! check it !

Aassifh on 13 Jan 2020

It sounds like you are battling a machine learning process that's trying to detect scraping.

edsu on 8 Feb 2020

😕3

I'm starting to think the same thing. @edsu.

nonameable on 15 Feb 2020

I would not exclude that, I think that's totally possible

pielco11 on 15 Feb 2020

This feels more like simple rate limiting or capacity management mechanisms.

o7n on 15 Feb 2020

Might be that the detector uses a specific algorithm, that may run in a ML process or not, and the effect is rate limiting

pielco11 on 15 Feb 2020

And how would it be more effective than just rate limiting on its own? This is all just speculation, there is not a single piece of evidence for a ML process.

o7n on 15 Feb 2020

@o7n actually nobody said for sure that behind this issue there's a ML process. We are just sharing our thoughts

pielco11 on 15 Feb 2020

I’m just sharing my thoughts as well.

o7n on 15 Feb 2020

I know its a fresh issue, but any more ideas for a solution? I've tried downgrading Twint version and it didn't work. When I can scrape between 8-10MB of raw tweets to CSV file it displays the discussed error. PS: awesome tool!

GeekOnAcid on 25 Feb 2020

I used to see this message while collecting tweets:
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
But the messages were showing up intermittently and the process continued with no issues. Now, the process stops after receiving the first message while sending a sequence of the above messages. Then, for ~1 minute the message keep showing up and after that Twint successfully can collect tweets like before getting those messages again.
I wonder if we can hibernate the process for a few seconds and start over the process after. Is it possible? I didn't look into the codes yet.

pbabvey on 28 Feb 2020

As a temporary solution, I tried this and then the collector did not fell off. I change the part of the twint code in run.py before getting a random UserAgent.

                if consecutive_errors_count < self.config.Retries_count:
                    #################################
                    delay = random.randint(60, 120)
                    print('sleeping for {} secs'.format(delay))
                    time.sleep(delay)
                    #################################
                    self.user_agent = await get.RandomUserAgent()

                    continue
                logme.critical(__name__+':Twint:Feed:Tweets_known_error:' + str(e))
                print(str(e) + " [x] run.Feed")
                print("[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!")
                break

It seems the collector is able to start over from the place it left off after 2 min. I know this is totally inefficient. So, please share your better solution to counteract the
Twitter limitations.

pbabvey on 28 Feb 2020

So, please share your better solution to counteract the
Twitter limitations.

I've also run into this error for larger scrapes. If it's, in fact, an issue with rate-limiting from twitter, what would it take to implement a config option that slows down requests by a set # of seconds (e.g., new tweet is scraped every 3 seconds)?

zoltanpm on 10 Mar 2020

How about a binary exponential backoff? Every time it fails it doubles the sleep time.

            if consecutive_errors_count < self.config.Retries_count:
                #################################
                delay = exp(2, consecutive_errors_count)
                print('sleeping for {} secs'.format(delay))
                time.sleep(delay)
                #################################
                self.user_agent = await get.RandomUserAgent()
                continue

o7n on 10 Mar 2020

By the way, can we make those errors/warning messages print to stderr instead of stdout? It would make handling errors so much easier.

o7n on 10 Mar 2020

👍1

How about a binary exponential drawback? Every time it fails it doubles the sleep time.

Sounds good!

pbabvey on 10 Mar 2020

How about a binary exponential backoff? Every time it fails it doubles the sleep time.

            if consecutive_errors_count < self.config.Retries_count:
                #################################
                delay = exp(2, consecutive_errors_count)
                print('sleeping for {} secs'.format(delay))
                time.sleep(delay)
                #################################
                self.user_agent = await get.RandomUserAgent()
                continue

I like it. But for those of us regularly scraping large numbers, it'd be nice to have a pre-set delay that doesn't require an error to trigger. I could implement that manually in my script (iterate backwards by day or week with a pause between every item in the range), but it'd be more elegant to have a delay built-in via config option.

zoltanpm on 11 Mar 2020

👍1

Yeah, I agree, but I think those are two different features.

o7n on 11 Mar 2020

I made a pull request here https://github.com/twintproject/twint/pull/686 with features for configurable delays and exponential backoff with a configurable base. Also a bug fix where after the first error the user agent would become random again, making things worse, and errors to stdout.

o7n on 11 Mar 2020

684

pielco11 on 12 Mar 2020

I am scraping a lot of data each run per day around a large area and #686 did not solve my issue. It would not restart once i hit the limit; so instead I added along with it

            elif self.config.TwitterSearch:
            self.feed, self.init = feed.Json(response)
           ########################################
            if self.config.Max_delay:
                delay = random.randint(self.config.Min_delay, self.config.Max_delay)
                sys.stderr.write('sleeping for {} secs\n'.format(delay))
                time.sleep(delay)
            #########################################

Depending on the type of scraping the user needs a different type of delay timer. This has worked for me with the Min_delay = 0 and Max_delay = 3. We can probably make this sample from a normal distribution shifted towards 0 if you need it to be faster.

Aaronzinhoo on 14 Mar 2020

👍1

I found another workaround for this problem. It seems that Twitter block many requests from api. When we use TOR proxy ip address change every 10 minutes.

markowanga on 14 Mar 2020

I'm also facing this issue. I think it may occur some rating limit by Twitter Server. However, I tried to set up a daily interval search in my code and run it every 1/2/3/4 minutes. In practice, this issue does not occur at 4-minute intervals. So, I think maybe this rating limitation has a specific time range. For example, the issue occurs if there are too many access requests within three minutes.

yipkaitung on 1 Apr 2020

Yes, that is how rate limiting works. It kicks in when the number of requests over a sliding time interval (i.e. the last x minutes) is higher than a certain limit.

o7n on 1 Apr 2020

Hi there

I have been suffering these noDataExpecting issues.
Previously I was using CLI twint and it was great - then it started failing so I started looking at using python to see if there were any workarounds - but I hit the errors again and then found this thread

I am a developer but new to python (and driving twint from python) and Git (!!)
That said, how can I download your latest changes for this bug to see if they fix my issue?

Apologies if this is the wrong place to post this - (I did say I'm new to Git right?)
If appropriate feel free to delete my post and reply to me as best

TinkerPhil on 5 Apr 2020

Just throwing this in if it helps. I've identified that twitter blocks your IP after you've scraped roughly 15000 tweets, which resets after a period of 7 minutes. A simple workaround I did was generate a list of days I'm interested in scraping over and run the twint search function for each day. After each iteration, time.sleep(600) <- just to be on the safer side, you can go with 420 if you're feeling daring.

It will definitely take a longer time for the script to run but as long as you're happy leaving the PC running it works!

limkhaixi on 11 Apr 2020

👍4

As per my experience, this error is due to the queries limited per time window of twitter. In my case, trying after a while works well. Can developers put the incremental delay to restart the job? Apart from this,, I have two naive questions 1) where is the location of logs? and 2) how to throw an exception for this bug.

patel-zeel on 11 Apr 2020

@markowanga can you post the code? Also scrapping a lot each day and running into this problem.

awendell989 on 13 Apr 2020

Hey all, I face the same issue.

I parametrized 120 jobs for 120 Dataframes with different user information to speed up my scraping process. For Friend- and Followerlists there no problems. But for Timeline Data I have these issues.

I would love to throw an exception for this error.

I think @patel-zeel is right and queries are limited. If I run just one job, it seems that it should be fine for twitter. But strange why it is just a problem for timelines.

However. I love your project! Thanks!

give-me-data on 15 Apr 2020

Just throwing this in if it helps. I've identified that twitter blocks your IP after you've scraped roughly 15000 tweets, which resets after a period of 7 minutes. A simple workaround I did was generate a list of days I'm interested in scraping over and run the twint search function for each day. After each iteration, time.sleep(600) <- just to be on the safer side, you can go with 420 if you're feeling daring.

It will definitely take a longer time for the script to run but as long as you're happy leaving the PC running it works!

You're right. But does someone know why this is only the case for Timelines?

give-me-data on 15 Apr 2020

For those looking to throw an exception to catch this, the place to do that appears to be in twint's run.py file. Approximately line 102, right after it prints the "[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!" message.

Here's what I have plugged in there:

logme.critical(__name__ + ':Twint:Feed:Tweets_known_error:' + str(e))
print(str(e) + " [x] run.Feed")
print("[!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!")

# my exception
raise Exception("Hit Rate Limit")

break # with the addition of the exception, this break is probably unnecessary

I catch that exception, sleep for 420 seconds, and then run twint.run.Search(c) again to pick up from the resume file.

Edit: If I was less lazy, I'd build a custom error to make the error catching better, but just catching a naked error works.

jomorrcode on 15 Apr 2020

❤2 👍2

Just in case it's of interest to anyone, here is the code I am using to scrape Covid data from January, looking for early signs. With the added exception above being caught, I have it iterating day by day through the date range and it seems to work fairly flawlessly. I don't pretend to be a software engineer, so take it for what it is.

import pandas as pd
import twint
from datetime import datetime, timedelta
from time import sleep
import os

query = 'coronavirus OR virus OR 2019-nCoV OR wuhan OR #WHO OR flu OR pneumonia'



start_str = "2020-01-01"
end_str = "2020-02-01"
start_date = pd.to_datetime(start_str, format='%Y-%m-%d', errors='ignore')
end_date = pd.to_datetime(end_str, format='%Y-%m-%d', errors='ignore')
data_folder = "D:/Data/covid_tweets/"
filename = f"{data_folder}covid_tweets_{start_str}_{end_str}.csv"
resume_file = f"{data_folder}resume.txt"

c = twint.Config()
c.Hide_output = False
c.Store_csv = True
c.Output = filename
c.Resume = resume_file
c.Search = query
c.Lang = 'en'

while start_date < end_date:

    check = 0
    c.Since = datetime.strftime(start_date, format='%Y-%m-%d')
    c.Until = datetime.strftime(start_date + timedelta(days=1), format='%Y-%m-%d')

    while check < 1:
        try:
            print("Running Search: Check ", start_date)
            twint.run.Search(c)
            check += 1

        except Exception as e:
            # pause when twitter blocks further scraping
            print(e, "Sleeping for 7 mins")
            print("Check: ", check)
            sleep(420)

    # before iterating to the next day, remove the resume file
    os.remove(resume_file)

    # increment the start date by one day
    start_date = start_date + timedelta(days=1)

jomorrcode on 15 Apr 2020

👍3 🚀1

Just in case it's of interest to anyone, here is the code I am using to scrape Covid data from January, looking for early signs. With the added exception above being caught, I have it iterating day by day through the date range and it seems to work fairly flawlessly. I don't pretend to be a software engineer, so take it for what it is.

import pandas as pd
import twint
from datetime import datetime, timedelta
from time import sleep
import os

query = 'coronavirus OR virus OR 2019-nCoV OR wuhan OR #WHO OR flu OR pneumonia'



start_str = "2020-01-01"
end_str = "2020-02-01"
start_date = pd.to_datetime(start_str, format='%Y-%m-%d', errors='ignore')
end_date = pd.to_datetime(end_str, format='%Y-%m-%d', errors='ignore')
data_folder = "D:/Data/covid_tweets/"
filename = f"{data_folder}covid_tweets_{start_str}_{end_str}.csv"
resume_file = f"{data_folder}resume.txt"

c = twint.Config()
c.Hide_output = False
c.Store_csv = True
c.Output = filename
c.Resume = resume_file
c.Search = query
c.Lang = 'en'

while start_date < end_date:

    check = 0
    c.Since = datetime.strftime(start_date, format='%Y-%m-%d')
    c.Until = datetime.strftime(start_date + timedelta(days=1), format='%Y-%m-%d')

    while check < 1:
        try:
            print("Running Search: Check ", start_date)
            twint.run.Search(c)
            check += 1

        except Exception as e:
            # pause when twitter blocks further scraping
            print(e, "Sleeping for 7 mins")
            print("Check: ", check)
            sleep(420)

    # before iterating to the next day, remove the resume file
    os.remove(resume_file)

    # increment the start date by one day
    start_date = start_date + timedelta(days=1)

Thank you for sharing this. As for the delay, 2~4 mins works for me.
But I wonder how long does it take roughly to collect tweet for one single day for you?

pbabvey on 15 Apr 2020

I find it totally depends upon the volume of tweets for your search terms. This morning I was looking at some covid19 keywords for 1 day in late March and it was taking over an hour to get one hour worth of tweets just because the volume was extreme. The volume of tweets that had been sent during that hour was over 1000 per minute!

Obviously a less active search topic will be faster. The search I posted above has so far taken about 5 hours to get the first 25 days of January.

Edit: For scale, those 25 days has returned about 300,000 tweets

jomorrcode on 15 Apr 2020

Hey all, I face the same issue.

I parametrized 120 jobs for 120 Dataframes with different user information to speed up my scrapping process. For Friend- and Followerlists there no problems. But for Timeline Data I have these issues.

Would you clarify more, did you run 120 search jobs in parallel using Twint? Did you try to search for tweets using keywords as well? And if so how if works? I didn't try more than 5 scrapings in parallel.

pbabvey on 15 Apr 2020

I would guess that running many processes in parallel from the same IP address would greatly increase your chances of being banned, or at the very least would slam into the rate limit very quickly. Just a guess though.

jomorrcode on 15 Apr 2020

The search I posted above has so far taken about 5 hours to get the first 25 days of January.

I wonder why for me it works much slower. Even with some narrow-down conditions like Min_likes or Min_replies, it takes ~100 orders of magnitude more time.

pbabvey on 15 Apr 2020

Hey all, I face the same issue.
I parametrized 120 jobs for 120 Dataframes with different user information to speed up my scrapping process. For Friend- and Followerlists there no problems. But for Timeline Data I have these issues.

Would you clarify more, did you run 120 search jobs in parallel using Twint? Did you try to search for tweets using keywords as well? And if so how if works? I didn't try more than 5 scrapings in parallel.

I scrape Tweets, Friends and Followers for more than 30k users via Twint. I parametrized via bash.

for l in $(seq 0 $batchlenght $END); do
python3 friends_data_pipeline.py &
python3 follower_data_pipeline.py &
python3 timeline_data_pipeline.py &
done

code snipped

`def get_timeline(self):

    if self.disable_twint_logger:
        os.environ["TWINT_DEBUG"] = "debug"

    if self.batch:
        print("\n" + "Start Fetching Timeline Data Batchwise")
        self.batch = int(self.batch)
        print("batch_data/batch_{}.json".format(self.batch))
        new_df = pd.read_json("batch_data/batch_{}.json".format(self.batch), orient="records", lines=True,
                              convert_dates=False)

        if self.reversed_it:
            print("Reverse DataFrame")
            new_df = reversed(new_df["user_id"])
        else:
            new_df = new_df["user_id"]

        while True:
            logging.info("\n" + "Timeline search" + "\n")
            for account_id in tqdm(new_df):
                timeline_path = os.path.join(self.TIMELINE_DIR, str("{}_timeline").format(account_id) + '.json')
                if os.path.exists(timeline_path):
                    print("exist - pass")
                    continue
                else:
                    time.sleep(10)
                    timeline = twint.Config()
                    timeline.Limit = int(self.limit)
                    timeline.Hide_output = True
                    timeline.User_id = str(account_id)
                    timeline.Output = timeline_path
                    timeline.Store_json = True
                    try:
                        twint.run.Search(timeline)
                    except:
                        print("\n" + str(account_id))
                        print("Having that error because that account is suspended or protected" + "\n")
                        continue`

I only have problems with the timelines. Friends & Followers worked fine.

give-me-data on 15 Apr 2020

I would guess that running many processes in parallel from the same IP address would greatly increase your chances of being banned, or at the very least would slam into the rate limit very quickly. Just a guess though.

Yes, I think that's an issue. At least for the timelines...

Unfortunately, I´m not familiar with proxy / VPN implementations in such scraping frameworks.

I know that twint has a proxy functionality and my gut feeling says to me that this could be a good workaround. What about using a proxy pool? Does someone have experience?

give-me-data on 16 Apr 2020

I would guess that running many processes in parallel from the same IP address would greatly increase your chances of being banned, or at the very least would slam into the rate limit very quickly. Just a guess though.

Yes, I think that's an issue. At least for the timelines...

Unfortunately, I´m not familiar with proxy / VPN implementations in such scraping frameworks.

I know that twint has a proxy functionality and my gut feeling says to me that this could be a good workaround. What about using a proxy pool? Does someone have experience?

Sorry for the spam.

How about to scrape https://www.socks-proxy.net/ to get updated ip adresses.

give-me-data on 16 Apr 2020

Judging by the rate of tweets collected,

Using a large time break like 420 or 300 is much slower (more than 2x slower) than just using a random sleep timer between 1-10 seconds in between each query. With a random sleep interval you can also run multiple twint programs at once also. I run two in paralell but it could work with more as I have yet to test its limits.

Aaronzinhoo on 17 Apr 2020

👍1

Judging by the rate of tweets collected,

Using a large time break like 420 or 300 is much slower (more than 2x slower) than just using a random sleep timer between 1-10 seconds in between each query. With a random sleep interval you can also run multiple twint programs at once also. I run two in paralell but it could work with more as I have yet to test its limits.

I run a test for 1 and 4 jobs and hit the rate limits twice. But not so fast.
However, I raise an exception (run.py) and use a random sleep timer between 1-10 seconds. Wish me luck. 19 k users left :)

give-me-data on 17 Apr 2020

👍1

@give-me-data Hope this works well because if so then this would definitely be the fastest method up to this point! good luck!

Aaronzinhoo on 17 Apr 2020

❤1

For all who are interested why you hit the rate limits for timeline search but not for friends and follower search. I aksed the developer and he said:

so friends and followers use the "no-js version" of Twitter, while the timeline uses another endpoint.

give-me-data on 17 Apr 2020

👍1

Judging by the rate of tweets collected,
Using a large time break like 420 or 300 is much slower (more than 2x slower) than just using a random sleep timer between 1-10 seconds in between each query. With a random sleep interval you can also run multiple twint programs at once also. I run two in paralell but it could work with more as I have yet to test its limits.

I run a test for 1 and 4 jobs and hit the rate limits twice. But not so fast.
However, I raise an exception (run.py) and use a random sleep timer between 1-10 seconds. Wish me luck. 19 k users left :)

Could you please elaborate a bit more on how you "raise an exception (run.py) and use a random sleep timer between 1-10 seconds."? I am stuck and am having a hard time getting this to work.

UpasanaDutta98 on 19 Apr 2020

run twint.run.Search(c) again to pick up from the resume file.

Could you please share the piece of code when you run twint.run.Search(c) again to pick up from the resume file? Thanks!

UpasanaDutta98 on 19 Apr 2020

My two above posts with code on April 15 show where I put the exception in run.py, and then the code where I am actually scraping (with a while loop to re-run twint.run.Search(c) when the exception is raised).

Note that while I used a long delay of 420 seconds, lots of other people are saying you can use a much shorter one. I haven't had a chance to try that yet, but it would make things a lot faster.

jomorrcode on 19 Apr 2020

Thanks for sharing your code! Do you think this will be provided as an option (also for the command line) in the next twint release? I have always used it from the CL and I currently don't have the time to set this up like you did.

wuqui on 20 Apr 2020

Just an update, I've tried my code with varying lengths of sleep, and anything under 420 seconds seems unreliable. I like the idea of the exponential backoff that other people have talked about so I'm going to look at that.

Adding a random 1-10 seconds between each query seems logically like it should take a lot longer though, no? If my code hits the rate limit and sleeps every 15,000 tweets for 420 seconds, that still a lot shorter than the 15,000-150,000 seconds the random delay is adding to the scrape of those same 15,000 tweets.

Or have I misunderstood the random idea?

jomorrcode on 23 Apr 2020

@jomorrcode actually you are right, I was under the impression that for each query the result returns 200 tweets, but after looking at it again, it is only 20. Since on avg. 1-10 second random delay is 5 seconds then we can say within 420 seconds we will approx, get

20(tweets per query) * (420/5) = ~1680, much less than scraping with the 420 second timer.

I believe adding an exponential timer is unnecessary since it would just force us to wait until at least 420 seconds have passed regardless.

Aaronzinhoo on 23 Apr 2020

Ah, I missed the part where the delay got added after 20 tweets. I was thinking one, but as you say, even so it would be slower. It might feel faster too because the scrape never actually pauses ;).

jomorrcode on 23 Apr 2020

👍1

Yea plus my init assumption was wrong haha, but for now, the 420 wait limit should be implemented as the standard. Suggest a PR maybe?

Aaronzinhoo on 23 Apr 2020

That would be a ridiculous idea. Using a hard coded timer won't work if there are multiple processes on the same IP address. Also, that 420 seconds value has been found using trial and error and Twitter can change it at any time.

There is already a PR which I made which has an automatic, dynamic, exponential backoff (https://github.com/twintproject/twint/pull/686) but which is apparently being ignored.

o7n on 23 Apr 2020

👍5

A hardcoded timer is definitely not the answer, it should just be an adjustable parameter. With the default value of 420. If twitter changes it then simply adjusting the default is fine.

The exponential backoff is definitely a more sustainable solution that does not require any changes but varies in speed. I do not see the argument for why it is better for multiple processes though as the limit is hit at a certain point and regardless all processes should stop for some extended period of time. can you explain @o7n?

Aaronzinhoo on 23 Apr 2020

I tried to code a way to use a different proxy for each request but failed because of connection or proxy issues. Dont Know...

give-me-data on 23 Apr 2020

That would be a ridiculous idea.

Jeez man, no one thinks a hard coded timer is the ideal situation, but no need to be rude about it. I saw your PR and thought it was the best idea. Seems to be awaiting the resolution of conflict though?

jomorrcode on 23 Apr 2020

👍1

I just re-read the code changes for @o7n 's PR. It's very good. I like the solution very much.

Edit: Not sure why it hasn't been merged, the only conflict looks like a trivial difference in the config file.

jomorrcode on 23 Apr 2020

Hi all

honestly I can't check/try/evaluate PR. I'm not ignoring your requests, comments and stuff. Simply I do not have the time for this. I'm really sorry.

If someone wants write rights, just say and I'll add him/her as collaborator

Hope you will understand my position.

Thank you,
Francesco

pielco11 on 28 Apr 2020

👍2

@pielco11 If it is possible, can I be added? I am going to be using this project for a while for work and will make improvements where and when I can.

Aaronzinhoo on 28 Apr 2020

Thanks a lot! So if I understand this correctly, thanks to the changes committed above it's now possible to add a parameter to the command line execution to avoid getting the error, right? And it seems there is a (semi-)automatic option there, which would be great.

Could you give me an example of how to use this effectively via the command line? What would I have to add to a normal command like twint -s 'QUERY' --since 2019-01-01 --until 2019-12-31 -o 'QUERY.csv' --csv --hashtags --count -l en? Help would be much appreciated :)

wuqui on 28 Apr 2020

@wuqui the error is unavoidable when you are scraping a large amount. The only thing we can do is ensure the smallest wait times once Twitter blocks our IP for a certain amount of time. All this commit does is provide an easy way to find the most optimal time to wait for.

In the case Twtter changes the time your IP is blocked for, the dual timer will find good value to estimate it automatically. I suggest using --min-wait-time 50 and setting --backoff-exponent 3.0. These values seem to work best under any conditions regardless of when twitter changes it.

Aaronzinhoo on 29 Apr 2020

👍2

Hi, I am facing a related but more serious problem. When I run the script with the command "twint -s pineapple", I faced the "noDataExpecting value" after scraping 1-3days of tweets with multiple trials. I tried other keywords or even with another ip but the same problem persist, I can only scrape tweets for few days duration and then encounter the "noDataExpecting value".
I am using the latest twint, python 3.6.5 and ubuntu 19.04, just wondering if you guys encounter this problem as well.

This is the latest output:
1259483985215807494 2020-05-10 22:02:36 HKT Perhaps the soft right
1259483982204461056 2020-05-10 22:02:36 HKT Pineapples can go. For a very SPECIFIC reason. 😈😈😈😈 https://twitter.com/Mawpy/status/1259237431804661760 …
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

KwanWaiChung on 11 May 2020

https://twitter.com/ZusorOW/status/1258885451055800320

This is going to be a huge issue for Twint, especially if Twitter will remove the endpoint (almost sure)

pielco11 on 16 May 2020

https://twitter.com/ZusorOW/status/1258885451055800320

This is going to be a huge issue for Twint, especially if Twitter will remove the endpoint (almost sure)

@pielco11 can you elaborate a bit? I am assuming that twint relies on this endpoint?

mr-devs on 18 May 2020

@mr-devs correct

Twint is "exploiting" the legacy version of Twitter

pielco11 on 18 May 2020

Here's something I just learned... Twitter is smart. They aren't just doing the rate limiting on individual ips. I spun up 4 VMs on a cloud provider with 4 different IPs (but all from the same group) and Twitter simultaneously rate limited the 4 scrapers running on ips x.x.x.a, x.x.x.b, x.x.x.c, and x.x.x.d,

Clever folks.

jomorrcode on 19 May 2020

For twitter and Instagram this is true. For rotating proxies try not to
have the end numbers be the close to each other! Good observation for
people trying proxies on these sites.

On Tue, May 19, 2020 at 5:34 AM John Morrow notifications@github.com
wrote:

Here's something I just learned... Twitter is smart. They aren't just
doing the rate limiting on individual ips. I spun up 4 VMs on a cloud
provider with 4 different IPs (but all from the same group) and Twitter
simultaneously rate limited the 4 scrapers running on ips x.x.x.a, x.x.x.b,
x.x.x.c, and x.x.x.d,

Clever folks.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/twintproject/twint/issues/604#issuecomment-630787888,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGM26KF32CCXHUXPQVWXPHDRSJ4EZANCNFSM4JZ4KC6A
.

Aaronzinhoo on 19 May 2020

That is effective, but only as long as not too many people are doing it.

o7n on 19 May 2020

https://twitter.com/ZusorOW/status/1258885451055800320

This is going to be a huge issue for Twint, especially if Twitter will remove the endpoint (almost sure)

Does anyone know a path forward for resolving this? Which parts (if not all) of Twint will be impacted by this change?

SpencerFlow on 29 May 2020

👍3

I just tried and Twint still works

As of now the legacy version has not been removed, yet

pielco11 on 1 Jun 2020

👍2

I was running something overnight and woke up to find it still running, so far so good.

jomorrcode on 1 Jun 2020

This is weird, everything broke for me two days ago? No matter what I try and scrape (tweets, followers, lookup data) I get "TypeError: 'NoneType' object is not subscriptable"

mr-devs on 1 Jun 2020

As of now it doesn't work as expected, but it's still possible to run searches with the mobile version

https://mobile.twitter.com/search?q=from%3Anoneprivacy&s=typd&x=0&y=0

pielco11 on 2 Jun 2020

Hi all, I am on:

Mac OS 10.15.5
Python 3.7.7
twint 2.1.20

I am looping over the script below for a range of days and saving to CSV after each search (I can post my full code if requested).

c = twint.Config()
c.Lang = 'en'
c.Search = '(mask OR masks OR facemask OR facemasks)'
c.Since = '2020-01-01'
c.Until = '2020-01-02'
c.Pandas = True
c.Hide_output = True
twint.run.Search(c)

I keep getting this repeatedly:

CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 1.0 secs
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 8.0 secs
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 27.0 secs
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 64.0 secs
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 125.0 secs
CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
sleeping for 216.0 secs

Am I correct that it keeps getting hung up on the same tweet and increasing the sleep time, until finally giving up after the 216 sec sleep? Due to the amount of tweets I am hoping to scrape for this project (Jan 1 - present), I would prefer to skip over the tweet if it doesn't return after the 8 or 27 sec sleep. Is there a twint method to do this?

I read through much of this thread as well as the others linked and did not see a direct answer. Thanks in advance!

EricB10 on 14 Jun 2020

It's not being caused by any tweet in particular. Twitter is blocking your IP for a fixed period of time (about 7 mins) after scraping a large batch of tweets. This progressive backoff is the best work around we've found so far. You'll just have to suck it up unless you can rotate your IP each time you get blocked.

jomorrcode on 14 Jun 2020

👍6

Gotcha, thanks for the FAST reply! Loving twint by the way... I will keep monitoring this issue.

Edit: has anyone experimented with scraping over TOR? Obviously it would generally be slower... but avoiding the 7 minute timeout could be worth it?

Edit 2: I tried and ended up getting an ssl connection error :(

EricB10 on 14 Jun 2020

CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) CRITICAL:root:twint.run:Twint:Feed:Tweets_known_error:Expecting value: line 1 column 1 (char 0) Expecting value: line 1 column 1 (char 0) [x] run.Feed [!] if get this error but you know for sure that more tweets exist, please open an issue and we will investigate it!

Receiving this error continuously and program exits after that, I have read the entire thread but wasn't able to find the solution. Could anyone guide me?

My code:

`
for city in PROVINCE_CITIES[province]:
print(city)
c = twint.Config()

        c.Search = keywords
        c.Since = since
        c.Store_csv = True
        c.Output = "./" + outfile

        c.Near = city
        c.Hide_output = True

        twint.run.Search(c)

`

cc: @pielco11 @jomorrcode

MrAsimZahid on 26 Aug 2020

👍2

looks like it started again -_-

webcoderz on 18 Sep 2020

looks like it started again -_--

vsdfnvuitsb on 18 Sep 2020

looks like it started again -_--

it really started again. any fix on this please?

lokixxvii on 19 Sep 2020

I can confirm that CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0) error is happening. Tested from both MacOS and Ubuntu through CLI and jupyter notebook.

CoastalBoltman on 19 Sep 2020

👍5

It happens on me too.
I got CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 error tested in window and jupyter notebook.

hazelnutlemon on 20 Sep 2020

👍4

same issue on docker/ubuntu

twitter_parser    | sleeping for 512.0 secs
twitter_parser    | CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)

krogla on 22 Sep 2020

917

Tyler-D-M on 23 Sep 2020

Going to echo what others have said that this probably isn't anything to do with twint.
Not sure if anyone else who uses the Twitter Dev Api received the string of mails recently that myself and a few people I know have?
The initial one of concern spoke about Dev Apis possibly being abused after being exposed through the browsers cache etc etc.

Seems to me that coupled with the seemingly "new" policies Twitter have been enforcing recently like forcing number verification, banning random accounts and all that I am of the opinion that they are simply starting to limit the type of full access that we've all become used to & just a general crackdown of sorts

I've been having luck with a solution @yunusemrecatalcam posted in issue #888 https://github.com/twintproject/twint/issues/888#issuecomment-693977671 So big thank you to him and the Dev of the project mentioned. Works nicely with a rotating pool, speeds are not that bad at all

And of course Big thank you to @pielco11 and everyone involved for your awesome work on Twint. It really is appreciated!

0m3rta13 on 26 Sep 2020

I have also faced the same issue. Twint is not working anymore and giving the error as given in question. How to get rid from this. Is there any alternate ?

hshafiq132 on 30 Sep 2020

looks like it started again -_-

JiaoZiLang on 9 Oct 2020

Update for all interested people: Looks like this issue has already been resolved an unknown time ago, if you clone the current version of twint, the noDatExcepting error will not occur any longer. (via https://github.com/twintproject/twint/issues/983#issuecomment-721922571)

🎉

LinqLover on 5 Nov 2020

👍1

When will these changes be released on pip?

LinqLover on 5 Nov 2020

I need to collect tweets, so I used the same code, but I have a problem with implementation .. Please help me because I am a student and the time is very short to complete the project.
import twint
import nest_asyncio
nest_asyncio.apply()
c = twint.Config()
c.Hide_output=True
c.Pandas_clean=True
c.Pandas=True
c.Search="#covid"
c.Since='2020-12-01'
c.Until='2020-12-10'
twint.run.Search(c)

aliabdmahdi on 11 Dec 2020

@aliabdmahdi
Instead of installing twint from pypi, use the below command:

pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

himanshudabas on 11 Dec 2020

Uninstalling the Twint tool, then I used this code (pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint). This error appeared to me. I don’t know the reason ?? .

aliabdmahdi on 11 Dec 2020

😕1

It's because this method first clones the twint from GitHub, that requires you to have git installed on your system.
Instead what you can do is, download the twint repo from GitHub, extract it. And then open the anaconda terminal in the extracted folder.
Then run the following command:

pip install .

himanshudabas on 11 Dec 2020

It's because this method first clones the twint from GitHub, that requires you to have git installed on your system.
Instead what you can do is, download the twint repo from GitHub, extract it. And then open the anaconda terminal in the extracted folder.
Then run the following command:
pip install .
How open the anaconda terminal in the extracted folder? .

aliabdmahdi on 11 Dec 2020

First start the conda terminal, like you previously did. Then change the directory using the cd command.

Google how to change directory in cmd

himanshudabas on 11 Dec 2020

I put the extracted tool file on my desktop and then went through the installation steps , and there was also an error. What is the problem now?

aliabdmahdi on 11 Dec 2020

👀1 😕1

I put the extracted tool file on my desktop and then went through the installation steps , and there was also an error. What is the problem now?

aliabdmahdi on 11 Dec 2020

👀1

I put the extracted tool file on my desktop and then went through the installation steps , and there was also an error. What is the problem now?

As per the the error. If you read it it asks you a question.

Do you have got installed?
It's asking because you don't have it installed. Or it's not in your PATH. But I'm gonna guess it's not installed.

Have you installed git?

If not....

https://git-scm.com/downloads

0m3rta13 on 11 Dec 2020

after install git , After performing the steps, this error appeared ..

aliabdmahdi on 11 Dec 2020

after install git , After performing the steps, this error appeared ..

Don't take this the wrong way but I think you should really consider getting to know the things behind the tools you're trying to use before using them.
For example, in this case, python.

That error indicates that you're probably not using a correct or compatible version of python. Or you've only just done the initial install of conda and not updated it prior to using it.

A quick google search of the error you're getting would help you more than waiting for replies here. And you'll probably learn a a bit along the way which will only be good.

Again, I'm not trying to be mean. But you're going to find yourself running into many issues like this when you start out and people aren't going to want to help you if you don't appear to really try and fix your problem yourself first. Like by just googling your error.
Also because the problem you're having is in no way related to Twint or it's Devs and this space is really for supporting and helping people who experience problems with their tools. It's good practice to respect that.

That error points to countless posts on stackexchange for example. And they always offer up great, detailed information that'll really help you.
I used stackexchange for years before I even posted my first question because almost anything I've ever wanted to ask has already been asked, many times. Lol.

Just some advice. Hope it lands and comes across as intended.

0m3rta13 on 11 Dec 2020

👍1

Twint: [ERROR] CRITICAL:twint.run:Twint:Feed:noDataExpecting ~ Inconsistent results [High Severity]

Most helpful comment

All 137 comments

684

917

Related issues