Please help. I already tried to reinstall and updgrade twint. Still this error persists.
If the issue is a request please specify that it is a request in the title (Example: [REQUEST] more features). If this is a question regarding 'twint' please specify that it's a question in the title (Example: [QUESTION] What is x?). Please only submit issues related to 'twint'. Thanks.
Make sure you've checked the following:
pip3 install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint;twint -s covid
The command above returns the following error:
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/twint", line 11, in <module>
load_entry_point('twint', 'console_scripts', 'twint')()
File "/home/ubuntu/src/twint/twint/cli.py", line 311, in run_as_command
main()
File "/home/ubuntu/src/twint/twint/cli.py", line 303, in main
run.Search(c)
File "/home/ubuntu/src/twint/twint/run.py", line 427, in Search
run(config, callback)
File "/home/ubuntu/src/twint/twint/run.py", line 319, in run
get_event_loop().run_until_complete(Twint(config).main(callback))
File "/home/ubuntu/src/twint/twint/run.py", line 35, in __init__
self.token.refresh()
File "/home/ubuntu/src/twint/twint/token.py", line 68, in refresh
raise RefreshTokenException('Could not find the Guest token in HTML')
twint.token.RefreshTokenException: Could not find the Guest token in HTML
Ubuntu 20.04 LTS x86_64
I also had this "Guest token" issue when I run twint on my ubuntu server.
It's weird that twint runs okay with my Mac book but not in my ubuntu server.
I also had this "Guest token" issue when I run twint on my ubuntu server.
It's weird that twint runs okay with my Mac book but not in my ubuntu server.
Hi @hkim2636, did you find any solution for this?
@hkim2636 yes you are right. Currently _twint_ won't work on a AWS IP. Check this
This is because _twitter_ doesn't provide _guest token_ when a request is made from AWS IP address.
@dcbacarro I'm currently exploring solutions for this issue and hopefully will put up a patch for this soon. In the meantime, what you can do is setup a proxy on your server. That will fix it
@dcbacarro No, I just decided to run twint on Mac. But, I would also try to figure out about it.
@hkim2636 yes you are right. Currently _twint_ won't work on a AWS IP. Check this
This is because _twitter_ doesn't provide _guest token_ when a request is made from AWS IP address.@dcbacarro I'm currently exploring solutions for this issue and hopefully will put up a patch for this soon. In the meantime, what you can do is setup a proxy on your server. That will fix it
Thanks @himanshudabas. I'll try your suggestion. Hope this will be fixed soon.
I'll close this for now.
@himanshudabas thanks I wasn鈥檛 sure why it was failing once on AWS. Do you have any links on how to setup the proxy server? I was running twint inside a container on ec2 before the API break
Yeah hitting the same problem using AWS Lambda... works fine running locally
@robert-moore try installing from this branch where I implemented a workaround for this exact issue. It might solve this issue for you for the time being.
@himanshudabas tried to run your branch on AWS but still doesn't work. Getting the below error
| 2020-10-24T15:02:48.219-04:00 | [TOR SESSION] Creating new TOR Session. Please give it a couple of seconds...
聽 | 2020-10-24T15:02:48.224-04:00 | Traceback (most recent call last):
聽 | 2020-10-24T15:02:48.224-04:00 | File "/root/.local/lib/python3.7/site-packages/twint/run.py", line 62, in Feed
聽 | 2020-10-24T15:02:48.224-04:00 | response = await get.RequestUrl(self.config, self.init)
聽 | 2020-10-24T15:02:48.224-04:00 | File "/root/.local/lib/python3.7/site-packages/twint/get.py", line 135, in RequestUrl
聽 | 2020-10-24T15:02:48.224-04:00 | response = await Request(_url, params=params, connector=_connector,headers=_headers)
聽 | 2020-10-24T15:02:48.224-04:00 | File "/root/.local/lib/python3.7/site-packages/twint/get.py", line 161, in Request
聽 | 2020-10-24T15:02:48.224-04:00 | return await Response(session, _url, params)
聽 | 2020-10-24T15:02:48.224-04:00 | File "/root/.local/lib/python3.7/site-packages/twint/get.py", line 170, in Response
聽 | 2020-10-24T15:02:48.224-04:00 | raise TokenExpiryException(loads(resp)['errors'][0]['message'])
聽 | 2020-10-24T15:02:48.224-04:00 | twint.token.TokenExpiryException: Rate limit exceeded
Looks to me that twitter changed something again,
I know that they fully deprecated the older API endpoints on Oct 12, perhaps that broke the workaround, I'll take a look
@himanshudabas have you had a chance to look at it?
I've been using your branch for a few days running it on AWS, and I'm getting mixed results. It seems to work around 50% of the time with the failures with the below error.
Any crawl below 20k tweets seems fine, it's crawling above that amount which seems to trigger the issue.
`
[TOR SESSION] Creating new TOR Session. Please give it a couple of seconds...
Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/torpy/http/adapter.py", line 157, in _new_conn
self._tor_stream = self._circuit.create_stream((self.host, self.port))
File "/root/.local/lib/python3.7/site-packages/torpy/circuit.py", line 318, in wrapped
return fn(self, args, *kwargs)
File "/root/.local/lib/python3.7/site-packages/torpy/circuit.py", line 596, in create_stream
tor_stream.connect(address)
File "/root/.local/lib/python3.7/site-packages/torpy/stream.py", line 271, in connect
self._connect(address)
File "/root/.local/lib/python3.7/site-packages/torpy/stream.py", line 298, in _connect
self._wait_connected(address, self._conn_timeout)
File "/root/.local/lib/python3.7/site-packages/torpy/stream.py", line 277, in _wait_connected
raise TimeoutError('Could not connect to %r' % (address,))
TimeoutError: Could not connect to ('twitter.com', 443)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 309, in connect
conn = self._new_conn()
File "/root/.local/lib/python3.7/site-packages/torpy/http/adapter.py", line 163, in _new_conn
self, 'Connection to %s timed out. (connect timeout=%s)' % (self.host, self.timeout)
urllib3.exceptions.ConnectTimeoutError: (
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: MyHTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "tweets_crawler.py", line 52, in
twint.run.Search(c)
File "/root/.local/lib/python3.7/site-packages/twint/run.py", line 419, in Search
run(config, callback)
File "/root/.local/lib/python3.7/site-packages/twint/run.py", line 315, in run
get_event_loop().run_until_complete(Twint(config).main(callback))
File "/root/.local/lib/python3.7/site-packages/twint/run.py", line 36, in __init__
self.token.refresh()
File "/root/.local/lib/python3.7/site-packages/twint/token.py", line 92, in refresh
if not self._request():
File "/root/.local/lib/python3.7/site-packages/twint/token.py", line 56, in _request
res = f.send(twitter_request).text
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: MyHTTPSConnectionPool(host='twitter.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(
`
Is there now a limit on the Guest Token as well?
I noticed the below error as well
twint.token.TokenExpiryException: Rate limit exceeded
Is there now a limit on the Guest Token as well?
I noticed the below error as well
twint.token.TokenExpiryException: Rate limit exceeded
No.
How the workaround works is, it uses tor to get the guest Token for you using a Tor IP. Then you can use this token to fetch results from twitter, without actually needing to use the Tor for these requests. But when you make too many requests in a very short time, let's say 400+ requests within 15 minutes, twitter will ban you AWS IP for 15 mins, even if you have a guest Token.
Only way to avoid this if either use Tor for fetching data or use proxies for fetching data.
Although I don't know why you are being blocked after fetching 20K tweets.
I could easily fetch around 6,000,000 tweets in 16-18 hours without being blocked once.
Perhaps you are making requests using multithreading.
In that case twitter will definately block you for a certain period of time whenever you make requests more than what they allow from a single IP address.
Try adding some delay to your requests so you don't cross the rate limit.
Thanks @himanshudabas. Yes you're right I'm running multiple instances at the same time. I will try to limit concurrency and see if it works again.
In the past Twint used to back off on its own when it was blocked and try again after some seconds, is this no longer working after the API changes?
@karabi
yeah, after I implemented the fix for v1.1 deprication in twint, a lot of things were still broken as the library needed an overhaul.
I am trying to fix things up little by little but it might take some time to get a proper working version out.
I removed the concurrency but I'm still facing the same issue.
Locally I'm able to scrape okay (result 47k tweets for my crawl). On AWS anything beyond ~20k fails.
What's weird though is that if I start the script again right after it fails, it works. So I don't think it's Twitter blocking my AWS IP.
[TOR SESSION] Creating new TOR Session. Please give it a couple of seconds...
Traceback (most recent call last):
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 64, in Feed
response = await get.RequestUrl(self.config, self.init)
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/get.py", line 135, in RequestUrl
response = await Request(_url, params=params, connector=_connector, headers=_headers)
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/get.py", line 161, in Request
return await Response(session, _url, params)
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/get.py", line 170, in Response
raise TokenExpiryException(loads(resp)['errors'][0]['message'])
twint.token.TokenExpiryException: Rate limit exceeded
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/tweets_crawler.py", line 65, in
twint.run.Search(c)
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 422, in Search
run(config, callback)
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 334, in run
get_event_loop().run_until_complete(twint_class.main(callback))
File "/usr/local/lib/python3.7/asyncio/base_events.py", line 587, in run_until_complete
return future.result()
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 239, in main
await task
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 266, in run
await self.tweets()
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 221, in tweets
await self.Feed()
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/run.py", line 67, in Feed
self.token.refresh()
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/token.py", line 96, in refresh
if not self._request():
File "/home/ec2-user/.local/lib/python3.7/site-packages/twint/token.py", line 58, in _request
with self._session as f:
File "/usr/local/lib/python3.7/contextlib.py", line 110, in __enter__
del self.args, self.kwds, self.func
AttributeError: args
I was able to temporarily work around the issue by using the Resume functionality of Twint. I'm basically crawling with Limit 10,000 then starting another crawl from where the last one finished. I was able to get around 250k tweets with one execution.
I'm also having this issue on Google App Engine, I assume the IP is banned. I am not even getting that many tweets, just from one user name and regular intervals.
@Greatdane
Check this comment https://github.com/twintproject/twint/issues/957#issuecomment-716030286
It might solve your issue for the time being,
But remeber that is a very basic implementation of the workaround, so other things might be broken.
@himanshudabas I've been using your branch (twint-fixes) and it's been working until yesterday but now it stopped with the below error. Do you get the same?
ERROR:root:[ignored]
Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/torpy/cell_socket.py", line 63, in connect
self._socket.connect((self._router.ip, self._router.or_port))
File "/usr/local/lib/python3.7/ssl.py", line 1172, in connect
self._real_connect(addr, False)
File "/usr/local/lib/python3.7/ssl.py", line 1159, in _real_connect
super().connect(addr)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/torpy/utils.py", line 78, in newfn
return func(args, *kwargs)
File "/root/.local/lib/python3.7/site-packages/torpy/consesus.py", line 175, in renew
if not self.verify(new_doc):
File "/root/.local/lib/python3.7/site-packages/torpy/consesus.py", line 200, in verify
pubkey = self._get_pubkey(sign['identity'], sign['signing_key_digest'])
File "/root/.local/lib/python3.7/site-packages/torpy/consesus.py", line 206, in _get_pubkey
key_certificate = self._authorities.download_fp_sk(identity, signing_key_digest)
File "/root/.local/lib/python3.7/site-packages/torpy/consesus.py", line 121, in download_fp_sk
with TorGuard(authority) as guard:
File "/root/.local/lib/python3.7/site-packages/torpy/guard.py", line 65, in __init__
self.__tor_socket.connect()
File "/root/.local/lib/python3.7/site-packages/torpy/cell_socket.py", line 69, in connect
raise TorSocketConnectError(e)
torpy.cell_socket.TorSocketConnectError: [Errno 111] Connection refused
WARNING:torpy.utils:Retry with another authority...
[TOR SESSION] Creating new TOR Session. Please give it a couple of seconds...
[+] Finished: Successfully collected 0 Tweets.
Looks to me a issue related to _torpy_
what script did you run exactly?
Also are you able to replicate this issue?
Sometimes torpy is unable to create a tor session, then it retries again, and after a few retries it gives up,
So that might be the reason for it here.
If it's possible, share the exact script that results in this issue (replace the sensitive info, if there's any).
Most helpful comment
@hkim2636 yes you are right. Currently _twint_ won't work on a AWS IP. Check this
This is because _twitter_ doesn't provide _guest token_ when a request is made from AWS IP address.
@dcbacarro I'm currently exploring solutions for this issue and hopefully will put up a patch for this soon. In the meantime, what you can do is setup a proxy on your server. That will fix it