Affected version(s)
4.9
Description
On my website the new crawler stops between 95 and 98%.
The following error message can be found in the log files.
[2020-02-19 09:02:23] request.CRITICAL: Uncaught PHP Exception InvalidArgumentException: "Unable to parse URI: http://" at /www/htdocs/vendor/nyholm/psr7/src/Uri.php line 51 {"exception":"[object] (InvalidArgumentException(code: 0): Unable to parse URI: http:// at /www/htdocs//vendor/nyholm/psr7/src/Uri.php:51)"} []
How to reproduce
I can show you my installation.
Can you paste the complete stack trace?
Uncaught` PHP Exception InvalidArgumentException: "Unable to parse URI: http://" at /www/htdocs/vendor/nyholm/psr7/src/Uri.php line 51
Hide context Hide trace
[▼
"exception" => InvalidArgumentException {#2097 ▼
#message: "Unable to parse URI: http://"
#code: 0
#file: "/www/htdocs/vendor/nyholm/psr7/src/Uri.php"
#line: 51
trace: {▼
/www/htdocs/vendor/nyholm/psr7/src/Uri.php:51 {▶}
/www/htdocs/vendor/terminal42/escargot/src/Subscriber/HtmlCrawlerSubscriber.php:55 {▼
Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber->onLastChunk(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk): void …
› $link = new Link($node, (string) $crawlUri->getUri()->withPath('')->withQuery('')->withFragment(''));
› $uri = new Uri($link->getUri());
›
arguments: {▶}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:449 {▼
Terminal42\Escargot\Escargot->processResponseChunk(ResponseInterface $response, ChunkInterface $chunk): void …
› if (SubscriberInterface::DECISION_NEGATIVE !== $needsContentDecision) {
› $subscriber->onLastChunk($crawlUri, $response, $chunk);
› }
arguments: {▼
$crawlUri: Terminal42\Escargot\CrawlUri {#829 …}
$response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
$chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:407 {▼
Terminal42\Escargot\Escargot->processResponses(array $responses): void …
› foreach ($this->getClient()->stream($responses) as $response => $chunk) {
› $this->processResponseChunk($response, $chunk);
› }
arguments: {▼
$response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
$chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:315 {▼
Terminal42\Escargot\Escargot->crawl(): void …
›
› $this->processResponses($responses);
› }
arguments: {▼
$responses: [ …5]
}
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Crawl.php:171 {▼
Contao\Crawl->run() …
› // Start crawling
› $escargot->crawl();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/modules/ModuleMaintenance.php:49 {▼
Contao\ModuleMaintenance->compile() …
›
› $buffer = $this->$callback->run();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/BackendModule.php:92 {▼
Contao\BackendModule->generate() …
› $this->Template = new BackendTemplate($this->strTemplate);
› $this->compile();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Backend.php:434 {▼
Contao\Backend->getBackendModule($module, PickerInterface $picker = null) …
›
› \t$this->Template->main .= $objCallback->generate();
› }
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/controllers/BackendMain.php:155 {▼
Contao\BackendMain->run() …
›
› $this->Template->main .= $this->getBackendModule(Input::get('do'), $picker);
› $this->Template->title = $this->Template->headline;
arguments: {▼
$module: "maintenance"
$picker: null
}
}
/www/htdocs/vendor/contao/core-bundle/src/Controller/BackendController.php:48 {▼
Contao\CoreBundle\Controller\BackendController->mainAction(): Response …
›
› return $controller->run();
› }
}
/www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:146 {▼
Symfony\Component\HttpKernel\HttpKernel->handleRaw(Request $request, int $type = self::MASTER_REQUEST): Response …
› // call controller
› $response = $controller(...$arguments);
›
}
/www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:68 {▼
Symfony\Component\HttpKernel\HttpKernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
› try {
› return $this->handleRaw($request, $type);
› } catch (\Exception $e) {
arguments: {▼
$request: Symfony\Component\HttpFoundation\Request {#6 …}
$type: 1
}
}
/www/htdocs/vendor/symfony/http-kernel/Kernel.php:201 {▼
Symfony\Component\HttpKernel\Kernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
› try {
› return $this->getHttpKernel()->handle($request, $type, $catch);
› } finally {
arguments: {▼
$request: Symfony\Component\HttpFoundation\Request {#6 …}
$type: 1
$catch: true
}
}
/www/htdocs/web/index.php:31 {▼
require …
›
› $response = $kernel->handle($request);
› $response->send();
arguments: {▶}
}
/www/htdocs/web/app.php:4 {▼
› // Backwards compatibility
› require __DIR__.'/index.php';
›
arguments: {▼
"/www/htdocs/web/index.php"
}
}
}
}
]
{▼
/www/htdocs/vendor/nyholm/psr7/src/Uri.php:51 {▼
Nyholm\Psr7\Uri->__construct(string $uri = '') …
› if (false === $parts = \parse_url($uri)) {
› throw new \InvalidArgumentException("Unable to parse URI: $uri");
› }
}
/www/htdocs/vendor/terminal42/escargot/src/Subscriber/HtmlCrawlerSubscriber.php:55 {▼
Terminal42\Escargot\Subscriber\HtmlCrawlerSubscriber->onLastChunk(CrawlUri $crawlUri, ResponseInterface $response, ChunkInterface $chunk): void …
› $link = new Link($node, (string) $crawlUri->getUri()->withPath('')->withQuery('')->withFragment(''));
› $uri = new Uri($link->getUri());
›
arguments: {▼
$uri: "http://"
}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:449 {▼
Terminal42\Escargot\Escargot->processResponseChunk(ResponseInterface $response, ChunkInterface $chunk): void …
› if (SubscriberInterface::DECISION_NEGATIVE !== $needsContentDecision) {
› $subscriber->onLastChunk($crawlUri, $response, $chunk);
› }
arguments: {▼
$crawlUri: Terminal42\Escargot\CrawlUri {#829 …}
$response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
$chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:407 {▼
Terminal42\Escargot\Escargot->processResponses(array $responses): void …
› foreach ($this->getClient()->stream($responses) as $response => $chunk) {
› $this->processResponseChunk($response, $chunk);
› }
arguments: {▼
$response: Symfony\Component\HttpClient\Response\CurlResponse {#734 …}
$chunk: Symfony\Component\HttpClient\Chunk\LastChunk {#744 …}
}
}
/www/htdocs/vendor/terminal42/escargot/src/Escargot.php:315 {▼
Terminal42\Escargot\Escargot->crawl(): void …
›
› $this->processResponses($responses);
› }
arguments: {▼
$responses: [ …5]
}
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Crawl.php:171 {▼
Contao\Crawl->run() …
› // Start crawling
› $escargot->crawl();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/modules/ModuleMaintenance.php:49 {▼
Contao\ModuleMaintenance->compile() …
›
› $buffer = $this->$callback->run();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/BackendModule.php:92 {▼
Contao\BackendModule->generate() …
› $this->Template = new BackendTemplate($this->strTemplate);
› $this->compile();
›
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/classes/Backend.php:434 {▼
Contao\Backend->getBackendModule($module, PickerInterface $picker = null) …
›
› \t$this->Template->main .= $objCallback->generate();
› }
}
/www/htdocs/vendor/contao/core-bundle/src/Resources/contao/controllers/BackendMain.php:155 {▼
Contao\BackendMain->run() …
›
› $this->Template->main .= $this->getBackendModule(Input::get('do'), $picker);
› $this->Template->title = $this->Template->headline;
arguments: {▼
$module: "maintenance"
$picker: null
}
}
/www/htdocs/vendor/contao/core-bundle/src/Controller/BackendController.php:48 {▼
Contao\CoreBundle\Controller\BackendController->mainAction(): Response …
›
› return $controller->run();
› }
}
/www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:146 {▼
Symfony\Component\HttpKernel\HttpKernel->handleRaw(Request $request, int $type = self::MASTER_REQUEST): Response …
› // call controller
› $response = $controller(...$arguments);
›
}
/www/htdocs/vendor/symfony/http-kernel/HttpKernel.php:68 {▼
Symfony\Component\HttpKernel\HttpKernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
› try {
› return $this->handleRaw($request, $type);
› } catch (\Exception $e) {
arguments: {▼
$request: Symfony\Component\HttpFoundation\Request {#6 …}
$type: 1
}
}
/www/htdocs/vendor/symfony/http-kernel/Kernel.php:201 {▼
Symfony\Component\HttpKernel\Kernel->handle(Request $request, $type = HttpKernelInterface::MASTER_REQUEST, $catch = true) …
› try {
› return $this->getHttpKernel()->handle($request, $type, $catch);
› } finally {
arguments: {▼
$request: Symfony\Component\HttpFoundation\Request {#6 …}
$type: 1
$catch: true
}
}
/www/htdocs/web/index.php:31 {▼
require …
›
› $response = $kernel->handle($request);
› $response->send();
arguments: {▼
$request: Symfony\Component\HttpFoundation\Request {#6 …}
}
}
/www/htdocs/web/app.php:4 {▶}
You have an empty <a href="http://"></a> link somewhere on your page which is invalid. Can you try to update all dependencies so terminal42/escargot is updated to 0.5.2. It should be fixed with https://github.com/terminal42/escargot/commit/3b16bad749fb87d26035cf25ddb1cb3ef97ffa00. The debug log should then also tell you where that link was found and that it couldn't be added to the queue :)
After the update comes a new error.
[2020-02-20 17:19:55] request.CRITICAL: Uncaught PHP Exception Doctrine\DBAL\Exception\DriverException: "An exception occurred while executing 'SELECT uri, level, processed, found_on, tags FROM tl_crawl_queue WHERE (job_id = ?) AND (processed = ?) ORDER BY id ASC LIMIT 1 OFFSET 508' with params ["c423a14e-233d-48a6-b291-155429e27422", 0]: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away" at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php line 106 {"exception":"[object] (Doctrine\\DBAL\\Exception\\DriverException(code: 0): An exception occurred while executing 'SELECT uri, level, processed, found_on, tags FROM tl_crawl_queue WHERE (job_id = ?) AND (processed = ?) ORDER BY id ASC LIMIT 1 OFFSET 508' with params [\"c423a14e-233d-48a6-b291-155429e27422\", 0]:\n\nSQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php:106, Doctrine\\DBAL\\Driver\\PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:123, PDOException(code: HY000): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away at /www/htdocs/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOStatement.php:121)"} []
General error: 2006 MySQL server has gone away"
...
Where should the MySQL server be going? ... the page is running ... the crawler rotates at 96%. 🎃
I can't help you anymore here without having a copy of the whole setup, sorry. The MySQL server shuts down, you have to debug on your own why that happens. Maybe there is some endless loop, maybe not.
On the command line, the crawler runs without problems.
On command line the crawler crawls _web-freelancer-gesucht.de_, but i never link to this page.

Now I have found the link. 💡
The link is behind a name in a comment.
https://brkwsky.de/blog-leser/diese-drei-tools-erleichtern-unseren-arbeitsalltag
<p class="info">Kommentar von <a href="http://www.web-freelancer-gesucht.de" target="_blank" rel="nofollow noreferrer noopener">Mark</a
_nofollow noreferrer noopener?!_
nofollow should be ignored. Check the debug logs.
I have a similar problem. Only links that contain rel="nofollow" are correctly tagged with "rel-nofollow" in the table "tl_crawl_queue". With rel="nofollow noopener" it doesn't work.
Indeed. Fixed in https://github.com/terminal42/escargot/commit/f01decbcd9789ae8e62d2ed788bf5268c4951d4f and released as 0.5.3. Update your dependencies so you get the latest terminal42/escargot version and this issue should be gone.
Great... thanks for the fast fix!
Sorry. 0.5.3 does not fix my problem. :(

I fixed the problem @LIVID-Media was mentioning. I cannot fix your problem until I have proper instructions on how I can reproduce the issue.
i can share my screen? @Toflar
I couldn't spot any issues. The crawler finishes correctly and crawls through all the data.
The progress bar advances normally. The title (in your screenshot http://www.web-freelancer-gesucht.de/) is just not updated for all the requests. It only happens for all the requests that actually were fully completed (which is not the case for broken link checks).
So it might be that the title does not update for quite a while.
I think we should remove that completely as that seems to confuse people and it provides no added value anyway?
Ok, that's a good idea.
PR is here: https://github.com/contao/contao/pull/1396
I've also found an issue and released Escargot 0.5.5 which should immensely reduce the number of URLs that are checked and speed up things quite a lot 😄
Please provide feedback.
Tadaaaa, now it works and the crawler was lightning fast. ❤️
@Toflar Can it be, that the newest version detects console-commands as GET Requests but has no URI, because we are on the console? I'm getting a similar error on all cronjob-commands, since i updated from Contao 4.9.1 to 4.9.2 with all dependencies.
One example command would be:
vendor/terminal42/notification_center/bin/queue -s 2 -n 10
Error:
`Fatal error: Uncaught InvalidArgumentException: Unable to parse URI: http://:/ in /home/linderim/public_html/contao49/vendor/nyholm/psr7/src/Uri.php:51
Stack trace:
As a Hotfix i added this:
if($request->getUri() == 'http://:/'){
return;
}
to /vendor/contao/core-bundle/src/EventListener/SearchIndexListener.php just before the Call to createFromRequestResponse() on Line 67.
Is this a local problem in my page or something caused by a logic-error in the SearchIndexListener?
_In the database i could not find an empty href-attribute._
Thanks in advance for your help
That doesn't look like any issue of the crawler no. But it looks like the BC layer of the initialize.php always expects a web request which is not true in the case of the notification center (or probably also many other scripts). /cc @aschempp
Is there an update or fix for this? Everytime i update Contao, I get this error every minute (by cron) until I insert the hotfix by @rorych .
`PHP Fatal error: Uncaught InvalidArgumentException: Unable to parse URI: http://:/ in /var/www/vhosts/mydomain/httpdocs/vendor/nyholm/psr7/src/Uri.php:53
Stack trace:
`