Reusing a parser with http::async_read() results in parser.body() accumulating the bodies of all the responses. Calling parser.clear() before again calling http::async_read() does not fix this. The only solution appears to be to create a new parser with every http::async_read() performed.
Unlike the http::read documentation on msg, the http::async_read documentation makes no mention that the parser should not have any previous content.
The http_crawl.cpp example also happily reuses the http::response<http::string_body> res_ member of the worker class for every request. For http_crawl.cpp, this might not be problematic because it ignores the response body, but it creates the wrong expectations in the absence of documentation warning against such use.
144
res_.body() in worker::on_read(). E.g.:
...
auto const code = res_.result_int();
std::cout << "============================" << std::endl;
std::cout << res_.body() << std::endl;
std::cout << "----------------------------" << std::endl;
report_.aggregate(
...
Modify urls_large_data.cpp to fetch only google.com two times:
...
urls_large_data()
{
static std::vector <char const*> const urls ({
"google.com",
"google.com"
});
return urls;
}
...
bjamRun with 1 thread: http-crawl 1
============================
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
----------------------------
Progress: 0 of 2
============================
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
----------------------------
Elapsed time: 5 seconds
Crawl report
Failure counts
Timer : 0
Resolve : 0
Connect : 0
Write : 0
Read : 0
Success : 2
Status codes
301: 2 (Moved Permanently)
gcc (Ubuntu 7.2.0-8ubuntu3.2) 7.2.0
Hi @vinniefalco,
Is this a documentation/example or an implementation bug?
Regards,
@smipi1
Is this a documentation/example or an implementation bug?
Documentation. The parser was never meant to be reusable. Note that the message class has no "clear" member function.
Is it worthwhile reworking the examples as well to make this clear. Minimally a comment why this generally is I'll advised, but okay for the crawler use case.
Actually that might explain why the crawler malfunctions towards the end... good find :)
This issue has been open for a while with no activity, has it been resolved?
Hi @vinniefalco,
Should I take a stab at improving the documentation, or would you like this to be accommodated with a fix for the crawler too?
The place for the documentation is in basic_parser, since it can be used even without calling async_read , and exhibit the problem. If you want to try your hand at a fix for either of these problems (or both) I certainly won't mind!