beast 🚀 - Receive/parse the message body one chunk at a time

Incremental processing for the body is already possible, but with a different interface than what you are asking for. If you want to process the body incrementally then you simply need to create your own Body type, and then customize the Reader:

Requirements for Body
http://vinniefalco.github.io/beast/beast/ref/Body.html

Requirements for Reader
http://vinniefalco.github.io/beast/beast/ref/Reader.html

Example user-defined body:

// incrementally reads body data
struct MyBody
{
  using value_type = std::string; // can be any type you want

  struct reader
  {
    template<bool isRequest, class Headers>
    reader(message<isRequest, MyBody, Headers>& m);

    void
    write(void const* data, std::size_t size, error_code& ec)
    {
      // this will be called with each piece of the body, after chunk decoding
  };
};

Another useful feature would be to be able to specify async_parse to complete after reading at most X amount of bytes in the body

Why do you want to do this? What is the use-case?

vinniefalco on 25 Oct 2016

The Reader customization is a nice feature but as well as any other callback mechanisms it does not provide the functionality to suspend/throttle the receive until another async operation completes. The only option is to store the whole message body and then start a different async operation.

Imagine a reverse proxy implementation that receives an HTTP message from a client parses the headers and then based on the path creates a new http request to a backend server then passes all the data received from the client to the server.

The workflow would be like this:

{  
  parse_headers(sock, hp, msg);
  // check the path
  connect(backend, beSock);
  error_code ec;
  request beReq;

  for (parse_atmost(sock, bp, msg, 1024); 
         ec == error_more_data;
        parse_atmost(sock, bp, msg, 1024))
  {
      write_chunk(beSock, beReq, msg.data());
  }
}

I think an error code can be used to indicate that the parsing was stopped before reaching the end of the message.

To make things symmetrical a similar mechanism would be needed on the write side as well. To send the data chunk by chunk.

georgi-d on 25 Oct 2016

I agree, Reader does not provide the ability to throttle. However, I think the responsibility of throttling lies not with the HTTP implementation but rather, the choice of the AsyncReadStream parameter used to parse:
http://vinniefalco.github.io/beast/beast/ref/http__async_parse.html

If you want to throttle, then you should write a class that meets the requirements of AsyncReadStream (http://www.boost.org/doc/libs/1_62_0/doc/html/boost_asio/reference/AsyncReadStream.html), which wraps an existing stream and allows calls to async_read_some to be suspended and resumed.

For the write side you can use a custom body, and suspend by returning boost::indeterminate from the Writer instance:
http://vinniefalco.github.io/beast/beast/ref/Writer.html

vinniefalco on 25 Oct 2016

I do not really agree that throttling should be responsibility of the read stream. The whole design of ASIO is to have throttling be done by not issuing new async_* operations. If the reverse proxy was implemented over TCP then we would just have one buffer of fixed length. Then async_read into it, then async_write out of it. Throttling is automatic.

Could you add an example of how a simple reverse proxy would be implemented? The backend server can easily be fixed during construction and not depend on the request.

I feel the implementation would be much simpler by having a custom parser that can be called multiple times than implementing a custom AsyncReadStream.

Thanks

georgi-d on 25 Oct 2016

I'll consider pausing the parser during the message body.

What about the Writer solution, does that work for you?

vinniefalco on 25 Oct 2016

@georgi-d As a temporary solution, you could first read the headers using the headers_parser:
https://github.com/vinniefalco/Beast/blob/master/include/beast/http/headers_parser_v1.hpp#L39

And then, construct a new parser for the body using with_body:
https://github.com/vinniefalco/Beast/blob/master/include/beast/http/parser_v1.hpp#L286

The resulting parser will be ready to read the message body, and you can just feed in the data. In other words read from the socket yourself and pass the buffers in to parser_v1::write(). When the parser is done, parser_v1::complete() will return true.

The implementation of beast::http::parse is pretty straightforward, there's no magic going on:
https://github.com/vinniefalco/Beast/blob/master/include/beast/http/impl/parse.ipp#L220

vinniefalco on 25 Oct 2016

I like the idea of reading the data myself and passing it to the parser with write. I will give it a try.

As for the Writer solution I will try it as well and see how it works for me. I would really like if there is a solution similar to the parser_v1::write for writing the message body so I can control it from the outside rather than being called back for the data.

Thanks

georgi-d on 25 Oct 2016

For the write you can specify the type beast::http::empty_body for Body in the message, call beast::http::write to send the headers, and then handle writing of the body yourself, directly on the socket. You won't be able to call prepare (since it will set a zero Content-Length) but that's not a big deal, you can just fill those fields in yourself (you'll also have to set Connection and Transfer-Encoding).

You can still do chunked encoding, Beast provides a routine for you:
https://github.com/vinniefalco/Beast/blob/master/include/beast/http/detail/chunk_encode.hpp#L88

I can make chunk_encode part of the public API if necessary.

This would be much easier than defining your own Body and associated writer.

vinniefalco on 26 Oct 2016

This sounds good, I will give it a try. Making chunk_encode public would be useful I believe.

Thanks

georgi-d on 26 Oct 2016

I started working on reading myself from the socket and passing the data to the parser and ended up copying 90% of async_parse + parse_op. The only extra feature I actually need is the ability to break out of the async_read loop and call the handler on a condition different than parse.complete() returning true.

I was thinking about adding a version of async_parse similar to boost::asio::async_read which receives a CompletionCondition functor. The CompletionCondition should be called after each parser.write() with the number of bytes consumed by the parser and and an error_code (optionally a reference to the parser but it could also be a binded parameter). The default CompletionCondition could be implemented as returning the result of parser.complete().

On error or parser.complete() returning true the handler should also be called.

async_parse() should be callable multiple times for a single http message until the complete message body is consumed and parsed. From I gather the state in parse_op would not need to change to support multiple invocations. On each invocation the data.state variable would start from 0.

Thanks

georgi-d on 10 Nov 2016

Why don't you just create your own Parser ? Then you can return whatever value you want from complete(). And you can call async_parse over and over again.

vinniefalco on 10 Nov 2016

I thought about it but feels more like a hack to me. The parser's job is to parse the message and not to determine when the read operation should be completed and the handler invoked. The fact that it currently is used as the sole condition for completion of the async_parse is a special case in my opinion. It is similar to using boost::asio::async_read() with boost::asio::transfer_all() condition. If the parser is used to indicate that the async_read should complete mid message then it should have some other "complete()" to indicate if the message has really ended or not. It should also have a special "reset_read()" method to enable consecutive async_parse () calls.

I will try implementing both solutions to see which one I like better for the use case I have.

georgi-d on 10 Nov 2016

@georgi-d It is my intention for parse and async_parse to be something of a "generic read algorithm" that isn't necessarily constrained to only parsing HTTP messages (although thats what the library predominantly uses it for). The parse_op composed operation takes care of some of the heavy lifting for you, so that you can focus on implementing the business logic.

I have been thinking a little bit about renaming async_parse to async_read, rename Parser to Reader, and move the free function and composed operation to <beast/core/read.hpp> since it is now quite general purpose. I don't know that I will actually do this in the near future but that is my thinking.

Given this direction, I don't think you should view using the Parser concept in that fashion as a hack. Beast is a low level library, and exposes building blocks for you to use. There's no "right" or "wrong" way to put together these building blocks. I want the blocks to be flexible, so people can build things with them that I did not anticipate or design for. I think that's the best measure of "success" for this library.

vinniefalco on 10 Nov 2016

If you wouldn't mind giving me more details about your use-case, perhaps I can improve Beast's interfaces to support it. Or suggest better ways of accomplishing the same result.

vinniefalco on 10 Nov 2016

I have two use cases:

1) High performance HTTP reverse proxy which accepts HTTP requests, parses the headers and based on the path of the request creates a new HTTP to a backend server forwarding the body of the client request to the backend server and the response from the server back to the client asynchronously. The proxy should have fixed memory footprint per connection so it should be able to use a single fixed size buffer per connection and not read the whole message body before forwarding it.

2) A high performance HTTP service which for each HTTP request calls a handler (after parsing the headers) passing it a Request and Response objects.

The Request should have accessors for the request headers, verb and path and should also have async_read() for reading the request body. The async_read() call should return system::eof when the end of the body is reached.
The Response object should have setters for the result code and headers and it should also provide async_write() for writing the response body. The Response should have a finish() method for indicating the end of the response. When finish() is called the underlying stream should be returned to the server for waiting and handling a consecutive request from the peer.

Request::async_read() and Response::async_write should be callable multiple times for processing a client request.

I plan on using 2) for implementing 1) and as a transport for a generic REST/RPC framework. The REST/RPC framework allows based on an IDL definition to have client side stubs and server side REST and/or JsonRPC exposed implementation.

georgi-d on 11 Nov 2016

This is great, thanks. It sounds like Beast has good interfaces for handling the write side, but we might need to do a little more work on reading. Some questions:

How will you handle trailing headers?
How will you handle Expect: 100-continue?
How do you feel about the current division between message and message_header (which will be renamed to just header), and the routines to write them?

vinniefalco on 11 Nov 2016

How are trailing headers handled?

I have not considered trailing headers up to now but I think it would be fine if they appear in the headers container of the message after async_read() returns eof(). I would be fine if they are not supported at all as well.

How is Expect: 100-continue handled?

on the server side: I think it would be better if handling is explicit. When the handler is called the the code should explicitly check for is_continue_required(request), do any additional checks and do a async_write_response_continue() before doing any async_read() calls.

on the client side I think it should be handled explicitly as well. If the client adds an Expect: 100-continue header then it should do an explicit async_read_continue() before doing any writes to the request body. The the client should do another async_parse/async_read which should fill any additional headers. Then the response body could be read.

How do you feel about the current division between message and message_header (which will be renamed to just header), and the routines to write them

I have not looked too deep into them but I like the separation in general and the presence of the headers_parser which allows to parse the headers, do some handling and then move to parsing the body. On the write side I have not looked much yet. I am still working on the read side. I would like to be able to write the headers separately and then write the body.

georgi-d on 11 Nov 2016

I have created a draft change which adds CompletionCondition to async_parse():

https://github.com/georgi-d/Beast/commit/c14400711f63b6b1010cd447ace0fce11fb0ada5
For now I have only tested that it compiles. Will write some unit tests next.

georgi-d on 11 Nov 2016

I'm not sure that a completion condition is the best way to solve this problem. I need more time to think about it. I believe we can do better.

vinniefalco on 11 Nov 2016

@georgi-d I've been thinking a lot about this, and I am starting to think that the "stateless" model of the interface is not the right approach. What I mean is that free functions don't encapsulate enough state for us to do useful things. Now I am thinking about a new approach, similar to what is done with websocket:

namespace beast {
namespace http {

template<bool isServer, class NextLayer>
class stream_v1
{
    streambuf rb_;
    NextLayer next_layer_;
    basic_parser_v1 parser_;

public:
    template<class... Args>
    stream_v1(Args&&... args)
        : next_layer_(std::forward<Args>(args)...)
    {
    }

    // Read the header asynchronously
    template<class Fields, class ReadHandler>
    void
    async_read(header<! isServer, Fields>& h, ReadHandler&& handler);

    // Read some of the body asynchronously
    template<class MutableBufferSequence, WriteHandler>
    std::size_t
    async_read_some(MutableBufferSequence const& buffer, WriteHandler&& handler);

    // Write the header asynchronously
    template<class Fields, class WriteHandler>
    void
    async_write(header<! isServer, Fields> const& h, WriteHandler&& handler);

    // Write some of the body asynchronously
    template<class MutableBufferSequence, class WriteHandler>
    std::size_t
    async_write(MutableBufferSequence const& buffer, WriteHandler&& handler);
};

} // http
} // beast

// typical use:
boost::asio::io_service ios;
beast::http::stream<boost::asio::ip::tcp::socket> stream_v1{ios};

This allows us to make the parsing state a member of the class, which persists across member function calls. We would do away with the free functions like http::async_read and replace them with member functions. The parser would always exist, and always have the correct state. This allows us to handle the body differently if we want. It will take me more time to sketch out how this might look in practice. It will require significant modifications to the parser.

We can probably keep the free functions for simple use-cases though.