Beast: [Feature Request] URI parser

Created on 21 Sep 2017  路  24Comments  路  Source: boostorg/beast

Hi @vinniefalco et al.

Are there plans to implement a URI parser similar to cpp-netlib?

Feature

Most helpful comment

Yes.

All 24 comments

Yes.

@vinniefalco fantastic! 馃憤 Can we rename this issue as a feature request?

Sure, although it would probably appear as a piece of example code at first. When it becomes a public interface, it will likely be part of a different library (which will also be submitted to Boost).

You can see some of the partial results here
https://github.com/vinniefalco/beast/commits/uri

You can see some of the partial results here
https://github.com/vinniefalco/beast/commits/uri

Wow, that looks like quite a bit of effort. Any estimate on when that will be merged into master?

It is far from ready. A few months?

What's wrong with a "simple" regex?
E.g. something like this:

#include <boost/algorithm/string.hpp> // for case-insensitive string comparison
...
struct ParsedURI {
  std::string protocol;
  std::string domain;  // only domain must be present
  std::string port;
  std::string resource;
  std::string query;   // everything after '?', possibly nothing
};

ParsedURI parseURI(const std::string& url) {
  ParsedURI result;
  auto value_or = [](const std::string& value, std::string&& deflt) -> std::string {
    return (value.empty() ? deflt : value);
  };
  // Note: only "http", "https", "ws", and "wss" protocols are supported
  static const std::regex PARSE_URL{ R"((([httpsw]{2,5})://)?([^/ :]+)(:(\d+))?(/([^ ?]+)?)?/?\??([^/ ]+\=[^/ ]+)?)", 
                                     std::regex_constants::ECMAScript | std::regex_constants::icase };
  std::smatch match;
  if (std::regex_match(url, match, PARSE_URL) && match.size() == 9) {
    result.protocol = value_or(boost::algorithm::to_lower_copy(std::string(match[2])), "http");
    result.domain   = match[3];  
    const bool is_sequre_protocol = (result.protocol == "https" || result.protocol == "wss");
    result.port     = value_or(match[5], (is_sequre_protocol)? "443" : "80");
    result.resource = value_or(match[6], "/");
    result.query = match[8];
    assert(!result.domain.empty());
  }
  return result;
}

What's wrong with a "simple" regex?

That's perfectly fine for a lot of cases but not something suitable for standardization... there's a lot of memory allocations there.

@vinniefalco Any updates on when you feel like it might get merged into master?

Nothing in sight, sorry!

@vinniefalco this is awesome work, and I'd like to help. Are there any low-hanging fruit style issues, or areas/tasks good for a new contributor?

Well great question. There is this lovely pull request: https://github.com/boostorg/beast/pull/1227

Unfortunately it needs a lot of work, which I will do if the author is not able to but it will take time. He implemented a URL parser which is not ideal but I think it is suitable for the beast/experimental directory. We need to extract his URL parser and polish it up to the level of quality needed to exist in beast/experimental. This means it needs the right file structure, Javadocs, correct interface, and documentation entries. But most importantly it needs tests. If you want to take a stab at it, go for it. Otherwise I will start it soon.

If you take this on, I can guide you on the necessary cleanups.

Will start reviewing the code in that PR, and pay close attention to the parser. After a first read, will come up with some test cases, and try to get something working.

Thanks for the direction, and offer to guide. Will check back in as progress is made.

The first thing to do is to extract the files and sort them into using the same style as Beast (detail namespace for private interfaces, .ipp file in impl/ directory for function definitions, files in detail/ directory if they are entirely private). I'm not quite sure if the parser has its own set of error_code but if it does it needs to use the same style as Beast. If it doesn't have error codes that's okay for a first draft.

Thanks for the heads up, I didn't see your message before getting started. Will move the files into the proper structure. I wrote some basic tests based on your URI branch, and will move those to the proper place as well.

edit: Made the changes, and they're ready to PR in my develop branch. Should I PR now, or wait until refactoring the draft (improving parser state machine, using isalpha/isdigit from your URI branch, implementing error codes, etc.)?

Travis builds codecov reports so we can ensure that every line is tested:
https://codecov.io/gh/boostorg/beast/list/master/

A boost::beast-based project I'm working on is using a custom URI parser with known bugs, and I wouldn't mind giving this parser a try rather than first using another library, then moving to this once it's stable. Keeping the dependency list to only boost would be great. Any progress in sight, or shouldn't I hold my breath?

I won't be able to get to this until the end of the year so I wouldn't wait if I were you.

There is a new project from cppnetlib which seem quite complete. Might worth including it in beast ? https://github.com/cpp-netlib/url

I will look at it!

Any news on this topic other than COVID?

I'm looking forward to have this feature implemented

Yes, it is being worked on: https://github.com/CPPAlliance/url

Thanks for the info!

Was this page helpful?
0 / 5 - 0 ratings