Welcome to my corner of the web. Here you'll find my ramblings about faith, church, drupal, Geeks and God (my podcast), and my other unrelated interests.

While you can subscribe to all posts here from the Subscribe link on the right, there are two other main feeds. There is the drupal and other technology feed along with the faith and church feed.

RFC 3986 URL Validation

Posted on: Thu, 2009-01-08 09:17 | By: matt | In:

Have you ever submitted a url to a site just to have the site tell you it was invalid when you knew it wasn't. That happened to be recently and happened on one of my drupal sites. Drupal told me a flickr url containing an @ symbol was invalid. When I looked deeper into the issue I found the URL really was valid. When I looked to see what other software was doing I found many cases where there was no validation or what was present failed for many types of valid urls. So, let's take a look at how to do good url validation.

The Specifications

The specification for URIs, of which URLs is a subset, is RFC 3986. This is not only a great place to start but Appendix A provides a detailed guide to the makeup of a URI.

filter_var in PHP

In PHP 5 there is a function called filter_var which allows you to test a piece of data against a filter. One of the filters is FILTER_VALIDATE_URL and this seems like the obvious case to use. It's usage would be something like:

  $result = filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_SCHEME_REQUIRED);

Simple, right? The issue is with how this works. FILTER_VALIDATE_URL attempts to pass the URL through parse_url and see if it can break the URL down. In the documentation of parse_url it says, "This function is not meant to validate the given URL".

Additionally, if you test this you'll find urls like "http://", http://...", and many other invalid URLs will pass because they have a structure parse_url can handle.

Using A Regular Expression

I searched for a regular expression that validated for RFC 3986 and couldn't find one. So, I wrote the following (for PHP).

# Start at the beginning of the text
/^
# The scheme
([a-z][a-z0-9\*\-\.]*):\/\/
# Userinfo (optional)
(?:                                                     
  (?:(?:[\w\.\-\+!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)*
  (?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+@
)?
# The domain
(?:
  # Domain name or IPv4
  (?:[a-z0-9\-\.]|%[0-9a-f]{2})+
  # or IPv6
  |(?:\[(?:[0-9a-f]{0,4}:)*(?:[0-9a-f]{0,4})\])
)
# Server port number (optional)
(?::[0-9]+)?
# The path (optional)
(?:[\/|\?]
  (?:[\w#!:\.\?\+=&@!$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})  
*)?
$/xi

For drupal we are narrowing down the scheme to just allow http, https, and ftp and we aren't looking for reference details to be returned so the regex looks like

# Start at the beginning of the text
/^
# Look for ftp, http, or https
(?:ftp|https?):\/\/
# Userinfo (optional)
(?:                                                     
  (?:(?:[\w\.\-\+!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)*
  (?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+@
)?
# The domain
(?:
  # Domain name or IPv4
  (?:[a-z0-9\-\.]|%[0-9a-f]{2})+
  # or IPv6
  |(?:\[(?:[0-9a-f]{0,4}:)*(?:[0-9a-f]{0,4})\])
)
# Server port number (optional)
(?::[0-9]+)?
# The path (optional)
(?:[\/|\?]
  (?:[\w#!:\.\?\+=&@!$'~*,;\/\(\)\[\]\-]|%[0-9a-f]{2})  
*)?
$/xi

Drupals testing framework has made it easy to pass valid and invalid urls through this expression to test if it's working. Details on the issue are at http://drupal.org/node/124492.

Think you can break it of do better? I would love it if you could. Please feel free to try and let me know. If it works for you please share with your friends, enemies, and anyone else who could use it.

Update: The regular expressions were updated to allow % HEXDIGIT HEXDIGIT rather than % and to allow for IPv6 addresses for the domain.

Comments

#1 The regexp does not seem to

The regexp does not seem to handle a literal IPv6 in the domain part, does it?

#2 Another good catch

You are right about IPv6 domains. I'll update it after I test have a tested fix.

#3 http://somedomain.com/ind%ex.

http://somedomain.com/ind%ex.html

Is not a valid URL, but is the regexp would mark it as valid.

#4 Good catch

The % is in there to catch hex digits know as pct-encoded in the spec. The proper formatting for those is "%" HEXDIG HEXDIG. Any suggestions on a modification to accommodate this?

pct-encoded is allowed in the query, path, host, and userinfo.

#5 Good catch

What about something like this:

(?:(?:[\w\.\-\+%!$&'\(\)*\+,;=]|%[0-9a-f]{2})+:)*

instead of

(?:[\w\.\-\+%!$&'\(\)*\+,;=]+:)*

I'd be interested on how colons should be encoded (if at all) in the passwort part. If it needs escaping or something, you should include that and drop the asterisk (ther should be only one password max, i.e. * -> ?.

#6 Specs

Thanks for the detail on the %HH. I'm planning on reworking this tonight when I'm at my test setup.

According to RFC 3986, the URI spec, you can have something like scheme://one:two:three@host.com/path. But, this is for URIs and not URLs. URLs are a subset of URIs.

According to RFC 1738, which deals with URLs , in 3.1. Common Internet Scheme Syntax it says the common scheme is //<user>:<password>@<host>:<port>/<url-path>.

But, this spec is old a "mostly obsolete definition of URL schemes and generic URI syntax". For details on that see RFC 3305 (and a number of other specs).

While I don't know any schemes that use scheme://one:two:three@host.com/path as far as I understand the one:two:three part is valid. Or, is there some document or reference I'm missing?

#7 What about International Domain Names (RFC 3490, etc)?

RFC 3986 includes the handling of international domain names (IDN), your regular expressions don't seem to cover this. See http://tools.ietf.org/html/rfc3490 and others.

#8 International Domain Names

For anyone who wants to get up to speed quickly on international domain names see http://en.wikipedia.org/wiki/Internationalized_domain_name.

You might, also, want to take a look at International Resource Identifier (IRI). Some details at http://en.wikipedia.org/wiki/Internationalized_Resource_Identifier.

I think we will eventually be using IRIs instead of URIs. But, RFC 3986 (URI Spec) is a standard and RFC 3987 (IRI spec) is a Proposed Standard.

Additionally, IRIs can be converted to URIs and vice versa.

I would prefer to stay with approved Standards at this point. Approved standards are always open to change. Unless someone can sell me on a reason to go with a proposal rather than the standard.

On top of that that W3C Internationalization group recommends encoding international characters in %HH form. See http://www.w3.org/International/techniques/developing-specs#defining for more details.

Can someone sell me on coding up something that supports IRIs? It would go a long way if someone knows a drop in replacement for [a-z] and \w that deals with international characters.

#9 http://☃.net/ or

#10 It won't be simple

Validating URLs with a regexp is a problem that seems to pop up every now and then on tech related blogs. And every time the conclusion seems to be the same: URLs seem simple at first but can be massively complicated. And there will always be a corner case you forgot to handle.

Probably all regular expressions for this purpose that I have seen were long and ugly (=difficult to read or understand). When it comes to coding, long and difficult means hidden bugs and laborious code maintaining. I once read a hint about validating URLs with a regexp: "don't do it". Please think about some other solution if possible.

And here is a link for those who are interested in seeing how other people have iterated over and over to find a suitable regex for URLs: http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

#11 Have you tried the regular

Have you tried the regular expression in Appendix B of RFC 3986:

^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

#12 Detect not Validate

This regular expression will detect uri formatting and break it down into usable parts. It won't validate a uri. For example, it doesn't detect the characters that are allowed.

#13 No need to escape metachars in char class

Your regexs are needlessly escaping metachars within character classes. With the PCRE regex engine (used by the PHP preg_* functions), the only metachars that need to be escaped within a character class are the backslash '\', the right square bracket ']' and the dash '-'. And the '-' dash doesn't need to be escaped when its the first char in the class. Thus your char class can be simplified...

from this:
[\w#!:\.\?\+=&@!$'~*,;\/\(\)\[\]\-]

to this:
[-\w#!:.?+=&@!$'~*,;/()[\]]

There are also some real problems with your regex overall. For one thing, it does not take into account the more restrictive naming required of the host portion when using the DNS domain name service. Careful reading of section 3.2.2 (and its references to RFC1034 and RFC1123) describes how a proper domain name is less than 255 chars overall and consists of dot separated components which each are no more than 63 chars and begin and end in alphanumerics but may contain embedded dashes.

Your regex says the following (invalid) URLs are valid:
  http://....../path/?query#fragment
  http://------/path/?query#fragment

Remember, there is a difference between a valid Generic URI (as spelled out in RFC3986) and a valid HTTP URI scheme.

#14 Slight correction

He does need to escape the forward slash, as it's the pattern delimiter in PHP's implementation of PCRE. Otherwise there are going to be some pretty unexpected results!

Interesting point about http://...../. However, the above is useful enough for me as it is. Thanks Matt!