The Great Confusion About URIs

I've been using the terms URI and URL interchangeably for as long as I can remember. After all, they mean the same thing, right? Wrong! It looks like those terms have been used so often that we lost their meaning. "Really? Isn't that just... very easy to understand?" you might say. Well, think again, because this topic even has ramifications in our dear-loved programming languages, as I'll explain shortly.

The problem

It turns out that many people are in the same boat according to this RFC1:

The body of documents (RFCs, etc) covering URI architecture, syntax, registration, etc., spans both the classical and contemporary periods. People who are well-versed in URI matters tend to use "URL" and "URI" in ways that seem to be interchangeable. Among these experts, this isn't a problem, but among the Internet community at large, it is a problem. People are not convinced that URI and URL mean the same thing, in documents where they (apparently) do. When one RFC talks about URI schemes (e.g. "URI Syntax" (RFC 2396) [12]), another talks about URL schemes (e.g. "Registration Procedures for URL Schemes" (RFC 2717) [1]), and yet another talks of URN schemes ("Architectural Principles of URN Resolution" (RFC 2276) [13]), it is natural to wonder how they differ, and how they relate to one another. While RFC 2396, section 1.2, attempts to address the distinction between URIs, URLs and URNs, it has not been successful in clearing up the confusion.

So while experts tend to use both terms interchangeably, they mean different things, both for historical and technical reasons. Add to that the fact that:

  • some documents previously published by the IETF, the organization responsible for establishing web standards, use those terms in ways that are now considered technically incorrect (e.g. URL schemes instead of URI schemes).
  • the IETF tries to educate people about the differences between those terms while simultaneously advocating the use of URI over URL2.
  • URL is still way more popular than URI as a term.

And you end up with... confusion3.

Let's clear that up

I personally think that there's nothing fundamentally wrong with using the terms URI and URL (or even URN), as long as your intent is clear. Let's put definitions on those terms.

Uniform Resource Identifier (URI)

A URI is a string of characters that identifies a resource. Plain and simple. Its syntax is <scheme>:<authority><path>?<query>#<fragment>, where only <scheme> and <path> are mandatory. The IETF now advocates its use over the now informal 'URL'. For the record, these are perfectly valid URIs:

Uniform Resource Locator (URL)

A URL is a string of characters that identifies a resource located on a computer network. Its syntax depends on its scheme. For example, for a protocol-based URL, the syntax would be <scheme>://<userinfo>@<host>:<port><path>?<query>#<fragment>, where only <scheme> and <host> are mandatory. But isn't that syntax familiar? Well, it is, because a URL is technically a URI, but the inverse is not necessarily true. Even though 'URL' is still commonly used today, the IETF now views it as an informal concept only7. It's interesting to note that neither ':' nor '//' are part of <scheme>, even conceptually. This is contrary to what I initially thought. So '://' is only a separator between the scheme and the rest of the URI8.

Among the preceding URI examples, these are URLs:

  • http://avp.wikia.com/wiki/Alan%22Dutch%22Schaefer
  • ftp://ftp.free.fr/mirrors/ftp.centos.org/7.2.1511/os/x86_64/GPL
  • mailto:billg@microsoft.com
  • news:comp.lang.python
  • tel:+1-212-555-2368

But wait! There's also another kind of URI, which is like the forgotten child of URIs.

Uniform Resource Name (URN)

A URN is a string of characters that uniquely identifies a resource. Its syntax is urn:<namespace identifier>:<namespace specific string>, where <namespace identifier> is a way to group related and unique identifiers together, and <namespace specific string> can be pretty much anything allowed by the namespace identifier. Just like a URL, a URN is technically a URI, but the inverse is not necessarily true. What's interesting about URNs is that contrary to URLs, they only name or identify a resource, without specifying its exact location. It's up to a system using URNs to locate the specified resources. Among the preceding URI examples, these are URNs:

  • urn:isbn:9780062301239
  • urn:schemas-upnp-org:service:MyService

The programmer's dilemma11

Most modern programming languages come with facilities12 used to parse URIs. But are they named according to the terminology defined by the IETF? And even more importantly, do their capabilities reflect that terminology? Let's see. Since I'm a Python programmer above all else, I started out with a very basic code sample in that language and I ported it to various other languages. This allowed me to answer my questions.

from urllib.parse import urlparse

def parse_uri(the_uri):  
  try:
     uri = urlparse(the_uri)
     print("{0: <9}: {1}".format("URI", the_uri))
     print("{0: <9}: {1}".format("Scheme", uri.scheme))
     print("{0: <9}: {1}".format("Host", uri.netloc))
     print("{0: <9}: {1}".format("Path", uri.path))
     print("{0: <9}: {1}".format("Query", uri.query))
     print("{0: <9}: {1}".format("Fragment", uri.fragment))
     print("---")
  except Exception as e:
     print(str(e))

parse_uri("http://avp.wikia.com/wiki/Alan_%22Dutch%22_Schaefer")  
parse_uri("ftp://ftp.free.fr/mirrors/ftp.centos.org/7.2.1511/os/x86_64/GPL")  
parse_uri("mailto:billg@microsoft.com")  
parse_uri("news:comp.lang.python")  
parse_uri("tel:+1-212-555-2368")  
parse_uri("urn:isbn:9780062301239")  
parse_uri("urn:schemas-upnp-org:service:MyService")  

And here are the results:

Facility First appeared Can parse URI with scheme? Respects IETF's terminology?
http ftp mailto news tel urn
urllib.parse (Python 3) 1994 Yes Yes Yes Yes Yes Yes No
java.net.URL (Java) 1995 Yes Yes Yes No9 No9 No9 No
System.Uri (.NET) 2002 Yes Yes Yes Yes Yes Yes Yes
URL (JavaScript) 2010 Yes Yes Yes Yes Yes Yes No
URI (Ruby) 2002 Yes Yes No10 No10 No10 No10 Yes
net/url (Go) 2009 Yes Yes No10 No10 No10 No10 No

These results reveal that:

  • Most of those facilities still rely on the former terminology defined by the IETF. It's no surprise: most of them appeared before or during the transition from 'URL' to 'URI', and they most likely kept their name for historical and backward compatibility reasons. However, I had a few surprises. Independently of their respective capabilities, .NET's System.Uri and Ruby's URI both respect the new terminology, whereas JavaScript's URL and Go's net/url don't, even though they were introduced just a few years ago.
  • There's no relationship between how facilities are named and what they can do. For example, Ruby's URI can really just parse protocol-based URIs (e.g. http://), while its name suggests that it's a general-purpose URI parsing facility. Inversely, Python's urllib.urlparse and JavaScript's URL can both parse generic URIs, while their name suggests that they can only parse protocol-based URIs.
  • General-purpose URI parsing facilities (e.g. .NET's System.Uri) obviously come with limitations. For URIs that are not protocol-based (e.g. mailto:...), <scheme> and <path> can successfully be extracted, but <path> can't be parsed any further because its format is highly dependent on the scheme itself, and there's a ton of possible schemes.

Clearly, we can't blame old code for falling victim of changing standards, as backward compatibility is often a priority. However, when creating new code and naming things, we should be aware of current standards and terminology and stick to them.

Final thoughts

As we've seen in this post, the confusion about URI/URL/URN mainly comes from a misunderstanding of web standards. What you should remember is that:

  • URI is a superset of URL and URN.
  • There's nothing wrong with using any one of those terms, as long as your intent is clear. Whenever possible, you should use URI, which is more general.
  • URI parsing facilities are often named in a way that doesn't represent their real capabilities. Ultimately, it's our responsibility, as programmers, to find out what those capabilities are.
  • When naming things, we should stick to current standards. Concepts that look simple on the surface are sometimes more complex than we think, so it's worth to learn more about them.

The RFCs may be a bit intimidating at first, but I found that reading them carefully really helped me in clearing up my own confusion about URIs. And interestingly, I now appreciate what the IETF has done on this matter so far.

  1. Copyright (C) The Internet Society (2002). All Rights Reserved.

  2. See RFC 3986: "Future specifications and related documentation should use the general term 'URI' rather than the more restrictive terms 'URL' and 'URN'".

  3. I sympathize deeply with the IETF because they've done a tremendous job and they don't have an easy task at hand. Establishing standards must be very hard!

  4. Apparently, this would be the real email address of Bill Gates. Good luck on getting a response from him!

  5. Google keeps an archive of this Usenet newsgroup at https://groups.google.com/forum/#!forum/comp.lang.python.

  6. This form is typically used for XML namespaces.

  7. See section 2.2 of RFC 3305: "[...] the term 'URL' does not refer to a formal partition of URI space; rather, URL is a useful but informal concept."

  8. As a side note, Tim Berners-Lee, the inventor of the World Wide Web, said that adding '//' as a separator for HTTP URIs was probably a mistake. Imagine using addresses like http:google.com!

  9. Throws an exception.

  10. The scheme is parsed successfully, but anything else is ignored.

  11. Okay, I admit that I was a bit inspired by this excellent book.

  12. By facility, I mean either package, class, interface or module.