Seventh Heaven

The Anatomy of a URL

Internet-Scale Namespaces, Part I

Rohit Khare * 4K Associates * September 1, 1999

In 1977 Charles and Ray Eames produced a classic short that opened with a shot of a couple lounging on a picnic blanket in a lakeside park in Chicago. Over the next eight minutes they stepped out one order of magnitude at a time to depict the park, the city, the planet, all the way up to galactic clusters -- only to zoom all the way back into his hand into its cells, proteins, and even subatomic particles.

Powers of Ten introduces two complementary points. First, that we have different scientific vocabularies for phenomena at various scales, ranging up from meteorology to geology to astronomy to cosmology; and ranging down from biology to chemistry to quantum mechanics. Second, the exact same physical laws apply at every scale. There are still scale-invariant laws underlying it all.

At UC Irvine's recent Workshop on Internet Scale Technologies (TWIST99) on Namespaces, I stole the same rhetorical framework to zoom in on a single Web page, decomposing one Internet-scale name into a host of more specialized names; and to zoom out to the larger social 'web of trust' such transactions are embedded in. Like the Eames', our whirlwind tour will attempt to illustrate unique features of each namespaces' domain of discourse, while simultaneously identifying cross-cutting, uniquely Internet-scale management issues.

Anatomy of an URL

Let's say I want to buy an airline ticket today. I've already made a reservation, so I fire up my Web browser and type the Uniform Resource Identifier http://www.united.com/Itinerary/NQSS5A into its 'Address' field. To me, that's just an opaque address which is directly resolved into a web page. That's why this format is also known as a Uniform Resource Locator: it can be directly interpreted as the process to find that web page as much as its destination. Contrast it with the difference between a postal address and directions to the same building.

On the other hand, it's less of a locator than the schemes it succeeded. Directing folks to Internet information resources before URIs required writing out several lines of instructions, as documented by the MIME message/external-body access types in RFC 1521. When Tim Berners-Lee set out to define a format suitable even for "cocktail napkins," he was able to reduce the recipe to one line, as scheme://user:password@host:port/path. Ease of transcription came at the expense of character repertoire, though. While URLs look great in English, ad-hoc internationalization efforts must squeeze into the few legal alphanumeric US-ASCII characters permitted by the URI specification (originally RFC 1630, in June 1994, and standardized in August 1998 as RFC 2396).

It's certainly been a successful namespace. The big search engine projects claim there are over a billion web pages out there, each with its place in a site's hierarchy (URLs and URL paths are read left-to-right). We know from experience that some URIs are ephemeral to the point of expiring in seconds, while some have been around for a decade already. And as user interface goes, well, you'd never see anonymous FTP instructions on a toothpick wrapper…

Scheme

To the Web browser, though, the URI in the 'Address' field isn't. It has to resolve each of the namespaces embedded within that URI. The URI Scheme is the very first part, since it selects the namespace and syntax (and typically, access protocol) for the rest of the URI.

Typical entries in this namespace are telnet: and ftp:, but there are also less protocol-specific schemes such as mailto: and even data: (which is followed by an arbitrary block of base-64 encoded bytes!) Security flags have also been crammed into the scheme, using a single, unreliable 's' to indicate a separate port for https (443) or ftps (990), for example.

At the same time, if URIs are expected to be stable for decades there can't be too many schemes. The requirement to publish an IETF RFC to become an IANA-registered scheme enforces scarcity. It's also reflected in the limited ASCII token space, as well as the preclusion of internationalized variants.

Table 1. Comparison of several layers of namespaces used to resolve a typical Uniform Resource Identifier.

Name

Resolved by

Structure

Format by

Controlled by

Internationalization

Number

Lifetime (sec)

URI

Web Browser

Left-to-Right

RFC 2396

Webmaster

US-ASCII (UTF-8)

1010+

101 - 108

URI Scheme

Web Browser

Atomic

RFC 1738

IANA registered

None

101+

108 - 100

Domain Name

DNS Resolver

Right-to-Left

RFC 883

ICANN- delegated

None

A-Z,0-9,-

108+

107 - 108

IP Address

TCP/IP Stack

Left-to-Right

Net.subnet
Net.Host

RFC 791

IANA-

Delegated

NoneN/A

231 = 1010+ (231)

101 - 107

MAC Address

LAN Adapter

Mfr:device

IEEE 802.3

IEEE Registered. Auth.

NoneN/A

248 = 1014 (248)

108 - 1010

Phone #

Modem +Modem + PPP link

Left-to-Right

ITU E.164

N. Amer. Number Plan

Country Codes

1010

105 - 109

URL Path

Web Server

Left-to-Right

RFC 2396

Webmaster

US-ASCII

(UTF-8)

1010+

101 - 108

Filename

Operating Oper. Sys.tem

Left-to-Right

OS-specific

Web Author

Ad-hoc

1-106+

101 - 108

Passenger Name Rec.

Reservation database Atomic picture str.

Database-specific

Airline

None

108+

101-107

Domain Name

Moving left-to-right in the http: URI syntax, we encounter the string "www.united.com." The Domain Name System interprets it in turn as a hierarchical right-to-left identifier within a global, singly-rooted tree. That is to say, the new Internet Corporation for Assigned Names and Numbers ultimately stands behind the (literally, anonymous) "." domain, delegating .com, .int, .mil and other national codes authority to "top-level domain" registrars. Physically, though, there are 13 "root servers" (the most that can fit in a single UDP packet) which mirror each others' copies of the entire DNS database.

Sitting at the edge of the network, the browser isn't concerned with any of these details. It passes that address to the local caching DNS resolver, which in turn hooks into the global pyramid scheme to find .com (today, from IANA), then united.com (today, from Network Solutions), and then www.united.com (from United).

Of course, figuring out whether it's United Air Lines, United Moving Lines, or United Parcel Service is the original sin inherent in DNS. The 1982-3 vision of replacing manual site-local "hosts.txt" lookup files with a unified distributed database has succeeded only too well, entrenching it as the only possible human-friendly IP address directory scheme. The trademark, regulatory, and commercial morass of the mid-90's will ultimately set the stage for parallel resolution services such as RealNames or, arguably, Yahoo! categories.

DNS namespaces have a few more interesting constraints. There is little range for internationalization, since only the letters A-Z, digits, and '-' are permitted. The protocol limits fully-qualified domain names (FQDN's) to 255 characters maximum and components to 63 (while .com entries are further restricted to only 20 characters). While names were expected to remain valid for years or decades, dialup and mobile access have motivated proposals for Dynamic DNS as well. Since DNS namespace also contains its inverse -- by looking up an IP address under the in-addr.arpa "domain" -- dynamism is even more difficult to accomodate.

IP Address

Once the browser reduces a domain name to a 32-bit IPv4 address, it is combined with a 16-bit port number (usually determined by the URI scheme, such as HTTP on port 80) and passed down one further layer. The TCP/IP transport stack has to resolve this "name" to a next-hop (or destination) machine.

Unbeknownst to the application layer, the underlying routing algorithms rely on the internal division of the IP address into network and local numbers. Routers can summarize the world's network topology because IANA delegates large blocks of numbers to regional IP registries, who in turn allocate variable-size classes of network-numbers to service providers (per policy articulated in RFC 2050). Other blocks are reserved for multicasting (224.0.0.0 and up), broadcasting (set the local part to all 1's), and loopback testing (127.0.0.1, also reserved as the domain name "localhost")

Network Address Translators (NATs) can complicate the picture. Renumbering a whole organization when it outgrows its network class can be a tedious and expensive proposition, so it may 'hide' its entire network behind one or a few gateway machines. Since IP addresses are assumed to be end-to-end unique (one IP interface number per host), NATs can't catch every place in every packet every application protocol might cite an IP address. In practice, applications can 'leak' private IP addresses onto the public Internet, risking routing havoc and other errors.

MAC Address

At the next layer down, the network interface has to translate an IP address to a Multiple Access Control identifier, such as a 48-bit Ethernet station identifier. The MAC namespace is permanently world-unique, since manufacturers must buy 4K blocks from the IEEE Registration Authority for $500 each, plus a $1,250 initiation fee.

On any particular network medium, though, this minature namespace is resolved in the packet driver using the Address Resolution Protocol (ARP) and its complement, Reverse ARP. Rather than flooding the network by periodically announcing one's own IP:MAC binding, there is a reserved query frame type in Ethernet for this purpose.

Phone Number

Or perhaps this airline ticket's actually being purchased over a dialup connection. Once more, the link layer's address becomes the next lower one's resolvable name. In this case, the Point-to-Point protocol driver needs a phone number to setup a connection over the PSTN. As initially designed by the Bell System in 1947, telephone numbers are another world-unique hierarchical left-to-right namespace, split into country, city, exchange, and subscriber codes.

Past seven digits, their inherent human interface isn't all that great, relying on yellow and white pages at any appreciable scale. Even the written form varies, from ITU standard +1-(626)-806.7574 to the domain name form 4.7.5.7.6.0.8.6.2.6.1.tpc.int. Furthermore, voice is only one of many modern applications, so phone numbers are additionally disambiguated by use (fax, mobile, data, etc).

The North American Numbering Plan Administration (www.nanpa.net) is the authority responsible for renumbering new area codes and allocating the 792 valid exchanges to local carriers. Issuing every new entrant a minimum block of 10,000 lines only aggravates the problem, similar to the IP registrars' fears before the advent of Classless Inter-Domain Routing (CIDR) allowed for network prefixes between 16 and 24 bits wide.

NANPA also reserves application-specific numberspaces, such as the 555 exchange for information services or 800-855 for toll-free teletype access.

tree diagram of the anatomy of a URI

Pathname

Our trip down the network stack has only moved us halfway across the original URI. Now we have a connection to an HTTP server, which can resolve the remaining pathname into a renderable Web page. Only the server knows what aliases and file-type extensions and pathname rewriting rules apply to the URI path to yield a filename. The server's operating system, in turn, can resolve that filename to an inode. The filesystem uses inodes to insulate the bits on disk from user-level renaming, moving, linking, and so on. Inodes, in turn, are resolved to physical track and sector addresses by the disk driver's directory (or "map").

Of course, a Web server does more than blindly transfer files; it can also execute processes or interpret scripts. In our airline case, the first component of the URI path specifies a reservation database at the server, but the second is a database key, which names a single record within that database. In the lingo of the airlines' Global Distribution Systems (GDS), a Passenger Name Record (PNR) is created for every reservation enquiry and ticket.

A PNR is a short, opaque alphanumeric string that's particular to that GDS -- but a PNR alone isn't enough to tell if it comes from the Apollo or Sabre namespace, for example (much less which airline it's on!). Furthermore, a PNR name is needed up until the day of flight, to plan revenue yield, catering, and so on; but for archival post-flight access, a more permanent International Air Transport Association (IATA) 16-digit ticket number names the journey.

Names in HTTP Messages

Once the Web server begins transmitting an HTML page, we can zoom in one last time, down to the actual "bytes on the wire," as we say. Now the HTTP response message format itself looms large. First comes "HTTP/1.1 200 OK", sporting an IANA-registered version number and reply code. Later, we see "Content-Type: text/html", plucked from the Internet Media Type namespace. Also known as MIME types, these entries must be documented with IANA to some degree, or designated private with "x-" or "vnd." prefixes. "Content-Language: en-us" also comes from an IANA list, but it's ultimately dependent on ISO country and language tokens. Similarly with character sets, referenced by IANA-registered atoms like "iso-8859-1" and encodings like "UTF-8", though the actual character sets and encoding algorithms are defined by other bodies.

After the blank line, we're in the midst of the actual hypertext markup, with tags like <HEAD> and <B> flying by from the various HTML tag namespaces defined by W3C over the years. Within a <META> tag, we come across a Platform for Internet Content Selection (PICS) label rating this page. Suppose J.D. Power and Associates defined a ratings schema for airlines: price, meals, timeliness, each on a 1-5 scale. That itself is yet another namespace, of course, but let's dig even further into the PICS label's fine print.

We find that at the end of the label is J.D. Powers' digital signature verifying that this rating is indeed accurate. Within that data structure we can make out URIs describing which particular cryptographic algorithms were selected, from a space defined by W3C's DSig 1.0 specification.

Ultimately we arrive at Ground Truth: the actual mathematical facts, including the public key.

Zooming Out

But what do we really have here? Is a prime number an address, just one entry in the infinite possibilities for public keys? Or is it a name, a legally binding entry resolvable to J.D. Powers & Associates? All of a sudden, we'll have to zoom all the way out from "bytes on the wire" to "people in organizations and society." Is identity certified through hierarchy or peer-to-peer introductions? Is United's XML tag for <FARE> comprable to Delta's, which happens to include all taxes, too? What about that URI linking to Hertz's rental deal -- will it still be around tomorrow? And what about the names of airports, the syntax of GPS coordinates, or passport numbers?

In the first half of our tour, zooming in on the anatomy of URL has certainly exposed a plethora of namespaces. In the next issue, we'll try to distill which of these are uniquely Internet-scale challenges: namespaces that scale across time, into the future; across space, into uncountable digital nooks and crannies; and across organizational boundaries, negotiating the meaning of each name in lieu of "universal" enforced standards.

----

For more detail on the namespaces introduced here (and a sneak preview), companion slides for this article are available at http://www.ics.uci.edu/IRUS/twist/twist99/presentations/khare/ -- as well as a dozen other speakers' presentations and detailed minutes from the workshop.