joshua schachter's blog

that suckedfidelity

lessons learned: apache etags

When del.icio.us first started its ever-upward climb in traffic, we pretty quickly went from one to two webservers behind a load-balancing reverse proxy. We saw a huge improvement in the performance of dynamically generated web pages (that is, most of the site), However, there were still weird latencies that were hard to explain: approximately every other reload would cause the browser to refetch the CSS and javascript and images and so on.

One of the things that HTTP does to reduce the amount of data transferred is to negotiate whether the document needs to be transferred at all. One of these is a "conditional GET" which allows the server to specify an identifier with the document in the headers, for example:

$ HEAD http://memepool.com/
...
ETag: "cc038c-86b7-89ce1880"
...

The ETag field here refers to some state identifier for the returned document. Later, a browser can specify that same identifier via If-None-Match and the server can decide to say that the document has not changed instead of sending the document again.

Now, I'd seen ETag headers before, but I'd always assumed they were a function of the contents of the documents; perhaps a hash or checksum or similar. It turns out, though, that Apache actually constructs them from the inode, filesize, and last-modified time (easy to get from the directory entry, I suppose). And naturally, while the other two items are set when the files are checked out of the revision control system, the inode is entirely dependent on the local filesystem and the blocks that happened to be available when the file was created. Since multiple web servers were serving the same files on different requests, the inodes did not match, and frequently the document would have to be fetched from scratch.

In the end, a quick FileETag MTime Size in the Apache configuration file made it all work properly.