autoincrement considered harmful
MySQL's auto_increment, and similar features in other databases, are a powerfully useful function but ultimately lead to problems.
The first problem is that you will be tempted to use the internal identifiers in external URLs. I realize that RESTian canon indicates that every single object have its own identifier, and many new and whizzy frameworks generate simple create/lookup/update/delete user interfaces automatically.
URLs that include an identifier will let you down for three reasons.
The first is that given the URL for some object, you can figure out the URLs for objects that were created around it. This exposes the number of objects in your database to possible competitors or other people you might not want having this information (as famously demonstrated by the Allies guessing German tank production levels by looking at the serial numbers.)
Secondly, at some point some jerk will get the idea to write a shell script with a for-loop and try to fetch every single object from your system; this is definitely no fun.
Finally, in the case of users, it allows people to derive some sort of social hierarchy. Witness the frequent hijacking and/or hacking of high-prestige low-digit ICQ ids.
The second problem, in the case of MySQL, setting a column as auto_increment requires that there be a primary key placed on this column. It's not well explained in the documentation, but under InnoDB, the primary key is similar to a unique key, except that the rows in the database are stored in the sort order of the primary key -- this is why there may only be one such key. (Other database systems refer to this as a "clustered index"). This means if you are using it merely as a join identifier, but freqently do large queries based on some other column, the rows have to be fetched from all across the disk since they are not all together. As an example, in early implementations of del.icio.us, fetching all of the bookmarks for a given URL could cause tens of thousands of disk seeks even if there was an index on that column. As a datastore grows, the location of things on disk in relation to each other becomes an important consideration for scaling.
Comments
Interesting recommendations - too bad you don't give any examples of what you consider the Right Way(TM). What would make you happier? Passing around UUIDs?
- Jemaleddin
Jemaleddin - it depends on the application. For delicious, the natural identifier is some function of the user and the URL, for example.
Consider using the MD5 of some unique property to the object, perhaps.
- joshua
I have to admit that I'm guilty of using serial numbers externally. Your arguments for not doing that in a large public system are well-taken.
For those of us not using MySQL, I don't see any reason to avoid them internally. In PostgreSQL, sequences (and the "serial[48]" types which implicitly create them) do not imply "primary key", and "primary key" does not affect how the table is stored. As in Codd's relational algebra (and thus any database but MySQL), it means "unique not null" and is a convention to indicate that you've chosen it over other candidate keys as the primary way of joining to that relation. (There still can be only one.) I don't think PostgreSQL has any real way to duplicate the MySQL behavior [*], which is what Oracle calls index-organized tables. ("create table ( ... ) organization index")
In fact, in non-MySQL databases there may be good reason to prefer serials rather than UUIDs for references. I haven't been in a situation lately to worry about/measure this sort of scaling issue, but it seems like 4 or 8 bytes as opposed to 16 could have a measurable impact on the ability to cache an index in memory.
What do you tend to cluster your tables by?
[*] - There's the "cluster" command, but it doesn't actually store the row data inside the index, and subsequent inserts don't follow it. I've seen log tables partitioned by month through a trick with inheritance; I think that's as close as you can get.
- Scott Lamb