beyond rest
A publish/subscribe architecture is natural to other problem domains such as instant messaging and financial data systems (Tibco, Reuters, and so on).
Similarly, Brad Fitzpatrick implemented something similar as a never-ending Atom feed a few years ago for Livejournal (sans XMPP, which wasn't as conceptually prevalent then.)
One important point in the presentation is that, for example, a single application would poll Flickr approximately three million times in a day to fetch only several thousand updates. At Delicious we saw a similar level of polling activity, made somewhat worse by speculative querying (hitting the URL information pages to see if there was any data for arbitrary URLs, which was generally unlikely.)
One solution that ocurred to me at the time was to build a simple callback system over HTTP. This would fall comfortably between full polling and full persistent publish/subscribe. The clever acronym even writes itself: PIMP Is Mostly Push, although maybe PRSS (Push RSS) would be slightly more polite.
Simply described, instead of polling frequently, a client would send a normal HTTP
request with the resource to be subscribed to and an endpoint to deliver updates to:
http://your.app/subscribe?resource=/some/user&callback=http://my.app/endpoint
Presumably the endpoint would then receive RSS item fragments when and only when that resource updated. For security, the exchange should include some kind of token, borowing from the appropriate protocols. The subscription would lapse after, say, 24 hours, or that could be passed in as a parameter.
In some ways this is slightly more elegant than the XMPP solution as neither side has to maintain a dedicated long-running process. A simple server-side implementation would justfetch items from a work queue and send out HTTP messages. A simple implementation on the client side would be a plain old web page that could accept and process a POST request. There are a number of people on inexpensive service providers who have at best web scripting hosting and not much else. The case where Delicious/Twitter/Flickr pushes my own items (and not much else) up to my blog is an important one. Additionally, there would not need to be any persistent TCP connections, which is probably more efficient in server resources (but less efficient in network resources; for billions of messages the TCP overhead becomes significant).
Of course, callbacks are totally infeasible for a variety of other uses, especially for mobile or desktop applications (which are likely to be firewalled).
Comments
Issues that you need to address:
1) What do you do when the end point is down (pushee)?
- Presumably you would need to have durable subscriptions of some kind and the Pushee would have to worry about not double processing messages when you redeliver.
2) Is there really a savings over XMPP? Maybe in the tiny pushee case.
- Presumably you would have a server that is receiving messages from various sources. It is cheaper for that service to make one socket connection to a message bus and then sources would connect to the bus to send messages when they have one available.
I like your idea in principle, but I think I would rather the callback be an XMPP endpoint. The small fries could use Google's Jabber server which even has server side storage of the messages for when you are offline, the larger ones can implement their own XMPP farms. Everyone wins.
Sam
- Sam
I think that adding another component for the client side is an unnecessary component as it adds complexity for almost no payoff, at least on the small scale. Similarly, my solution is MUCH easier to implement, which you should never discount. Finally, you now have to have some system which POLLS some XMPP server in the case that you can't run your own. This just moves the problem.
- Joshua
Sam: One of the main problems is that the "small fries" can't even necessarily use Google's Jabber server, because there still needs to be a long-running custom daemon running somewhere as a Jabber client. Many "small fries" offering the network effects on the web have $5/mo commodity PHP hosting to run Wordpress and the like.
This is why Atom/RSS feed polling has won over XMPP push thus far: If it's between hosting a mostly-static file served over HTTP versus hosting XMPP servers and clients - the simple solution wins. The PIMP pattern described here is only just a bit more complex than feed polling, which gives it a lot of potential.
- l.m.orchard
After talking more with Joshua about it, I actually don't mind this as an additional gateway from XMPP to clients. Is there really no server-side XMPP client that works like this already? Where you register an HTTP callback rather than wait on the socket?
- Sam
This is basically the pattern that services like GitHub follow for dealing with post-commit hooks on remotely hosted VCS services. When a change happens to a repository, a message gets POST-ed to an URL of your choice, and you can do with it what you will.
It is ridiculously simple to implement, and you can build all sorts of clever services with it - not the least of which is a simple XMPP bridge.
Long live PIMP.
- Adam Jacob
Jeff Lindsay has been working on this concept for ages. He calls it "webhooks": http://webhooks.pbwiki.com/. It's already in production at GitHub and a handful of other sites. We've considered adopting it at Twitter.
- Alex Payne
Reminds me of http://www.rssping.com/
- Rowan Nairn
I've been raving about this model for a while. I call it web hooks.
- Jeff Lindsay
It's great to be able to confirm that the use pattern is similar at delicious as they are at flickr.
So there are a couple things, the ping back system is something which we had thought about putting a slide in for. Clearly being able to do ping backs / web hooks seems like a good pattern, maybe anti-pattern?
It works out that creating a big scale web hooks system you end up with aggregator / crawler problems. Definitely doable. With XMPP we get that functionality with federation and a nicer interface. We also can potentially add some delegated auth over it.
-rabble
- rabble
A few years ago I wrote a protocol for HTTP notifications:
http://gonze.com/http-notifications.html
The gist of it is that the only thing in the callback is a notice that there is something to be pulled. The advantage of this approach is that it's the smallest modification to the existing way of doing things.
- Lucas Gonze
What you are describing is WebHooks (briefly mentioned on Rabble & Kellan presentation, I believe). You can find more about webhooks here: http://webhooks.pbwiki.com/
There is nothing that stops you or some third party to subscribe the source XMPP feed and translate that to HTTP-based callbacks. In fact, I do believe that those services will show up. It is indeed simpler in some cases.
But for the source site, XMPP is much simpler, because they have to send one message to a XMPP PubSub node.
Best regards
- Pedro Melo
Your idea sounds like a delegated AtomPub (or more generally HTTP) publication mechanism where the del.icio.us server would post to a user's AtomPub repository (on her berhalf with OAuth), doesn't it ?
It seems also similar to an HTTP-XMPP Pubsub gateway but without XMPP notifications (see also draft-saintandre-atompub-notify).
@l.m.orchard
I think feeds polling has won thus far because there's no XMPP Pubsub client, though (except Gajim SVN and Synapse-IM). But now there exists an ATOM over XMPP Pubsub WordPress plugin, it becomes relevant to have a Pubsub ATOM aggregator, and hopefully it will happen soon. Actually, current Jabber clients could already be used if server supported the "pubsub#body_xslt" option as demonstrated with Twitter messages.
- kael
Joshua - something very much like what you describe actually exists. Take a look at Gnip:
http://www.gnipcentral.com/
- Dan Weir
This idea was actually considered by some folks and was called LLUP (Limited Lifetime Ubiquitous Protocol, pull backwards). The goal was to make it essentially protocol agnostic so it would work over HTTP and XMPP.
Another idea that came up that was very similar was the idea of web triggers. This is much closer to your callback concept with the main difference being that there is not a direct link between the action that occurred and the service requesting a callback. Instead, if you wanted to receive a notification, a separate service handles those requests and the application that is being updated simply sends a message to the trigger service. This is merely a slight difference, but it seems helpful to consider as a trigger server would be very busy very quickly.
In any case, I think that a push technology is becoming necessary and that it must be RESTful in how it functions. Comet is, for example, an interesting idea, but it does not seem to have the same concepts as REST, which makes we wonder how well it fits within the web at large. Personally, I'm very excited to see these kinds of conversations come up since both push and pull have their place.
- Eric Larson
There are a few key advantages that XMPP has over HTTP in this case.
1. PIMP has already been done. It was called Trackback, and broke because of spam. In order for PIMP to work, you'd need to verify the identity of the sender, which is going to make the solution more complicated[1].
2. The failure case of the HTTP solution is bad. Essentially, you have many HTTP requests happening frequently, and each one is taking, say, 5 ms on average (that might be optimistic). If you have 10 processes capable of sending messages, you can send 2000 messages per second. If, for some reason, you run into a situation where the remote host accepts the TCP connection but honeypots you and takes 1s to accept each message, you're all of a sudden down to 10 messages per second. Resolving that problem is hard, at least compared to pushing out messages via HTTP post.
One approach that I particularly like is a proxy; if you don't / can't speak XMPP, then use a 3rd party that does, and you can either poll (which is exactly the same badness as now, except for the provider) or be pushed to via HTTP (the aggregator can deal with the hard HTTP problems).
To speak to the server resources, the reality is that the XMPP servers already do smart allocation and de-allocation of TCP sockets. If I post one message per day that goes to romeda.org, then the server-to-server XMPP connection will be short-lived. On the other hand, if I post 100 million messages per day to GTalk, then I'll have at least one persistent XMPP connection, and a significantly reduced overhead.
[1] I mocked up what that might look like here: Messaging-over-HTTP
- Blaine Cook
I actually love this idea. There's obviously a few complications with it not working in a bunch of situations (like say with endpoints behind a firewall), but it does look like an elegant solution. Plus, these days anyone can run a thin client/server to accept the callbacks. Any quick 10-liner dynamic language script can serve as a perfectly capable endpoint.
I love this, and I'll certainly give this some thought.
PS: How's it going, Joshua! Long time no speak.
- Fred Oliveira
I think this thread serves as proof of my point. When I (and presumably most others) commented on this thread, Sam's comment was the most recently visible. The latency (associated with approval of those POST requests) was so high that the possibility of real conversation in the context of your blog post was missed.
With a *simple* WebHooks system, this isn't possible because ad-hoc communication will descend into spam without pulling in a fairly complicated discovery-based OAuth-like approach. XMPP solves this transparently, by giving you verified identity with no extra work.
Now, give all these low-cost PHP-only users (whether or not they care about all this is another question) an evented web server, a way to schedule background threads, and an authentication module for their server that does sender-verification, and then we can talk. Until then, it's just a pipe dream [at scale].
- Blaine Cook
This is mostly what we did for the "chat" portion of Picasa's Hello 5 years ago. Pairs of users present a big streaming document of XML, and groups present a big merge-able document of XML. Multi-master cases are lots more fun than pubsub. :)
The popular thing in the chat-over-HTTP world is to do hanging GETs (mentioning this because it doesn't seem to be getting discussed). This does require some server resources to hold a TCP socket open, but that's not a huge limitation these days. Many firewalls (and IE7) end up timing out a GET after 30-60 seconds or so, so this amounts to 1-minute polling in real life, since HTTP doesn't send extra packets to do keep-alive.
- Michael Herf
I'm not sure why you can't just exchange a secret key on the initial request. You don't need any auth delegation or even identity, just need to know that the responses are properly paired with your intiial request. Thus, only a shared secret is required.
Blaine, please get past the strawman arguments. It feels like you're making up problems to solve.
- trackback is not the same thing, since it's just individual messages, not request/responses
- a poorly implemented outgoing message queue will suck, sure. build in a timeout and requeue messages that fail. drop them off the queue after too many failures.
I think that XMPP is an ideal solution for the firehose between large sites. For everything below that, mine is much better, down to and including the low-end user subscribing to a single stream.
Adding XMPP to the mix radically increases the complexity of any given solution and ends up being simply impossible at the bottom of the range, which is the vast majority of implementations.
- Joshua Schachter
Blaine: a shared secret is enough to circumvent the spam issue. It sounds like you are either inventing problems or not reading carefully.
Comments do not instantly post here because Joshua filters against noise using an advanced neural network which presumably isn't always at the keyboard. It appears not to have worked well this time, but if anything that argues for an increase in moderation time.
"Evented" is not a word.
- Maciej Ceglowski
This is starting to sound a lot like work we were doing at the evil empire in 1998-1999 as part of the very early work on what eventually became the IETF IMPP[1] working group. Building on HTTP callbacks and subscription registration makes tons of sense when the volume doesn't merit a persistent connection.
I think the most crystallized versions of the protocol discussions ended up at:
http://tools.ietf.org/html/draft-cohen-gena-p-base-01
http://tools.ietf.org/html/draft-cohen-gena-client-00
These were based on some earlier work the team had done:
http://tools.ietf.org/html/draft-dusseault-rvp-schema-00
http://tools.ietf.org/html/draft-dusseault-rvp-addr-00
Don't think for a moment that I'd actually propose bolting pubsub into HTTP at the protocol level in this day and age, but there was a whole bunch of thought and discussion about the ramifications of various models and what sorts of issues users and implementers would face. It might be worth poking around the early archives of the IMPP working group to see if there's anything actually useful there.
-Jesse
[1] Instant Messaging and Presence Protocol - the forerunner to XMPP and SIMPLE. A number of us lobbied for calling it the 'Presence and Instant Messaging Protocl' working group, but that got shot down by someone with no sense of humor. We'd even worked out a full network architecture around servers (they could 'hook' connections), pseudonymous clients ("When talking to a server, Alice and Bob can both safely call themselves, say John") I won't even get into the discussions of server clustering, service payment processing or connection throttling.
- Jesse Vincent
Maciej:
You're right, "evented" isn't a word, it's a neologism that's commonly used in the Ruby community. If you're not familiar with the term, I'd be happy to explain over coffee.
Joshua:
A shared secret would work, though without reasonably careful semantics a plain-text secret becomes very susceptible to MiTM attacks. Even in the simplest case, it does make things slightly more complicated, though, which means that the solution you're presenting is hiding some of the complexity.
The reason I bring up identity (and complicate the token exchange) is that I'd like to be able to have the semantic where I can tell my grandmother: "Add me (blaine@flickr.com) to your Facebook photos and you'll see my photos." Without some sense of identity, I can't verify that my grandmother (rcook@facebook.com) is the one asking for permission to create the subscription. If all you're trying to do is PushRSS, then identity doesn't matter. For me, the fact that private data over RSS has been a non-starter for a decade is a major failure of RSS, and a problem I'd like to see fixed.
If we're arguing that this system needs to be usable by 10-line PHP scripts, then a poorly implemented outgoing message queue is par for the course. At a moderate scale (10000 users, 50 contacts each, 1/5 off-site, 2 posts per day), you're looking at 2.3 remote HTTP requests per second, which isn't nothing.
To say that XMPP radically more difficult to implement is itself a straw man. It's different, sure. There's a lot to be said for using existing technologies, and I think the XMPP community has a long way to go to present documentation and tools to make this stuff straightforward. But please take a look at my Jabber::Simple library and others. Developing CGI applications was relatively hard for a number of years, and then it got easy because it became important. Plenty of people have built applications that integrate email functionality, and that's not far off from building an XMPP service, so don't discount it out of hand.
That all said, in my first post I linked to a post that I'd made over a month and a half ago that describes exactly your idea. I think HTTP Push has real value here, I just don't think that it will be the primary mechanism, and it's somewhat misleading to suggest that it's trivial to implement.
- Blaine Cook
Ok, so you replace the pushed data with the simple update that the RSS feed for the resource itself has changed, without pushing the update itself. Then we don't have to care about a shared key or authenticity.
I don't think it's trivial to implement; nothing is at scale. Just, you know, easier than using XMPP itself. And radically more accessible to small developers in the way that anything XMPP is not.
Note again that these are different technologies covering different parts of the spectrum.
My suggestion that XMPP is complicated is an opinion, and I think a relatively well-founded one. That's different from a strawman argument, which is when someone equates an argument with a weaker one, and attacks that instead.
- Joshua Schachter
Now we are getting to the heart of what we really want I think. Something very simple that solves this continual polling problem in a sufficiently simple way that services can implement it and simple web applications can use it. My suggestion is to think of this instead as another form of caching. All we really want is a header that tells the server that we are interested when a particular resource has been updated and how to tell us. The server can then either understand that header and acknowledge in the response that it will notify me. Here is my strawman:
Request:
...
X-Cache-Callback: http://www.javarants.com/notify/joshua.schachter.org/atom.xml;SECRET
Response:
...
X-Cache-Callback: OK
Then if that resource is updated the service is expected to either HEAD the callback as a notification or POST the new contents of the resource, servers choice. You could later add semantics for merely updating the resource vs replacing it wholesale. I would also think about adding the ability for the server to specify a timeout after which you are free to poll again if you haven't heard anything on the assumption that sometimes the service may lose the state associated with your subscription. Obviously there is still the possibility of MITM attacks but I'm not that concerned about it as they could be detected through sampling the original resource. You could always use https with certs if you wanted to be sure.
- Sam Pullara
Great article. I stumbled onto webhooks by actually needing to create a webhook service to solve a problem I was working on. I tried to research the issue but I had no idea what this model/pattern was called until I found this article!
http://neude.net/2008/07/distributed-observer-pattern/
- Vyrotek.com
Joshua, you might remember that this model of posting updates over XMPP is precisely what Pubsub.com implemented many years ago. There are many aspects of XEP-0060 and "Atom over XMPP" that came from the experience we gained distributing blog posts and other kinds of data over XMPP. I think we were a bit "ahead of our time" back then, but it looks like people are finally realizing the problem with polling and realizing that push models can result in massive reductions in both the latency of message propagation as well as resource consumption.
The HTTP hanging get stuff also has quite a long history. KnowNow.com was created by Adam Rifkin and Rohit Kahre back at the beginning of the decade to exploit that model. They built the mod-pubsub extension for Apache and a company around it all. Unfortunately, they too have gone bust -- too early and before their time as well.
Let's hope that the recent excitement in the Pubsub/Push model is more successful this time around.
bob wyman
- Bob Wyman
Another real-world example...
blo.gs used this sort of system to distribute pings, up until its acquisition. It used XML-RPC over HTTP, and subscriptions lasted for 25 hours.
For some time, until Jim implemented an XML stream for the pings, I had the topicexchange.com server receive such HTTP pings (re-subscribing every 24 hours) and re-publish them as a plaintext stream (it would send you "UPDATE http://url/...\r\n" for each update) that you could listen to by connecting to topicexchange.com:9123.
At some point I noticed that the inbound HTTP traffic to my server for all the pings was using about a megabit or so of bandwidth. But until then it was quite manageable.
To avoid the issues Blaine raises above, on the sending end you'd want to make sure you weren't using up too much queue space or spending too much time trying to deliver updates to slow receivers, but that doesn't sound like a big deal.
- Phillip Pearson
I barely remember why I was looking at this but Microsoft implemented some extensions to HTTP in some of their HTTP servers - at least in the one in Exchange. They added SUBSCRIBE and and UNSUBSCRIBE methods that create and destroy subscriptions to state changes on resources. These state changes can be polled for using the well named POLL method, or at SUBSCRIBE time a Call-Back header can be passed containing an URL whose NOTIFY method will be called.
This is all kind of sane. It extends the traditional HTTP/WebDAV model to include the kind of resource-change notification that has been available in the Windows filesystem APIs for a long time and that we're now starting to take for granted on Linux and MacOS these days. The crazy part is that callback URLs must be httpu:// urls. What is httpu, you ask? HTTP over UDP. *facepalm*
- Ian McKellar
Oh my god, you've reinvented email:-) The challenge I see are dialups; guess a dialup client is polling your service every 15mins for three hours or so and you are updating the page two times. With normal feeds, you would have 12 queries that can easily be cached and retrieved using an http head request. Now you have in the best case one full query and one subscribe query plus two updates - none of these can be cached. Even worse: if my client reconnects, I won't receive any updates so I need a additional heartbeat. Alone this heartbeat will anhilitate the performance winnings.
- Benjamin Schweizer
Benjamin: I'm not talking about end-clients. I'm talking about various services pushing updates rather than polling. As I said, XMPP probably serves that situation better.
Similarly, the hanging get queries are also not the right solution. We're talking about reducing polling of systems that only update occasionally.
- Joshua Schachter
if you're reinventing email, you should reinvent uucp with it too (rather than smtp); and then you can reinvent usenet.
i was trying to figure out if usenet-style flooding strategies made sense rather than polling or pushing, but it's too late at night to do anything other than type those words in and see if it makes sense. certainly it fits some sort of model of batched updates, but it may introduce too many other complicating factors to be relevant.
- Edward Vielmetti
Joshua I really like this idea; it keeps the endpoints nice and simple REST (leaving an XMPP fire hose as an option too for the heavy hitters).
Sam - I really like your strawman. I've long wanted a 'SUBSCRIBE' verb in HTTP for doing this kinda thing; but I think your cache-header approach is cleaner - as folks can either keep polling and/or subscribe for the update notification. I love the simplicity of the HEAD or POST to differentiate a notification of change to a notification-with-the-data. There should be a X-Cache-Timeout so that the server can know when to timeout the subscription. Maybe rather than returning OK the server returns the amount of time before the client has to re-issue the subscription to keep it alive? So the server can decide the maximum subscription time. Then most PIMP clients can do relatively long polls every few days or something, to keep the subscription alive and to still get updates if the PIMP mechanism is borked.
I'm with you in the thinking of this as another form of caching. In implementing PIMP some folks might be able to create update notifications internally in their system when resources change to push out change messages into some kinda queue for posting to the callback URL. This would involve significant work for many sites though.
However it'll surely be pretty trivial to just install a PIMP-enabled caching web proxy inside your data centre in front of your servers - that does the usual cache thing, but also detects these extra cache headers and does a background poll of resources to detect changes both to update the cluster of front web caches (so non-PIMP pollers get more real time data) but also to drive the pushing of updates out to PIMP subscribers.
i.e. I can see this as a pretty easy upgrade to most web sites - folks just update front end web proxies to enable PIMP and hey presto you now support PIMP consumers. Am sure the web proxies could include an XMPP firehose too pretty easily for heavy hitters.
It also should be pretty easy to hack the web proxies to do this I'd have thought? Even the problem thats been noted earlier in this thread - of trying to push updates to a URL endpoint might be slow, unresponsive or unavailable - the web proxies have to deal with already right in case a *local* server is borked.
Great blog post and comments :)
- James Strachan
I like Sam's strawman as well.
There are some things to like about proper verbs like the SUBSCRIBE/NOTIFY approach, but ultimately I think treating it as a cacheing concern is cleaner. Our motivation isn't semantic here, it's performance. Adding verbs for performance concerns that could credibly be solved with cache control headers seems contrary to the ideas behind REST and http.
I like the PIMP proxy idea as well. Such a setup could be an interesting service integration layer. Services just publish feeds and poll feeds: very simple model to program. The proxy layer caches the dependency information, and assuming you have some coherence in which proxy server services hit for a particular resource (ie url hashing layer 7 load balancers) it'll scale out the dependency information between entities.
- Jason Watkins
I'd like to talk with you about this sometime in person! You should come to a SuperHappyDevHouse. :)
- Jeff Lindsay
As far as web hooks style push notifications go -- GitHub has a nice example for commits notification, in which they do a simple HTTP POST call with a payload of JSON data.
http://github.com/guides/post-receive-hooks
http://webhooks.pbwiki.com/
- Brendan O'Connor