Monday, February 3, 2014

A proposal for Actions

... in which we are combining the discovery of entities with the affordance of a broad set of actions.

I'm really excited to see things like web/android intents/activities/appurl popping up! It is certainly a cool new paradigm that could enable a lot of different interactions between decoupled applications.

I've been working on a related idea that is still in formation, but well baked enough to be worth sharing.

This is just my personal recollection of historical notes and the challenges we faced as we are designing this protocol specification (sort of the backstage of the protocol design). This is a collaborative process between Google, Microsoft, Yahoo and Yandex and you can be part of this participating here.

(edit: added more related efforts as I learned about them, corrected a few obvious mistakes)

The problem

The basic problem we were facing was very much like the one that web intents was set to solve: de-couple service providers and service requestors providing an intent brokering platform.

We wanted to enable products like this and this.

As we looked into specific use cases, a few things became clear:

- We needed to deal with a wide variety of platforms (Web, POP/SMTP, APIs, Android, iOS, Windows, Feeds, etc)
- We needed a common way to invoke these abilities.
- Declaring a service's abilities via a registry of (verb, data type) wasn't going to be sufficient. You had to be more specific.

The first wasn't that huge of a problem, but needed to be dealt with. The second is tough, but tractable. The third, however, is quite a challenge and we call it "The Inventory Problem".

The Affordance Problem

The affordance problem refers to the fact that it is not sufficient for a service to describe its ability to "act" (verb) on "types" (nouns). You actually need to go further down in the granularity level and enumerate the individual instances your service "acts" on.

Take the existing intent model as an example:


That certainly works well for verbs like "share" that apply to any image/*, but does it work for verbs like "watch"?

For example,  is it sufficient to say that "netflix can stream movies"? Not actually. There are very specific instances of movies that netflix can play, aka their inventory (e.g. the latest movies still in theatres *cannot* be watched on netflix).

So, one way or another, services need to declare more specifically what resources they can act on.

This problem comes up in a variety of different use cases.

Use Cases

We've explored a few key use cases that we wanted to support. Here are a few key ones:
  • Restaurants that allow reservations and orders (e.g. food delivery or for pickup)
  • Movies that can be watched, songs that can be listened
  • Hotels that can book rooms
  • Taxis that can be reserved
  • Airlines that can find flights
  • Flights that can be reserved or checked-in
  • Cars that can be rented
  • Local Businesses providing appointments
  • Organizations that allow you to search for Stores
  • Things that can be reviewed
  • Package deliveries that can be tracked
  • Events that can be RSVPed
  • Products/Movies that can be reviewed
  • Expense approvals that can be confirmed
  • Offers that can be saved
Here is a presentation I made that goes over modelling them.

All of these have in some shape or form the "Inventory Problem".

For example, opentable/urbanspoon/grubhub can't reserve *any* arbitrary restaurant, they represent specific ones. Netflix/Amazon/Itunes can't stream *any* arbitrary movie, there is a specific set of movies available. Taxis have their own coverage/service area. can't check-in into UA flights. UPS can't track USPS packages, etc.

That basic premise led us to take a different approach: to annotate individual resources with the operations that are available, rather than annotate services with their general abilities.

Verbs ... they are kind of weird

We first asked ourselves: how do we model verbs? Which rat-holed us into a really long discussion around things like:
  • Do verbs have arguments?
  • How do we deal with synonyms, antonyms and reciprocals?
  • Do verbs follow a hierarchy like nouns?
Which I went over in more detail here.

With a hierarchy of verbs, we started to look into how they would connect with resources.

Resources and actions

Thanks to the good work of the semantic web folks, finding and exposing resources is quite simple.

Take a movie on netflix, for example, this is what it looks like:

Roughly, with markup added to that resource, this is represented as graph:

<script type="application/ld+json">
  @context: "",
  @type: "Movie",
  @id: "",
  name: "The Pursuit of Happyness"

Now, there are plenty of actions that you can take on a movie: you can do things like watching, buying, renting and reviewing it.

Netflix, allows you to watch movies, so lets add nodes to this graph to express that:

<script type="application/ld+json">
  @context: "",
  @type: "Movie",
  @id: ""
  name: "The Pursuit of Happyness",
  operation: {
    @type: "WatchAction"

Via the property, you can attach an operation that can be performed in this resource. In this case, the fact that you can it (with well defined semantics).

Taking a step further, if you wanted to say that your application can handle this resource on the web as well as on mobile, you'd have something like this:

<script type="application/ld+json">
  @context: "",
  @type: "Movie",
  @id: ""
  name: "The Pursuit of Happyness",
  url: "android-app://com/netflix/movies/70044605",
  operation: {
    @type: "WatchAction"

That gives a movie streamer the language to express:
  • The individual movies in their catalog/inventory (resources)
  • What can be done with each individual movie (actions)
  • How to invoke the action (handlers)

Brokers, Requestors and Providers

Netflix exposes these resources as well as these operations via a variety of transport mechanisms (e.g. markup on webpages, feeds, POP/SMTP messages, etc). We call these entities the providers. 

Crawlers/browsers/registries discover these resources following the links and indexing these abilities, building a global registry. We call these entities the brokers. 

When a specific problem needs to be solved (e.g. watching movie X) by a specific application, it queries the brokers. We call these entities the requestors.


Think of the actions as the things that you can do with a resource. So, on top of things like GET, POST, PUT and DELETE, you'd now have things like Watch, Listen, Buy, Order and Review to describe what they do.

The same mental model of REST applies though: you have a resource, and you apply operations on that resource.

For instance:
As a parallel to REST collections, you'd have similar operations like:

Next Steps

There are plenty of challenges ahead of us. Here are a few things I am actively working on:
  • More and more implementation
  • Adding more action handlers, understanding how invoking these operation in multiple platforms should work
  • Standardize/Document more interactions and use cases we expect to see exposed on the web
  • A communication protocol between Requestors and Brokers, so these can be further de-coupled. Currently, the spec only covers the protocol between Brokers and Providers.

Related Efforts

Here are some efforts that are related but not quite. I'd love to learn more about related efforts and learn from experience, so feel free to drop me a line to let me know if I'm forgetting something.