*Why* Schema?

Posted by david, Fri Feb 22 23:07:00 UTC 2008

A friend and former colleague asked me (about three weeks ago – I’ve been busy!) about the bit of fuss I had been making about a schema for JSON. Specifically, he asked what schemas are useful for. I admit, its an interesting question – and perhaps even shows the difference between my background which has been 90% compiled languages like java/C#, and his experience of probably around 80% scripting/late binding languages like perl and ruby.

So what is a schema? Well, the dictionary entry from new Oxford is a good start:

schema |ˈskēmə|
noun ( pl. -mata |-mətə| or -mas ) technical
a representation of a plan or theory in the form of an outline or model
...

So, a schema represents the desired form of the result; in this case, the format and structure of JSON data.

But why is this useful?

A schema allows you to know what to expect

The first major usage I’ll state is probably the most important. A standard schema document has an advantage over any other publication of a document’s structure simply because – its a standard document. Even ignoring that it defines a consistent way to describe what a document should look like, without any schema you are limited to giving the other person nothing other than examples of things that should work1.

Examples have their own purpose. They are great for unit tests, for example, or for having an author see what a real document’s structure should be like. In this sense, this allows the reader to internally generate their own schema in their head.

A schema allows you to use tools to help you

My reasoning for working on schema right now is that my JSON4Java project is stalled. For 1.0, I wanted support or JSON bindings; being able to feed in a plain java object (a POJO, as it were) and get out JSON text. That turned out to be pretty simple, especially in comparison to feeding in a JSON text document and getting java out. It turns out that I have to know more than what a class defines for me in order to be able to handle all the corner cases. It turns out the best way of doing that is to create tools that work based on a schema.

The second big reason for schemas is automatic validation. Based on a schema definition, a piece of software can accept or reject a document you are being given. This is a great thing for removing all that ugly document validation logic you had to manually add to your code to handle problems. Or rather, this makes it so a tool handles all the corner cases you forgot. You can write much more robust code, easier, with this sort of tool.

JSON for example doesn’t support a native “date” format, while a schema could define one. A tool could thus know to deal with date objects in your language of choice while reading in a json document. There is simply no good way you could know the difference between “2008-02-21” as a date and as a string value without knowledge of the document form.

Schemas help to establish Meta-schemas2

I believe, and have seen evidence, that publishing and sharing and reusing document forms will allow for an ad-hoc set of standard best practices to result.

This isn’t to say such a thing can’t happen without schema. However, its much more likely to happen once schema is in the picture. Given the date format above, it is much more likely to become the common format for representing dates in JSON once you have a way to easily choose to use it. Otherwise, you will have people who insist on using two digit years, removing the dashes for terseness, or even doing a ‘the number of days since jan 1st, 1970’ and using an integer value. Or representing times in the document, but throwing the time of day portion away.

Reinvention takes time too. It takes time both from the person who is doing the inventing and from the others who are attempting to work with it. Schema can allow for reuse, and for tools which can evolve to know how to handle data for you.

1 And when I say schema, I am including ad-hoc sentences like “and the post tag is an object which contains a url, a date, and a title”. These may not be formal definitions and thus may be easier to write. However, they may leave significant gaps in the ability implement because of their informality.

2 Ok, so I kinda invented that word, or at least a new use for that word. I am not just doing so to have a search that will bring up my blog as the first result.

Filed Under: | Tags:

Comments

  1. Jeremie Miller 02.27.08 / 23PM

    I dislike the word schema as it’s so closely tied to xml and the abomination that schemas there have become, but I really love how you outlined the core principle behind the concept, it definitely has value.

    I suppose the issue becomes when the schema form itself is more complicated than what it’s describing… I find the informal example you gave as one of the simplest and the kind I most often enjoy using :)

    Data and protocol formats need to be recognized as linguistic elements for developers to communicate first, before they work for software to communicate.

  2. David Waite 02.28.08 / 00AM

    XML Schema had several strikes against it;

    1. People have never really figured out if XML is supposed to be describing data, objects, or structured text. XML Schema had to make all three groups equally dissatisfied. For instance, XML schema lets you both restrict a type, and to expand it with lax rules and new data. That makes life miserable for both the people trying to map XML to a database, and people trying to map XML to an object oriented language.

    2. XML has namespaces. Namespaces have always felt like a last minute bandage that left a really ugly scar. And a lot of the reason for that, from my perspective, was that namespaces were really written with only representation from the ‘structured text’ group – who thought all the ambiguities would be solved by the people describing the document and writing specific tools to parse the document type.

    3. XML cannot really type data itself. Data structure concepts like maps and lists and sets are all very useful, but there is no accepted best practice for representing them in XML. So, to use XML schema for data you had to not just figure out your needs, but define your own new way to describe and format them.

    4. XML Schema was created by tool vendors. I really don’t know how much real-world input they had from users until after it was created. XML Schema 1.1 will solve a lot of the really painful problems – but in the end it would have been better to start small and gain user (xml schema writer) feedback.

    In short, I think I’d have a really hard time making something as horrible as XML schema on accident; I would have to decide to do it on purpose. And I certainly intend to start simple and small, and wait for real problems to occur before fixing them.

    At least for the moment, I’m calling my work “JSON Blueprints” to distinguish from other schema projects for JSON. It is slowly getting fleshed out on the json4java googlecode page.