Data vs APIs

Volume 19, Issue 14; 30 May 2016

If you can’t have the data, an API is nice. A better API would be better, and sometimes the data would be nice(r).

Why not use the API instead? It has everything.

Robin Berjon

What happened was, for another posting I’m in the middle of writing, I wanted know “how many W3C specifications have I edited”? There’s really no way to answer that question, but as an approximation, an answer to this question would suffice: “on how many W3C specifications am I credited as an editor?”

The W3C publishes (in RDF) the data that drives their technical reports page. Take your favorite triple store, run this query:

xquery version "1.0-ml";

import module namespace sem = "http://marklogic.com/semantics"
    at "/MarkLogic/semantics.xqy";

declare default function namespace "http://www.w3.org/2005/xpath-functions";

declare option xdmp:mapping "false";

let $rdfxml  := xdmp:document-get("http://www.w3.org/2002/01/tr-automation/tr.rdf")
let $triples := sem:rdf-parse($rdfxml/*)
let $_       := cts:uris() ! xdmp:document-delete(.) (: Danger, Will Robinson :)
return
  sem:rdf-insert($triples)

Followed by this one:

PREFIX rec54: <http://www.w3.org/2001/02pd/rec54#>
PREFIX contact: <http://www.w3.org/2000/10/swap/pim/contact#>

SELECT ?doc ?type
WHERE
{
  ?doc a ?type .
  ?doc rec54:editor ?ed .
  ?ed contact:fullName "Norman Walsh"
}

And the answer is 31. Except that’s not really the answer. That’s just the number of unique specifications, the current set as of today. Some number of Working Drafts preceded most of those. (And each of those was possibly preceded by some number of never-officially-published editorial drafts; there’s no precise answer to my questionAnd it should be noted that editors names often persist even after the principle editing task has been passed on to someone else: I don’t claim to have edited every version of every specification on which I’m credited. But I bet most of them.; I’m only interested in an order of magnitude).

At one time, maybe a decade ago, it was either the case that tr.rdf contained the whole history of the technical reports page, or there was another RDF version available that did. I asked around, that’s not available anymore. “Use the API, instead.”

So I did. And at this point, I began to construct a rant in my head, a screed possibly. Much wailing and gnashing of teeth about the fact that my straightforward 19 lines of query would have to be replaced by more than 100 lines of Python to be tested and debugged and, with it’s thousands upon thousands of HTTP requests, run tediously. Run more than once, written carefully (more testing, more debugging) to code around API rate limiting. A script that will not even, as it happens, answer my question, it will only collect the narrow slice of data needed to answer my question. I’ll have to write even more code to get the answer.

This is not that rant.

It’s not a rant for a few of reasons.

  1. Primarily because what the W3C has done is not unreasonable. The tr.rdf file is about 1.1M. I estimate that the entire data set would be at least four times that size. (There are about 1,200 specifications and about 5,000 distinct versions.) Fourish megabytes isn’t a very big download with a modern, first-world internet connection, but it’s big enough. You don’t want your browser[But there’s more to the web than brows…oh, nevermind —ed] doing it everytime someone wants to know the latest version of spec.

  2. It’s a nicely designed API, and for some kinds of access, an API is nice.

  3. Much as it would have been easier for me to get the results I wanted from the raw data, it’s only fair to observe that if it was 1.1T of data instead of 1.1M, a link to the file would be substantially less useful. I guess what I really want is for the W3C to store the data in MarkLogic and publish a SPARQL endpoint that I could use. But that’s a whole different kettle of fish.

  4. Finally, this isn’t a rant because if it was, I fear it would appear to be directed at the W3C. The fact that this data exists at all, let alone is published in any form at all, is a testament to the W3C’s reliability, professionalism, and serious concern about the web. Most organizations wouldn’t have had the foresight to collect, preserve, and curate this data. Of those few that had, most wouldn’t have bothered to publish it in any useful form at all, for free, on the web.

So I’m disappointed that I couldn’t just download the RDF. And I’m annoyed that I had to code my way through an API to get the data. But I’m grateful that it was possible to get it at all.

My initial plan was brute force: get all the specs, get all the versions, get all the editors, count the number of specs where I’m credited as an editor. Unfortunately, the data backing the API seems to be incomplete: many versions have no editors.

Backup plan: get all the specs, get all the versions, figure out what specs I’ve edited, count all the versions of the specs I’ve edited.

This query, against the tr.rdf data, answers the question, “what specs have I edited”:

PREFIX rec54: <http://www.w3.org/2001/02pd/rec54#>
PREFIX contact: <http://www.w3.org/2000/10/swap/pim/contact#>
PREFIX doc: <http://www.w3.org/2000/10/swap/pim/doc#>

SELECT ?doc ?type
WHERE
{
  ?version a ?type .
  ?version rec54:editor ?ed .
  ?version doc:versionOf ?doc .
  ?ed contact:fullName "Norman Walsh"
}

There are a few places where the short names have been reused, but I can get the list of short names from the results of that query. Then this Python script will bang on the API until it gets an answer:

import json
import requests

"""
Get stuff from the W3C API
"""


class Specs:
    """Specs"""
    def __init__(self):
        f = open("/home/ndw/.w3capi.json")
        self.headers = json.loads(f.read())
        f.close()

        self.datafile = "/tmp/specs.json"
        try:
            f = open(self.datafile)
            self.data = json.loads(f.read())
            f.close()
        except FileNotFoundError:
            self.data = {}

    def save(self):
        f = open(self.datafile, "w")
        print(json.dumps(self.data, indent=2), file=f)
        f.close()

    def get(self, uri, page):
        params = {
            'items': 1000,
            'page': page
        }

        response = requests.get(uri, headers=self.headers, params=params)

        if response.status_code != 200:
            raise Exception("Error code: {}".format(response.status_code))

        return json.loads(response.text)

    def get_specs(self):
        uri = "https://api.w3.org/specifications"
        page = 1
        done = False

        specs = []
        while not done:
            data = self.get(uri, page)

            for hash in data['_links']['specifications']:
                specs.append(hash['href'])

            done = 'next' not in data['_links']
            page = page + 1

        for key in specs:
            self.data[key] = {}

    def get_versions(self, spec):
        uri = "{}/versions".format(spec)
        page = 1

        data = self.get(uri, page)

        self.data[spec]['versions'] = []
        for version in data['_links']['version-history']:
            self.data[spec]['versions'].append(version['href'])

    def count_versions(self, spec):
        if 'versions' in self.data[spec]:
            return len(self.data[spec]['versions'])
        else:
            return 1  # there must be at least one!


def main():
    """Main"""
    specs = Specs()

    if "https://api.w3.org/specifications/xml" not in specs.data:
        print("Getting specifications")
        specs.get_specs()
        specs.save()

    for spec in specs.data:
        if "versions" not in specs.data[spec]:
            try:
                print("V: {}".format(spec))
                specs.get_versions(spec)
            except Exception:
                specs.save()
                raise

    specs.save()

    shortnames = ['WD-XSLReq', 'html-xml-tf-report', 'leiri',
                  'namespaceState', 'proc-model-req', 'webarch',
                  'xinclude-11-requirements', 'xinclude-11', 'xlink10-ext',
                  'xlink11', 'xml-id', 'xml-link-style', 'xml-proc-profiles',
                  'xpath-datamodel-30', 'xpath-datamodel-31',
                  'xpath-datamodel', 'xpath-functions', 'xproc-template',
                  'xproc-v2-req', 'xproc', 'xproc20-steps', 'xproc20',
                  'xptr-element', 'xptr-framework', 'xptr',
                  'xslt-xquery-serialization']

    count = 0
    for shortname in shortnames:
        spec = "https://api.w3.org/specifications/" + shortname
        count = count + specs.count_versions(spec)

    print(count)


if __name__ == '__main__':
    main()

The answer is 101. Approximately.

Comments

(not code) Heh!

APIs! Gotta love 'em. At some point, I expect that the API "infrastructure" on the web is going to collapse under its own weight. For now, it keeps programmers employed.

(Also, seven plus two is nine, not 9)

—Posted by Kurt Cagle on 31 May 2016 @ 02:07 UTC #

Heh, indeed! You're the second person to point out that my spambot filter requires digits where a word might be expected. Luckily, you're not a spambot so you were able to figure it out. :-)

—Posted by Norman Walsh on 10 Jun 2016 @ 09:31 UTC #