The End of Screen-scraping?

Volume 6, Issue 43; 22 Jun 2003

Will web services provide useful information on demand?

I've been reading Google Hacks. Among other interesting topics, the authors discuss the tradeoffs of screen scraping versus the Google API.

I have a few scripts that depend on screen scraping: I collect local movie times and weather information for my Palm by scraping fandango.com and weather.com. And I have a tool that scrapes a commercial dictionary so that I can run dict word and get quick definitions.

mercury:~$ dict scraping
Main Entry: 1scrape
Pronunciation: 'skrAp
Function: verb
Inflected Form(s): scraped; scrap.ing
Etymology: Middle English, from Old
Norse skrapa; akin to Old English
scrapian to scrape, Latin scrobis
ditch, Russian skresti to scrape
Date: 14th century
transitive senses
1 a : to remove from a surface by
usually repeated strokes of an edged
instrument b : to make (a surface)
smooth or clean with strokes of an
edged instrument or an abrasive
...

I expect they'd object in principle to my scraping, but I have a pretty clear conscience. I don't follow links or anything, I just discard a bunch of HTML presentation goop.

And here's the thing, if these sites provided an API that let me get descent access to the structured information that I'm currently forced to scrape, not only would I stop scraping, I'd happily pay for it.

Let's be reasonable, I'm not going to pay a fortune, but I'd definitely subscribe for the convenience of accessible data.

There's lots of data that I'd like to be able to get to: The Oxford English Dictionary, The Encyclopedia Brittanica, The Merriam-Webster Thesaurus, among others.

Note

Selling me the data on a CD-ROM is OK, but not if you make me use some obnoxious GUI to access it. And if the GUI only runs on some proprietary operating system that I don't routinely have access to, you might as well not bother trying to sell me the CD-ROM.

Hopefully, Google, Amazon, and the other experimental API services will be a catalyst for widespread API deployment. Ideally, I'd like nice RESTfull, bookmarkable URIs for simple queries, but I'm willing to live with whatever I'm offered. I've got Perl and XSLT, after all.