Bret Victor's recent post The Web Of Alexandria laments the high rate of links going dead on the internet. It compares the lossing of collective knowledge through dead links to the burning of Library of Alexandria. This got me thinking: how would we fix this? My mind naturally went to The Internet Archive and its WayBack Machine.
The WayBack Machine has already been archiving historical versions of web pages for almost 20 years. So, perhaps what is needed is simply a little bit of glue to make it convinient to add a URL to the archive and allow authors to link to a "permalink" as opposed to the live link. That way, even if the live page dies, the article will still link to the archived page which has the original content. Ideally, this will happen transparently to the user.
Idea: Permalink As A Service
The name that came naturally for this service is "Permalink", eventhough this term is already in use, it's too apt for me to think of a better one. So, until there's a better name, I will just call this the "permalink service". So, how this service works is simple:
- An author wants to link to an external web page from her article as a resource/referenece. Instead of linking to the URL itself, she would:
- Go to the permalink service's front page, enter the URL she wants to create a permalink for, and click "Create Permalink".
- Once the permalink is created, the author would get the URL of the permalink to use in her article.
That's it! In step 2, the permalink service would ask the Way Back Machine to archive a snapshot of the submitted URL. When someone clicks on the permalink, the service would check the availability of the resource. If the resource is available, it would redirect to the live service, otherwise, it would redirect to the Internet Archive's archived version of the resource. Of course, there are circumstances when this wouldn't work. For example:
- The content of the requested URL could be hidden behind a login.
- The site may have a robot.txt that prohibits scrappers like search engines or the Internet Archive from retrieving content.
The hope is that this scheme would work at least often enough for it to be worth maintaining. But perhaps this is something we'd have to see to believe.
The Prototype
I made a prototype that implements this service, and the code is on Github under the MIT license. The implementation is extremely simple. It runs on io.js and the Koa framework and the code itself is 122 LOC at the time of this writing. The most interesting bit of code is probably the part that makes a request to archive.org to request that a URL be archived:
var request = require('request-promise');
var cheerio = require('cheerio');
module.exports = function(url) {
return request(`https://web.archive.org/save/${url}`)
.then(function() {
return true
})
.error(function(err) {
var $ = cheerio.load(err.message);
err = new Error;
err.message = $('#error').text().replace(/\s+/g, ' ').trim();
err.learnMore = $('.wm-nav-link-div').html();
throw err;
});
};
To request a page be archived be the Way Back Machine, all that is required is a GET request to https://web.archive.org/save/${url}
where url
is the URL you want to archive. In the error handling, I do a little bit of HTML scrapping on the error page to get a meaningful error message in the case that the page could not be archived.
I have stood up an instance of this server running currently at http://permalink.tobyho.com. Please feel free to poke at it.
Ideas and Questions
As you must have gathered by the fact that the app has no styling on it of any kind, I didn't put too much effort into the implementation. This was a one day hack job and intentionally so. However, I can already think of additional potentially useful features. One would be a bookmarklet or browser extensions to make it a snap to create a permalink. Another is to actually store all the permalinks that were created through the service so that they can be exported in bulk.
One issue that I haven't figured out from a usability perspective is whether the permalink should always redirect to the live site, the latest archived version, or the archived version at the time that the author created the link. It's possible that different authors may want different things here, and that it would be made configurable. But even if that were the case, deciding what should be the sensible default is something that would benefit from more thoughtfulness.
Another explored idea in Bret's post is that of data replication - an idea that seems very underutilized in the current internet. Rather than asking the Internet Archive to take on all of the burden of archiving content and becoming a single point of failure, perhaps a better architecture would be to allow any author to easily stand up a server for archiving all the links he or she personally wants to share. And, as a second step, have some sort of confederation that will replicate data across multiple nodes on the network. Some prior art here is Wallabag - a self-hosted read-later bookmarking server, and the Smallest Confederated Wiki. I, for one, am all for more self-hosting in this age of SAAS.
Feedback Please
I have no idea if anyone will actually find this useful or interesting, but I would appreciate your thoughts - I am talking to you, dear reader. Thanks for reading.