I frequently have to access a variety of documents to do my job. Unfortunately a lot of these documents are stuck on the Portal, which whilst it can do a good job of being a document management system it does terribly at getting those documents out to people. Let me explain how getting a document normally works.
- Visit a long Portal URI, which is usually long enough that either you’re drawing a pension by the time you’ve finished typing it or it breaks across two lines on your screen so it needs copying and pasting.
- Log in, even for supposedly public documents.
- Discover I’m not at a document, but at a ‘sub-portal’ for a department with a big list of things I can read.
- Find what I want.
- Click the item, to be taken to another page where I can click another button to download it.
- Download the document.
- Open Pages, since my Mac doesn’t have Word installed.
- Tweak the document formatting so it looks right.
- Read and enjoy.
Now, I really wish I was exaggerating there, but I’m not. What I’d like to happen is:
- Visit short, sensible URI.
- Read online version with a nice layout, the ability to use my own browser accessibility etc.
- If I want it to download, click to get a PDF.
Let’s see how we can do that.
First of all, a few limitations. This isn’t a document repository I’m building here. You won’t be able to upload your .doc or your .xls file, because that would break the notion that it can easily shift content between formats. Nor is it an attempt to replace Portal (which runs on SharePoint) as a collaboration tool. Instead, it’s an attempt to drag all the static content like policy documents and how-to guides out of the depths of a horrifically hierarchical system and into one which treats documents as equals, and can output them as whatever people want.
So we’ll start by limiting what can be done in a document to a small subset of HTML. Basic wiki-style text formatting like bold, italic and underline is probably a good start, as are headers of differing levels. Links are probably good, as are lists, tables and images (with restricted positioning options), but that’s about it. The upshot of this is that using this system it’s impossible to come up with something which won’t render nicely in different formats. Or, to put it another way, we’re forcing users to distance the content from the presentation on the assumption (which in most cases I suspect will be true) that we know more about semantics, typography and accessibility than they do. It also means no more Comic Sans on documents, and that things like headings will actually be structured as such rather than big, bold and underlined body text.
Next up, each ‘page’ of content needs permissions. Fortunately we’re already building a simply enormous permissions handling system into Nucleus, as a bridge between the People and Groups systems. You’ll be able to tailor access to a document on a group basis, a person basis, or a collection thereof. Access permissions are the usual none, read, and read/write. We now have a document, stored centrally in a controlled format, which you can assign people appropriate permissions to. This is getting good.
Version control is easy, each time a page is modified just spawn a new version. We have more than enough data storage space to keep hundreds of iterations of documents, and because they’re stored (effectively) as plain text we’re saving hundreds of kB on each document which would be taken with document metadata. We’re up to a standardised, secured, flexible document system with versioning.
Finally, let’s consider getting to the documents. Each document can be assigned a stupidly short identifier when it’s first created, and then a further stupidly short identifier for each version. Getting to a document then becomes a matter of visiting http://pages.lincoln.ac.uk/a7b4 (or whatever the document ID is), and a version is just http://pages.lincoln.ac.uk/a7b4/3 (or whatever the version number is). Whilst we’re at it, we may as well add a format option so we could (for example) request http://pages.lincoln.ac.uk/a7b4/pdf to download a nice PDF version, or http://pages.lincoln.ac.uk/a7b4/xml to get the whole thing nicely rendered out in XML for use elsewhere.
Worth it?
http://www.comp.eonworks.com/scripts/convert_pdf_to_text-20040418.html
I’ve used this little utility before via exec() to pull in text from imported PDFs. I used it purely to automagically tag PDFs in a search system, but it could just as easily have reformatted the text for HTML viewing. Obviously it’s not a lot of use for importing much more than plain text PDFs, and it has its hit and miss moments even with that, but it might make importing lots of documents that bit quicker.