I’ve had a couple of people ask how my lunchtime project today actually works behind the scenes, so here’s the lowdown in easily-digestible speak. I should point out that I am relying heavily on two frameworks which we’ve already built at Lincoln. These are Nucleus – our heavy-lifting data platform – and the Common Web Design – our web design and application framework. These two gave me a massive head-start by already doing all of the hard work such as extracting data from our directory and making the whole thing look great. Now, on with technology.
First of all is the job of getting the contents of our staff directory and mangling them into a form which Sphinx – a blistering fast search engine – can index and work with. This is a matter of making a HTTPS call to Nucleus with Staff Directory’s anonymous access token. The result comes back in straight JSON, containing staff names, email addresses, phone numbers, departments and a few other bits of information such as internal ID numbers and the last time that Nucleus updated their details.
The JSON is then parsed and rendered back out in Sphinx’s xmlpipe2 standard, which contains both the details we want to search on and some extra details which we want to include such as phone number. Standing on its own this doesn’t do anything useful at all, it’s just a slightly fancy XML representation of our staff directory. At this point we drag Sphinx in to do the heavy lifting of indexing it all.
I configured a Sphinx data source to get the XML representation of the directory using wget. In some cases it’s more efficient to have the XML renderer output directly to disk and then grab it directly over the file system instead of wasting time on the TCP/IP stack, but in this case we can make one very quick query and be done with it (saving scheduling two jobs instead of one). I then configured an index to use this data source, setting the prefix limit to 1 (allowing queries like “nick j” to complete), ran the indexer with the –rotate and –all flags, and did a happy dance as it indexed the whole lot in about a second.
At this point I could search the whole lot from the command line and get relevant results. I played around with metaphones for a bit to allow ‘soundalike’ searches, but this proved to be extraordinarily vague. I may, however, consider using the metaphone mode for if normal search returns absolutely no results in a kind of “did you mean…” sense.
Searching from the command line is cool, but it’s not massively awesome when people need to get to the service from all over the world. It’s not feasible to make people telnet in and we can’t hand out SSH accounts to everybody, so the next step was to make PHP talk to Sphinx. Fortunately for me the Sphinx API reference implementation is in PHP.
Since I’m using the CodeIgniter framework for PHP development I opted to just use the PHP API implementation as a helper rather than shoehorn it into a library. That said, since we’ll likely be using Sphinx a lot in Jerome it may be worth building a proper CodeIgniter library and open-sourcing it.
The Sphinx API lets me make all kinds of fine-tuning to the actual query. Things I actually did included setting more sensible field weighting so that searching for “jo” would return somebody called “jo” over “journalism”. The current order goes name, department, email address. I also set the matching mode to one of Sphinx’s extended modes meaning that (should the feeling take people) they can use complex search operators such as not, for example “n jackson -media” will only return me and not the other N Jackson who works over in Media. Pipe the search query into Sphinx, grab the array of results, and pass it off to a view in Codeigniter to be rendered out to the end user.
Simples.
OMG AWESOME EXTRAZ: You can now hit up the search using a JSON API at:
http://phone.labs.lncn.eu/search/api?q=query
obviously replacing “query” with your own properly URLencoded query string. But you knew that already, right?
“I played around with metaphones for a bit to allow ‘soundalike’ searches, but this proved to be extraordinarily vague”
metaphone has been improved quite a bit since i first developed it in 1990. i wonder which implementation you were using. drop me a line and i’ll send you a free copy of metaphone 3