For the last several weeks, a couple of us on the Singly team have been taking a deep-dive into the idea of search within a Locker. As most people already know–and as we were quick to discover, search seems simple but can become large, difficult, and unweildy if not done with care and attention. We wanted to share a couple of the thoughts we had around these last few weeks, and let you know of some cool things we discovered.
Discovery 1: Lucene is a pretty amazing information retrieval library
Lucene is an open-source Apache project that handles information storage and retrieval, specific for the uses of searching and information retrieval. It is actually just a library, however, meaning we can’t easily slide it into the existing Locker codebase.
A little detail about Lucene, and why it’s so powerful.
- Lucene implementations have scaled up to the terabyte range of storage. This fares really well as each user’s Locker may grow quite large over time, and we don’t want to outgrow an architecture if possible.
- Lucene has clearly-defined concepts of text processing, both when indexing–placing information into the system, and when querying–pulling information out of the system.
- Lucene also has some subprojects that may turn out to be very helpful for Locker users. One in particular is Tika, which can extract the textual contents from binary files such as PDFs and Word documents. This is super helpful if, for instance, you wanted to not only search across your entire Gmail account, but also wanted to search within any attachments others have sent to you.
- The Lucene library has been battle-hardened over several years of active development and lots of implementations in production. Some very important thoughts have emerged around scaling Lucene, based on actual users pushing it to its limits. This gives us a lot of useful information, and shows us what not to do when adding Lucene to Lockers.
In short, we were pretty sold on Lucene as the search platform of choice. However, we weren’t ready quite yet.
Discovery 2: Lucene has lots of implementations, and each of these has tradeoffs
If only we could download and install a simple library and have full-search Lockers that easily! No, unfortunately, there are several implementations of Lucene, and we want to choose the best based on the unique requirements of Lockers.
First, we tried elasticsearch, which is a very robust implementation of Lucene, along with capabilities to allow it to scale to MASSIVE sizes. It also has a very robust method for indexing and querying the information store from pratically any client. We prototyped this one first, knowing that it was large, and probably overkill, but we could begin to get search results in a day or two, because of its maturity.
Sure enough, a couple days later, we were searching across the Locker, as you can see here:
However, this required every Locker user to run a full instance of elasticsearch alongside their Lockers, which is a heavy requirement. However, we loved how easy it was to get going with useful searching, so we committed that code as the Locker app called “Search”, and forged ahead.
Discovery 3: Lucene isn’t only a Java library
How to get Lucene embedded within the Locker? That was our next task. We wanted something lightweight, fast, and embedded so we didn’t have to add yet another external requirement.
That’s when we learned that a group had ported the Lucene library to C++, and the project’s called CLucene. This was fantastic news for us, as we figured we could just wrap clucene into a native Node.js module, and it’ll be extremely fast and small, but still retain the power of Lucene.
Discovery 4: CLucene looks great, but it’s based on a much older Lucene version
Daaaaamn. We were so close, and it turns out that CLucene is missing some of the features we really wanted to use from current Lucene implementations, specifically geopoint support and updating existing documents. Now we needed to look closely at CLucene to see how much we need to implement.
But first, like most of our projects at Singly, we applied the mantra “Make it work. Make it right. Make it fast.” So before getting it as full-featured as we desired, we first just wanted to make it work inside the Locker natively. This turned out to be a good learning experience for us, as we now know how to write native Node.js modules, and that opens up the massive amounts of C and C++ libraries out there to us.
Enough technobabble, let’s see what this sucka can do!
Okay! Here are some queries you can run on your Locker:
- “eric” - Find everything that has the term “eric” in it, case-insenitive
- “eric~” – Do a fuzzy search for “eric”, meaning be a bit forgiving about misspellings. This will return back data that contains terms like “erick” and “erik”, as well as “eric”.
- “+lockerproject singly” – Do a search for data that definitely has the term “lockerproject” in it, but may or may not have the term “singly” in it.
- “+lockerproject singly^5″ – Do the same search as above, but for those records that do have the term “singly” in it, boost those scores much higher, so they show up closer to the top of the search results.
What’s next?
Well, this is where we need to add functionality to CLucene that it did not already have. Specifically, we need to be able to add all the different fields that are internal to Locker records. For instance, for a record representing one of your contacts, you should be able to search for first name separately from last name. Right now our Node.js CLucene module does not support that.
We also want to add back in the ability to update existing records, so your data is always searchable and retrievable, regardless of how many times you resync your connectors. Also, being able to search distances from a geo location such as a Foursquare checkin is very high on our list as well.
We’re working hard to bring useful features to Lockers across the web, and search is going to be a core part of that. We are always happy to hear your thoughts or answer any questions you may have. We are regularly available in the Freenode IRC room #lockerproject, as well as via e-mail and github. We welcome your feedback!
Way to go guys! Thanks for writing these updates as you go along. I’m itching to dive in but waiting for the barrier-to-entry threshold to be low enough that I believe it becomes an effective use of my (part) time. So these blog posts are very helpful to gauge whether or not the time is right for me. And after this post, it almost feels right