Updates from ethomjenn Toggle Comment Threads | Keyboard Shortcuts

  • ethomjenn 10:35 pm on August 12, 2011 Permalink | Reply
    Tags: ,   

    A Deeper Dive on Locker Search 

    For the last several weeks, a couple of us on the Singly team have been taking a deep-dive into the idea of search within a Locker.  As most people already know–and as we were quick to discover, search seems simple but can become large, difficult, and unweildy if not done with care and attention.  We wanted to share a couple of the thoughts we had around these last few weeks, and let you know of some cool things we discovered.

    Discovery 1:  Lucene is a pretty amazing information retrieval library

    Lucene is an open-source Apache project that handles information storage and retrieval, specific for the uses of searching and information retrieval.  It is actually just a library, however, meaning we can’t easily slide it into the existing Locker codebase.

    A little detail about Lucene, and why it’s so powerful.

    • Lucene implementations have scaled up to the terabyte range of storage.  This fares really well as each user’s Locker may grow quite large over time, and we don’t want to outgrow an architecture if possible.
    • Lucene has clearly-defined concepts of text processing, both when indexing–placing information into the system, and when querying–pulling information out of the system.
    • Lucene also has some subprojects that may turn out to be very helpful for Locker users. One in particular is Tika, which can extract the textual contents from binary files such as PDFs and Word documents.  This is super helpful if, for instance, you wanted to not only search across your entire Gmail account, but also wanted to search within any attachments others have sent to you.
    • The Lucene library has been battle-hardened over several years of active development and lots of implementations in production.  Some very important thoughts have emerged around scaling Lucene, based on actual users pushing it to its limits.  This gives us a lot of useful information, and shows us what not to do when adding Lucene to Lockers.

    In short, we were pretty sold on Lucene as the search platform of choice.  However, we weren’t ready quite yet.

    Discovery 2:  Lucene has lots of implementations, and each of these has tradeoffs

    If only we could download and install a simple library and have full-search Lockers that easily!  No, unfortunately, there are several implementations of Lucene, and we want to choose the best based on the unique requirements of Lockers.

    First, we tried elasticsearch, which is a very robust implementation of Lucene, along with capabilities to allow it to scale to MASSIVE sizes.  It also has a very robust method for indexing and querying the information store from pratically any client.  We prototyped this one first, knowing that it was large, and probably overkill, but we could begin to get search results in a day or two, because of its maturity.

    Sure enough, a couple days later, we were searching across the Locker, as you can see here:

    Screen_shot_2011-08-12_at_3

    However, this required every Locker user to run a full instance of elasticsearch alongside their Lockers, which is a heavy requirement. However, we loved how easy it was to get going with useful searching, so we committed that code as the Locker app called “Search”, and forged ahead.

    Discovery 3:  Lucene isn’t only a Java library

    How to get Lucene embedded within the Locker?  That was our next task.  We wanted something lightweight, fast, and embedded so we didn’t have to add yet another external requirement.

    That’s when we learned that a group had ported the Lucene library to C++, and the project’s called CLucene. This was fantastic news for us, as we figured we could just wrap clucene into a native Node.js module, and it’ll be extremely fast and small, but still retain the power of Lucene.

    Discovery 4:  CLucene looks great, but it’s based on a much older Lucene version

    Daaaaamn.  We were so close, and it turns out that CLucene is missing some of the features we really wanted to use from current Lucene implementations, specifically geopoint support and updating existing documents.  Now we needed to look closely at CLucene to see how much we need to implement.

    But first, like most of our projects at Singly, we applied the mantra “Make it work.  Make it right.  Make it fast.”  So before getting it as full-featured as we desired, we first just wanted to make it work inside the Locker natively.  This turned out to be a good learning experience for us, as we now know how to write native Node.js modules, and that opens up the massive amounts of C and C++ libraries out there to us.

    Enough technobabble, let’s see what this sucka can do!

    Okay!  Here are some queries you can run on your Locker:

    • “eric”  - Find everything that has the term “eric” in it, case-insenitive
    • “eric~” – Do a fuzzy search for “eric”, meaning be a bit forgiving about misspellings.  This will return back data that contains terms like “erick” and “erik”, as well as “eric”.
    • “+lockerproject singly” – Do a search for data that definitely has the term “lockerproject” in it, but may or may not have the term “singly” in it.
    • “+lockerproject singly^5″ – Do the same search as above, but for those records that do have the term “singly” in it, boost those scores much higher, so they show up closer to the top of the search results.

    What’s next?

    Well, this is where we need to add functionality to CLucene that it did not already have.  Specifically, we need to be able to add all the different fields that are internal to Locker records.  For instance, for a record representing one of your contacts, you should be able to search for first name separately from last name.  Right now our Node.js CLucene module does not support that.

    We also want to add back in the ability to update existing records, so your data is always searchable and retrievable, regardless of how many times you resync your connectors.  Also, being able to search distances from a geo location such as a Foursquare checkin is very high on our list as well.

    We’re working hard to bring useful features to Lockers across the web, and search is going to be a core part of that.  We are always happy to hear your thoughts or answer any questions you may have.  We are regularly available in the Freenode IRC room #lockerproject, as well as via e-mail and github.  We welcome your feedback!

     
  • ethomjenn 5:31 pm on August 3, 2011 Permalink | Reply
    Tags: conference, , OSCON   

    Thoughts Around the Locker Project and OSCON 

    Last week some of the Singly crew headed to Portland, Oregon for O’Reilly’s OSCON.  Our primary goal was to get Lockers in the hands of as many people as possible, by handing out USB thumbdrives with instances of the Locker codebase on them.  Also, Jeremie’s talk on the Locker Project was on our agenda.  However, there were a couple of other things that got us excited about heading forward.

    The Concept of Xen micro instances for running Lockers:

    We spoke with Jeremy Fitzhardinge from XenSource/Citrix, regarding running Lockers in micro instances of Xen.  Through a lot of discussion, it sounds like it’s at least feasible in theory that very small Xen instances could indeed be used.  We also discussed the possibility of running Xen inside of Xen (to run Xen instances on a platform such as AWS), but it appears that this doesn’t peform well, understandably.

    Plug Computers:

    I attended a session that talked about the current state of small “plug” computers–defined as those types of computers that are small enough to simply plug into a wall.  These usually range from very small thumb-drive sizes up to a large wall wart or slightly larger.  The DreamPlug was one being demoed, and could be very interesting as a platform to use as a Locker appliance.  The good news is that besides the various networking capabilities of the DreamPlug, it also has an eSATA port on it, meaning it’s simple to plug in a large, high-performance hard drive for lots of storage.  Storage tends to be one of the weakest aspects of plug computers, and being able to add storage as necessary makes this very interesting.  Its price, when released, is said to be around $99.00 USD.

    Here are the DreamPlug specs:

    • CPU – Marvell Kirkwood 88F6281 @ 1.2GHz speed
    • Linux 2.6.3x Kernel
    • 512MB 16bit DDR2-800 MHz
    • 2MB SPI NOR Flash for uboot
    • 2 GB on board micro-SD for kernel and root file system
    • 2 x Gigabit Ethernet 10/100/1000 Mbps
    • 2 x USB 2.0 ports (Host)
    • 1 x eSATA 2.0 port -3Gbps SATAII
    • 1 x SD socket for user expansion/application
    • WiFi 802.11 b/g
    • Bluetooth 2.1 + EDR
    • Audio Interfaces
    • 5V3A DC power supply

    Another plug computer that came across the radar is the Raspberry Pi.  This is a tiny computer, yet is ARM-based and can run Linux.  It’s does not have a lot of storage support other than SD cards, but could be extremely flexible in being able to run one or more connectors and stream data back to a central Locker instance.  I can see this being used with the Arduino stack to get cheap, realtime sensor data from the environment and into Lockers.  Its price is $25.  (Cheaper than the Arduino Uno dev board!)

    Here are the Raspberry Pi specs:

    • 700MHz ARM11
    • 128MB of SDRAM
    • OpenGL ES 2.0
    • 1080p30 H.264 high-profile decode
    • Composite and HDMI video output
    • USB 2.0
    • SD/MMC/SDIO memory card slot
    • General-purpose I/O
    • Open software (Ubuntu, Iceweasel, KOffice, Python)

    Handling Locker Storage Safely and Quickly:

    How to safely scale and store Locker data has been an ongoing discussion within the Locker Project.  Several options are on the table, and a lot depends on the requirements at hand.  For instance, for Singly-hosted Lockers, we will need something that we can host on insecure platform-as-a-service providers such as Amazon AWS.  For this, something like Tahoe LAFS looks like a great contender.

    Other requirements that span both Singly-hosted Lockers and self-hosted Lockers are the capabilities of authorization to access, proof of change of data, and versioning of previously-changed data.

    So it was interesting to have met up with Brad Fitzpatrick from the Memcached/Danga/LiveJournal world, to chat with us about his new project Camlistore.  He gave us a very quick rundown on various features.  One of note is the key signature of any changes made to the filestore, which allows easy confirmation of who changed what and when.  As it was described to me, it sounds like this same functionality also allows only approved users to view sets and/or subsets of the data.  Lastly, the versioning support that it has could prove to help us with our versioning issues as well.

    One use case I found quite compelling was Brad discussing the possibility for a Camlistore (and by extension, a Locker) user to provide a subset of data that other people can write to.  For instance, you could provide a set of photos that you took while attending a party, provide write access to those particular photos to a group of friends who were also at that party, and those other users could add data–such as comments or tags–to your dataset, all using the Camlistore functionality.  This peer-enrichment idea is super exciting to me.

    Locker-wide Search:

    I met up with Tyler Gillies, who wrote the node-lucene module for Node.js.  This module wraps up the CLucene library and exposes it to a Node instance.  CLucene, for those who aren’t familiar with it, is a C++ port of the Java Lucene library–itself a very mature and powerful information retrieval library.  Internally, Singly has a Locker-wide search application already running, but it’s a prototype and requires the Elasticsearch Lucene implementation to run.  Elasticsearch is amazingly powerful, but it is also very resource-intensive.  We need something smaller, leaner, and faster, and CLucene fits the bill well.

    The last night we were there, Singly put in a Locker BIrds of a Feather session where a bunch of us hacked on Locker things together.  I was able to work with Tyler to get up to speed on the state of node-lucene, and to begin contributing back to it with some more features we’ll need for full Locker search.  I’m super excited to get advanced search capabilities in the Locker soon, such as the following:

    - Find all of my contacts from the contacts collection that have the e-mail address containing the term “singly.com”

    - Find all of my geographic data from any connector or collection for the user “Eric Jennings”

    - Find that link I visited recently that talked about the Google V8 garbage collection method, can’t remember if I read it on my phone or on one of my machines

    If we do search right, any of thes types of search queries will be available to any application, collection, or connector.

    That’s it for the OSCON trip.  Several of us were able to catch up with people we hadn’t seen in years, or have never met other than in IRC.  For those we met, it was great meeting you!  And for those who weren’t able to go, we hope to meet up soon!

     
    • jsgf 9:06 pm on August 4, 2011 Permalink | Reply

      “We also discussed the possibility of running Xen inside of Xen (to run Xen instances on a platform such as AWS), but it appears that this doesn’t peform well, understandably.”The problem is more that I don’t think AWS supports the right virtualization modes to even make it possible.But if you want to host on EC2 instances, then you can just use an instance outright, without needing to worry about nesting. Like any of the hosting options, it has its ups and downs, of course.

    • Duane 9:03 pm on March 27, 2012 Permalink | Reply

      The Tonido Plug 2 also looks like a great platform. It has basic file search on top of a shared filesystem built in, plus the ability to add “apps” that work on top of the data.

      http://www.tonidoplug.com/tonido_plug.html

c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel
Follow

Get every new post delivered to your Inbox.