Using Thinking Sphinx

I recently had an instance where I wanted to add full-text search to an application. I've used Lucene, Solr, and a few others in past lives, but this time I wanted something just as functional but a little more lightweight. After looking around I settled on Sphinx, and so far it's worked great. By itself, Sphinx is not hard to use, but since I'm in Rails, I figured someone must have a gem or plugin for this. Sure enough, I found Thinking Sphinx. Now, it's really simple. Let's get things installed.

To install Sphinx on Linux (See doc for others):

  1. Download Sphinx 0.9.8
  2. tar xzvf sphinx-0.9.8.tar.gz
  3. cd sphinx
  4. ./configure
  5. make
  6. sudo make install

To install Thinking Sphinx:

First, install the gem. There is a plugin available, but I prefer the gem.

sudo gem install freelancing-god-thinking-sphinx \
  --source http://gems.github.com

Add to your config/environment.rb:

config.gem(
  'freelancing-god-thinking-sphinx',
  :lib         => 'thinking_sphinx',
  :version     => '1.1.12'
)

Finally, to make all the rake tasks available to your app, add the following to your Rakefile:

require 'thinking_sphinx/tasks'

Now, we need to use it, but before we do that a brief introduction to some Sphinx terms is necessary. Sphinx will build an index based on fields and attributes. Fields are the actual content of your search index. Fields are always strings. If you want to find content by keywords then it must be a field. Attributes are part of the index, but they are only used for sorting and grouping. Attributes are ignored for keyword searches, but they are very powerful when you want to limit a search. Unlike fields, attributes support multiple types. The supported types are integers, floats, datetimes (as Unix timestamps – and thus integers anyway), booleans, and strings. Take note that string attributes are converted to ordinal integers, which is especially useful for sorting, but not much else.

Thinking Sphinx adds the ability to index any one of your models. To setup an index, you simply add a define_index block. For example:

class Company < ActiveRecord::Base
  define_index do
    indexes [:name, sym], :as => :name, :sortable => true
    indexes description
    indexes city
    indexes state
    indexes country
    indexes area_code
    indexes url
    indexes [industry1, industry2, industry3], :as => :industry
    indexes [subindustry1, subindustry2, subindustry3], :as => :subindustry

    has fortune_rank, created_at, updated_at, vendor_updated_at, employee_bucket, revenue_bucket
    has "reviewed_at IS NULL", :as => :unreviewed, :type => :boolean

    set_property :delta => WorklingDelta
  end
end

Most of this should be pretty self explanatory. To index content (fields), you use "indexes" keyword. As you can see, you can have compound fields by using an array. Note that :name and :id must be symbols or Thinking Sphinx will get confused. You can also use some SQL in your indexes statement.

To add attributes, you use the "has" keyword. Thinking Sphinx is pretty good about determining the type of an attribute, but sometimes you need to tell it using :type.

I will explain the set_property :delta => WorklingDelta later.

To build your index, simply run:

rake thinking_sphinx:index

After processing each model, you will see a message like the one below. Ignore it. Everything is working fine. Really.

distributed index 'company' can not be directly indexed; skipping.

However, if you have made structural changes to your index (which is anything except adding new data into the database tables), you’ll need to stop Sphinx, re-index, and then re-start Sphinx – which can be done through a single rake call.

rake thinking_sphinx:rebuild

Once you have your index setup, you can search really easily.

Company.search "International Business Machines"

This will perform a keyword search across all the indexes for Company. If you want to limit your search to a specific field, use :conditions.

Company.search :conditions => { :description => "computers" }

To use your attributes for grouping and such use :with.

Company.search :conditions => { :description => "computers" },
                                :with => { :employee_bucket => 2 }

With can also accept arrays and ranges. See the doc for more information.

Back to the set_property above. One issue with Sphinx vs. Solr or Lucene is that the Sphinx index is fixed. If you update your model, the change will not be reflected in the index until you rebuild the entire index. To get around this, Sphinx supports delta indexes. A delta index allows you to make a change and have it show up in searches without rebuilding the entire index. Although, rebuilding an index is not a big deal with Sphinx. For example, I can rebuild the Company index defined here in under 2 minutes (1.6 million records).

What does set_property :delta => WorklingDelta do? First, it adds an after_save callback to your model that will use WorklingDelta to perform the delta index step. Given that Workling is in the name you're probably guessing that I hooked this up to use Workling so delta indexing happens asynchronously.

Add lib/workling_delta.rb:

class WorklingDelta < ThinkingSphinx::Deltas::DefaultDelta
  def index(model, instance = nil)
    return true unless ThinkingSphinx.updates_enabled? && ThinkingSphinx.deltas_enabled?
    return true if instance && !toggled(instance)

    doc_id = instance ? instance.sphinx_document_id : nil
    WorklingDeltaWorker.asynch_index(:delta_index_name => delta_index_name(model), :core_index_name => core_index_name(model), :document_id => doc_id)

    return true
  end
end

Add app/workers/workling_delta_worker.rb:

class WorklingDeltaWorker < Workling::Base
  def index(options = {})
    logger.info("WorklingDeltaWorker#index: #{options.inspect}")
    ThinkingSphinx::Deltas::DeltaJob.new(options[:delta_index_name]).perform
    if options[:document_id]
      ThinkingSphinx::Deltas::FlagAsDeletedJob.new(options[:core_index_name], options[:document_id]).perform
    end

    return true
  end
end

Now, whenever a Company object is created, updated, or destroyed, the WorklingDeltaWorker will be called to update the delta index.

If you have a need to perform powerful searches over hundreds of thousands (or even millions) of records give Sphinx and Thinking Sphinx a try. There are some minor feature omissions, but I think the trade-offs for most applications more than make up for them. BTW, scale is not one of the omissions. The largest Sphinx installation, boardreader.com, uses Sphinx to index over 2 billion records. Craigslist.org is probably the biggest with 50 million queries per day.