I’m using the thinking sphinx gem my queries are taking about 45 seconds to complete (13 Million records, the folder containing the indexes is 1.1GB). I’m assuming I have something configured incorrectly (first time Sphinx user). Anyway let me know if you see anything that looks amiss. Here is my configuration:
define_index do
indexes :name
indexes :summary
indexes :tag_list
indexes categories.name, :as => :category_name
has "RADIANS(lat)", :as => :latitude, :type => :float
has "RADIANS(lng)", :as => :longitude, :type => :float
set_property :field_weights => {
:name => 8,
:summary => 6,
:category_name => 5,
:tag_list => 3
}
set_property :delta => ThinkingSphinx::Deltas::ResqueDelta
set_property :ignore_chars => %w(' -)
end
Here is an example query:
Location.search('Restaurant',
:geo => [0.5837843098436726,-1.9560609568879357],
:latitude_attr => "latitude",
:longitude_attr => "longitude",
:with => {"@geodist" => 0.0..4000.0},
:include => :categories,
:page => 1,
:per_page => 100)
My Log shows:
Sphinx Query (43066.3ms) restaurant
Sphinx Found 467 results
I’ll keep digging through the docs and trying stuff!
UPDATE: my development.sphinx.conf
indexer
{
}
searchd
{
listen = 127.0.0.1:9312
log = /project_path/log/searchd.log
query_log = /project_path/log/searchd.query.log
pid_file = /project_path/log/searchd.development.pid
}
source location_core_0
{
type = pgsql
sql_host = localhost
sql_user = user
sql_pass = pass
sql_db = db_name
sql_query_pre = UPDATE "business_entities" SET "delta" = FALSE WHERE "delta" = TRUE
sql_query_pre = SET TIME ZONE 'UTC'
sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = FALSE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = FALSE
sql_attr_uint = sphinx_internal_id
sql_attr_uint = sphinx_deleted
sql_attr_uint = class_crc
sql_attr_float = latitude
sql_attr_float = longitude
sql_attr_string = sphinx_internal_class
sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}
index location_core
{
source = location_core_0
path = /project_path/db/sphinx/development/location_core
morphology = stem_en
charset_type = utf-8
ignore_chars = ', -
enable_star = 1
}
source location_delta_0 : location_core_0
{
type = pgsql
sql_host = localhost
sql_user = user
sql_pass = pass
sql_db = db_name
sql_query_pre =
sql_query_pre = SET TIME ZONE 'UTC'
sql_query = SELECT "business_entities"."id" * 1::INT8 + 0 AS "id" , "business_entities"."name" AS "name", "business_entities"."summary" AS "summary", "business_entities"."tag_list" AS "tag_list", "business_entities"."id" AS "sphinx_internal_id", 0 AS "sphinx_deleted", CASE COALESCE("business_entities"."type", '') WHEN 'Location' THEN 2817059741 WHEN 'Group' THEN 2885774273 WHEN 'BraintreeBusiness' THEN 28779289 WHEN 'InvoicedBusiness' THEN 1440117572 ELSE 2817059741 END AS "class_crc", COALESCE("business_entities"."type", '') AS "sphinx_internal_class", RADIANS(lat) AS "latitude", RADIANS(lng) AS "longitude" FROM "business_entities" WHERE ("business_entities"."type" = 'Location') AND ("business_entities"."id" >= $start AND "business_entities"."id" <= $end AND "business_entities"."delta" = TRUE AND "business_entities"."type" = 'Location') GROUP BY "business_entities"."id", "business_entities"."name", "business_entities"."summary", "business_entities"."tag_list", "business_entities"."id", "business_entities"."type"
sql_query_range = SELECT COALESCE(MIN("id"), 1::bigint), COALESCE(MAX("id"), 1::bigint) FROM "business_entities" WHERE "business_entities"."delta" = TRUE
sql_attr_uint = sphinx_internal_id
sql_attr_uint = sphinx_deleted
sql_attr_uint = class_crc
sql_attr_float = latitude
sql_attr_float = longitude
sql_attr_string = sphinx_internal_class
sql_query_info = SELECT * FROM "business_entities" WHERE "id" = (($id - 0) / 1)
}
index location_delta : location_core
{
source = location_delta_0
path = /project_path/db/sphinx/development/location_delta
}
index location
{
type = distributed
local = location_delta
local = location_core
}
I found my problem – The records happen to be in an STI table but I only want to index those of type Location (Location doesn’t have any descendants). Of the 13 million records in this table 99.99984% (seriously) of them are of Location type. The SELECT DISTINCT type FROM business_entities query was taking way too long (even with an index). The tricky part was noticing this since the log was reporting the Sphinx Query lasting 84 seconds but it was really the predatory SQL queries that were the problem:
So I monkey patched Thinking Sphinx in an initializer to return the only type I care about:
https://gist.github.com/1603565