I have an Ruby on Rails 3 Heroku application, which needs to perform text search on a few models. Each models have a large datasets, and that dataset is expected to grow considerably.
I want to be able to do fast text search on columns like title and description. Simple queries, like give me all Articles having “postgresql” (case insensitive) in their title, or body. I need multilingual capability too.
Currently, my DB is not being used in production, and I’m using the Ronin plan, which gives a dedicated db using PostgreSQL.
In order to do that, I decided to go with a plugin call texticle. That plugin allows full text search using PostgreSQL capability. However, it did not work smoothly, and I decided to build full text indexes.
I ran the following query, on a table with 15 millions entries. 20 hours later, it is still running.
create index on articles using gin(to_tsvector('english', title));
My questions :
1- Is it normal that it is so long for this index to build?
2- Is there any way to find out the status of that index build-up? It doesn’t show yet in my indexes usage table.
3- What about my approach. Am I looking at this the wrongway? Would you have other recommendations? I would like to keep my budget low for now, but be able to easily migrate to an effective production quality solution when needs arise, a scalable one.
Thanks
No.
This is on my postgres 9.0 server which runs on single-core AMD Athlon 64 3700+:
As you can see, on building GIN index on 15 Mrows took 340 seconds (BTW, table size was 977 MB and index size was 319 MB).
Turning text documents into tsvector and building a GIN (or GIST) index is CPU-intensive.
I don’t know exact specs of heroku ronin in terms of CPU power. Can you tell us what it compares to?
Performance of index building is also very sensitive to
maintenance_work_memsetting. Memory needed (and size of the index) depends on input data, might be from 20% to 150% of input data size.Unfortunately, no. PostgreSQL does not have this kind of “introspection”.
You could create same index on a 10% sample and multiply to estimate.
Nothing bad – it is OK, at last if PostgreSQL has built-in FTS, it’s good to begin with.
But if you need faster solution (both indexing time and searching speed) – the only way is to go out of database. External solutions like Sphinx or Lucene are faster (10x from my experience).