During my studies of computer science I came accross some functional languages like Prolog but now I have only been doing imperative stuff like C#, Ruby JavaScript and Java for the past 10 years. Currently I am creating a full text search engine for an online shop and I have come quite far the “imperative way” already. But having stumbled accross some functional languages like Haskell of Clojure it became clear that the functional paradigm fits so much better and that the imperative way is just not the right tool for this job.
So we have a full text index of some 10 million records. Each record basically contains a word occurrence, along with the id and text position from the record of which it originates.
When the user enters a search string it is parsed into an expression tree. For example the search string “transformer 100 W” results in something like
AND('transformer%', OR(NEAR('100', 'W'), NEAR('100', 'watts'), '100W', '0.1kW'))
There is some extra “intelligence” going on here, but that is of no concern for this question.
The expression tree is then evaluated recursively and results in a couple of sql queries that could return up to 100,000 rows in the form of .NET-DataTables. These are then read into sets or dictionaries and, depending on the predicates, intersections and unions are applied in order to find all results that match the entire search expression. For the NEAR evaluation the position indexes of the found occurrences are compared as well. But this is all done imperatively, with a lot of for-loops.
Additionally there is a ranking function that adds up scores of the found word occurrences. Words that are only found as prefixes or with fuzzy matching (done by the database server) get lower scores than precise matches.
For each resulting item I also need to get a list of all word occurrences that were matched, in order to highlight these words in the result pages.
So roughly the evaluation algorithm is a function like
expression tree, full text index ->
resulting items that match the expressin tree,
each with a ranking sum
and a list of all found word occurrences for this item
I am just giving a rough overview here, but I hope you get enough of a picture.
Now the “real world” constraints:
- The whole application (up to now) is written in C#, so an easy integration with .NET is paramount.
- Loads of data is read into .NET-DataTables and will then need to be evaluated and transformed. The results should be contained in .NET types (Dictionaries, Sets, Arrays, whatever…).
- Performance is of great importance. At present my algorithm often takes two seconds for a search (not counting the sql), which is kind of ok, but should be improved. Our server has 16 processors, so parallel processing would be welcome. Since we get about one search request per second and the current implementation is single threaded, processor time is still available.
- The language (and the compiler) should be mature.
Since I need to stay with .NET, I was looking into Clojure-CLR, F# and Scala for .NET.
I like the concepts of Clojure a lot, but right now I can not assess, whether it would be up to the job. Reading about F# gave me mixed feelings, since it seems to want to be able to do just about everything, whereas I would tend to a more “pure” mathematical approach for the given task. But maybe that is possible with F# as well and I am not yet aware of it. I haven’t delved into Scala much yet, but it seems to be well established.
Any insights would be welcome!
I’m curious why you’re not considering LINQ as an option. It seems to satisfy all your criteria. Note I have no experience with Scala, so I can’t comment on that.
Here, LINQ > F# > Clojure-CLR. If everything is already in C#, LINQ is going to be the easiest to integrate. Visual Studio support for things like intellisense and function definition navigation seems much better in a C#-only program. Calling Clojure from C# can be horrendous—in theory it should work OK, but in practice, be prepared to spend weeks figuring out why things aren’t working the way you’d expect. It’s really designed to be the top level thing; you call C# from Clojure, going the opposite direction really just isn’t high on the Clojure-CLR developers’ priority list; there’s basic support, but you get what you get.
LINQ ~= F# > Clojure. I’ve read elsewhere that LINQ’s performance can be shown to be slightly better than F# for most idiomatically-written algorithms, but they’re close enough that it shouldn’t matter. PLINQ makes parallelism easy. Clojure-CLR has mega-slow startup time, and the runtime overhead slows things down too.
LINQ >= F# > Clojure. Not to say F# is immature at all, but Visual Studio support lags behind a bit, and there’s much more production code in the world (and a lot more stack overflow answers) based on LINQ than F#.
None of the languages are pure pure like Haskell, but in terms of how difficult it makes it to write non-pure code, I’d rank it as: LINQ > Clojure > F# > Scala. LINQ can only be made impure by calling impure methods. Clojure has refs and atoms, F# anything can be designated mutable, and Scala (per my understanding) really is just Java with functional features bolted on.
The functional feature that F# and Scala both have going for them though is language support for pattern matching. Where in C# you’d need either some kind of inheritance hierarchy or chains of b?x:y operators to do things functionally (or if/else if you’re fine with non-functional approach), pattern matching makes conditional operation on different variations of raw data types much more succinct. This maybe could be useful in your calculation of exact vs prefix vs fuzzy match rankings, however a b?x:y chain
var alg = x.match == exact ? alg1 : x.match == prefix ? alg2 : alg3in C# would be perfectly legible in this simple case—it’s when the matching gets much more complex that language integrated pattern matching becomes more valuable.Interestingly, I think the one aspect of your toolkit where F# would prove more useful than LINQ isn’t the querying, which the name of LINQ itself should indicate it can handle, but the parsing of your search string into an expression tree. This is one area where functional languages and pattern matching really excels, and add in tools like FsLex and FsYacc can give you a big head start.
All that said, I think the decision comes down to where you’re wanting to go. If you just want to clean up your search algorithms and be done with it, I’d advise the LINQ approach. But if you’re wanting to piece-by-piece get into a more functional-oriented style for the whole program (and your company is willing to pay for the time you’re committing to it), then do maybe look at the F# option. Either way I’d do the LINQ option first, as that will likely be more straightforward to you, and help guide your F# to be more functionally idiomatic once you start down that path.
Simplistically, here’s what you’d want, just fill in your functions for your Near and Equal fetchers, and your GetRank and GetStrings functions, and use the below
like this:
This gives you both simple parallelizability, and laziness so that the
GetStringsInfunction is only executed on the records you take (in this case the top 30). (Note theANDselector can be simplified using any of theIntersectAllexamples here).