I have a project where I need to build and store large trees of data in Ruby. I am considering different approaches for serialization, deserialization and querying of trees, and I am wondering what would be the best way to go. My major constraints are read time, query efficiency and and cross-version/cross-platform compatibility. The most frequent operation is to retrieve sets of nodes based on a combination of id/value and/or feature(s).Trees can be up to 15-20 levels deep. Moving subtrees is an uncommon procedure, but should be possible without too much black magic. Rails integration is not a primary concern. The options I thought about, along with some issues I’m concerned about, are the following:
- Marshal the trees, and when needed load them into memory and query them in Ruby (inefficiency as tree grows, cross-version compatibility?)
- Same as above, but use YAML (more cross-version compatible, but less efficient?)
- Same as above, but use a custom XML parser (need to recreate objects from scratch each time the tree is loaded?)
- Serialize the trees to XML, store them in an XML database (e.g. Sedna) and use XPath to query the trees (no experience with this approach, not sure about efficiency?)
- Use adjacency lists to query trees stored in an schema-less database (inefficiency when counting descendants?)
- Use materialized paths (potential of overfilling the max string length for deep trees?)
- Use nested sets (complex SQL queries?)
- Use the array of ancestors approach? Seems interesting in terms of querying efficiency according to the MongoDB page, but I haven’t been able to find any serious discussion of this algorithm.
Based on your experience, which approach would better fit with the constraints I have described? If I go for an XML database, are there ones that would be more suited for this project? Are there other approaches I have overlooked that would be more efficient? Thanks for your time.
Trees work really well with graph databases, such as neo4j: http://neo4j.org/learn/
Neo4j is a graph database, storing data in the nodes and relationships of a graph. The most generic of data structures, a graph elegantly represents any kind of data, preserving the natural structure of the domain.
Ruby has a good interface for the trees:
https://github.com/andreasronge/neo4j
Pacer is a JRuby library that enables very expressive graph traversals. Pacer allows you to create, modify and traverse graphs using very fast and memory efficient stream processing. That also means that almost all processing is done in pure Java, so when it comes the usual Ruby expressiveness vs. speed problem, you can have your cake and eat it too, it’s very fast!
https://github.com/pangloss/pacer
Neography is like the neo4j.rb gem and suggested by Ron in the comments (thanks Ron!)
https://github.com/maxdemarzi/neography