I am storing data for ranking users in XML documents – one row per user – containing a 36 char key, score, rank, and username as attributes.
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>
There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:
- Look up – finding a particular ID
- Iterate through each Rank element and get the id attribute
In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach – i.e. OutOfMemoryException!
I’m stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don’t know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).
Questions:
- Is it possible to use StAX for multiple find (like getElementById) operations on large documents quick enough to serve an HttpRequest?
- What is the maximum file size that a DOM Parser can handle?
- Is it possible to estimate how much memory per user would be used for an XML document with the above structure?
Thanks
Edit: I am not allowed to use databases.
Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?
It sounds like you’re using the xml document as a database. I think you’ll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that’s well supported, like mysql or postgresql, although even sqlite will work better than xml.
In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You’re then on your own to manage memory (recording the data you see depending on the state you’re in), so you’re correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.