I’m in the process of setting up a system which will have to repeatedly parse large amounts of text (as a String or StringBuffer – which might be better?) acquired from the a data source. The text will be displayed and may consist of several thousand words and each time the text is parsed, each word may have to checked against a list of 550 stop words. This will allow the words to be filtered from display.
So I wonder about performance as this could be going on in multiple servlet sessions at any one time; is it better to check each word against a MySQL database table (MyISAM or InnoDB) using an index? Or simply to store the 550 words in a Java array or arraylist within servlet context so they possibly be read more quickly?
So I wonder about the trade off between database IO against storing 550 strings in memory.
Any advice?
Thanks
Mr Morgan.
Assuming that the “data source” is not your database, you can get better performance by doing the stopword search in memory rather than asking the database for do it. It stands to reason:
It is also likely that you can implement a better algorithm for detecting the stop-words than a general purpose database engine could. And the memory needed for a data structure that represents the 500 or so stopwords should be trivial compared with the space used by the rest of your application, the servlet container and all of the libraries that you use.