In my project, I need to download a html (about 50K-100K long when read into String, yes, quite fat), and fetch some contents using regular expressions.And then insert them into the database.
The performance is quite bad, and I want to know why.
The process of the codes is like that (multithreaded):
- using httpcomponents to download the html file into String (String html)
- using Regular expressions to fetch the content,and insert (database is mysql)
Pattern p = Pattern.compile("<h.*</a></h.>",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(html);
boolean result = m.find();
while (result) {
//insert into database stuff
//update database stuff
}
The string is very long, but if I split it into pieces, some matches may be missed. This is quite disturbing.
I added some print lines and find that after inserting into database, there are some delays, before updating operations, but I can’t figure it out as the connection to the database isn’t closed.
Try avoid Regex, use standard HTML Parser like JSoup, there are many. They might be more efficient, at least more than Regex, I would hope.
If using regex, try not compiling regex each time. Can have a private static for the
Pattern. But this ain’t huge gain in performance, just good practice.Use connection pooling for Database. If possible do batch inserts.