I’m building a program that will live on an AWS EC2 instance (probably) be invoked periodically via a cron job. The program will ‘crawl’/’poll’ specific websites that we’ve partnered with and index/aggregate their content and update our database. I’m thinking java is a perfect fit for a language to program this application in. Some members of our engineering team are concerned about the performance detriment of java’s garbage collection feature, and are suggesting using C++.
Are these valid concerns? This is an application that will be invoked possible once every 30 minutes via cron job, and as long as it finishes its task within that time frame the performance is acceptable I would assume. I’m not sure if garbage collection would be a performance issue, since I would assume the server will have plenty of memory and the actual act of tracking how many objects point to an area of memory and then declaring that memory free when it reaches 0 doesn’t seem too detrimental to me.
No, your concerns are most likely unfounded.
GC can be a concern, when dealing with large heaps & fractured memory (requires a stop the world collection) or medium lived objects that are promoted to old generation but then quickly de-referenced (requires excessive GC, but can be fixed by resizing ratio of new:old space).
A web crawler is very unlikely to fit either of the above two profiles – you probably don’t need a massive old generation and should have relatively short lived objects (page representation in memory while you parse out data) and this will be efficiently dealt with in the young generation collector.
We have an in-house crawler (Java) that can happily handle 2 million pages per day, including some additional post-processing per page, on commodity hardware (2G RAM), the main constraint is bandwidth. GC is a non-issue.
As others have mentioned, GC is rarely an issue for throughput sensitive applications (such as a crawler) but it can (if one is not careful) be an issue for latency sensitive apps (such as a trading platform).