I’m having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
“Update table set count=count+1 where ….“,
but i’m afraid that MySQL will be slow for data from such amount of servers. Moreover, I’m looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don’t need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It’s easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it’s easier to scale out.