I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?
My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn’t too big but 1000+ of them in memory is way too much, hence the need for disk storage.
The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don’t care about concurrency issues and I don’t need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.
Basically I just need a way to iterate over these many items quickly.
Well, you haven’t given a whole lot to go on. You don’t describe your data, and you don’t describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and…
So anything anybody says here is going to be a complete shot in the dark.
…so along those lines, here’s a shot in the dark.
If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.
This is called the LRU algroithm. It’s a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it’s probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.
Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.
< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don’t see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.
In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory… because you’ve already visited them!
If your network is large, you aren’t going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >