I need some advice to chose a sorting algorithm to code for this problem.
In phase one, the program will fetch clientIDs and respective hashes (will be using a struct, probably) from a database. There can be 0 or many thousands of records.
In phase two, the program will complete this set with records read from a XML file. I’ve already built the stream parser. The XML file has all the client info sequentially before invoice data.
When phase two is done, the program will read the invoice data. For each invoice there’s one clientID and this has to be checked from the set of clients. The number of invoices can be millions of records.
What I initially thought. Since I don’t know how many client records there will be, I must add memory dynamically using a linked list. At the end of phase two I can create an array of data ordered by clientID, so that I can perform further searches, one for each invoice, can be retrieved quick, maybe using a binary search.
I’d like to know what do you advise me to handle this situation. What sort algorithms should I use? (I’ll be coding in C).
Arguably, the best algorithm is one that satisfies the following criteria:
Given that thousands of records is basically none, I’d suggest using
qsortfor the sort, andbsearchfor the searches; both of these are in the C standard library.Issues to note:
qsortcan’t be used on a linked-list. I’d strongly suggest storing your data in a dynamically-grown array; the amortized cost of creation is the same, and you’ll have other benefits (e.g. less memory overhead, better locality of reference).If, after careful profiling, you find that
bsearchis not sufficiently fast, then you may want to move over to a hashtable-based lookup, as this is O(1), not O(log N). However, don’t attempt to write your own; use an existing library for this. (See other answers here for suggestions.)