I am connecting to a sockets API that is very inflexible. It will return rows such as:
NAME, CITY, STATE, JOB, MONTH
But will have duplicates because it does not do any aggregation. I need to count the duplicate rows (which would be very easy in SQL, but not, as far as I know, in Java).
Example source data:
NAME, CITY, STATE, JOB, MONTH
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
John Doe, Denver, CO, INSTALLATION, 090301
Jane Doe, Phoenix, AZ, SUPPORT, 090301
Intended:
NAME, CITY, STATE, JOB, MONTH, COUNT
John Doe, Denver, CO, INSTALLATION, 090301, 3
Jane Doe, Phoenix, AZ, SUPPORT, 090301, 1
I can easily do this for approximately 100,000 return rows, but I am dealing with about 60 million in a month. Any ideas?
Edit: Unfortunately, the rows are not returned sorted… nor is there an option through the API to sort them. I get this giant mess of stuff that needs to be aggregated. Right now I use an ArrayList and do indexOf(new row) to find if the item already exists, but it gets slower the more rows that there are.
Edit: For clarification, this would only need to be run once a month, at the end of the month. Thank you for all of the responses
You could use a HashSet to store the previous row with the same contents. (assuming your Row objects have proper .hashValue() and .equals() methods implemented.
Something like this perhaps:
Then in use (assuming further that you have an incrementCount() method to the Row class):
If you don’t care about the order in which the rows came in, you can get rid of the List and just use the Set.