I have a problem and not getting idea which algorithm have to apply.
I am thinking to apply clustering in case two but no idea on case one:
I have .5 million credit card activity documents. Each document is well defined and contains 1 transaction per line. The date, the amount, the retailer name, and a short 5-20 word description of the retailer.
Sample:
2004-11-47,$500,Amazon,An online retailer providing goods and services including books, hardware, music, etc.
Questions:
1. How would classify each entry given no pre defined categories.
2. How would do this if you were given pre defined categories such as “restaurant”, “entertainment”, etc.
1) How would classify each entry given no pre defined categories.
You wouldn’t. Instead, you’d use some dimensionality reduction algorithm on the data’s features to them in 2-d, make a guess at the number of “natural” clusters, then run a clustering algorithm.
2) How would do this if you were given pre defined categories such as “restaurant”, “entertainment”, etc.
You’d manually label a bunch of them, then train a classifier on that and see how well it works with the usual machinery of accuracy/F1, cross validation, etc. Or you’d check whether a clustering algorithm picks up these categories well, but then you still need some labeled data.