Am getting started with Hadoop, and am working on building a MapReduce chain for “customers who bought x also bought y”, where y is the product that is purchased most frequently with x. I am looking for advice on increasing the efficiency of this task, by which I mean reducing the amount of data shuffled from mapper nodes to reducer node. My goal is a little different than other “customer bought x” scenarios, because I simply want to store the most commonly purchased product for a given product, not a list of products purchased with a given product ranked by frequency.
I am following this blog post to guide my approach.
If, as I understand, one of the big performance limiters in Hadoop is shuffling data from the mapper nodes to the reducer node, then, for every phase of the MapReduce chain, I want to keep the amount of shuffled data at a minimum.
Let’s say my initial data set is a SQL table purchases_products, a join table between a purchase and products that were bought in that purchase. I’ll feed select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id into my MapReduce operation.
My MapReduce strategy is to map product_id_x, product_id_y to product_id_x_product_id_y, 1 and then sum the values in my reduce step. At then end I can split the keys and store pairs back to a SQL table.
My problem with this operation is that it shuffles a potentially huge number of rows, even though the size of the result set I want to produce is only count(products) big. Ideally, I’d like to have a combiner step narrow the amount of rows shuffled to reducers during this phase, but I don’t see a way to reliably do this.
Is this simply a limitation of the task at hand, or are there Hadoop tricks for organizing the workflow that will help me shrink the data shuffle during the second step? Is my worry about shuffle size appropriate in this case, or not?
Thanks!
Depending on how big your products set is (therefore defining the number of possible product pairs), you could look into map side ‘local’ aggregation.
Maintain a map of product pairs to frequency count in your mapper, and rather than writing each product pair and the value 1 to the context, accumulate them in a map. When the map gets to a predefined size, flush the map to the output context. You could even use an LRU Map to keep the most frequently observed pairs in the map, and write out those ‘expired’ entries when they are forced out.
For an example adapted for the Word Count example, see http://www.wikidoop.com/wiki/Hadoop/MapReduce/Mapper#Map_Aggregation
Of course, if you have a huge product set, or random product pairings, this isn’t going to save you that much. You also need to understand how big your map can get before you expire the JVM memory available.
You can also look into reducing the amount of data stored in your output Key / Value objects: