Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8830909
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 14, 20262026-06-14T08:11:50+00:00 2026-06-14T08:11:50+00:00

How can we make our MapReduce Queries Faster? We have built an application using

  • 0

How can we make our MapReduce Queries Faster?

We have built an application using a five node Riak DB cluster.
Our data model is composed of three buckets: matches, leagues, and teams.

Matches contains links to leagues and teams:

Model

var match = {
        id: matchId,
        leagueId: meta.leagueId,
        homeTeamId: meta.homeTeamId,
        awayTeamId: meta.awayTeamId,
        startTime: m.match.startTime,
        firstHalfStartTime: m.match.firstHalfStartTime,
        secondHalfStartTime: m.match.secondHalfStartTime,
        score: {
            goals: {
                a: 1*safeGet(m.match, 'score.goals.a'),
                b: 1*safeGet(m.match, 'score.goals.b')
            },
            corners: {
                a: 1*safeGet(m.match, 'score.corners.a'),
                b: 1*safeGet(m.match, 'score.corners.b')
            }
        }
    };

var options = {
        index: {
            leagueId: match.leagueId,
            teamId: [match.homeTeamId, match.awayTeamId],
            startTime: match.startTime || match.firstHalfStartTime || match.secondHalfStartTime
        },
        links: [
            { bucket: 'leagues', key: match.leagueId, tag: 'league' },
            { bucket: 'teams', key: match.homeTeamId, tag: 'home' },
            { bucket: 'teams', key: match.awayTeamId, tag: 'away' }
        ]
    };
    match.model = 'match';
    modelCache.save('matches', match.id, match, options, callback);

Queries

We write a query that returns results from several buckets, one way is to query each bucket separately. The other way is to use links to combine results from a single query.

Two versions of the query we tried both take over a second, no matter how small our bucket size.
The first version uses two map phases, which we modeled after this post (Practical Map-Reduce: Forwarding and Collecting).

#!/bin/bash
curl -X POST \
-H "content-type: application/json" \
-d @- \
http://localhost:8091/mapred \
<<EOF
{
    "inputs":{
        "bucket":"matches",
        "index":"startTime_bin",
        "start":"2012-10-22T23:00:00",
        "end":"2012-10-24T23:35:00"
    },
    "query": [
        {"map":{"language": "javascript", "source":"
                function(value, keydata, arg){
                    var match = Riak.mapValuesJson(value)[0];
                    var links = value.values[0].metadata.Links;
                    var result = links.map(function(l) {
                        return [l[0], l[1], match];
                    });
                    return result;
                }
            "}
        },
        {"map":{"language": "javascript", "source": "
                function(value, keydata, arg) {
                    var doc = Riak.mapValuesJson(value)[0];
                    return [doc, keydata];
                }
            "}
        },
        {"reduce":{
            "language": "javascript",
                "source":"
                    function(values) {
                        var merged = {};
                        values.forEach(function(v) {
                            if(!merged[v.id]) {
                                merged[v.id] = v;
                            }
                        });
                        var results = [];
                        for(key in merged) {
                            results.push(merged[key]);
                        }
                        return results;
                    }
                "
            }
        }
    ]
}
EOF

In the second version we do four separate Map-Reduce queries to get the objects from the three buckets:

async.series([
        //First get all matches
        function(callback) {
            db.mapreduce
                .add(inputs)
                .map(function (val, key, arg) {
                    var data = Riak.mapValuesJson(val)[0];
                    if(arg.leagueId && arg.leagueId != data.leagueId) {
                        return [];
                    }
                    var d = new Date();
                    var date = data.startTime || data.firstHalfStartTime || data.secondHalfStartTime;
                    d.setFullYear(date.substring(0, 4));
                    d.setMonth(date.substring(5, 7) - 1);
                    d.setDate(date.substring(8, 10));
                    d.setHours(date.substring(11, 13));
                    d.setMinutes(date.substring(14, 16));
                    d.setSeconds(date.substring(17, 19));
                    d.setMilliseconds(0);
                    startTimestamp = d.getTime();
                    var short = {
                        id: data.id,
                        l: data.leagueId,
                        h: data.homeTeamId,
                        a: data.awayTeamId,
                        t: startTimestamp,
                        s: data.score,
                        c: startTimestamp
                    };
                    return [short];
                }, {leagueId: query.leagueId, page: query.page}).reduce(function (val, key) {
                    return val;
                }).run(function (err, matches) {
                    matches.forEach(function(match) {
                        result.match[match.id] = match; //Should maybe filter this
                        leagueIds.push(match.l);
                        teamIds.push(match.h);
                        teamIds.push(match.a);
                    });
                    callback();
                });
        },
        //Then get all leagues, teams and lines in parallel
        function(callback) {
            async.parallel([
                //Leagues
                function(callback) {
                    db.getMany('leagues', leagueIds, function(err, leagues) {
                        if (err) { callback(err); return; }
                        leagues.forEach(function(league) {
                            visibleLeagueIds[league.id] = true;
                            result.league[league.id] = {
                                r: league.regionId,
                                n: league.name,
                                s: league.name
                            };
                        });
                        callback();
                    });
                },
                //Teams
                function(callback) {
                    db.getMany('teams', teamIds, function(err, teams) {
                        if (err) { callback(err); return; }
                        teams.forEach(function(team) {
                            result.team[team.id] = {
                                n: team.name,
                                h: team.name,
                                s: team.stats
                            };
                        });
                        callback();
                    });
                }
            ], callback);
        }
    ], function(err) {
        if (err) { callback(err); return; }
        _.each(regionModel.getAll(), function(region) {
           result.region[region.id] = {
               id: region.id,
               c: 'https://d1goqbu19rcwi8.cloudfront.net/icons/silk-flags/' + region.icon + '.png',
               n: region.name
           };
        });
        var response = {
            success: true,
            result: {
                modelRecords: result,
                paging: {
                    page: query.page,
                    pageSize: 50,
                    total: result.match.length
                },
                time: moment().diff(a)/1000.00,
                visibleLeagueIds: visibleLeagueIds
            }
        };
        callback(null, JSON.stringify(response, null, '\t'));
    });

How do we make these queries faster?

Additional info:

We are using riak-js and node.js to run our queries.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-14T08:11:51+00:00Added an answer on June 14, 2026 at 8:11 am

    One way to make it at least a bit faster would be to deploy the JavaScript mapreduce functions to the server instead of passing them through as part of the job. (see description of js_source_dir parameter here). This is usually recommended if you have a JavaScript functions that you run repeatedly.

    As there is some overhead associated with running JavaScript mapreduce functions compared to native ones implemented in Erlang, using non-JavaScript functions where possible may also help.

    The two map phase functions in your first query appear to be designed to work around the limitation that a normal linking phase (which I believe is more efficient) does not pass on the record being processed (the matches record). The first function includes all the links and passes on the match data as additional data in JSON form, while the second passes on the data of the match as well as the linked record in JSON form.

    I have written a simple Erlang function that includes all links as well as the ID of the record passed in. This could be used together with the native Erlang function riak_kv_mapreduce:map_object_value to replace the two map phase functions in your first example, removing some of the JavaScript usage. As in the existing solution, I would expect you to receive a number of duplicates as several matches may link to the same league/team.

    -module(riak_mapreduce_example).
    
    -export([map_link/3]).
    
    %% @spec map_link(riak_object:riak_object(), term(), term()) ->
    %%                   [{{Bucket :: binary(), Key :: binary()}, Props :: term()}]
    %% @doc map phase function for adding linked records to result set
    map_link({error, notfound}, _, _) ->
        [];
    map_link(RiakObject, Props, _) ->
        Bucket = riak_object:bucket(RiakObject),
        Key = riak_object:key(RiakObject),
        Meta = riak_object:get_metadata(RiakObject),
        Current = [{{Bucket, Key}, Props}],
        Links = case dict:find(<<"Links">>, Meta) of
            {ok, List} ->
                [{{B, K}, Props} || {{B, K}, _Tag} <- List];
            error ->
                []
        end,
        lists:append([Current, Links]).
    

    The results of these can either be sent back to the client for aggregation or passed into a reduce phase function as in the example you provided.

    The example function would need to be compiled and installed on all nodes, and may require a restart.

    Another way to improve performance (that very well may not be an option for you) would perhaps be alter the data model in order to avoid having to use mapreduce queries for performance critical queries altogether.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

We currently have developed an application using WCF. Our clients make connections to different
I know there are lots of tools on the net that can make our
I'm logging different actions users make on our website. Each action can be of
I can make my program write a .dat file with an array of Node
Can someone tell me how I can make the following output? I have a
In PHP, you can make two variables point to the same data. $a =
On our site user can make paypal donations to other users in different currencies,
I've just started using Eclipse for Python development since we can make use of
Here's the situation: i have a SearchPage where an user can make a complex
Can we make our own List<string, string, string> in C#.NET? I need to make

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.