I was asked by the interviewer to design a system to store gigabytes of

Question

0

Asked: May 25, 20262026-05-25T14:16:00+00:00 2026-05-25T14:16:00+00:00

I was asked by the interviewer to design a system to store gigabytes of

0

I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.

Description:

There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I’m not sure which data type should I pick to represent them:

struct Record {
    url;  //char *
    IP;   //int?
    visit_time;   //time_t or simply a number?
}

Requirements:

Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:

First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.

Second, given a time period (t1, t2) and a url, query how many times this url has been visited.

I was stumbled, and here is my stupid solution:

Analysis:

because every query is performed upon a given period of time, so:

1.Create a set, put all visit time into the set, and keep the set ordered according to the time’s value from older to latest.

2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.

3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.

ip-hash-table uses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;

url-hash-table uses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.

Give a drawing as follows:

time_hastbl
  []
  []
  []-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
  []                     |          |
  []               ip_hastbl       url_hastbl
                      []               []
                      :                :
                      []               []
                      []               []

So, when doing the query upon (t1, t2):

find the closest match from the time set, let’s say the match is (t1′, t2′), then all the valid visit time will fall into the part of set starting from t1′ to t2′;
for each visit-time t in the time set[t1′:t2′], do hash(t) and find t’s ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.

Questions:

1.My solution is stupid, hope you can give me another solution.

2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T14:16:01+00:00

Editorial Team

2026-05-25T14:16:01+00:00Added an answer on May 25, 2026 at 2:16 pm

I believe the interviewer was expecting a distributed computing based solution, esp when “100 billion records” are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I was asked by the interviewer to design a system to store gigabytes of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply