I was asked by the interviewer to design a system to store gigabytes of data and the system also has to support some kind of query.
Description:
There are massive amount of records generated in an IDC, each record is composed of a url, an IP which visits the url, and the time when the visit occurs. The record can probably be stated as a struct like this, but I’m not sure which data type should I pick to represent them:
struct Record {
url; //char *
IP; //int?
visit_time; //time_t or simply a number?
}
Requirements:
Design a system to store 100 billion records, and also the system gotta support 2 kinds of query at least:
First, given a time period (t1, t2) and a IP, query how many urls this IP has visited in the given period.
Second, given a time period (t1, t2) and a url, query how many times this url has been visited.
I was stumbled, and here is my stupid solution:
Analysis:
because every query is performed upon a given period of time, so:
1.Create a set, put all visit time into the set, and keep the set ordered according to the time’s value from older to latest.
2.Create a hash table using hash(visit_time) as the key, this hash table is called time-hash-table, then each node in a specific bucket has 2 pointers pointing to another 2 hash-tables respectively.
3.The another 2 hash-tables would be a ip-hash-table and a url-hash-table.
ip-hash-tableuses hash(ip) as the key and all the ips in the same ip-hash-table have the same visit-time;
url-hash-tableuses hash(url) as the key and all the urls in the same url-hash-table have the same visit-time.
Give a drawing as follows:
time_hastbl
[]
[]
[]-->[visit_time_i]-->[visit_time_j]...[visit_time_p]-->NIL
[] | |
[] ip_hastbl url_hastbl
[] []
: :
[] []
[] []
So, when doing the query upon (t1, t2):
-
find the closest match from the time set, let’s say the match is (t1′, t2′), then all the valid visit time will fall into the part of set starting from t1′ to t2′;
-
for each visit-time t in the time set[t1′:t2′], do hash(t) and find t’s ip_hastbl or url_hastbl, then count and log how many times the given ip or url appears.
Questions:
1.My solution is stupid, hope you can give me another solution.
2.with respect to how to store the massive records on disk, any advice? I thought of B-tree, but how to use it or is B-tree applicable in this system?
I believe the interviewer was expecting a distributed computing based solution, esp when “100 billion records” are involved. With the limited knowledge of Distributed Computing I have, I would suggest you to look into Distributed Hash Table and map-reduce (for parallel query processing)