For one of my classes we need to calculate the session length for a user visiting a website. We were given a web log. The web log is in this format:
IPAddress date httpMethod httpStatus size referrer browserInfo
- The
httpMethodlooks like this:GET /include/main_page.css HTTP/1.1 - The referrer is always the main page:
http://www.cs.myCollage.comor-
I am using a timeout value of 20 minutes.
QUESTIONS:
I am not sure how to tell when a session is over other than when it times out. Is the only way to end a session with a timeout? Is there a way to detect when a user leaves the site (using only the information in the logs)?
This is my current strategy (assume that we have these logs):
IPAddress Time httpMethod ...
IP1 2:15 GET something
IP1 2:17 GET something else
IP1 2:30 GET something else
IP1 4:30 GET something else
IP1 4:32 GET something else
This means that the user has had two sessions. I think that the first session would be either 15 minutes or 35 minutes. Should I include the timeout in the session time?
The second session would be between 2 minutes and 22 minutes.
Timeout value is used to separate different sessions coming from same IP (which is not necessarily the same person). In your example you have two different sessions because period from 2:30 to 4:30 is larger than timeout value.
As for determining session length this is probably straightforward class homework solution, and probably what teacher had in mind: just subtract start time from end time. In your case 15 minutes for first session, and 2 minutes for second.
If this would be a real world project then maybe last page in each session should be given some value too. For this you can use temporal locality approach:
The duration of the last GET could be estimated by average durations of all pages that precede it. In you example (2:15,2:17,2:30) first two pages lasted for 15 minutes, so estimation is that visitor is kinda slow and/or thorough and that third page lasted for 7.5 minutes, and session total is 22.5 minutes. From (4:30,4:32) we deduce that last page lasted for 2 minutes, and session total is 4 minutes. In special case where we have only one page visit you must have some arbitrary value for duration, like 1 minute.
Another approach is to put a value to every page. Some page take more time to read than others. This means you must read the whole log and determine the average visit time for each page when they are in mid session, and use this time for case when page is last in session. This is more complicated, and probably not an answer to your homework question.
Best real world solution would probably be a mix of these two approaches.