Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 1088823
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 16, 20262026-05-16T23:07:43+00:00 2026-05-16T23:07:43+00:00

Web server log analyzers (e.g. Urchin) often display a number of sessions. A session

  • 0

Web server log analyzers (e.g. Urchin) often display a number of “sessions”. A session is defined as a series of page visits / clicks made by an individual within a limited, continuous time segment. The attempt is made to identify these segments using IP addresses, and often supplementary info like user agent and OS, along with a session timeout threshold such as 15 or 30 minutes.

For certain web sites and applications, a user can be logged in and/or tracked with a cookie, which means the server can precisely know when a session begins. I’m not talking about that, but about inferring sessions heuristically (“session reconstruction“) when the web server does not track them.

I could write some code e.g. in Python to try to reconstruct sessions based on the criteria mentioned above, but I’d rather not reinvent the wheel. I’m looking at log files of a size around 400K lines, so I’d have to be careful to use a scalable algorithm.

My goal here is to extract a list of unique IP addresses from a log file, and for each IP address, to have the number of sessions inferred from that log. Absolute precision and accuracy are not necessary… pretty-good estimates are ok.

Based on this description:

a new request is put in an existing
session if two conditions are valid:

  • the IP address and the user-agent are the same of the requests already
    inserted in the session,
  • the request is done less than fifteen minutes after the last
    request inserted.

it would be simple in theory to write a Python program to build up a dictionary (keyed by IP) of dictionaries (keyed by user-agent) whose value is a pair: (number of sessions, latest request of latest session).

But I would rather try to use an existing implementation if one’s available, since I might otherwise risk spending a lot of time tuning performance.

FYI lest someone ask for sample input, here is a line of our log file (sanitized):

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status 
2010-09-21 23:59:59 215.51.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.mysite.org/blarg.htm 200 0 0
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-16T23:07:43+00:00Added an answer on May 16, 2026 at 11:07 pm

    OK, in the absence of any other answer, here’s my Python implementation. I’m not a Python expert. Suggestions for improvement are welcome.

    #!/usr/bin/env python
    
    """Reconstruct sessions: Take a space-delimited web server access log
    including IP addresses, timestamps, and User Agent,
    and output a list of the IPs, and the number of inferred sessions for each."""
    
    ## Input looks like:
    # Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status
    # 2010-09-21 23:59:59 172.21.1.119 GET /graphics/foo.gif - 80 - 128.123.114.141 Mozilla/5.0+(Windows;+U;+Windows+NT+5.1;+en-US;+rv:1.9.2)+Gecko/20100115+Firefox/3.6+(.NET+CLR+3.5.30729) http://www.site.org//baz.htm 200 0 0
    
    import datetime
    import operator
    
    infileName = "ex100922.log"
    outfileName = "visitor-ips.csv"
    
    ipDict = {}
    
    def inputRecords():
        infile = open(infileName, "r")
    
        recordsRead = 0
        progressThreshold = 100
        sessionTimeout = datetime.timedelta(minutes=30)
    
        for line in infile:
            if (line[0] == '#'):
                continue
            else:
                recordsRead += 1
    
                fields = line.split()
                # print "line of %d records: %s\n" % (len(fields), line)
                if (recordsRead >= progressThreshold):
                    print "Read %d records" % recordsRead
                    progressThreshold *= 2
    
                # http://www.dblab.ntua.gr/persdl2007/papers/72.pdf
                #   "a new request is put in an existing session if two conditions are valid:
                #    * the IP address and the user-agent are the same of the requests already
                #      inserted in the session,
                #    * the request is done less than fifteen minutes after the last request inserted."
    
                theDate, theTime = fields[0], fields[1]
                newRequestTime = datetime.datetime.strptime(theDate + " " + theTime, "%Y-%m-%d %H:%M:%S")
    
                ipAddr, userAgent = fields[8], fields[9]
    
                if ipAddr not in ipDict:
                    ipDict[ipAddr] = {userAgent: [1, newRequestTime]}
                else:
                    if userAgent not in ipDict[ipAddr]:
                        ipDict[ipAddr][userAgent] = [1, newRequestTime]
                    else:
                        ipdipaua = ipDict[ipAddr][userAgent]
                        if newRequestTime - ipdipaua[1] >= sessionTimeout:
                            ipdipaua[0] += 1
                        ipdipaua[1] = newRequestTime
        infile.close()
        return recordsRead
    
    def outputSessions():
        outfile = open(outfileName, "w")
        outfile.write("#Fields: IPAddr Sessions\n")
        recordsWritten = len(ipDict)
    
        # ipDict[ip] is { userAgent1: [numSessions, lastTimeStamp], ... }
        for ip, val in ipDict.iteritems():
            # TODO: sum over on all keys' values  [(v, k) for (k, v) in d.iteritems()].
            totalSessions = reduce(operator.add, [v2[0] for v2 in val.itervalues()])
            outfile.write("%s\t%d\n" % (ip, totalSessions))
    
        outfile.close()
        return recordsWritten
    
    recordsRead = inputRecords()
    
    recordsWritten = outputSessions()
    
    print "Finished session reconstruction: read %d records, wrote %d\n" % (recordsRead, recordsWritten)
    

    Update: This took 39 seconds to input and process 342K records and write 21K records. That’s good enough speed for my purposes. Apparently 3/4 of that time was spent in strptime()!

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have an android application and web-server working together. Now I want user log
for a C++ Web-Server I have to generate session id's. I thought of using
We have a web server that we're about to launch a number of applications
I would like to save to a server log SOAP envelopes for web service
I am setting up windows performance monitor to log activity on a web server
Lets say I have a log file from a web server with response times
When putting my application on a web server and trying to 'log in' I
I would like to log the uninstall event onto my own web-server for my
How to read a web server log file in Java. This file is getting
I have a log file from a web server which looks like this; 1908

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.