Here is the situation:
I am making a small prog to parse server log files.
I tested it with a log file with several thousand requests (between 10000 – 20000 don’t know exactly)
What i have to do is to load the log text files into memory so that i can query them.
This is taking the most resources.
The methods that take the most cpu time are those (worst culprits first):
string.split – splits the line values into a array of values
string.contains – checking if the user agent contains a specific agent string. (determine browser ID)
string.tolower – various purposes
streamreader.readline – to read the log file line by line.
string.startswith – determine if line is a column definition line or a line with values
there were some others that i was able to replace. For example the dictionary getter was
taking lots of resources too. Which i had not expected since its a dictionary and should have its keys indexed. I replaced it with a multidimensional array and saved some cpu time.
Now i am running on a fast dual core and the total time it takes to load the file i mentioned is about 1 sec.
Now this is really bad.
Imagine a site that has tens of thousands of visits a day. It’s going to take minutes to load the log file.
So what are my alternatives? If any, cause i think this is just a .net limitation and i can’t do much about it.
EDIT:
If some of you gurus want to look at the code and find the problem here are my code files:
- http://freehosting1.net/temp/data.txt
- http://freehosting1.net/temp/logentry.txt
- http://freehosting1.net/temp/lists.txt
The function that takes the most resources is by far LogEntry.New
The function that loads all the data is called Data.Load
Total amount of LogEntry objects created: 50 000. Time taken: 0.9 – 1.0 seconds.
CPU: amd phenom II x2 545 3ghz.
not multithreaded
Without seeing your code, it’s hard to know whether you’ve got any mistakes there which are costing you performance. Without seeing some sample data, we can’t reasonably try experiments to see how we’d fare ourselves.
What was your dictionary key before? Moving to a multi-dimensional array sounds like an odd move – but we’d need more information to know what you were doing with the data before.
Note that unless you’re explicitly parallelizing the work, having a dual core machine won’t make any difference. If you’re really CPU bound then you could parallelize – although you’d need to do so carefully; you would quite probably want to read a “chunk” of text (several lines) and ask one thread to parse it rather than handing off one line at a time. The resulting code would probably be significantly more complex though.
I don’t know whether one second for 10,000 lines is reasonable or not, to be honest – if you could post some sample data and what you need to do with it, we could give more useful feedback.
EDIT: Okay, I’ve had a quick look at the code. A few thoughts…
Most importantly, this probably isn’t something you should do “on demand”. Instead, parse periodically as a background process (e.g. when logs roll over) and put the interesting information in a database – then query that database when you need to.
However, to optimise the parsing process:
StreamReaderis at the end – just callReadLineuntil the result isNothing.line.StartsWith("#")– I’d have to test.LineFormatclass which can cope with any field names, but specifically remembers the index of fields that you know you’re going to want. This also avoids copying the complete list of fields for each log entry, which is pretty wasteful.There are probably other things, but I’m afraid I don’t have the time to go into them now 🙁