I have this raw text:
________________________________________________________________________________________________________________________________
Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
3 37 Bruce Cook Bruce Cook Ford Escort 3759 10 9:56.4388 4 0:58.3359
4 18 Troy Marinelli Troy Marinelli Nissan Silvia 3396 10 9:56.7758 2 0:58.4443
5 75 Anthony Gilbertson Anthony Gilbertson BMW M3 3200 10 10:02.5842 3 0:58.9336
6 26 Trent Purcell Trent Purcell Mazda RX7 2354 10 10:07.6285 4 0:59.0546
7 12 Scott Hunter Scott Hunter Toyota Corolla 2000 10 10:11.3722 5 0:59.8921
8 91 Graeme Wilkinson Graeme Wilkinson Ford Escort 2000 10 10:13.4114 5 1:00.2175
9 7 Justin Wade Justin Wade BMW M3 4000 10 10:18.2020 9 1:00.8969
10 55 Greg Craig Grag Craig Toyota Corolla 1840 10 10:18.9956 7 1:00.7905
11 46 Kyle Orgam-Moore Kyle Organ-Moore Holden VS Commodore 6000 10 10:30.0179 3 1:01.6741
12 39 Uptiles Strathpine Trent Spencer BMW Mini Cooper S 1500 10 10:40.1436 2 1:02.2728
13 177 Mark Hyde Mark Hyde Ford Escort 1993 10 10:49.5920 2 1:03.8069
14 34 Peter Draheim Peter Draheim Mazda RX3 2600 10 10:50.8159 10 1:03.4396
15 5 Scott Douglas Scott Douglas Datsun 1200 1998 9 9:48.7808 3 1:01.5371
16 72 Paul Redman Paul Redman Ford Focus 2lt 9 10:11.3707 2 1:05.8729
17 8 Matthew Speakman Matthew Speakman Toyota Celica 1600 9 10:16.3159 3 1:05.9117
18 74 Lucas Easton Lucas Easton Toyota Celica 1600 9 10:16.8050 6 1:06.0748
19 77 Dean Fuller Dean Fuller Mitsubishi Sigma 2600 9 10:25.2877 3 1:07.3991
20 16 Brett Batterby Brett Batterby Toyota Corolla 1600 9 10:29.9127 4 1:07.8420
21 95 Ross Hurford Ross Hurford Toyota Corolla 1600 8 9:57.5297 2 1:12.2672
DNF 13 Charles Wright Charles Wright BMW 325i 2700 9 9:47.9888 7 1:03.2808
DNF 20 Shane Satchwell Shane Satchwell Datsun 1200 Coupe 1998 1 1:05.9100 1 1:05.9100
Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012 Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended
I need to parse it into an object with the obvious Position, Car, Driver etc fields. The issue is I have no idea on what sort of strategy to use. If I split it on whitespace, I would end up with a list like so:
["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
Can you see the issue. I cannot just interpret this list, because people may have just 1 name, or 3 words in a name, or many different words in a car. It makes it impossible to just reference the list using indexes alone.
What about using the offsets defined by the column names? I can’t quite see how that could be used though.
Edit: So the current algorithm I am using works like this:
- Split the text on new line giving a collection of lines.
- Find the common whitespace characters FURTHEST RIGHT on each line. I.e. the positions (indexes) on each line where every other
line contains whitespace. EG: - Split the lines based on those common characters.
- Trim the lines
Several issues exist:
If the names contain the same lengths like so:
Jason Adams
Bobby Sacka
Jerry Louis
Then it will interpret that as two separate items: (["Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"]).
Whereas if they all differed like so:
Dominic Bou
Bob Adams
Jerry Seinfeld
Then it would correctly split on the last ‘d’ in Seinfeld (and thus we’d get a collection of three names(["Dominic Bou", "Bob Adams", "Jerry Seinfeld"]).
It’s also quite brittle. I am looking for a nicer solution.
You can use the
fixed_widthgem.Your given file can be parsed with the following code:
The
trapmethod identifies the lines in each section. I used regex:headregex looks for lines that don’t contain a digit.bodyregex looks for lines starting with a digit or “DNF”Each section must include the line immediately after the last. The
columndefinitions simply identify the number of columns to grab. The library strips whitespace for you. If you wanted to produce a fixed-width file, you can add alignment parameters, but it doesn’t appear you will need that.The result is a hash that starts like this: