I’m working on trying to transform a large dataset into the required formats for analyzing within the flowstrates package.
What I currently have is a large file (600k trips) with origin and destination points.
Format is sort of like this:
tripID Month start_pt end_pt
1 June 1 3
2 June 1 3
3 July 1 5
4 July 1 7
5 July 1 7
What I need to be able to generate is a file that has trip counts by unit time (let’s say months) in a format like this:
start_pt end_pt June July August ... December
1 3 2 0 5 9
1 5 0 1 4 4
1 7 0 2 0 0
It’s easy enough to pre-segment the data by month and then generate counts for each origin-destination pair, but then putting it all back together causes all sorts of problems since each of the pre-segmented chunks of data have very different sizes. So it seems that I’d need to do this for the entire dataset at once.
Are there any packages for doing this type of processing? Would it be easier to do this in something like SQL or SQLite?
Thanks in advance for any help.
You can use the
reshape2package to do this fairly easily.If your data is
dat,This gives a single entry for each
start_pt/end_pt/Monthcombination, the value of which is how many cases had that combination (the length oftripIDfor that set).