I have a collection of documents and I’m trying to pull the dates out of them. They are plain text and HTML mostly but the date formats they use very greatly (though they are all English dates). How can I find and parse dates like this in a long string of text?
updated 2011-03-21T00:43:14
Sunday, March 20, 2011
Wednesday, March 16, 2011 | 11:25 AM
March 20, 2011 @ 12:21 pm
May 5, 2011
Published March 19, 2011
Some text here (March 19, 2011)
10/28/2011 21:16
<a href="#>Author Name</a> on Mar 17th 2011 ...
Location, ABBR., Jan. 8, 2008
01/07/2008 (6:00 pm)
By Author Name and Company 03/19/2011 09:59
Posted by Author Name on March 16, 2011 at 03:20 PM EDT
Have a look at the strtotime function.
Edit: Here is a more complete example showing how to parse a bunch of the dates provided.
Demo
Of course, some dates will not parse as-is since they are not supported by the list of date formats – For those, you’ll need to do some additional filtering / parsing to either extract their date or form them into a string suitable for strtotime.
Edit: Since there’s an interest in further processing of the input string, here is an example of how you can parse the text without using a regex to get the dates out. Notice how some of the dates just can’t be extracted, for this you will either need more string processing, or to use a regex.
As a side note, I would investigate using a regex if the provided string is only one of many variants of lines that contain dates. However, if the provided string is the only formats that the dates will be found in, string processing should be enough.
Demo.
Final Edit:
Finally, an example how to do some text processing to get more of the dates to parse.
Demo