I want to parse elements of RFC822 (SMTP) "Received" lines, which are defined formally in the spec, e.g.:
atom = 1*
[...]
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
[...]
msg-id = "" ; Unique message id
[...]
addr-spec = local-part "@" domain ; global address
etc. for domain, date-time, etc.
Here’s a real example:
Received: from ll-194.132.162.89.kv.sovam.net.ua (ll-194.132.162.89.kv.sovam.net.ua [83.170.243.194] (may be forged)) by raq2073.uk2.net (8.10.2/8.10.2) with ESMTP id lASHDDE10765 for <johnsmithsvt@matts.co.uk>; Wed, 28 Nov 2007 17:13:13 GMT
Would regex be a good strategy to capture the parts of a received line?
I realize that many SMTP servers don’t format received lines properly (in real life).
Otherwise, does anyone know of a library in Java that does this well?
Edit Here’s a fiddle showing a regex and tests that I’ve banged on for a while, which seems to work.
Received:\s+(?:from\s+(.+?))?(?:\(qmail (.+?)\))?(?:\s+by\s+(.+?))?(?:\\s+via\s+(.+?))?(?:\s+with\s+(.+?))?(?:\;?\s+id\s+(.+?))?(?:\s+for\s+(.+?))?(?:;\s*(?!.*\;.*)(.+))?$
The choice really depends on exactly what you want to achieve.
For capturing specific parts of a Receiver-line (e.g. ‘give me the From-part’), regexes are awesome.
If you need a full-fledged parser for this grammar, then regexes alone will not suffice. Especially the addr-spec has so many special cases that a regex cannot hope to handle each one correctly (explanation). Regexes are not parsers.
Last time I needed an actual parser, I wrote my own using JavaCC. I would only recommend going down that road if you know a thing or two about grammars and parsing.