I’m trying to parse a PDF to XML in c# and i want to extract headings like: I. INTRODUCTION, II. PAGE LAYOUT which are categorized by roman numerals from my file. I would like to write a regex to match strings like this I tried a couple of things but doesn’t work, can anyone help?
Share
This should do what you need:
[IVXLCDM]+. [A-Z ]+
As stated here:
On the other hand, if you want to make sure that the string contains only Roman numerals and a heading name, you might want to use this:
The
^and$are called anchors. The^instructs the regex engine to start matching from the very beginning of the string while the$instructs the regex engine to stop matching at the very end of the string.The complete list of Roman Numerals can be obtained from Wikipedia