I used the regex given in perlfaq6 to match and remove javascript comments, but it results in segmentation fault when the string is too long. The regex is –
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
Can it be improved to avoid segmentation fault ?
[EDIT]
Long input:
<ent r=\"6\" t=\"259\" w=\"252\" /><ent r=\"6\" t=\"257\" w=\"219\" />
repeated about a 1000 times.
I suspect the trouble is partly that your ‘C code’ isn’t very much like C code. In C, you can’t have the sequence
\"outside a pair of quotes, single or double, for example.I adapted the regex to make it readable and wrapped into a trivial script that slurps its input and applies the regex to it:
I took your line of non-C code, and created files data.1, data.2, data.4, data.8, …, data.1024 with the appropriate number of lines in each. I then ran a timing loop.
I’ve munged the output to give just the real time for the different file sizes:
I did not get a core dump (Perl 5.16.0 on Mac OS X 10.7.4; 8 GiB main memory). It does begin to take a significant amount of time. While it was running, it was not growing; during the 1024-line run, it was using about 13 MiB of ‘real’ memory and 23 MiB of ‘virtual’ memory.
I tried Perl 5.10.0 (the oldest version I have compiled on my machine), and it used slightly less ‘real’ memory, essentially the same ‘virtual’ memory, and was noticeably slower (33.3s for 512
lines; 1m 53.9s for 1024 lines).
Just for comparison purposes, I collected some C code that I had lying around in the test directory to create a file of about 88 KiB, with 3100 lines of which about 200 were comment lines. This compares with the size of the data.1024 file which was about 77 KiB. Processing that took between 10 and 20 milliseconds.
Summary
The non-C source you have makes a very nasty test case. Perl shouldn’t crash on it.
Which version of Perl are you using, and on which platform? How much memory does your machine have. However, total quantity of memory is unlikely to be the issue (24 MiB is not an issue on most machines that run Perl). If you have a very old version of Perl, the results might be different.
I also note that the regex does not handle some pathological C comments that a C compiler must handle, such as:
Yes, you’d be right to reject any code submitted for review that contained such comments.