On a messaging system, line breaks have been added automatically when messages are posted after a certain number of characters (silly way to do things but unfortunately that isn’t allowed to be changed). This means that breaks have been automatically inserted into URLs, so that long ones are split up, eg:
http://www.stackoverflow.com/some-more-<br/>stuff
When messages are retrieved, a function converts links into tags, which for this URL results in:
<a href='http//www.stackoverflow.com/some-more-'>http://www.stackoverflow.com/some-more-</a>stuff
I need to remove the <br/> before it’s turned into a link.
I have had the message split into words on spaces, then iterated through each word, seeing if it contains ‘http://’ or ‘www.’ and then replacing <br/> with an empty string if it does.
However, this only works on URLs entered in a paragraph, for example:
The URL is http://www.stackoverflow.com
It doesn’t work for URLs entered with line breaks around it, for example:
Here’s the URL:
And here’s some more text
..is chopped into:
Here’s the URL:http://www.stackoverflow.comAnd here’s some more text
..because all the line breaks have been removed in this ‘word’ (as I’m splitting on spaces, all of that is seen as one word).
I thought I could split on line breaks, but then this won’t work for URLs entered in a paragraph as in the first example, and it will also split in the middle of any URLs that contain a break.
Clearly I need to somehow just find URLs and replace line breaks inside them, but I’m having real trouble with this, as I just can’t seem to do it!
If I’ve left out any details feel free to ask and I’ll get back at once. Thanks 🙂
PS – This is being coded in C#.
Please delete the other answer.
I wasn’t able to understan your problem. No I think I do.
You can use this regex to find all the urls, wheter they are broken in several lines or not:
This will return capture groups called “url” which contain your url, with or without line breaks inside them. You get this with the (.|\r\n)*, which allows to find urls broken in several lines by \r\n (cr, lf). Check if this is the end of line coding of your messages. If not, you can change the grouo with (.|\n) or whatever is your case.
Oce you’ve found your urls, you can remove the \r\n inside them.
You can improve it using this regex:
The
deleteMegroup captures all the offending line breaks inside the urls, so you can safely remove them all.Important: You have to run the regex with multiline option If not, it will not work.
Sample text:
Matches:
Delete me group matches the bold \r\n