I’ve been looking for this one all day now, this is the closest useful ref I found.
My problem: huge files are imported from a closed system (can’t be altered at the source) and need to be imported. These files are | separated and have a CRLF at the end of each line
(until the last one). Now they found it funny to include a new type that can contain text with CR and CRLF in the text (instedd of <br>).
So what I need to do before I can process this file in our system, is to replace all CRLF and CR occurrences that are not preceded by a | to <br>, so that every line starts with a code like 000| … 600|
Closest I’ve got in Notepad ++:
Find: (?<![\|])[\r\n]+$
Replace: <br>
The prroblem is that it will not give a <br> for every crlf, misses crlf after cr… Other attempts to select the |crlf too forget the CR altogether.
Any thoughts greatly appreciated. Do keep in mind that the file can be over 500MB (complicating things a bit)
Extract of the file:
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :CR
CRLF
------------------------------CR
CRLF
CRLF
Gezien U het evaluatierooster niet heeft ingevuld, blijft CR
CRLF
CRLF
|||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET CR
CRLF
(DE) GEBOUW(EN) CR
CRLF
CR
CRLF
indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF
…
Required result: (rough copy paste job ;))
000|709076|153943|11||1|CRLF
300|709076|153943|11|4|20000729||Majo509|CRLF
500|709076|153943|11|6|3-3BNME|20000729|||21.13|4||20120509|CRLF
600|709076|153943|11||SBV|7103||||20120509|CRLF
600|709076|153943|11||SBV|7105||||20120509|CRLF
600|709076|153943|11||SBV|7607||||20120509|CRLF
600|709076|153943|11||MC||EVALUATIEROOSTER NIET INGEVULD :<BR><BR>---------------------<BR><BR><BR>Gezien U het evaluatierooster niet heeft ingevuld, blijft <BR><BR>||20120509|CRLF
600|709076|153943|11||SBV|7517||||20120509|CRLF
000|709209|154072|9||1|Dne|LA1349|3100||L|20120509|CRLF
300|709209|154072|9|3|20HEM-AT20120509|CRLF
500|709209|154072|9|6|3-3BNME|20000908|||15.4|3||20120509|CRLF
600|709209|154072|9||SBV|7103||||20120509|CRLF
600|709209|154072|9||MC||AFSCHAFFING VAN DE EVOOR HET <BR><BR>(DE) GEBOUW(EN) <BR><BR><BR><BR>indien U huurder of gebruiker bent.|||20120509|CRLF
600|709209|154072|9||MC||DIEFSTAL CRLF
Wow, this one phased me for a little while…
It’s tricky to do it in one pass.
The N++ constraint probably makes it tougher than it needs to be, but short of writing some code to do what you want it’s a good way to go I guess.
While I’m not sure it’s optimal, I had success with this combo.
Find:
Replace:
You need the $1 in the replace or you lose a character from your replaced lines – probably not what you want!
Ideally, you should look into some Perl (I’m no perl advocate, other scripting languages handling regex are available…) or something to do this.
Edit:
Just a thought. This makes the assumption that there won’t be sections of your file that contain |CRLF or |CR or |CRCR that are not ‘real’ line endings.