I’m creating test samples of text of varying length, where each sample is separated by a line break. Currently I have 3mb+ files of text with no line breaks, only spaces. I was hoping for help with the proper reg expression to make sure no line breaks are cutting words in half.
I’m very new to using reg expressions. but I assumed that for i.e. lines of 300 character length, it would be somewhere in the ballpark of:
/.{300,}\s+/&\n/g
(Apologies, I know this doesn’t work!)
Note: I know there are similar posts about this subject, but I’m relatively sure there’s nothing out there that specifically addresses this scenario.
Update: Solved! Worked with this command: perl -lpe's/\b(.{80,300})\b/\1\n/g' file
Are you sure there are no new lines already in the data? (if there are, the
.dot character will not match them) If there are no newlines, something as simple as this might work:The 80 lower bound is just an arbitrary choice, that will rarely affect the outcome, if there are no newlines present. You can make 300 lower if you want shorter lines.
Edit: changed
\bto\swhich may be a better choice to avoid unexpected line breaks around non-word characters, as pointed out by @tchrist. Also, OP did not say he needed Perl backreference’s, so tchrist changed\1to$1, which makes more sense for Perl.