I’m extracting the content between XML tags using the following: –
perl -lne 'BEGIN{undef $/} while (/<tagname>(.*?)<\/tagname>/sg){print $1}' input.txt > output.txt
Unfortunately I’m getting out of memory issues, I know I can split the file and process each then concat but I wondered if there was another way, be it a modification to the above or using the likes of awk or sed?
The input.txt file size varies between 17GB and 70GB.
EDIT:
The input file can be any XML file, a point to note is that it contains no newlines, e.g. : –
<body><a></a><b></b><c></c></body><foo></foo><bar><z></z></bar>
This one-liner reads entire file into memory as one gigantic “line”. Of course you’ll have problems with memory with stuffing 17GB and more into it! Read and process file line-by-line or use
readto get chunks of suitable size instead.In this case, search for
<tagname>, note its position in line and search for closing tag starting from there. If you didn’t find it, stuff current line/chunk into buffer and repeat until you’ve found it on some other line further in file. When found, print out this buffer and empty it. Repeat until the end of file.Note that if you’d use arbitrary sized chunks, you’ll have to account for possibility of tag split by boundary by cutting incomplete tag from end of chunk and stuffing it in “to process” buffer.