I typically work with large XML files, and generally do word counts via grep

Question

0

Editorial Team

Asked: June 8, 20262026-06-08T12:00:02+00:00 2026-06-08T12:00:02+00:00

I typically work with large XML files, and generally do word counts via grep

0

I typically work with large XML files, and generally do word counts via grep to confirm certain statistics.

For example, I want to make sure I have at least five instances of widget in a single xml file via:

cat test.xml | grep -ic widget

Additionally, I just like to be able to log the line that widget appears on, ie:

cat test.xml | grep -i widget > ~/log.txt

However, the key information I really need is the block of XML code that widget appears in. An example file may look like:

<test> blah blah
  blah blah blah
  widget
  blah blah blah
</test>

<formula>
  blah
  <details> 
    widget
  </details>
</formula>

I am trying to get the following output from the sample text above, ie:

<test>widget</test>

<formula>widget</formula>

Effectively, I’m trying to get a single line with the highest level of markup tags that apply to a block of XML text/code that is surrounding the arbitrary string, widget.

Does anyone have any suggestions for implementing this via a command-line one liner?

Thank you.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-08T12:00:04+00:00

A non-elegant way using both sed and awk:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}' file.txt | awk 'NR%2==1 { sub(/^[ \t]+/, ""); search = $0 } NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }'

Results:

<test>widget</test>
<formula>widget</formula>

Explanation:

## The sed pipe:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}'
## This finds the widget pattern, ignoring case, then finds the last, 
## highest level markup tag (these must match the start of the line)
## Ultimately, this prints two lines for each pattern match

## Now the awk pipe:

NR%2==1 { sub(/^[ \t]+/, ""); search = $0 }
## This takes the first line (the widget pattern) and removes leading
## whitespace, saving the pattern in 'search'

NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }
## This finds the next line (which is even), and stores the markup tag in 'end'
## We then remove the slash from this tag and print it, the widget pattern, and
## the saved markup tag

HTH

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I typically work with large XML files, and generally do word counts via grep

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply