Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7848927
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T18:22:27+00:00 2026-06-02T18:22:27+00:00

I have been told and watched others be told very often: do not use

  • 0

I have been told and watched others be told very often: do not use regular expressions to parse (or “parse”) a document written in a language like HTML, XML etc. The reasons named vary and are not really of importance here.

When asked what to do instead, usually you will be referred to a library for parsing such a document – a PHP extension, a JS framework etc. Most of the time they seem to rely on the document object model.

My question is not how to do this in a program or script. In a real situation I would not attempt to invent the wheel another time but just use one of the available frameworks.

What I want to know is – how do these frameworks do it? Or how would I do it without a framework (hypothetically)? I am not talking about any language in specific, I am interested in the theory behind extracting information from a document.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T18:22:29+00:00Added an answer on June 2, 2026 at 6:22 pm

    Parsing XML requires a tool that’s capable of recognizing something called a “context-free language.” Regular expressions recognize regular languages, which are a subset of context-free langauges.

    Recognizing Regular Languages

    Regular languages are recognized by deterministic finite automata (DFAs). A DFA is a set of states with transition edges between states, and an input buffer (the string you’re parsing). The DFA begins in its start state. The DFA reads off the character at the beginning of the input buffer, which tells it which transition to take. This moves the DFA to the next state, where it repeats the process. If the DFA ever encounters an input character it doesn’t have a transition for, it ends (the input was not recognized). If the DFA reaches a designated end state, the input has been recognized

    The most important thing to remember is that DFAs can’t remember what states they’ve been to—just where they are right now, and where to go next. This makes it impossible for a DFA to recognize certain types of languages, like matched XML tags for example.

    Regular expression implementations (like PCRE) have some extensions for convenience (‘+’, ‘?’, and character classes, for example), and others that change the power of regular expressions (like lookahead and back-references). These regular expressions are more powerful than DFAs, but it would be hard or impossible to build an XML parser with just these extended regular expressions.

    Recognizing Context-Free Languages

    Context-free languages are recognized by pushdown automata. These work just like DFAs, but with the addition of a stack. Pushdown automata select a transition using the first character of input and the value on the top of the stack. In each step, the machine consumes one input character and can push a value on the stack, pop one off, or do nothing with the stack.

    Pushdown automata can use the stack to remember where they’ve been, which makes them suitable for parsing languages like XML (or most programming languages, with a few special exceptions).

    Parsing XML

    Parsers aren’t built by designing a pushdown automaton, the same way you don’t recognize regular languages by designing a DFA. Context-free grammars are a nicer way to describe a context-free language. They’re typically written down in Backus-Naur Form (BNF). Here’s a simple BNF grammar for a subset of XML:

    Tags ::= Tag Tags | <nothing>
    
    Tag ::= "<" /[a-zA-Z]+/ Attributes ">" Document "</" /[a-zA-Z]+/ ">"
    
    Attributes ::= Attribute Attributes | <nothing>
    
    Attribute ::= /[a-zA-Z]+/ "=" "\"" /[a-zA-Z0-9 ]+/ "\""
    

    This grammar is made up of non-terminals (“Tags”, “Tag”, “Attributes”, and “Attribute”). Anywhere a non-terminal shows up on the right side of a rule it can be replaced by any of the possible definitions (separated by |). The text in quotes and regular expressions are terminals, which must match the input exactly.

    The Tag non-terminal recognizes the start and end tags, with a Tags non-terminal between them. Whenever the parser recognizes a start tag, it expects to find the closing tag on the other side. Tags will recognize one tag, followed by Tags again. This recursive definition allows the parser to recognize an unbounded number of tags.

    Parser generators are tools that turn context-free grammars into pushdown automata to recognize the input language. This takes a lot of the complexity out of building a parser, although there are plenty of challenges in accurately specifying a grammar.

    Other Methods for Parsing

    You can write a parser without building the state machine by hand, or by writing a context-free grammar. Typically this is done either with a recursive-descent parser or a hand-crafted parser that uses regular expressions with some special knowledge about the language being parsed. Recursive descent parsers look a lot like context-free grammars, but have some severe performance problems and functional limitations. There are also parsing expression grammars (PEGs) which work like a hybrid of regular expressions and BNF grammars. There are great articles on all of these techniques on Wikipedia, and many tools available for building parsers of all sorts.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

i have been told to use 'when' statement to make multiplexer but not use
I have been told to not use sudo so that the package concerned get
I have previously been told that I should always use Randomize() before I use
I have been told to never use == for strings but for everything else
I have been told that scanf should not be used when user inputs a
I have been told that a handle is sort of a pointer, but not,
I have been told that wsHttpBinding does not support older clients that still need
I have been told that on these pages, the last 2 links are not
I have been told to use log4j for my logging project,but before using log4j,
I am learning about preconditions and when to use them. I have been told

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.