A Brief Introduction to the HTML Parsing Process

Recently, my friend and I set out to create a browser parser/visualizer that could hopefully shed some light on how our HTML code is processed into the DOM tree. I know when I first started coding, I could understand conceptually what the DOM (Document Object Model) was but it was difficult for me to fully grasp how different tags/nodes were appended to the tree e.g. if I add another script tag to my body, how would that look inside of the DOM tree or how is my HTML document converted into a format that my runtime environment can actually run?

Hopefully, this article can shed some light on the HTML parsing process and here’s a link if you want to check out what my friend and I have worked on so far.

To start, let’s go through the HTML parsing process.

HTML Parsing Process

While there are many stages listed above, the two main stages to keep in mind are the tokenizer (tokenization) stage and the tree construction stage. Let’s walk through the tokenization stage first.

When you’re creating your HTML document ( i.e. <body> <script> </script> </body>), individual tokens (characters such as ‘<’ or ‘/>’) are tokenized using a state machine. State machines read a series of inputs and switch to a different state given its specific input. In the HTML parsing process, this is useful as most data states consume a single character which either switches the state machine to a new state to consume the current input character or stays in the same state to consume the next character. There’s more nuance to this process especially when it comes to how certain states depend on the insertion mode or the stack of open elements but for our purposes, knowing how individual characters are tokenized is enough. A few data states that we can keep in mind are listed below.

When our HTML document is being parsed, the state machine switches to different states depending on the specific token that’s being parsed. For example, when we start a new tag with ‘<’, the state machine switches to a tag open state. That tag open state expects different inputs to follow and responds accordingly. If we were to write a script tag as ‘<?script>’, we would run into a parsing error.

Here’s a link to a table with common parsing errors that occur and how they would be interpreted. In our case, a question mark code point in place of a tag name code point would be interpreted as a comment.

Once the tokenization stage of our HTML document is complete and all of our tokens are emitted, we would move onto the tree construction stage which can insert additional characters into the stream.

In the tree construction stage, a sequence of tokens from the tokenization stage are received as input and in response, our DOM tree is dynamically modified. When processing each token, the user agent (e.g. web browser) follows steps aka the tree construction dispatcher to handle different token scenarios. Some of these scenarios include checking if the adjusted current node (current element node or “context” element) is a MathML text integration point, a HTML integration point or an HTML namespace and whether the concurrent token is a start tag or a character token. After these different scenarios are accounted for, we can finally move onto how nodes are created and inserted into the DOM tree.

Let’s add some sample code into our DOM Tree visualizer and see what happens!

If we were to insert “<script> </script> <body> <p> </p> </body>” into our visualizer, we can see that our visualizer renders the child nodes (p tag and script tag) under the head and body nodes within the tree.

Let’s input a simple HTML page and see how that renders under the tree.

<HTML>
<HEAD>
<TITLE>Your Title Here</TITLE>
</HEAD>
<BODY BGCOLOR=”FFFFFF”>
<CENTER><IMG SRC=”clouds.jpg” ALIGN=”BOTTOM”> </CENTER>
<HR>
<a href=”http://somegreatsite.com">Link Name</a>
is a link to another nifty site
<H1>This is a Header</H1>
<H2>This is a Medium Header</H2>
Send me mail at <a href=”mailto:support@yourcompany.com”>
support@yourcompany.com</a>.
<P> This is a new paragraph!
<P> <B>This is a new paragraph!</B>
<BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B>
<HR>
</BODY>
</HTML>

Here, our DOM tree visualizer renders every tag hierarchically under its corresponding parent nodes.

I hope that was informative and stay tuned until next time where I’ll be going over how we set up our sample visualizer above. Thank you!

Hi, I’m a SWE based in NYC!