Skip to content

Home

Simple HTML tokenization & validation in JavaScript

I've been down the tokenization rabbit hole for a little while now, if it wasn't obvious from the previous articles on bracket pair matching and math expression tokenization. This time, I wanted to try something a little more complex, but still simple enough to be done in a single article. So, I decided to try my hand at tokenizing an HTML string and validating that its tags are balanced correctly.

💬 Note

Huge disclaimer right here that this is a learning exercise and, quite possibly, an exercise in futility. What I hope to achieve here is you coming out of this article with a basic understanding of how to tokenize some more complex inputs and how to validate them. This is not a full-fledged HTML parser, nor is it meant to be.

HTML Tokenization

If you've read the previous article on math expression tokenization, you'll know that tokenization is the process of breaking down a string into smaller, more manageable pieces. In that article, we didn't really delve into multi-character tokens, except for numbers, which made the process a little simpler. In this article, we'll be dealing with multi-character tokens, specifically HTML tags and text nodes.

Token types

The first problem we'll have to solve is distinguishing between an HTML tag and a text node. An HTML tag is a string that starts with < and ends with >, while a text node is everything else. In order to tackle this, we'll have to create a flushBuffer function, much like we did in the math expression tokenizer. Only this time around, we'll make it a simple conditional that delegates responsibility to different functions based on the detected token type.

const flushBuffer = () => {
  if (!buffer.length) return;

  const value = buffer.trim();
  if (value.startsWith('<') || value.endsWith('>'))
    processTagToken(value);
  else
    processTextToken(value);
};

This function is responsible for loosely detecting the token type based on the buffer's contents and delegating the processing to the appropriate function. If the buffer starts with < or ends with >, we'll assume it's an HTML tag and pass it to processTagToken. Otherwise, we'll assume it's a text node and pass it to processTextToken. We'll implement these functions next.

Processing text tokens

Text tokens aren't very interesting. They're just text nodes that don't contain any HTML tags. We'll simply add them to a tokens array as a simple object with a type of text and a value of the text node.

const processTextToken = str => {
  tokens.push({ type: 'text', value: str });
  buffer = '';
};

As you can see, the processTextToken function handles emptying the buffer after the token is processed. We could have done this in the flushBuffer function, but there's a slight chance we may want the raw buffer value at some point in the future, so it's better to keep the logic separate.

Processing tag tokens

Tag tokens are where most of the complexity of this process comes from. As soon as we enter the processTagToken, we must check if the buffer's contents are actually valid. Luckily, this is doable with a simple regular expression.

Then, we'll have to figure out the tag name, check if it's an opening, closing or even self-closing tag and, finally, get the rest of the attributes. Most of this is done using regular expressions, which I know are a pain, but they're fairly simple in this case.

const SELF_CLOSING_TAGS = new Set([
  'br', 'img', 'input', 'meta', 'hr', 'link'
]);

const processTagToken = str => {
  if (!str.match(/^<[^<>]+>$/))
    throw new Error(`${str} is not a valid HTML tag`);

  const tagName = str.match(/^<\/?([^<>/ ]+)/)[1];
  const isClosingTag = str.startsWith('</');
  const isSelfClosingTag =
    str.endsWith('/>') || SELF_CLOSING_TAGS.has(tagName);
  const tagAttributeString = str.
    replace(new RegExp(`^</?${tagName}`), '').
    replace(/\/?>/, '').
    trim() || null;

  tokens.push({
    type: 'tag',
    tagName,
    opening: !isClosingTag || isSelfClosingTag,
    closing: isClosingTag || isSelfClosingTag,
    tagAttributeString
  });

  buffer = '';
};
💬 Note

I'm not going to dive deeper into attribute handling, as this is almost a tokenizer of its own, but we'll simply store them as a string for now. Additionally, my self-closing tags list is very incomplete, but it'll do for now.

While this might look a little intimidating at first, there's not a whole lot going on in this function. We're simply performing the steps I outlined above, then, if all goes well, we're adding the token to the tokens array and emptying the buffer.

Tokenizing the HTML string

We're now ready to tokenize an HTML string. The structure of the tokenizer is very similar to the previous article, so I won't bother you with all the minute details. The only point I would like to focus on is the actual character processing loop.

In this loop, we're simply adding characters to the buffer until we encounter either a < or a >. When we do, we flush the buffer and continue. The one thing that's of note is that we handle these two characters differently. When we encounter a <, we flush the buffer, then add the character, whereas when we encounter a >, we add the character, then flush the buffer. This allows us to parse both tags and text nodes correctly.

Let's take a look at the full tokenization code:

const SELF_CLOSING_TAGS = new Set([
  'br', 'img', 'input', 'meta', 'hr', 'link'
]);

const tokenizeHtml = str => {
  const tokens = [];
  let buffer = '';

  const processTagToken = str => {
    if (!str.match(/^<[^<>]+>$/))
      throw new Error(`${str} is not a valid HTML tag`);

    const tagName = str.match(/^<\/?([^<>/ ]+)/)[1];
    const isClosingTag = str.startsWith('</');
    const isSelfClosingTag =
      str.endsWith('/>') || SELF_CLOSING_TAGS.has(tagName);
    const tagAttributeString = str.
      replace(new RegExp(`^</?${tagName}`), '').
      replace(/\/?>/, '').
      trim() || null;

    tokens.push({
      type: 'tag',
      tagName,
      opening: !isClosingTag || isSelfClosingTag,
      closing: isClosingTag || isSelfClosingTag,
      tagAttributeString
    });

    buffer = '';
  };

  const processTextToken = str => {
    tokens.push({ type: 'text', value: str });
    buffer = '';
  };

  // Flush the buffer and process the tokens
  const flushBuffer = () => {
    if (!buffer.length) return;

    const value = buffer.trim();
    if (value.startsWith('<') || value.endsWith('>'))
      processTagToken(value);
    else
      processTextToken(value);
  };

  // Tokenize the input string
  [...str].forEach(char => {
    // If we encounter the opening tag, flush the buffer
    if (char === '<') flushBuffer();

    // Add the character to the buffer
    buffer += char;

    // If we encounter the closing tag, flush the buffer
    if (char === '>') flushBuffer();
  });

  // Flush any remaining buffer
  flushBuffer();

  return tokens;
};

Let's see it in action, shall we?

const tokens = tokenizeHtml(
  '<div class="container"><p>Hello, <strong>world</strong>!<br/></p></div>'
);
// [
//   {
//     type: 'tag', tagName: 'div',
//     opening: true, closing: false,
//     tagAttributeString: 'class="container"'
//   },
//   {
//     type: 'tag', tagName: 'p',
//     opening: true, closing: false,
//     tagAttributeString: null
//   },
//   { type: 'text', value: 'Hello,' },
//   {
//     type: 'tag', tagName: 'strong',
//     opening: true, closing: false,
//     tagAttributeString: null
//   },
//   { type: 'text', value: 'world' },
//   {
//     type: 'tag', tagName: 'strong',
//     opening: false, closing: true,
//     tagAttributeString: null
//   },
//   { type: 'text', value: '!' },
//   {
//     type: 'tag',
//     tagName: 'br',
//     opening: true, closing: true,
//     tagAttributeString: null
//   },
//   {
//     type: 'tag', tagName: 'p',
//     opening: false, closing: true,
//     tagAttributeString: null
//   },
//   {
//     type: 'tag', tagName: 'div',
//     opening: false, closing: true,
//     tagAttributeString: null
//   }
// ]

Matching tag pairs

In the previous article on bracket pair matching, we used a simple stack-based approach to match bracket pairs. This scenario is no different, only we need to work with tokens, instead of a raw string and its characters.

The idea is rather simple. We skip non-tag tokens entirely and we keep track of encountered tags. When we encounter an opening tag, we push its tagName and index to the stack. When we encounter a closing tag, we pop the last tag from the stack and check if they match. If they don't, we throw an error. If they do, we continue. If we reach the end of the tokens and the stack is not empty, we also throw an error.

const findMatchingTags = tokens => {
  const { pairs, stack } = tokens.reduce(
    ({ pairs, stack }, token, i) => {
      if (token.type === 'tag') {
        if (token.opening && !token.closing) {
          stack.push({ index: i, tagName: token.tagName });
        } else if (token.closing && !token.opening) {
          const {index: openingIndex, tagName} = stack.pop();
          if (tagName !== token.tagName)
            throw new Error(`Mismatched tags: ${tagName} and ${token.tagName}`);
          pairs.set(openingIndex, i);
          pairs.set(i, openingIndex);
        }
      }
      return { pairs, stack };
    },
    { pairs: new Map(), stack: [] }
  );

  if (stack.length)
    throw new Error('Unmatched HTML tags');

  return pairs;
};

This isn't all that different from the last implementation, only we're working with tokens instead of characters. Let's see it in action:

// Given the tokens from the previous example
const pairs = findMatchingTags(tokens);
// Map {
//   0 => 9, 9 => 0,
//   1 => 8, 8 => 1,
//   3 => 5, 5 => 3
// }

A simple HTML validator

Putting the previous two pieces together, we can create a simple HTML validator that tokenizes the input string, matches the tag pairs and throws an error if the tags are mismatched or unmatched or if the input string is not a valid HTML string.

const validateHtml = str => {
  const tokens = tokenizeHtml(str);
  const matchingTags = findMatchingTags(tokens);
  return { tokens, matchingTags };
};

Conclusion

And that's basically our simple HTML validator done! Of course, there are a ton more things one can and should take care of when parsing HTML, but this is a good starting point for understanding the basics.

I hope this article has really driven home the mentality of tokenization and how it can be used in real-world scenarios to break down complex strings into more manageable pieces. See you in the next one!

More like this

Start typing a keyphrase to see matching articles.