Sunday, November 3

Case Study: Optimizing CommonMark Markdown Parser with Blackfire.io

As you may know, I am the author and maintainer of the PHP League‘s CommonMark Markdown parser. This project has three primary goals:

  1. fully support the entire CommonMark spec
  2. match the behavior of the JS reference implementation
  3. be well-written and super-extensible so that others can add their own functionality.

This last goal is perhaps the most challenging, especially from a performance perspective. Other popular Markdown parsers are built using single classes with massive regex functions. As you can see from this benchmark, it makes them lightning fast:

Library Avg. Parse Time File/Class Count
Parsedown 1.6.0 2 ms 1
PHP Markdown 1.5.0 4 ms 4
PHP Markdown Extra 1.5.0 7 ms 6
CommonMark 0.12.0 46 ms 117

Unfortunately, because of the tightly-coupled design and overall architecture, it’s difficult (if not impossible) to extend these parsers with custom logic.

For the League’s CommonMark parser, we chose to prioritize extensibility over performance. This led to a decoupled object-oriented design which users can easily customize. This has enabled others to build their own integrations, extensions, and other custom projects.

The library’s performance is still decent — the end user probably can’t differentiate between 42ms and 2ms (you should be caching your rendered Markdown anyway). Nevertheless, we still wanted to optimize our parser as much as possible without compromising our primary goals. This blog post explains how we used Blackfire to do just that.

Profiling with Blackfire

Blackfire is a fantastic tool from the folks at SensioLabs. You simply attach it to any web or CLI request and get this awesome, easy-to-digest performance trace of your application’s request. In this post, we’ll be examining how Blackfire was used to identify and optimize two performance issues found in version 0.6.1 of the league/commonmark library.

Let’s start by profiling the time it takes league/commonmark to parse the contents of the CommonMark spec document:

Initial benchmark of league/commonark 0.6.1

Later on we’ll compare this benchmark to our changes in order to measure the performance improvements.

Quick side-note: Blackfire adds overhead while profiling things, so the execution times will always be much higher than usual. Focus on the relative percentage changes instead of the absolute “wall clock” times.

Optimization 1

Looking at our initial benchmark, you can easily see that inline parsing with InlineParserEngine::parse() accounts for a whopping 43.75% of the execution time. Clicking this method reveals more information about why this happens:

Detailed view of InlineParseEngine::parse()

Here we see that InlineParserEngine::parse() is calling Cursor::getCharacter() 79,194 times — once for every single character in the Markdown text. Here’s a partial (slightly-modified) excerpt of this method from 0.6.1:

public function parse(ContextInterface $context, Cursor $cursor)
{
    // Iterate through every single character in the current line
    while (($character = $cursor->getCharacter()) !== null) {
        // Check to see whether this character is a special Markdown character
        // If so, let it try to parse this part of the string
        foreach ($matchingParsers as $parser) {
            if ($res = $parser->parse($context, $inlineParserContext)) {
                continue 2;
            }
        }

        // If no parser could handle this character, then it must be a plain text character
        // Add this character to the current line of text
        $lastInline->append($character);
    }
}

Blackfire tells us that parse() is spending over 17% of its time checking every. single. character. one. at. a. time. But most of these 79,194 characters are plain text which don’t need special handling! Let’s optimize this.

Instead of adding a single character at the end of our loop, let’s use a regex to capture as many non-special characters as we can:

public function parse(ContextInterface $context, Cursor $cursor)
{
    // Iterate through every single character in the current line
    while (($character = $cursor->getCharacter()) !== null) {
        // Check to see whether this character is a special Markdown character
        // If so, let it try to parse this part of the string
        foreach ($matchingParsers as $parser) {
            if ($res = $parser->parse($context, $inlineParserContext)) {
                continue 2;
            }
        }

        // If no parser could handle this character, then it must be a plain text character
        // NEW: Attempt to match multiple non-special characters at once.
        //      We use a dynamically-created regex which matches text from
        //      the current position until it hits a special character.
        $text = $cursor->match($this->environment->getInlineParserCharacterRegex());

        // Add the matching text to the current line of text
        $lastInline->append($character);
    }
}

Once this change was made, I re-profiled the library using Blackfire:

Post-optimization profile

Okay, things are looking a little better. But let’s actually compare the two benchmarks using Blackfire’s comparison tool to get a clearer picture of what changed:

Before-and-after comparison

This single change resulted in 48,118 fewer calls to that Cursor::getCharacter() method and an 11% overall performance boost! This is certainly helpful, but we can optimize inline parsing even further.

The post Case Study: Optimizing CommonMark Markdown Parser with Blackfire.io appeared first on SitePoint.


Source: Sitepoint

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x