md4qt
is a header-only C++ library for parsing Markdown.
md4qt
supports CommonMark 0.31.2 Spec, and some GitHub extensions, such as
tables, footnotes, tasks lists, strikethroughs, LaTeX Math injections, GitHub's autolinks.
md4qt
can be built with Qt6 or with ICU.
This library parses Markdown into tree structure.
- Example
- Benchmark
- Playground
- Q/A
- Why another AST Markdown parser?
- What should I know about links in the document?
- What is the second argument of
MD::Parser::parse()
? - What is an
MD::Anchor
? - Does the library throw exceptions?
- Why
MD::Parser
andMD::Document
are templates? - So, how can I use
md4qt
withQt6
andICU
? ICU
is slower thenQt6
? Really?- Why is parsing wrong on Windows with
std::ifstream
? - How can I convert
MD::Document
intoHTML
? - How can I obtain positions of blocks/elements in
Markdown
file? - How can I easily traverse through the
MD::Document
? - Why don't you have an implementation for pure
STL
withstd::string
? - Is it possible to write custom text plugin for this parser?
- Is it possible to find
Markdown
item by its position? - How can I walk through the document and find all items of given type?
- How can I add and process a custom (user-defined) item in
MD::Document
?
#define MD4QT_QT_SUPPORT
#include <md4qt/parser.hpp>
int main()
{
MD::Parser< MD::QStringTrait > p;
auto doc = p.parse( QStringLiteral( "your_markdown.md" ) );
for( auto it = doc->items().cbegin(), last = doc->items().cend(); it != last; ++it )
{
switch( (*it)->type() )
{
case MD::ItemType::Anchor :
{
auto a = static_cast< MD::Anchor< MD::QStringTrait >* > ( it->get() );
qDebug() << a->label();
}
break;
default :
break;
}
}
return 0;
}
Approximate benchmark with cmark-gfm says,
that Qt6 version of md4qt
is slower ~14 times.
But you will get complete C++ tree structure of the Markdown document with all
major extensions and sugar and cherry on the cake.
Markdown library | Result |
---|---|
cmark-gfm | 0.23 ms |
markdown-it (Rust) | 2.55 ms |
md4qt with Qt6 |
3.6 ms |
md4qt with Qt6 without GitHub autolinks extension |
3.1 ms |
This measurement done with test file in markdown-it (Rust)
markdown-it (Rust)
measurement done withmarkdown_it::plugins::extra
You can play in action with md4qt
in Markdown Tools. There you can find Markdown
editor/viewer/converter to PDF
.
And KleverNotes from KDE
uses md4qt
too.
-
When I wrote this library I knew about
md4c
parser, but not aboutcmark-gfm
.md4c
was not suitable for my purposes, whereascmark-gfm
could do everything I needed. But God did it so, so I wrotemd4qt
and only later knew aboutcmark-gfm
. Ok, code is written and tested. Let it be.What I can say yet, is that this library is C++. And for some people can be easier to use C++ code instead of C with freeing memory by hands. Qt do things easier by handling text encoding... So let it be, guys.
And one more cherry on the cake -
md4qt
can parse Markdown recursively. What it is described bellow.
-
In some cases in Markdown link's URL is something document related. So, when you got a
MD::Link
in the document check if the labeled links of the document contains key with URL in the link, and if so, use URL from labeled links, look:MD::Link< MD::QStringTrait > * item = ...; QString url = item->url(); const auto it = doc->labeledLinks().find( url ); if( it != doc->labeledLinks().cend() ) url = it->second->url();
- Second argument of
MD::Parser::parse()
is a flag that tells to the parser to process Markdown files recursively or no. If parsing is recursive then if in the targeted Markdown file exist links to other Markdown files, then they will be parsed too and will exist in the resulting document.
- As
md4qt
supports recursive Markdown parsing, then in the resulting document can be represented more then one Markdown file. Each file in the document starts withMD::Anchor
, it just shows that during traverse through the document you reached new file.
- No. This library doesn't use exceptions. Any text is a valid Markdown, so I
don't need to inform user about errors. Qt itself doesn't use exceptions too.
So you can caught only standard C++ exceptions, like
std::bad_alloc
, for example. Possibly withMD::UnicodeStringTrait
you will catch more standard exceptions, possibly I missed something somewhere, but I tried to negotiate all possible exceptions.
- Since version
2.0.0
md4qt
can be built not only withQt6
, but withSTL
too. The code of the parser is the same in both cases. I just added two ready traits to support different C++ worlds. WithSTL
I useICU
library for Unicode handling, anduriparser
library to parse and check URLs. These dependencies can be installed with the Conan package manager.
-
To build with
ICU
support you need to defineMD4QT_ICU_STL_SUPPORT
before includingmd4qt/parser.hpp
. In this case you will get access toMD::UnicodeStringTrait
, that can be passed toMD::Parser
as template parameter. You will receive in dependenciesC++ STL
,ICU
anduriparser
.To build with
Qt6
support you need to defineMD4QT_QT_SUPPORT
. In this case you will get access toMD::QStringTrait
to work with Qt's classes and functions. In this case in dependencies you will receiveQt6
.You can define both to have ability to use
md4qt
withQt6
andICU
.
- Don't believe anybody, just build built-in
md_benchamrk
and have a look. Dry numbers says, thatQt6
QString
~2 times fastericu::UnicodeString
in such tasks. Markdown parsing implies to check every symbol, and tied to use access to every character in the string withoperator [] (...)
, or memberat(...)
. I do it very often in the parser's code and profiler says that most of the run-time is spent on such operations.QString
just more optimized for access separate character thenicu::UnicodeString
...
- Such problem can occur on Windows with MSVC if you open file in text
mode, so for
MD::Parser
always openstd::ifstream
withstd::ios::binary
flag. And yes, I expect to receive UTF-8 encoded content...
-
In version
2.0.5
were made commits with implementation ofMD::toHtml()
function. You can do the following:#define MD4QT_QT_SUPPORT #include <md4qt/traits.hpp> #include <md4qt/parser.hpp> #include <md4qt/html.hpp> int main() { MD::Parser< MD::QStringTrait > p; auto doc = p.parse( QStringLiteral( "your_markdown.md" ) ); const auto html = MD::toHtml( doc ); return 0; }
- Done in version
2.0.5
. Remember that all positions inmd4qt
start with 0, where first symbol on first line will have coordinates(0,0)
. One more important thing is that all ranges of position inmd4qt
are given inclusive, that mean that last column of any element will point to the last symbol in this element.
-
Since version
2.6.0
inmd4qt/visitor.hpp
header implementedMD::Visitor
interface with which you can easily walk through the document, all you need is implement/override virtual methods to handle that or another element in the document, like:virtual void onHeading( //! Heading. Heading< Trait > * h ) = 0;
- Because of performance, I did an pure
STL
implementation where string class was anstd::string
with some small third-party library to handleUTF8
, and benchmark said that the performance was like withQt6
QString
, so I decided to not support third trait. Maybe because I so lazy?
-
Since version
3.0.0
in theMD::Parser
was added a method for adding custom text plugins.//! Add text plugin. void addTextPlugin( //! ID of a plugin. Use TextPlugin::UserDefinedPluginID value for start ID. int id, //! Function of a plugin, that will be invoked to processs raw text. TextPluginFunc< Trait > plugin, //! Should this plugin be used in parsing of internals of links? bool processInLinks, //! User data that will be passed to plugin function. const typename Trait::StringList & userData );
-
ID
of a plugin is a regularint
that should be (but not mandatory) started fromenum TextPlugin : int { UnknownPluginID = 0, GitHubAutoLinkPluginID = 1, UserDefinedPluginID = 255 }; // enum TextPlugin
UserDefinedPluginID
value. Note that plugins will be invoked corresponding to itsID
from smallest to largest, so a developer can handle an order of text plugins.
-
Text plugin is a usual function with a signature
template< class Trait > using TextPluginFunc = std::function< void ( std::shared_ptr< Paragraph< Trait > >, TextParsingOpts< Trait > &, const typename Trait::StringList & ) >;
You will get already parsed
Paragraph
with all items in it. And you are able to process remaining raw text data and check it for what you need.TextParsingOpts
is an auxiliary structure with some data. You are interested inbool collectRefLinks;
, when this flag istrue
the parser is in a state of collecting reference links, and on this stage plugin may do nothing.A last argument of plugin function is a user data, that was passed to
MD::Parser::addTextPlugin()
method.A most important thing in
TextParsingOpts
structure is astd::vector< TextData > rawTextData;
. This vector contains not processed raw text data fromMarkdown
. The size ofrawTextData
is the same as a count ofText
items inParagraph
, and theirs sizes should remain equal. So, if you replace one of text item with something, for example link, corresponding text item should be removed fromParagraph
andrawTextData
. Or if you replace just a part of text item - it should be modified inParagraph
andrawTextData
. Be careful, it's UB, if you will make a mistake here, possibly you will crash.One more thing - don't forget to set positions of elements in
Document
to new values if you change something, and don't forget about such things likeopenStyles()
andcloseStyles()
ofItemWithOpts
items. Document should remain correct after your manipulations, so any syntax highlighter, for example, won't do a mistake.Note, that
TextData
isstruct TextData { typename Trait::String str; long long int pos = -1; long long int line = -1; bool spaceBefore = false; bool spaceAfter = false; };
And
pos
andline
here is relative toMdBlock< Trait > & fr;
member ofTextParsingOpts
, but document require absolute positions in theMarkdown
text. So when you will set positions to new items, use, for example, a following code.setEndColumn( po.fr.data.at( s.line ).first.virginPos( s.pos ) );
where
s
is an object ofTextData
type.
processInLinks
flag should be set to false if you desire to not process your plugin in link's captions, as, for example, links can't contain other links, so if you are implementing a plugin for new links this flag should be set tofalse
for your plugin.
- This list of strings will be passed to plugin function. This is auxiliary data that can be handy for plugin implementation.
-
In
md4qt
already exists one text plugin for handling GitHub's autolink. A plugin function is quite simple, look.template< class Trait > inline void githubAutolinkPlugin( std::shared_ptr< Paragraph< Trait > > p, TextParsingOpts< Trait > & po ) { if( !po.collectRefLinks ) { long long int i = 0; while( i >= 0 && i < (long long int) po.rawTextData.size() ) { i = processGitHubAutolinkExtension( p, po, i ); ++i; } } }
But
processGitHubAutolinkExtension()
is not so trivial :) Have a look at its implementation to have a good example, it's placed inparser.hpp
.Good luck with plugining. :)
-
Let I will show you on example how raw text data correlate with paragraph. Just two diagrams and you won't have anymore questions. Look.
Consider we want to replace any occurence of
@X
by some kind of a link. Before modifications we had.And after work of your plugin we should have.
-
Since version
3.0.0
was added a function to get a substring from text fragment with given virgin positions.template< class Trait > inline typename Trait::String virginSubstr( const MdBlock< Trait > & fr, const WithPosition & virginPos );
And a function to get local position from virgin one.
template< class Trait > inline std::pair< long long int, long long int > localPosFromVirgin( const MdBlock< Trait > & fr, long long int virginColumn, long long int virginLine )
- Since version
3.0.0
was added new structureMD::PosCache
. You can passMD::Document
into itsinitialize()
method and find first item with all its nested first children by given position withfindFirstInCache()
method.
-
Since version
3.0.0
was added algorithmforEach()
.//! Calls function for each item in the document with the given type. template< class Trait > inline void forEach( //! Vector of item's types to be processed. const typename Trait::template Vector< ItemType > & types, //! Document. std::shared_ptr< Document< Trait > > doc, //! Functor object. ItemFunctor< Trait > func, //! Maximun nesting level. //! 0 means infinity, 1 - only top level items... unsigned int maxNestingLevel = 0 );
- Since version
3.0.0
inMD::ItemType
enum appearedUserDefined
enumerator. So you can inherit from anyMD::Item
class and return fromtype()
method value greater or equalMD::ItemType::UserData
. To handle user-defined types of items inMD::Visitor
class now exists methodvoid onUserDefined( Item< Trait > * item )
. So you can handle your custom items and do what you need.