GOALS

This is a simple HTML parser in C++.

Assumes C++20, mostly because that's what I currently use. Header-only, no external dependencies.

The main purpose is to let me get the data from certain Steam webpages, so that's also my main "testing".

I still made this a separate project, so I can maybe use it in other projects as well. Thus, I'm somewhat interested in fixing problems with other pages... if it's not too much work.

STANDARDS COMPLIANCE

None. It works with some pages, and fails for others.

KNOWN SHORTCOMINGS

No CDATA... yet
doesn't handle "tag soup" and other invalid markup; this might be complicated to do
the list of named character references is woefully incomplete
no detail information on errors
assumes UTF-8 input, and creates UTF-8 output

STUFF THAT IT DOESN'T WANT TO DO

no query functions -- just walk through the tree yourself, or use the callbacks
probably not a benchmark winner

USAGE

If you have an UTF-8 encoded webpage, parse it like this:

#include "HTMLParser/Parser.hpp"

void example()
{
    std::string html="…";
    ...
    HTMLParser::Parser parser(html);
    HTMLParser::Tree::Document document=parser.parse();
}

Refer to the HTMLParser/Tree.hpp file for the document data; it should be rather straightforward.

You can also subclass the Parser class to learn about elements when they are created:

class MyParser : public HTMLParser::Parser
{
   using Parser::Parser;

public:
   virtual ~MyParser() =default;

   virtual void startElement(const HTMLParser::Tree::Element& element) override
   {
   }

   virtual void endElement(const HTMLParser::Tree::Element& element) override
   {
   }
};

This lets you build lookup structures as elements are created (of course, you can just walk through the tree afterwards instead).

startElement is called right after reading the start tag. Your element will have a parent, but children will not be available.

endElement is called after reading the end tag. This means the children list will be completed as well.

Note: for both calls, even the html element at the top of the document will have a parent pointer, which is pointing to an element with no name. This is a parsing artifact; it will not appear in the final document tree.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Headers/HTMLParser		Headers/HTMLParser
LICENSE		LICENSE
ReadMe.md		ReadMe.md
Test.cpp		Test.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOALS

STANDARDS COMPLIANCE

KNOWN SHORTCOMINGS

STUFF THAT IT DOESN'T WANT TO DO

USAGE

About

Releases

Packages

Languages

License

Christian-Stieber/Christians-HTML-Parser

Folders and files

Latest commit

History

Repository files navigation

GOALS

STANDARDS COMPLIANCE

KNOWN SHORTCOMINGS

STUFF THAT IT DOESN'T WANT TO DO

USAGE

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages