-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL parser improvements #1452
Comments
Originally posted by @Darthagnon in #1300 (comment):
|
I think the concept isn't a bad idea. |
Off the top of my head
Yes. See https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js. The basic technique is for each function, e.g. Look for cover image, look for synopsis, you perform the operation both ways, and then take the first one one that works.
This seems something of an edge case.
That's not a bad idea. I think the way it would work would go something like:
I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank |
My current workflow is indeed 2-stage, because I haven't managed to write proper parsers for the websites I use, and WebToEpub does not (yet?) extract titles from chapters. So I must first grab the titles and URLs (note: as far as I know, no chapters are downloaded, just the page titles) and then supply those to WebToEpub. I will try and put together a parser for the multiple sites I need in 1 epub, based off NoblemtlParser.js as suggested.
This is my current workflow. Auto-parser doesn't take much time, just for my purposes it serves no purpose, just a part of the ritual to appease the machine spirit before actually getting to work and downloading an EPUB. Just that auto-parser, by default, doesn't work with most given websites, so it would help me by eliminating a few clicks and fiddling if it was disabled by default and only enabled by user choice (or on detecting a supported URL).
That sounds amazing, and exactly how I wish it worked - most chapters have some sort of |
I have started the implementation of the multi-domain Wizards MtG story scraper here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js (resolves #1300 ) I hate myself for using AI-generated scripts, but I know only very basic JS. Initial testing looks promising; it correctly scrapes chapters from Archive.org and the live website (though titles are duplicated and author names excluded) |
Just giving it a quick once over. These lines should not be needed parserFactory.register("web.archive.org", () => new MagicWizardsParser()); // For archived versions
parserFactory.registerRule(
(url, dom) => MagicWizardsParser.isMagicWizardsTheme(dom) * 0.7,
() => new MagicWizardsParser()
); WebToEpub knows about the web.archive.org, and will search the rest of the URL for the original site hostname, and apply parser for that. The lines 19, 30, 57, 67, 113, if (window.location.hostname.includes("web.archive.org")) should be (I think, note, not tested) if (dom.baseURI.includes("web.archive.org")) This is also not needed // Detect if the site matches the expected structure for magic.wizards.com or the archived version
static isMagicWizardsTheme(dom) {
// Check if the page is archived
if (window.location.hostname.includes("web.archive.org")) {
// Archived page structure typically wraps the original content in #content
return dom.querySelector("#content article") != null || dom.querySelector("#content .article-content") != null;
}
// Regular magic.wizards.com structure
return dom.querySelector("article") != null || dom.querySelector(".article-content") != null;
} |
Swapping |
If changing window.location breaks things, something is wrong. |
Further updates; this is latest v0.4. It's not perfect, but works more or less:
Known issues:
|
Update: fixed chapter title parsing. It might be ready for prime-time. |
@Darthagnon Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes. |
Updated version (1.0.1.0) has been submitted to Firefox and Chrome stores. |
I currently use an AI-generated janky Python script to convert a list of URLs into an HTML-formatted list for use with WebToEpub: https://github.com/Darthagnon/web2epub-tidy-script
It works to solve the workflow problems I have with this extension, as explained in a previous issue (quoted below).
Would it be possible to adapt the URL parser to automatically do what I currently use my external script to do?
The text was updated successfully, but these errors were encountered: