Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL parser improvements #1452

Closed
Darthagnon opened this issue Aug 31, 2024 · 13 comments
Closed

URL parser improvements #1452

Darthagnon opened this issue Aug 31, 2024 · 13 comments

Comments

@Darthagnon
Copy link
Contributor

I currently use an AI-generated janky Python script to convert a list of URLs into an HTML-formatted list for use with WebToEpub: https://github.com/Darthagnon/web2epub-tidy-script

It works to solve the workflow problems I have with this extension, as explained in a previous issue (quoted below).

Would it be possible to adapt the URL parser to automatically do what I currently use my external script to do?

@Darthagnon
Copy link
Contributor Author

Darthagnon commented Aug 31, 2024

Originally posted by @Darthagnon in #1300 (comment):

Apologies, my explanation was rather confusing.

"Edit chapter URLs" is the view I use to collect chapter lists to convert to EPUB, because

  • the Wizards website is broken/useless/missing chapters, so there is no auto-parser that could work (EDIT: without too much work). An auto-parser would need to process https://magic.wizards.com/en/news/archive (2024), https://web.archive.org/web/2023mmddetc/https://magic.wizards.com/en/articles/archive (unreliable infinite scroller) and https://web.archive.org/web/2020mmddetc/http://www.wizards.com/Magic/Magazine/Archive.aspx (paginated, mostly 404s), https://web.archive.org/web/2014mmddetc/http://www.wizards.com/Magic/Magazine/Article.aspx
  • a lot of chapters are not story-related, so less useful for EPUB.

Questions

  1. Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?
  2. Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.
  3. Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here</a> - could it be changed to just take a list of URLs? e.g. instead of
<a href="https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22">18 Cowardice of the Hero</a>
<a href="https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05">19 Emonberry Red</a>
<a href="https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14">20 Kiora's Followers</a>

we could have

https://magic.wizards.com/en/news/magic-story/cowardice-hero-2014-01-22
https://magic.wizards.com/en/news/magic-story/emonberry-red-2014-02-05
https://magic.wizards.com/en/news/magic-story/kioras-followers-2014-02-14

... and the titles read according to the filter template to editable fields in the chapter list:

chrome_240508_57

Many thanks for any advice or help!

@Darthagnon
Copy link
Contributor Author

Concept screenshot of improved workflow for WebToEpub:
improvements to web2epub

@gamebeaker
Copy link
Collaborator

I think the concept isn't a bad idea.
Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content.
If this should be implemented in WebToEpub i think there would be the need for a placeholder title as the title is only known after the Chapter is downloaded.

@dteviot
Copy link
Owner

dteviot commented Aug 31, 2024

@Darthagnon

Off the top of my head

Parser: Is it possible for a single parser to target multiple domains (e.g. archive.org, multiple versions of wizards.com) with separate filters on a per-chapter basis? Or are there any example site templates that are similar to this use case?

Yes. See https://github.com/dteviot/WebToEpub/blob/ExperimentalTabMode/plugin/js/parsers/NoblemtlParser.js. The basic technique is for each function, e.g. Look for cover image, look for synopsis, you perform the operation both ways, and then take the first one one that works.

Extension: Would it be possible to (optionally) open "Edit chapter URLs" by default, rather than auto-parsing all chapters? For the default parser and manual listing on weird random websites, auto-parsing doesn't seem to work very well, gathering lots of irrelevant navbar links. It would help my workflow if I could paste story links directly, rather than fiddling with the UI to eliminate irrelevant chapters.

This seems something of an edge case.
Does using the auto-parser actually take much time?
I would have thought you're just opening "Edit chapter URLs" and deleting everything in it.
i.e. Select all, delete.

Extension: Edit chapter URLs requires specifying the title to use per chapter, as well as html tags, eg. a href="">Title here - could it be changed to just take a list of URLs?

That's not a bad idea. I think the way it would work would go something like:

  1. You can leave the title out of the hyperlink.
  2. If there's no title, WebToEpub adds the title that it finds in the chapter.

I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

@Darthagnon
Copy link
Contributor Author

Darthagnon commented Sep 13, 2024

Problem: your current solution downloads all chapters two times. 1 download is with the python script to extract the title the second download is from WebToEpub for the content.

My current workflow is indeed 2-stage, because I haven't managed to write proper parsers for the websites I use, and WebToEpub does not (yet?) extract titles from chapters. So I must first grab the titles and URLs (note: as far as I know, no chapters are downloaded, just the page titles) and then supply those to WebToEpub.

I will try and put together a parser for the multiple sites I need in 1 epub, based off NoblemtlParser.js as suggested.


This seems something of an edge case.
Does using the auto-parser actually take much time?
I would have thought you're just opening "Edit chapter URLs" and deleting everything in it.
i.e. Select all, delete.

This is my current workflow. Auto-parser doesn't take much time, just for my purposes it serves no purpose, just a part of the ritual to appease the machine spirit before actually getting to work and downloading an EPUB.

Just that auto-parser, by default, doesn't work with most given websites, so it would help me by eliminating a few clicks and fiddling if it was disabled by default and only enabled by user choice (or on detecting a supported URL).

You can leave the title out of the hyperlink.
If there's no title, WebToEpub adds the title that it finds in the chapter.
I should clarify, formatting as hyperlinks is still required, because WebToEpub expects a list of {URL, Title} items. So the hyperlink format provides this. We'd just be adding the ability to leave the title as blank

That sounds amazing, and exactly how I wish it worked - most chapters have some sort of <h1> title that can be picked. I'm glad if the suggestion has provided some interesting ideas!

@Darthagnon
Copy link
Contributor Author

I have started the implementation of the multi-domain Wizards MtG story scraper here: https://github.com/Darthagnon/web2epub-tidy-script/blob/master/MagicWizardsParser.js (resolves #1300 )

I hate myself for using AI-generated scripts, but I know only very basic JS. Initial testing looks promising; it correctly scrapes chapters from Archive.org and the live website (though titles are duplicated and author names excluded)

@dteviot
Copy link
Owner

dteviot commented Sep 13, 2024

@Darthagnon

Just giving it a quick once over.

These lines should not be needed

parserFactory.register("web.archive.org", () => new MagicWizardsParser()); // For archived versions
parserFactory.registerRule(
    (url, dom) => MagicWizardsParser.isMagicWizardsTheme(dom) * 0.7,
    () => new MagicWizardsParser()
);

WebToEpub knows about the web.archive.org, and will search the rest of the URL for the original site hostname, and apply parser for that.

The lines 19, 30, 57, 67, 113,

if (window.location.hostname.includes("web.archive.org")) 

should be (I think, note, not tested)

if (dom.baseURI.includes("web.archive.org")) 

This is also not needed

    // Detect if the site matches the expected structure for magic.wizards.com or the archived version
    static isMagicWizardsTheme(dom) {
        // Check if the page is archived
        if (window.location.hostname.includes("web.archive.org")) {
            // Archived page structure typically wraps the original content in #content
            return dom.querySelector("#content article") != null || dom.querySelector("#content .article-content") != null;
        }
        // Regular magic.wizards.com structure
        return dom.querySelector("article") != null || dom.querySelector(".article-content") != null;
    }

@Darthagnon
Copy link
Contributor Author

Darthagnon commented Sep 13, 2024

Swapping if (window.location.hostname.includes("web.archive.org")) to if (dom.baseURI.includes("web.archive.org")) breaks it for Archive.org pages (e.g. https://web.archive.org/web/20230127170159/https://magic.wizards.com/en/news/magic-story), though I have applied your other changes.

@dteviot
Copy link
Owner

dteviot commented Sep 13, 2024

@Darthagnon

If changing window.location breaks things, something is wrong.
window.location is the URL of the page the browser is showing, which is NOT the same as the current chapter/page that WebToEpub is processing.
The dom parameter passed into the calls is the page being processed, so you want to switch based on it's URL.

@Darthagnon
Copy link
Contributor Author

Darthagnon commented Sep 14, 2024

Further updates; this is latest v0.4. It's not perfect, but works more or less:

"use strict";

// Register the parser for magic.wizards.com and archive versions
parserFactory.register("magic.wizards.com", () => new MagicWizardsParser());

class MagicWizardsParser extends Parser {
    constructor() {
        super();
    }

    // Extract the list of chapter URLs
    async getChapterUrls(dom) {
        let chapterLinks = [];
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived versions, select the correct container within #content
            chapterLinks = [...dom.querySelectorAll("#content article a, #content .article-content a")];
        } else {
            // For live pages
            chapterLinks = [...dom.querySelectorAll("article a, .article-content a")];
        }
        
        // Filter out author links using their URL pattern
        chapterLinks = chapterLinks.filter(link => !this.isAuthorLink(link));
        
        return chapterLinks.map(this.linkToChapter);
    }

    // Helper function to detect if a link is an author link
    isAuthorLink(link) {
        const href = link.href;
        const authorPattern = /\/archive\?author=/;
        
        // Check if the link matches the author URL pattern or CSS selector
        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }
    }

    // Format chapter links into a standardized structure
    linkToChapter(link) {
        let title = link.textContent.trim();
        return {
            sourceUrl: link.href,
            title: title
        };
    }

    // Extract the content of the chapter
    findContent(dom) {
        if (window.location.hostname.includes("web.archive.org")) {
            // For archived pages, the content is often inside #content
            return dom.querySelector("#content article");
        } else {
            // For live pages
            return dom.querySelector(".entry-content, article, .article-content");
        }
    }

}

Known issues:

  • Seems to work for story index pages such as:
  • Excludes author names
  • Sometimes excludes chapter titles
  • Does not crawl paginated indexes beyond the 1st page
  • Does not yet work for https://mtglore.com/
  • Ignores chapter index thumbnails and chapter summary blurbs
  • Uses 1st chapter index thumbnail as cover art, rather than page hero image

@Darthagnon
Copy link
Contributor Author

Update: fixed chapter title parsing. It might be ready for prime-time.

@gamebeaker
Copy link
Collaborator

@Darthagnon Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes.

@dteviot
Copy link
Owner

dteviot commented Nov 9, 2024

@Darthagnon

Updated version (1.0.1.0) has been submitted to Firefox and Chrome stores.
Firefox version is available now.
Chrome might be available in a few hours (typical) to 21 days.

@dteviot dteviot closed this as completed Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants