Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parser for mtgstory.com #1500

Merged
merged 8 commits into from
Sep 25, 2024

Conversation

Darthagnon
Copy link
Contributor

WIP. Add parser for mtgstory.com (redirects to https://magic.wizards.com/en/story). Seems to work on most versions of the website (e.g. current live version, archive.org version from 2-3 years ago, untested on 10 years ago archive.org version). Still missing fallback support for mtglore.com.

Based on MagicWizardsParser.js v0.6 from https://github.com/Darthagnon/web2epub-tidy-script

@Darthagnon
Copy link
Contributor Author

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

@gamebeaker
Copy link
Collaborator

gamebeaker commented Sep 21, 2024

@Darthagnon can you fix the eslint errors? (just push to your branch i think it should update in the merge request)
https://github.com/dteviot/WebToEpub/actions/runs/10974479483/job/30473842554?pr=1500
image

@dteviot
Copy link
Owner

dteviot commented Sep 21, 2024

@Darthagnon

Replacing

findCoverImageUrl(dom) {
    // Try to find an image inside the '.swiper-slide' or inside an 'article'
    let imgElement = dom.querySelector(".swiper-slide img, article img");

    // If an image is found, return its 'src' attribute
    if (imgElement) {
        return imgElement.getAttribute("src");
    // Check if the URL starts with '//' (protocol-relative URL)
        if (imgSrc && imgSrc.startsWith("//")) {
            // Add 'https:' to the start of the URL
            imgSrc = "https:" + imgSrc;
        }
    }
    // Fallback if no image was found
    return null;
}

with

    findCoverImageUrl(dom) {
        return util.getFirstImgSrc(dom, ".swiper-slide img, article img");
    }

will fix your problems.

@dteviot
Copy link
Owner

dteviot commented Sep 21, 2024

@Darthagnon

For some reason, gitignore was set to ignore additions to the parsers folder, so I have commented that line.

You can just remove that line.

@dteviot
Copy link
Owner

dteviot commented Sep 21, 2024

@Darthagnon

Commented out line 5

//parserFactory.register("mtglore.com", () => new MagicWizardsParser());

should be removed.

This

        if (authorPattern.test(href)) {
            return true;
        } else {
            return false;
        }

should be

        return authorPattern.test(href));

I'm not convinced that

if (window.location.hostname.includes("web.archive.org"))

does what you think it does.
I'm lazy. Please provide link links to the two cases it should distinguish between, and I'll check for myself.

@gamebeaker
Copy link
Collaborator

@Darthagnon Maybe you can change .gitignore to ignore the new files if someone does npm install (plugin/jszip/dist/jszip.min.js and package-lock.json)

@Darthagnon
Copy link
Contributor Author

Darthagnon commented Sep 22, 2024

This

       if (authorPattern.test(href)) {
           return true;
       } else {
           return false;
       }

should be

        return authorPattern.test(href));

This change breaks the parser, results in it being unable to pick up any chapters.

plugin/jszip/dist/jszip.min.js was already there, but commented out; restored.

I'm not too sure what to do about the spacing/lint errors in packed.js... I have pack.js but packed.js does not exist. I haven't touched either file and don't know what tool to use to automatically fix them (maybe JSLint or NPPExec with eslint?)

Some test pages:

I believe the archive.org logic may be needed to account for slight variations in the article selectors over time, but I will keep testing.

@Darthagnon
Copy link
Contributor Author

Hmmm... definitely WIP, I need to do some more work on it.

@gamebeaker
Copy link
Collaborator

@Darthagnon The spacing error message comes from npm run lint this command packages all js files into one file eslint/packed.js and evaluates it/ searches for warnings/ errors. The line from the error message is the line in packed.js as the normal Experimentaltab version has no errors the errors must be in a new file you created changed etc.
You have fixed these errors as the github actions which runs this command had no problems.
image

@gamebeaker
Copy link
Collaborator

An easy test is to change the indentation in main.js
image
now run npm run lint
image
in eslint/packed.js you can see the problem in line 23919
image
revert main.js and run npm run lint there are now no errors and packed.js also changed to reflect the changes.
image

Improves compatibility with 2016 version of site
@gamebeaker
Copy link
Collaborator

image
î guess the problem is here
image
line 18++

Also add TODO to JS. 2024 site and pre-2018 site work and are priority, as they cover all modern stories and older lost chapters. (Ancient MTG articles from pre-2014 not accounted for yet)
@Darthagnon
Copy link
Contributor Author

Ongoing improvements mean the script now deals quite well with both the 2023-2024 version and 2014-2018 version of the website (v0.72, chapter titles now generalised and correctly selected).

@dteviot
Copy link
Owner

dteviot commented Sep 22, 2024

@Darthagnon

return authorPattern.test(href));

D'oh! Copy/paste mistake on my part. Should only be one closing bracket. i.e.

return authorPattern.test(href);

@Darthagnon
Copy link
Contributor Author

@Gamebreaker No idea where packed.js is from, are you sure that isn't your dev build? I only have pack.js
Everything64_240922_148

Everything search for pack [space] .js, which would show up packed.js if it existed. And I haven't touched that file, I have only added MagicWizardsParser.js and edited popup.html, nothing more.

@dteviot
Copy link
Owner

dteviot commented Sep 22, 2024

@Darthagnon

packed.js is created when the build runs and creates the WebToEpub extension. As you're not running the build, you won't see this file on your machine.

I think the lines with the indentation problem are these:

titleElement = link.closest("article")?.querySelector(selector) ||
link.closest(".article-item")?.querySelector(selector) ||
link.closest(".details")?.querySelector(selector);

Should be

            titleElement = link.closest("article")?.querySelector(selector) || 
               link.closest(".article-item")?.querySelector(selector) || 
               link.closest(".details")?.querySelector(selector);

The line following a line ending with a || should be indented 4 more spaces.

@dteviot
Copy link
Owner

dteviot commented Sep 22, 2024

@Darthagnon
Give me 10 minutes, I'll run the build using your file and confirm.

@dteviot
Copy link
Owner

dteviot commented Sep 22, 2024

@Darthagnon

I'm wrong, @gamebeaker is correct. In my defense, it was hard to see the highlighted rows in his screenshot.
The problem is lines 29 to 32 here

async getChapterUrls(dom) {
let chapterLinks = [];
chapterLinks = [...dom.querySelectorAll("article a, .article-content a, window.location.hostname, #content article a, #content .article-content a, .articles-listing .article-item a, .articles-bloc .article .details a")];
// Filter out author links using their URL pattern
chapterLinks = chapterLinks.filter(link => !this.isAuthorLink(link));
return chapterLinks.map(this.linkToChapter);
}

@gamebeaker
Copy link
Collaborator

gamebeaker commented Sep 22, 2024

@Darthagnon here is how you can run lint the first time, you need npm

2024-09-22.22-37-23.mp4

@gamebeaker gamebeaker merged commit 43b4af7 into dteviot:ExperimentalTabMode Sep 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants