Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stylesheet issue - 404 Not Found #281

Open
BonjourGit opened this issue Aug 18, 2021 · 2 comments
Open

stylesheet issue - 404 Not Found #281

BonjourGit opened this issue Aug 18, 2021 · 2 comments
Labels

Comments

@BonjourGit
Copy link

BonjourGit commented Aug 18, 2021

For some download, when viewing a downloaded OEBPS/*.xhtml page with a browser, the page rendered did not seem to be styled correctly (quite different from what is viewed on-line which seems to offer better readability).

Looking at OEBPS/Styles/ and found that some downloaded stylesheet file (OEBPS/Styles/Style*.css) has the following content:

<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.19.7</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

Can it be an error with the downloading process or something else?
Is there anything can be done to fix it?

Thanks!

@lorenzodifuccia lorenzodifuccia added help wanted need more info Please provide more info to address the issue partial download labels Aug 24, 2021
@BonjourGit
Copy link
Author

BonjourGit commented Oct 15, 2021

This issue can be easily reproduced by downloading a book, e.g. 9781098122553.

# ./safaribooks.py 9781098122553

 ██████╗     ██████╗ ██╗  ██╗   ██╗██████╗
██╔═══██╗    ██╔══██╗██║  ╚██╗ ██╔╝╚════██╗
██║   ██║    ██████╔╝██║   ╚████╔╝   ▄███╔╝
██║   ██║    ██╔══██╗██║    ╚██╔╝    ▀▀══╝
╚██████╔╝    ██║  ██║███████╗██║     ██╗
 ╚═════╝     ╚═╝  ╚═╝╚══════╝╚═╝     ╚═╝                                        

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[-] Successfully authenticated.
[!] Working on book 9781098122553 [ 1 of 1 ] ...
[*] Retrieving book info...
[-] Title: C++ Crash Course
[-] Authors: Josh Lospinoso
[-] Identifier: 9781098122553
[-] ISBN: 9781593278885
[-] Publishers: No Starch Press
[-] Rights:
[-] Description: Upgrade your Code with C++C++ is one of the most widely used languages for real-world software. In the hands of a knowledgeable programmer, C++ can produce small, efficient, and readable code that any programmer would be proud of.Designed for intermediate to advanced programmers, C++ Crash Course cuts through the weeds to get you straight to the core of C++17, the most modern revision of the ISO standard. Part 1 covers the core of the C++ language, where you’ll learn about everything from types ...
[-] Release Date: 2019-09-24
[-] URL: https://learning.oreilly.com/library/view/c-crash-course/9781098122553/
[*] Output directory:
    /.../safaribooks/Books/C__ Crash Course (9781098122553)
[*] Retrieving book chapters...
[-] Downloading book contents... (38 chapters)
    [####################################################################] 100%
[-] Downloading book CSSs... (2 files)
    [####################################################################] 100%
[-] Downloading book images... (67 files, 45 uniq)
    [####################################################################] 100%
[-] Creating EPUB file...
[*] Done: /.../safaribooks/Books/C__ Crash Course (9781098122553)/9781098122553.epub

    If you like it, please * this project on GitHub to make it known:
        https://github.com/lorenzodifuccia/safaribooks
    e don't forget to renew your Safari Books Online subscription:
        https://learning.oreilly.com

[!] Bye!!
[*] Done: ['9781098122553']

    If you like it, please * this project on GitHub to make it known:
        https://github.com/lorenzodifuccia/safaribooks
    e don't forget to renew your Safari Books Online subscription:
        https://learning.oreilly.com

[!] Bye!!

Download on Oct 15, 2021: the C__ Crash Course (9781098122553)/OEBPS/Styles output directory contains 2 .css files but the second one (style01.css) was invalid and its content was showing 404 Not Found:

# ls -l
total 20
-rw-r--r--. 1 root root 14342 Oct 15 01:26 Style00.css
-rw-r--r--. 1 root root   555 Oct 15 01:26 Style01.css
[root@24234f2a3a16 Styles]# cat Style01.css
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx/1.19.7</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

Comparing to a previous download on Nov 4, 2020 for the same book: it was the same book, however the C__ Crash Course (9781098122553)/OEBPS/Styles output directory contains 2 .css files which were both valid. Style01.css contains valid content. Also, a diff on the Style00.css shows it is same with recent download.

$ ls -l
total 232
-rw-r--r--. 1 root root  14342 Nov  4  2020 Style00.css
-rw-r--r--. 1 root root 220357 Nov  4  2020 Style01.css

The issue appears to be true for other book downloads.

Is any other information needed?

Thanks.

@0Ky
Copy link

0Ky commented Oct 26, 2021

I can confirm that the API will list only 1 CSS file, but this program attempts to download another CSS file, one of the request for CSS hits a 404 Not Found HTML page that's stored as a .css file format.

Below you can find the API that I assume is used to fetch the CSS file.

Chapters:
https://learning.oreilly.com/api/v2/epub-chapters/?epub_identifier=urn:orm:book:9781098122553

Files:
https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098122553/files/?limit=20&offset=80

{
    "ourn": "urn:orm:book:9781098122553:asset:styles%2f9781593278892.css",
    "url": "https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098122553/files/styles/9781593278892.css",
    "full_path": "styles/9781593278892.css",
    "filename": "9781593278892.css",
    "filename_ext": ".css",
    "media_type": "text/css",
    "has_mathml": false,
    "kind": "stylesheet",
    "created_time": "2020-10-27T13:21:31.244025Z",
    "last_modified_time": "2021-02-11T00:14:31.128271Z",
    "virtual_pages": null,
    "file_size": 14342,
    "epub_archive": "https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098122553/"
}

Log File

Running the program with --preserve-log option outputs the following information:

$ cat info_9781098122553.log 
[26/Oct/2021 09:48:48] ** Welcome to SafariBooks! **
[26/Oct/2021 09:48:49] Successfully authenticated.
[26/Oct/2021 09:48:49] Retrieving book info...
[26/Oct/2021 09:48:50] Title: C++ Crash Course
[26/Oct/2021 09:48:50] Authors: Josh Lospinoso
[26/Oct/2021 09:48:50] Identifier: 9781098122553
[26/Oct/2021 09:48:50] ISBN: 9781593278885
[26/Oct/2021 09:48:50] Publishers: No Starch Press
[26/Oct/2021 09:48:50] Rights: 
[26/Oct/2021 09:48:50] Description: Upgrade your Code with C++C++ is one of the most widely used languages for real-world software. In the hands of a knowledgeable programmer, C++ can produce small, efficient, and readable code that any programmer would be proud of.Designed for intermediate to advanced programmers, C++ Crash Course cuts through the weeds to get you straight to the core of C++17, the most modern revision of the ISO standard. Part 1 covers the core of the C++ language, where you’ll learn about everything from types ...
[26/Oct/2021 09:48:50] Release Date: 2019-09-24
[26/Oct/2021 09:48:50] URL: https://learning.oreilly.com/library/view/c-crash-course/9781098122553/
[26/Oct/2021 09:48:50] Retrieving book chapters...
[26/Oct/2021 09:48:52] Output directory:
    /home/user/safaribooks/Books/C__ Crash Course (9781098122553)
[26/Oct/2021 09:48:52] Downloading book contents... (38 chapters)
[26/Oct/2021 09:48:53] Crawler: found a new CSS at https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781098122553/files/styles/9781593278892.css
[26/Oct/2021 09:48:53] Crawler: found a new CSS at https://learning.oreilly.com/static/CACHE/css/output.68851547a55f.css
[26/Oct/2021 09:48:53] Created: cover.xhtml
[26/Oct/2021 09:48:54] Created: title.xhtml
[26/Oct/2021 09:48:55] Created: copy.xhtml
[26/Oct/2021 09:48:57] Created: fm01.xhtml
[26/Oct/2021 09:48:58] Created: author.xhtml
[26/Oct/2021 09:48:59] Created: tech.xhtml
[26/Oct/2021 09:49:00] Created: toc01.xhtml
[26/Oct/2021 09:49:01] Created: toc.xhtml
[26/Oct/2021 09:49:03] Created: foreword.xhtml
[26/Oct/2021 09:49:03] Created: ack.xhtml
[26/Oct/2021 09:49:04] Created: intro.xhtml
[26/Oct/2021 09:49:05] Created: anoverture.xhtml
[26/Oct/2021 09:49:06] Created: part01.xhtml
[26/Oct/2021 09:49:07] Created: ch01.xhtml
[26/Oct/2021 09:49:08] Created: ch02.xhtml
[26/Oct/2021 09:49:09] Created: ch03.xhtml
[26/Oct/2021 09:49:10] Created: ch04.xhtml
[26/Oct/2021 09:49:11] Created: ch05.xhtml
[26/Oct/2021 09:49:11] Created: ch06.xhtml
[26/Oct/2021 09:49:13] Created: ch07.xhtml
[26/Oct/2021 09:49:13] Created: ch08.xhtml
[26/Oct/2021 09:49:14] Created: ch09.xhtml
[26/Oct/2021 09:49:15] Created: part02.xhtml
[26/Oct/2021 09:49:16] Created: ch10.xhtml
[26/Oct/2021 09:49:17] Created: ch11.xhtml
[26/Oct/2021 09:49:18] Created: ch12.xhtml
[26/Oct/2021 09:49:19] Created: ch13.xhtml
[26/Oct/2021 09:49:20] Created: ch14.xhtml
[26/Oct/2021 09:49:21] Created: ch15.xhtml
[26/Oct/2021 09:49:22] Created: ch16.xhtml
[26/Oct/2021 09:49:23] Created: ch17.xhtml
[26/Oct/2021 09:49:25] Created: ch18.xhtml
[26/Oct/2021 09:49:26] Created: ch19.xhtml
[26/Oct/2021 09:49:27] Created: ch20.xhtml
[26/Oct/2021 09:49:27] Created: ch21.xhtml
[26/Oct/2021 09:49:28] Created: index.xhtml
[26/Oct/2021 09:49:29] Created: resource.xhtml
[26/Oct/2021 09:49:30] Created: bm01.xhtml
[26/Oct/2021 09:49:30] Downloading book CSSs... (2 files)
[26/Oct/2021 09:49:31] Downloading book images... (67 files)
[26/Oct/2021 09:50:40] Creating EPUB file...
[26/Oct/2021 09:50:41] Done: /home/user/safaribooks/Books/C__ Crash Course (9781098122553)/9781098122553.epub

The Problem

Crawler is fetching the following CSS file that doesn't exist https://learning.oreilly.com/static/CACHE/css/output.68851547a55f.css

Possible Solution

Don't parse a response that has a HTTP status code of 404.

My Environment:

I'm running the latest commit (e016ad3) as of this post.

  • OS: Ubuntu 21.10 x86_64
  • Kernel: 5.13.0-20-generic
  • Shell: bash 5.1.8
  • Node: v12.22.5
  • npm: v8.1.1
  • Python3: v3.9.7

@lorenzodifuccia lorenzodifuccia added wontfix and removed help wanted need more info Please provide more info to address the issue partial download labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants