From 3ab3a602ac93ac5f1b169f692e16c5b2e8361f10 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 11 Dec 2017 18:50:45 +0100 Subject: [PATCH 001/100] Initial commit --- README.md | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..e28fc24 --- /dev/null +++ b/README.md @@ -0,0 +1,2 @@ +# safaribooks +Download and read in EPUB your favorites books from Safari Books Online. From e89b88b068ea39aa42ffb13250ad43a47e56c52a Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 11 Dec 2017 18:51:21 +0100 Subject: [PATCH 002/100] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e28fc24..c7ad817 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,2 @@ -# safaribooks +# SafariBooks Download and read in EPUB your favorites books from Safari Books Online. From a970ad8ead5926a59771ce56ec6f1b0880fb8e84 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 11 Dec 2017 19:22:12 +0100 Subject: [PATCH 003/100] Update README.md --- README.md | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/README.md b/README.md index c7ad817..710cb60 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,35 @@ # SafariBooks +Download and read in EPUB your favorites books from [Safari Books Online](https://www.safaribooksonline.com). + +## Usage: +```bash +~$ python3 safaribooks.py --cred "account_mail@mail.com:password01" XXXXXXXXXXXXX +``` +The book ID (the X-es) are the digits that you can find in the URL. +Ex: `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/ch01.html` + +The first time you use the program, you have to specify your SafariBooksOnline account credentials. +Next times you want to download a book, before session expires, you can omit the credential because the program save your session cookies in a file called `cookies.json`. +Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. +If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. + +#### List of program option: +```text +usage: safaribooks.py [--cred ] [--no-cookies] [--preserve-log] [--help] + Download and read in EPUB your favorites books from Safari Books Online. + +positional arguments: + Book digits ID that you want to download. You can find it in the URL (X-es): + `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/cover.html` + +optional arguments: + --cred Credentials used to perform the login on SafariBooksOnline. + Es. ` --cred "account_mail@mail.com:password01" `. + --no-cookies Prevent your session data to be saved into `cookies.json` file. + --preserve-log Leave the `info.log` file even if there isn't any error. + --help Show this help message. +``` + +## Example: +\# TODO From 9179364a864b01cc1c72810625275be4e2e71c6a Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 12:18:03 +0100 Subject: [PATCH 004/100] Update README.md --- README.md | 117 +++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 98 insertions(+), 19 deletions(-) diff --git a/README.md b/README.md index 710cb60..8eedf14 100644 --- a/README.md +++ b/README.md @@ -1,35 +1,114 @@ # SafariBooks -Download and read in EPUB your favorites books from [Safari Books Online](https://www.safaribooksonline.com). +Download and generate an EPUB of your favorite books from [Safari Books Online](https://www.safaribooksonline.com) library. +Use this program only for personal and/or educational purpose. +## Requirements & setup: +```shell +$ git clone https://github.com/lorenzodifuccia/safaribooks.git +Cloning into 'safaribooks'... + +$ cd safaribooks/ +$ pip3 install -r requirements.txt +``` + +The program depends of only two Python 3 modules: +```python3 +lxml>=4.1.1 +requests>=2.18.4 +``` + ## Usage: -```bash -~$ python3 safaribooks.py --cred "account_mail@mail.com:password01" XXXXXXXXXXXXX +It's really simple to use, just choose a book from the library and replace in the following command: + * X-es with its ID, + * `email:password` with your own. + +```shell +$ python3 safaribooks.py --cred "account_mail@mail.com:password01" XXXXXXXXXXXXX ``` -The book ID (the X-es) are the digits that you can find in the URL. -Ex: `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/ch01.html` + +The ID are the digits that you can find in the URL of the book description page: +`https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/` +Like: `https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/` + +The first time you'll use the program, you'll have to specify your Safari Books Online account credentials. +For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json`. -The first time you use the program, you have to specify your SafariBooksOnline account credentials. -Next times you want to download a book, before session expires, you can omit the credential because the program save your session cookies in a file called `cookies.json`. Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. -If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. +If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. -#### List of program option: -```text -usage: safaribooks.py [--cred ] [--no-cookies] [--preserve-log] [--help] +### Program options: +```shell +$ python3 safaribooks.py --help +usage: safaribooks.py [--cred ] [--no-cookies] [--no-kindle] + [--preserve-log] [--help] + -Download and read in EPUB your favorites books from Safari Books Online. +Download and generate an EPUB of your favorite books from Safari Books Online. positional arguments: - Book digits ID that you want to download. You can find it in the URL (X-es): - `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/cover.html` + Book digits ID that you want to download. + You can find it in the URL (X-es): + `https://www.safaribooksonline.com/library/view/book- + name/XXXXXXXXXXXXX/` optional arguments: - --cred Credentials used to perform the login on SafariBooksOnline. + --cred Credentials used to perform the auth login on Safari + Books Online. Es. ` --cred "account_mail@mail.com:password01" `. - --no-cookies Prevent your session data to be saved into `cookies.json` file. - --preserve-log Leave the `info.log` file even if there isn't any error. + --no-cookies Prevent your session data to be saved into + `cookies.json` file. + --no-kindle Remove some CSS rules that block overflow on `table` + and `pre` elements. Use this option if you're not going + to export the EPUB to E-Readers like Amazon Kindle. + --preserve-log Leave the `info.log` file even if there isn't any + error. --help Show this help message. ``` -## Example: -\# TODO + * ## Example: [Test-Driven Development with Python, 2nd Edition](https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/) + ```shell + $ python3 safaribooks.py --cred "XXXX@gmail.com:XXXXX" 9781491958698 + + ____ ___ _ + / __/__ _/ _/__ _____(_) + _\ \/ _ `/ _/ _ `/ __/ / + /___/\_,_/_/ \_,_/_/ /_/ + / _ )___ ___ / /__ ___ + / _ / _ \/ _ \/ '_/(_-< + /____/\___/\___/_/\_\/___/ + + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + [-] Logging into Safari Books Online... + [-] Title: Test-Driven Development with Python, 2nd Edition + [-] Authors: Harry J.W. Percival + [-] Identifier: 9781491958698 + [-] ISBN: 9781491958704 + [-] Publishers: O'Reilly Media, Inc. + [-] Rights: Copyright © O'Reilly Media, Inc. + [-] Description: By taking you through the development of a real web application from beginning to end, the second edition of this hands-on guide demonstrates the practical advantages of test-driven development (TDD) with Python. You’ll learn how to write and run tests before building each part of your app, and then develop the minimum amount of code required to pass those tests. The result? Clean code that works.In the process, you’ll learn the basics of Django, Selenium, Git, jQuery, and Mock, along with curre... + [-] URL: https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/ + [*] Found 73 chapters! + [*] Output directory: + /XXXX/XXXX/Test-Driven Development with Python, 2nd Edition + [-] Downloading book contents... + [#########################################----------------------------] 60% + ... + [-] Creating EPUB file... + [*] Done: Test-Driven Development with Python, 2nd Edition.epub + + If you like it, please * this project on GitHub to make it known: + https://github.com/lorenzodifuccia/safaribooks + e don't forget to renew your Safari Books Online subscription: + https://www.safaribooksonline.com/signup/ + + [!] Bye!! + ``` + The result will be (opening the EPUB file with [Calibre](https://calibre-ebook.com/)): + + ![Book Appearance](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example01_TDD.png "Book opened with Calibre") + + * ## Example: `--no-kindle` option + ```bash + $ python3 safaribooks.py --no-kindle 9781491958698 + ``` + ![NoKindle Option](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example02_NoKindle.png "Version comparison") From 662e0020ee9515ff4dfc3cc27d6e4ed279c7c980 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 13:23:57 +0100 Subject: [PATCH 005/100] Update README.md --- README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8eedf14..c571ee2 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,11 @@ For the next times you'll download a book, before session expires, you can omit Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. +The program default options are thought for ensure best compatibilities for who want to export the `EPUB` to E-Readers like Amazon Kindle. +If you want to do it, I suggest you to convert the `EPUB` to `AZW3` file with [Calibre](https://calibre-ebook.com/). +You can also convert the book to `MOBI` and if you'll convert it with Calibre be sure to select the `Ignore margins`: +![Calibre IgnoreMargins](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_calibre_IgnoreMargins.png "Select Ignore margins") + ### Program options: ```shell $ python3 safaribooks.py --help @@ -103,7 +108,7 @@ optional arguments: [!] Bye!! ``` - The result will be (opening the EPUB file with [Calibre](https://calibre-ebook.com/)): + The result will be (opening the `EPUB` file with Calibre): ![Book Appearance](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example01_TDD.png "Book opened with Calibre") From 614311653e9dad9a7ea0b499da9416442ff6f7fb Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 13:26:37 +0100 Subject: [PATCH 006/100] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c571ee2..7682824 100644 --- a/README.md +++ b/README.md @@ -38,9 +38,10 @@ If you don't want to cache the cookies, just use the `--no-cookies` option and p The program default options are thought for ensure best compatibilities for who want to export the `EPUB` to E-Readers like Amazon Kindle. If you want to do it, I suggest you to convert the `EPUB` to `AZW3` file with [Calibre](https://calibre-ebook.com/). -You can also convert the book to `MOBI` and if you'll convert it with Calibre be sure to select the `Ignore margins`: +You can also convert the book to `MOBI` and if you'll convert it with Calibre be sure to select `Ignore margins`: + ![Calibre IgnoreMargins](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_calibre_IgnoreMargins.png "Select Ignore margins") - + ### Program options: ```shell $ python3 safaribooks.py --help From aa56ce4d4d4d3fe7b2192da6c1878728c90b5278 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 13:27:27 +0100 Subject: [PATCH 007/100] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7682824..7526050 100644 --- a/README.md +++ b/README.md @@ -38,7 +38,7 @@ If you don't want to cache the cookies, just use the `--no-cookies` option and p The program default options are thought for ensure best compatibilities for who want to export the `EPUB` to E-Readers like Amazon Kindle. If you want to do it, I suggest you to convert the `EPUB` to `AZW3` file with [Calibre](https://calibre-ebook.com/). -You can also convert the book to `MOBI` and if you'll convert it with Calibre be sure to select `Ignore margins`: +You can also convert the book to `MOBI` and if you'll do it with Calibre be sure to select `Ignore margins`: ![Calibre IgnoreMargins](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_calibre_IgnoreMargins.png "Select Ignore margins") From 2e82d2edaa2432c9525b60fbfea1a2d9692f0712 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 15:14:52 +0100 Subject: [PATCH 008/100] Update README.md --- README.md | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 7526050..b8da7f7 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,13 @@ # SafariBooks -Download and generate an EPUB of your favorite books from [Safari Books Online](https://www.safaribooksonline.com) library. -Use this program only for personal and/or educational purpose. +Download and generate an *EPUB* of your favorite books from [*Safari Books Online*](https://www.safaribooksonline.com) library. +Use this program only for *personal* and/or *educational* purpose. -## Requirements & setup: +## Overview: + * [Requirements & Setup]() + * [Usage]() + * [Examples]() + +## Requirements & Setup: ```shell $ git clone https://github.com/lorenzodifuccia/safaribooks.git Cloning into 'safaribooks'... @@ -11,7 +16,7 @@ $ cd safaribooks/ $ pip3 install -r requirements.txt ``` -The program depends of only two Python 3 modules: +The program depends of only two **Python 3** modules: ```python3 lxml>=4.1.1 requests>=2.18.4 @@ -20,13 +25,13 @@ requests>=2.18.4 ## Usage: It's really simple to use, just choose a book from the library and replace in the following command: * X-es with its ID, - * `email:password` with your own. + * `email:password` with your own. ```shell $ python3 safaribooks.py --cred "account_mail@mail.com:password01" XXXXXXXXXXXXX ``` -The ID are the digits that you can find in the URL of the book description page: +The ID are the digits that you find in the URL of the book description page: `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/` Like: `https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/` @@ -37,12 +42,14 @@ Pay attention if you use a shared PC, because everyone that has access to your f If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. The program default options are thought for ensure best compatibilities for who want to export the `EPUB` to E-Readers like Amazon Kindle. -If you want to do it, I suggest you to convert the `EPUB` to `AZW3` file with [Calibre](https://calibre-ebook.com/). -You can also convert the book to `MOBI` and if you'll do it with Calibre be sure to select `Ignore margins`: +If you want to do it, I suggest you to convert the `EPUB` to `AZW3` with [Calibre](https://calibre-ebook.com/). +You can also convert the book to `MOBI` and if you'll do it with Calibre be sure to select `Ignore margins` in the conversion options: ![Calibre IgnoreMargins](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_calibre_IgnoreMargins.png "Select Ignore margins") + +In the other hand, if you're not going to export the `EPUB`, you can use the `--no-kindle` option to remove the CSS that blocks overflow on `table` and `pre` elements, see below in the examples. -### Program options: +#### Program options: ```shell $ python3 safaribooks.py --help usage: safaribooks.py [--cred ] [--no-cookies] [--no-kindle] @@ -71,7 +78,8 @@ optional arguments: --help Show this help message. ``` - * ## Example: [Test-Driven Development with Python, 2nd Edition](https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/) +## Examples: + * ## Download [Test-Driven Development with Python, 2nd Edition](https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/): ```shell $ python3 safaribooks.py --cred "XXXX@gmail.com:XXXXX" 9781491958698 @@ -113,8 +121,16 @@ optional arguments: ![Book Appearance](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example01_TDD.png "Book opened with Calibre") - * ## Example: `--no-kindle` option + * ## Use or not the `--no-kindle` option: ```bash $ python3 safaribooks.py --no-kindle 9781491958698 ``` + ![NoKindle Option](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example02_NoKindle.png "Version comparison") + +--- + +## Thanks!! +For any kind of problem, please don't hesitate to open an issue here on *GitHub*. + +*Lorenzo Di Fuccia* From f9f6686ffadffd45dfe318a686177569b47da2d4 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 15:16:12 +0100 Subject: [PATCH 009/100] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b8da7f7..194a2d2 100644 --- a/README.md +++ b/README.md @@ -3,9 +3,9 @@ Download and generate an *EPUB* of your favorite books from [*Safari Books Onlin Use this program only for *personal* and/or *educational* purpose. ## Overview: - * [Requirements & Setup]() - * [Usage]() - * [Examples]() + * [Requirements & Setup](#requirements--setup) + * [Usage](#usage) + * [Examples](#examples) ## Requirements & Setup: ```shell From d27d8227a4b4686961611d792712666e30ada021 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 15:19:36 +0100 Subject: [PATCH 010/100] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 194a2d2..8128ecf 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,8 @@ Use this program only for *personal* and/or *educational* purpose. ## Overview: * [Requirements & Setup](#requirements--setup) * [Usage](#usage) - * [Examples](#examples) + * [Example: Download *Test-Driven Development with Python, 2nd Edition*](#download-test-driven-development-with-python-2nd-edition) + * [Example: Use or not the `--no-kindle` option](#use-or-not-the---no-kindle-option) ## Requirements & Setup: ```shell From 5145fa293e0e17f729d9593409982a8068f7e037 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 16:12:08 +0100 Subject: [PATCH 011/100] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 8128ecf..a2bb2aa 100644 --- a/README.md +++ b/README.md @@ -126,8 +126,9 @@ optional arguments: ```bash $ python3 safaribooks.py --no-kindle 9781491958698 ``` + On the left book created with `--no-kindle` option, on the right without (default): - ![NoKindle Option](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example02_NoKindle.png "Version comparison") + ![NoKindle Option](https://github.com/lorenzodifuccia/cloudflare/raw/master/Images/safaribooks/safaribooks_example02_NoKindle.png "Version compare") --- From ac297007a1788990eead8c393892eaa5a6826907 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 16:15:05 +0100 Subject: [PATCH 012/100] First release --- requirements.txt | 2 + safaribooks.py | 831 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 833 insertions(+) create mode 100644 requirements.txt create mode 100644 safaribooks.py diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..6964772 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +lxml>=4.1.1 +requests>=2.18.4 diff --git a/safaribooks.py b/safaribooks.py new file mode 100644 index 0000000..11b35e2 --- /dev/null +++ b/safaribooks.py @@ -0,0 +1,831 @@ +import os +import sys +import json +import shutil +import logging +import argparse +import requests +from lxml import html +from html import escape +from random import random +from urllib.parse import urljoin, urlsplit +from multiprocessing import Process, Queue, Value + + +PATH = os.path.dirname(os.path.realpath(__file__)) +COOKIES_FILE = os.path.join(PATH, "cookies.json") + + +class Display: + BASE_FORMAT = logging.Formatter( + fmt="[%(asctime)s] %(message)s", + datefmt="%d/%b/%Y %H:%M:%S" + ) + + SH_DEFAULT = "\033[0m" + SH_YELLOW = "\033[33m" + SH_BG_RED = "\033[41m" + SH_BG_YELLOW = "\033[43m" + + def __init__(self): + self.columns, _ = shutil.get_terminal_size() + + self.logger = logging.getLogger("SafariBooks") + self.logger.setLevel(logging.INFO) + logs_handler = logging.FileHandler(filename=os.path.join(PATH, "info.log")) + logs_handler.setFormatter(self.BASE_FORMAT) + logs_handler.setLevel(logging.INFO) + self.logger.addHandler(logs_handler) + + self.logger.info("** Welcome to SafariBooks! **") + + self.book_ad_info = False + self.css_ad_info = Value("i", 0) + self.images_ad_info = Value("i", 0) + self.in_error = False + + self.state_status = Value("i", 0) + sys.excepthook = self.unhandled_exception + + def log(self, message): + self.logger.info(str(message)) + + def out(self, put): + sys.stdout.write("\r" + " " * self.columns + "\r" + put + "\n") + + def info(self, message, state=False): + self.log(message) + output = (self.SH_YELLOW + "[*]" + self.SH_DEFAULT if not state else + self.SH_BG_YELLOW + "[-]" + self.SH_DEFAULT) + " %s" % message + self.out(output) + + def error(self, error): + if not self.in_error: + self.in_error = True + + self.log(error) + output = self.SH_BG_RED + "[#]" + self.SH_DEFAULT + " %s" % error + self.out(output) + + def exit(self, error): + self.error(str(error)) + output = (self.SH_YELLOW + "[+]" + self.SH_DEFAULT + + " Please delete all the `/OEBPS/*.xhtml`" + " files and restart the program.") + self.out(output) + + output = self.SH_BG_RED + "[!]" + self.SH_DEFAULT + " Aborting..." + self.out(output) + sys.exit(128) + + def unhandled_exception(self, _, o, __): + self.exit("Unhandled Exception: %s (type: %s)" % (o, o.__class__.__name__)) + + def intro(self): + output = self.SH_YELLOW + """ + ____ ___ _ + / __/__ _/ _/__ _____(_) + _\ \/ _ `/ _/ _ `/ __/ / + /___/\_,_/_/ \_,_/_/ /_/ + / _ )___ ___ / /__ ___ + / _ / _ \/ _ \/ '_/(_-< + /____/\___/\___/_/\_\/___/ +""" + self.SH_DEFAULT + output += "\n" + "~" * (self.columns // 2) + self.out(output) + + def parse_description(self, desc): + try: + return html.fromstring(desc).text_content() + + except (html.etree.ParseError, html.etree.ParserError) as e: + self.log("Error parsing the description: %s" % e) + return "n/d" + + def book_info(self, info): + description = self.parse_description(info["description"]).replace("\n", " ") + for t in [ + ("Title", info["title"]), ("Authors", ", ".join(aut["name"] for aut in info["authors"])), + ("Identifier", info["identifier"]), ("ISBN", info["isbn"]), + ("Publishers", ", ".join(pub["name"] for pub in info["publishers"])), + ("Rights", info["rights"]), + ("Description", description[:500] + "..." if len(description) >= 500 else description), + ("URL", info["web_url"]) + ]: + self.info("{0}: {1}".format(t[0], t[1]), True) + + def state(self, origin, done): + progress = int(done * 100 / origin) + bar = int(progress * (self.columns - 11) / 100) + if self.state_status.value < progress: + self.state_status.value = progress + sys.stdout.write( + "\r " + self.SH_BG_YELLOW + "[" + ("#" * bar).ljust(self.columns - 11, "-") + "]" + + self.SH_DEFAULT + ("%4s" % progress) + "%" + ("\n" if progress == 100 else "") + ) + + def done(self, epub_file): + self.info("Done: %s\n\n" + " If you like it, please * this project on GitHub to make it known:\n" + " https://github.com/lorenzodifuccia/safaribooks\n" + " e don't forget to renew your Safari Books Online subscription:\n" + " https://www.safaribooksonline.com/signup/\n\n" % epub_file + + self.SH_BG_RED + "[!]" + self.SH_DEFAULT + " Bye!!") + + @staticmethod + def api_error(response): + message = "API: " + if "detail" in response and "Not found" in response["detail"]: + message += "book's not present in Safari Books Online.\n" \ + " The book identifier are the digits that you can find in the URL:\n" \ + " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" + + else: + os.remove(COOKIES_FILE) + message += "Out-of-Session%s.\n" % (" (%s)" % response["detail"]) if "detail" in response else "" +\ + Display.SH_YELLOW + "[+]" + Display.SH_DEFAULT + \ + " Use the `--cred` option in order to perform the auth login to Safari Books Online." + + return message + + +class SafariBooks: + + HEADERS = { + "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", + "accept-encoding": "gzip, deflate, br", + "accept-language": "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7", + "cache-control": "no-cache", + "cookie": "", + "pragma": "no-cache", + "referer": "https://www.safaribooksonline.com/home/", + "upgrade-insecure-requests": "1", + "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/62.0.3202.94 Safari/537.36" + } + + BASE_URL = "https://www.safaribooksonline.com" + LOGIN_URL = BASE_URL + "/accounts/login/" + API_TEMPLATE = BASE_URL + "/api/v1/book/{0}/" + + BASE_01_HTML = "\n" \ + "\n" \ + "\n" \ + "{0}\n" \ + "" \ + "\n" \ + "{1}\n" + + CONTAINER_XML = "" \ + "" \ + "" \ + "" \ + "" \ + "" + + # Format: ID, Title, Authors, Description, Subjects, Publisher, Rights, CoverId, MANIFEST, SPINE, CoverUrl + CONTENT_OPF = "\n" \ + "\n" \ + "\n"\ + "{1}\n" \ + "{2}\n" \ + "{3}\n" \ + "{4}" \ + "{5}\n" \ + "{6}\n" \ + "en-US\n" \ + "{0}\n" \ + "\n" \ + "\n" \ + "\n" \ + "\n" \ + "{8}\n" \ + "\n" \ + "\n{9}\n" \ + "\n" \ + "" + + # Format: ID, Depth, Title, Author, NAVMAP + TOC_NCX = "" \ + "" \ + "" \ + "" \ + "" \ + "" \ + "" \ + "" \ + "" \ + "{2}" \ + "{3}" \ + "{4}" \ + "" + + def __init__(self, args): + self.args = args + self.display = Display() + self.display.intro() + + self.cookies = {} + + if not args.cred: + if not os.path.isfile(COOKIES_FILE): + self.display.exit("Login: unable to find cookies file.\n" + " Please use the --cred option to perform the login.") + + self.cookies = json.load(open(COOKIES_FILE)) + + else: + self.display.info("Logging into Safari Books Online...", state=True) + self.do_login(*[c.replace("'", "").replace('"', "") for c in args.cred]) + if not args.no_cookies: + json.dump(self.cookies, open(COOKIES_FILE, "w")) + + self.book_id = args.bookid + self.api_url = self.API_TEMPLATE.format(self.book_id) + self.book_info = self.get_book_info() + self.display.book_info(self.book_info) + self.book_chapters = self.get_book_chapters() + self.display.info("Found %s chapters!" % len(self.book_chapters)) + self.chapters_queue = self.book_chapters[:] + + self.book_title = self.book_info["title"] + self.base_url = self.book_info["web_url"] + + self.BOOK_PATH = os.path.join(PATH, self.book_title) + self.create_dirs() + self.display.info("Output directory:\n %s" % self.BOOK_PATH) + + self.chapter_title = "" + self.filename = "" + self.css = [] + self.images = [] + + self.display.info("Downloading book contents...", state=True) + self.BASE_HTML = self.BASE_01_HTML + (self.KINDLE_HTML if not args.no_kindle else "") + self.BASE_02_HTML + self.get() + + self.css_path = "" + self.images_path = "" + self.cover = "" + + self.display.info("Downloading book CSSs...", state=True) + self.collect_css() + self.display.info("Downloading book images...", state=True) + self.collect_images() + + self.display.info("Creating EPUB file...", state=True) + self.create_epub() + + if not args.no_cookies: + json.dump(self.cookies, open(COOKIES_FILE, "w")) + + self.display.done(self.book_title + ".epub") + + if not self.display.in_error and not args.log: + os.remove(os.path.join(PATH, "info.log")) + + sys.exit(0) + + def return_cookies(self): + return " ".join(["{0}={1};".format(k, v) for k, v in self.cookies.items()]) + + def return_headers(self, url): + if "safaribooksonline" in urlsplit(url).netloc: + self.HEADERS["cookie"] = self.return_cookies() + + else: + self.HEADERS["cookie"] = "" + + return self.HEADERS + + def update_cookies(self, jar): + for cookie in jar: + self.cookies.update({ + cookie.name: cookie.value + }) + + def requests_provider(self, url, post=False, data=None, update_cookies=True, **kwargs): + try: + response = getattr(requests, "post" if post else "get")( + url, + headers=self.return_headers(url), + data=data, + **kwargs + ) + + except (requests.ConnectionError, requests.ConnectTimeout, requests.RequestException) as request_exception: + self.display.error(str(request_exception)) + return 0 + + if update_cookies: + self.update_cookies(response.cookies) + + return response + + def do_login(self, email, password): + response = self.requests_provider(self.BASE_URL) + if response == 0: + self.display.exit("Login: unable to reach Safari Books Online. Try again...") + + csrf = [] + try: + csrf = html.fromstring(response.text).xpath("//input[@name='csrfmiddlewaretoken'][@value]") + + except (html.etree.ParseError, html.etree.ParserError) as parsing_error: + self.display.error(parsing_error) + self.display.exit( + "Login: error trying to parse the home of Safari Books Online." + ) + + if not len(csrf): + self.display.exit("Login: no CSRF Token found in the page." + " Unable to continue the login." + " Try again...") + + csrf = csrf[0].attrib["value"] + response = self.requests_provider( + self.LOGIN_URL, + post=True, + data=( + ("csrfmiddlewaretoken", ""), ("csrfmiddlewaretoken", csrf), + ("email", email), ("password1", password), + ("is_login_form", "true"), ("leaveblank", ""), + ("dontchange", "http://") + ), + allow_redirects=False + ) + + if response == 0: + self.display.exit("Login: unable to perform auth to Safari Books Online.\n Try again...") + + if response.status_code != 302: + try: + error_page = html.fromstring(response.text) + errors_message = error_page.xpath("//ul[@class='errorlist']//li/text()") + recaptcha = error_page.xpath("//div[@class='g-recaptcha']") + messages = ([" `%s`" % error for error in errors_message + if "password" in error or "email" in error] if len(errors_message) else []) +\ + ([" `ReCaptcha required (wait or do logout from the website).`"] if len(recaptcha) else[]) + self.display.exit("Login: unable to perform auth login to Safari Books Online.\n" + + self.display.SH_YELLOW + "[*]" + self.display.SH_DEFAULT + " Details:\n" + "%s" % "\n".join(messages if len(messages) else [" Unexpected error!"])) + except (html.etree.ParseError, html.etree.ParserError) as parsing_error: + self.display.error(parsing_error) + self.display.exit( + "Login: your login went wrong and it encountered in an error" + " trying to parse the login details of Safari Books Online. Try again..." + ) + + def get_book_info(self): + response = self.requests_provider(self.api_url) + if response == 0: + self.display.exit("API: unable to retrieve book info.") + + response = response.json() + if not isinstance(response, dict) or len(response.keys()) == 1: + self.display.exit(self.display.api_error(response)) + + if "last_chapter_read" in response: + del response["last_chapter_read"] + + return response + + def get_book_chapters(self, page=0): + response = self.requests_provider(urljoin(self.api_url, "chapter/" + ("" if not page else "?page=%s" % page))) + if response == 0: + self.display.exit("API: unable to retrieve book chapters.") + + response = response.json() + + if not isinstance(response, dict) or len(response.keys()) == 1: + self.display.exit(self.display.api_error(response)) + + if "results" not in response or not len(response["results"]): + self.display.exit("API: unable to retrieve book chapters.") + + result = [] + result.extend([c for c in response["results"] if "cover." in c["filename"]]) + for c in result: + del response["results"][response["results"].index(c)] + + result += response["results"] + return result + (self.get_book_chapters(page + 1) if response["next"] else []) + + def get_html(self, url): + response = self.requests_provider(url) + if response == 0: + self.display.exit( + "Crawler: error trying to retrieve this page: %s (%s)\n From: %s" % + (self.filename, self.chapter_title, url) + ) + + root = None + try: + root = html.fromstring(response.text, base_url=self.BASE_URL) + + except (html.etree.ParseError, html.etree.ParserError) as parsing_error: + self.display.error(parsing_error) + self.display.exit( + "Crawler: error trying to parse this page: %s (%s)\n From: %s" % + (self.filename, self.chapter_title, url) + ) + + return root + + def link_replace(self, link): + if link[0] == "/" and ("cover" in link or "images" in link or "graphics" in link + or link[-3:] in ["jpg", "jpeg", "png"]): + self.images.append(link) + self.display.log("Crawler: found a new image at %s" % link) + image = link.split("/")[-1] + return "Images/" + image + + elif link[0] not in ["/", "h"]: + return link.replace(".html", ".xhtml") + + return link + + def parse_html(self, root): + if random() > 0.5: + if len(root.xpath("//div[@class='controls']/a/text()")): + self.display.exit(self.display.api_error(" ")) + + book_content = root.xpath("//div[@id='sbo-rt-content']") + if not len(book_content): + self.display.exit( + "Parser: book content's corrupted or not present: %s (%s)" % + (self.filename, self.chapter_title) + ) + + page_css = "" + stylesheet_links = root.xpath("//link[@rel='stylesheet']") + if len(stylesheet_links): + stylesheet_count = 0 + for s in stylesheet_links: + css_url = urljoin("https:", s.attrib["href"]) if s.attrib["href"][:2] == "//" \ + else urljoin(self.base_url, s.attrib["href"]) + + if css_url not in self.css: + self.css.append(css_url) + self.display.log("Crawler: found a new CSS at %s" % css_url) + + stylesheet_count += 1 + page_css += "2}.css\" " \ + "rel=\"stylesheet\" type=\"text/css\" />\n".format(stylesheet_count) + + stylesheets = root.xpath("//style") + if len(stylesheets): + for css in stylesheets: + if "data-template" in css.attrib and len(css.attrib["data-template"]): + css.text = css.attrib["data-template"] + del css.attrib["data-template"] + + try: + page_css += html.tostring(css, method="xml", encoding='unicode') + "\n" + + except (html.etree.ParseError, html.etree.ParserError) as parsing_error: + self.display.error(parsing_error) + self.display.exit( + "Parser: error trying to parse one CSS found in this page: %s (%s)" % + (self.filename, self.chapter_title) + ) + + book_content = book_content[0] + book_content.rewrite_links(self.link_replace) + + xhtml = None + try: + xhtml = html.tostring(book_content, method="xml", encoding='unicode') + + except (html.etree.ParseError, html.etree.ParserError) as parsing_error: + self.display.error(parsing_error) + self.display.exit( + "Parser: error trying to parse HTML of this page: %s (%s)" % + (self.filename, self.chapter_title) + ) + + return page_css, xhtml + + def create_dirs(self): + if os.path.isdir(self.BOOK_PATH): + self.display.log("Book directory already exists: %s" % self.book_title) + + else: + os.makedirs(self.BOOK_PATH) + + oebps = os.path.join(self.BOOK_PATH, "OEBPS") + if not os.path.isdir(oebps): + self.display.book_ad_info = True + os.makedirs(oebps) + + def save_page_html(self, contents): + self.filename = self.filename.replace(".html", ".xhtml") + open(os.path.join(self.BOOK_PATH, "OEBPS", self.filename), "w")\ + .write(self.BASE_HTML.format(contents[0], contents[1])) + self.display.log("Created: %s" % self.filename) + + def get(self): + if not len(self.chapters_queue): + return + + next_chapter = self.chapters_queue.pop(0) + self.chapter_title = next_chapter["title"] + self.filename = next_chapter["filename"] + + if os.path.isfile(os.path.join(self.BOOK_PATH, "OEBPS", self.filename.replace(".html", ".xhtml"))): + if not self.display.book_ad_info and \ + next_chapter not in self.book_chapters[:self.book_chapters.index(next_chapter)]: + self.display.info("File `%s` already exists.\n" + " If you want to download again all the book%s,\n" + " please delete the `/OEBPS/*.xhtml` files and restart the program." % + (self.filename.replace(".html", ".xhtml"), + " (especially because you selected the `--no-kindle` option)" if self.args.no_kindle + else "")) + self.display.book_ad_info = 1 + + else: + self.save_page_html(self.parse_html(self.get_html(urljoin(self.base_url, self.filename)))) + + self.display.state(len(self.book_chapters), len(self.book_chapters) - len(self.chapters_queue)) + self.get() + + def _thread_download_css(self, url, done_queue): + css_file = os.path.join(self.css_path, "Style{0:0>2}.css".format(self.css.index(url))) + if os.path.isfile(css_file): + if not self.display.css_ad_info.value and url not in self.css[:self.css.index(url)]: + self.display.info("File `%s` already exists.\n" + " If you want to download again all the CSSs,\n" + " please delete the `/OEBPS/*.xhtml` and `/OEBPS/Styles/*`" + " files and restart the program." % + css_file) + self.display.css_ad_info.value = 1 + + else: + response = self.requests_provider(url, update_cookies=False) + if response == 0: + self.display.error("Error trying to retrieve this CSS: %s\n From: %s" % (css_file, url)) + + with open(css_file, 'wb') as s: + for chunk in response.iter_content(1024): + s.write(chunk) + + done_queue.put(1) + self.display.state(len(self.css), done_queue.qsize()) + + def _thread_download_images(self, url, done_queue): + image_name = url.split("/")[-1] + image_path = os.path.join(self.images_path, image_name) + if os.path.isfile(image_path): + if not self.display.images_ad_info.value and url not in self.images[:self.images.index(url)]: + self.display.info("File `%s` already exists.\n" + " If you want to download again all the images,\n" + " please delete the `/OEBPS/*.xhtml` and `/OEBPS/Images/*`" + " files and restart the program." % + image_name) + self.display.images_ad_info.value = 1 + + else: + response = self.requests_provider(urljoin(self.BASE_URL, url), + update_cookies=False, + stream=True) + if response == 0: + self.display.error("Error trying to retrieve this image: %s\n From: %s" % (image_name, url)) + + with open(image_path, 'wb') as img: + for chunk in response.iter_content(1024): + img.write(chunk) + + done_queue.put(1) + self.display.state(len(self.images), done_queue.qsize()) + + def _start_multiprocessing(self, operation, full_queue, done_queue=None): + if not done_queue: + done_queue = Queue(0) + + if len(full_queue) > 5: + for i in range(0, len(full_queue), 5): + self._start_multiprocessing(operation, full_queue[i:i+5], done_queue) + + else: + process_queue = [Process(target=operation, args=(arg, done_queue)) for arg in full_queue] + for proc in process_queue: + proc.start() + + for proc in process_queue: + proc.join() + + def collect_css(self): + self.css_path = os.path.join(self.BOOK_PATH, "OEBPS", "Styles") + if os.path.isdir(self.css_path): + self.display.log("CSSs directory already exists: %s" % self.css_path) + + else: + os.makedirs(self.css_path) + self.display.css_ad_info.value = 1 + + self.display.state_status.value = -1 + self._start_multiprocessing(self._thread_download_css, self.css) + + def collect_images(self): + self.images_path = os.path.join(self.BOOK_PATH, "OEBPS", "Images") + if os.path.isdir(self.images_path): + self.display.log("Images directory already exists: %s" % self.images_path) + + else: + os.makedirs(self.images_path) + self.display.images_ad_info.value = 1 + + if self.display.book_ad_info == 1: + self.display.info("Some of the book contents were already downloaded.\n" + " If you want to be sure that all the images will be downloaded,\n" + " please delete the `/OEBPS/*.xhtml` files and restart the program.") + + self.display.state_status.value = -1 + self._start_multiprocessing(self._thread_download_images, self.images) + + def create_content_opf(self): + self.cover = self.images[0] if len(self.images) else "" + self.css = next(os.walk(self.css_path))[2] + self.images = next(os.walk(self.images_path))[2] + + manifest = [] + spine = [] + for c in self.book_chapters: + c["filename"] = c["filename"].replace(".html", ".xhtml") + item_id = escape("".join(c["filename"].split(".")[:-1])) + manifest.append("".format( + item_id, c["filename"] + )) + spine.append("".format(item_id)) + + alt_cover_id = False + for i in self.images: + dot_split = i.split(".") + head = "img_" + escape("".join(dot_split[:-1])) + extension = dot_split[-1] + manifest.append("".format( + head, i, "jpeg" if "jp" in extension else extension + )) + + if not alt_cover_id: + alt_cover_id = head + + for i in range(1, len(self.css) + 1): + manifest.append("2}\" href=\"Styles/Style{0:0>2}.css\" " + "media-type=\"text/css\" />".format(i)) + + authors = "\n".join("{0}".format( + escape(aut["name"]) + ) for aut in self.book_info["authors"]) + + subjects = "\n".join("{0}".format(escape(sub["name"])) + for sub in self.book_info["subjects"]) + + return self.CONTENT_OPF.format( + (self.book_info["isbn"] if len(self.book_info["isbn"]) else self.book_id), + escape(self.book_title), + authors, + escape(self.book_info["description"]), + subjects, + ", ".join(escape(pub["name"]) for pub in self.book_info["publishers"]), + escape(self.book_info["rights"]), + self.cover if self.cover else alt_cover_id, + "\n".join(manifest), + "\n".join(spine), + self.book_chapters[0]["filename"].replace(".html", ".xhtml") + ) + + @staticmethod + def parse_toc(l, c=0, mx=0): + r = "" + for cc in l: + c += 1 + if int(cc["depth"]) > mx: + mx = int(cc["depth"]) + + r += "" \ + "{2}" \ + "".format( + cc["fragment"] if len(cc["fragment"]) else cc["id"], c, + escape(cc["label"]), cc["href"].replace(".html", ".xhtml") + ) + + if cc["children"]: + sr, c, mx = SafariBooks.parse_toc(cc["children"], c, mx) + r += sr + + r += "\n" + + return r, c, mx + + def create_toc(self): + response = self.requests_provider(urljoin(self.api_url, "toc/")) + if response == 0: + self.display.exit("API: unable to retrieve book chapters. " + "Don't delete any files, just run again this program" + " in order to complete the `.epub` creation!") + + response = response.json() + + if not isinstance(response, list) and len(response.keys()) == 1: + self.display.exit( + self.display.api_error(response) + + " Don't delete any files, just run again this program" + " in order to complete the `.epub` creation!" + ) + + navmap, _, max_depth = self.parse_toc(response) + return self.TOC_NCX.format( + (self.book_info["isbn"] if len(self.book_info["isbn"]) else self.book_id), + max_depth, + self.book_title, + ", ".join(aut["name"] for aut in self.book_info["authors"]), + navmap + ) + + def create_epub(self): + open(os.path.join(self.BOOK_PATH, "mimetype"), "w").write("application/epub+zip") + meta_info = os.path.join(self.BOOK_PATH, "META-INF") + if os.path.isdir(meta_info): + self.display.log("META-INF directory already exists: %s" % meta_info) + + else: + os.makedirs(meta_info) + + open(os.path.join(meta_info, "container.xml"), "w").write(self.CONTAINER_XML) + open(os.path.join(self.BOOK_PATH, "OEBPS", "content.opf"), "w").write(self.create_content_opf()) + open(os.path.join(self.BOOK_PATH, "OEBPS", "toc.ncx"), "w").write(self.create_toc()) + + zip_file = os.path.join(self.BOOK_PATH, self.book_title) + if os.path.isfile(zip_file + ".epub"): + os.remove(zip_file + ".epub") + if os.path.isfile(zip_file + ".zip"): + os.remove(zip_file + ".zip") + shutil.make_archive(zip_file, 'zip', self.BOOK_PATH) + os.rename(zip_file + ".zip", zip_file + ".epub") + + +# MAIN +arguments = argparse.ArgumentParser(prog="safaribooks.py", + description="Download and generate an EPUB of your favorite books" + " from Safari Books Online.", + add_help=False, + allow_abbrev=False) + +arguments.add_argument( + "--cred", metavar="", default=False, + help="Credentials used to perform the auth login on Safari Books Online." + " Es. ` --cred \"account_mail@mail.com:password01\" `." +) +arguments.add_argument( + "--no-cookies", dest="no_cookies", action='store_true', + help="Prevent your session data to be saved into `cookies.json` file." +) +arguments.add_argument( + "--no-kindle", dest="no_kindle", action='store_true', + help="Remove some CSS rules that block overflow on `table` and `pre` elements." + " Use this option if you're not going to export the EPUB to E-Readers like Amazon Kindle." +) +arguments.add_argument( + "--preserve-log", dest="log", action='store_true', help="Leave the `info.log` file even if there isn't any error." +) +arguments.add_argument("--help", action="help", default=argparse.SUPPRESS, help='Show this help message.') + +arguments.add_argument( + "bookid", metavar='', + help="Book digits ID that you want to download. You can find it in the URL (X-es):" + " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" +) + +args_parsed = arguments.parse_args() + +if args_parsed.cred: + cred = args_parsed.cred.split(":") + if len(cred) != 2 or "@" not in cred[0]: + arguments.error("invalid credential: %s" % args_parsed.cred) + + args_parsed.cred = cred + +else: + if args_parsed.no_cookies: + arguments.error("invalid option: `--no-cookies` is valid only if you use the `--cred` option") + +if not args_parsed.bookid.isdigit(): + arguments.error("invalid book id: %s" % args_parsed.bookid) + +SafariBooks(args_parsed) From ac20e8bfa29dcff10ba37be71a7373f2fb33de51 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 12 Dec 2017 16:41:52 +0100 Subject: [PATCH 013/100] Update safaribooks.py --- safaribooks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 11b35e2..fc50f18 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -556,7 +556,7 @@ def get(self): (self.filename.replace(".html", ".xhtml"), " (especially because you selected the `--no-kindle` option)" if self.args.no_kindle else "")) - self.display.book_ad_info = 1 + self.display.book_ad_info = 2 else: self.save_page_html(self.parse_html(self.get_html(urljoin(self.base_url, self.filename)))) @@ -650,7 +650,7 @@ def collect_images(self): os.makedirs(self.images_path) self.display.images_ad_info.value = 1 - if self.display.book_ad_info == 1: + if self.display.book_ad_info == 2: self.display.info("Some of the book contents were already downloaded.\n" " If you want to be sure that all the images will be downloaded,\n" " please delete the `/OEBPS/*.xhtml` files and restart the program.") From 4cc96f32120a18af6ff50d1a2cc055fc04ea1b76 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 13 Dec 2017 11:31:06 +0100 Subject: [PATCH 014/100] CSS improvement: add a new rule for .bq elements --- safaribooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index fc50f18..6a7a577 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -178,7 +178,7 @@ class SafariBooks: "{0}\n" \ "" + book_content = book_content[0] book_content.rewrite_links(self.link_replace) + # TODO: add all not covered tag for `link_replace` function + svg_image_tags = root.xpath("//image") + if len(svg_image_tags): + for img in svg_image_tags: + image_attr_href = [x for x in img.attrib.keys() if "href" in x] + if len(image_attr_href): + svg_url = img.attrib.get(image_attr_href[0]) + svg_root = img.getparent().getparent() + new_img = svg_root.makeelement("img") + new_img.attrib.update({"src": self.link_replace(svg_url)}) + svg_root.remove(img.getparent()) + svg_root.append(new_img) + xhtml = None try: xhtml = html.tostring(book_content, method="xml", encoding='unicode') @@ -577,6 +591,8 @@ def get(self): if not len(self.chapters_queue): return + is_cover = len_books == len(self.chapters_queue) + next_chapter = self.chapters_queue.pop(0) self.chapter_title = next_chapter["title"] self.filename = next_chapter["filename"] @@ -597,7 +613,7 @@ def get(self): self.display.book_ad_info = 2 else: - self.save_page_html(self.parse_html(self.get_html(urljoin(self.base_url, self.filename)))) + self.save_page_html(self.parse_html(self.get_html(urljoin(self.base_url, self.filename)), is_cover)) self.display.state(len_books, len_books - len(self.chapters_queue)) From e90c0a9faa9a14f7f5ff6bd2285f29b89c301139 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 2 Feb 2018 12:14:36 +0100 Subject: [PATCH 023/100] Update: bug fix --- safaribooks.py | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/safaribooks.py b/safaribooks.py index 523bf53..2d98035 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -282,6 +282,9 @@ def __init__(self, args): self.book_chapters = self.get_book_chapters() self.chapters_queue = self.book_chapters[:] + if len(self.book_chapters) > sys.getrecursionlimit(): + sys.setrecursionlimit(len(self.book_chapters)) + self.book_title = self.book_info["title"] self.base_url = self.book_info["web_url"] @@ -439,6 +442,9 @@ def get_book_chapters(self, page=0): if "results" not in response or not len(response["results"]): self.display.exit("API: unable to retrieve book chapters.") + if response["count"] > sys.getrecursionlimit(): + sys.setrecursionlimit(response["count"]) + result = [] result.extend([c for c in response["results"] if "cover." in c["filename"]]) for c in result: From 5caf9768d8f43f438d2e66dcc03864c4a6234c84 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 2 Feb 2018 12:22:15 +0100 Subject: [PATCH 024/100] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4fbcca2..8301c16 100644 --- a/README.md +++ b/README.md @@ -105,7 +105,7 @@ In the other hand, if you're not going to export the `EPUB`, you can use the `-- [-] URL: https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/ [*] Retrieving book chapters... [*] Output directory: - /XXXX/XXXX/Test-Driven Development with Python, 2nd Edition + /XXXX/XXXX/Books/Test-Driven Development with Python, 2nd Edition [-] Downloading book contents... (73 chapters) [#########################################----------------------------] 60% ... From 0c9aab6b33739588cf1ed89be61be776223fe237 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 20 Feb 2018 19:43:19 +0100 Subject: [PATCH 025/100] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 8301c16..7da2323 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ # SafariBooks -Download and generate an *EPUB* of your favorite books from [*Safari Books Online*](https://www.safaribooksonline.com) library. +Download and generate *EPUB* of your favorite books from [*Safari Books Online*](https://www.safaribooksonline.com) library. Use this program only for *personal* and/or *educational* purpose. ## Overview: @@ -43,7 +43,7 @@ usage: safaribooks.py [--cred ] [--no-cookies] [--no-kindle] [--preserve-log] [--help] -Download and generate an EPUB of your favorite books from Safari Books Online. +Download and generate EPUB of your favorite books from Safari Books Online. positional arguments: Book digits ID that you want to download. From cdb96d6fcec769b0bbe3bc13e1e1abf67961dffb Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 22 Feb 2018 17:01:24 +0100 Subject: [PATCH 026/100] Update README.md Fixes #2 --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7da2323..430d81f 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ optional arguments: ``` The first time you'll use the program, you'll have to specify your Safari Books Online account credentials. -For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json`. +For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (see file format [`here`](/../../issues/2)). Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. From a9f1eb71a0dee6bedf6999140c9bb8174a232dcf Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 22 Feb 2018 17:03:20 +0100 Subject: [PATCH 027/100] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 430d81f..0700f36 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ optional arguments: ``` The first time you'll use the program, you'll have to specify your Safari Books Online account credentials. -For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (see file format [`here`](/../../issues/2)). +For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (see file format [`here`](/../../issues/2#issuecomment-367726544)). Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. From f38eeb67486017e11cc25181a13c49407a4171ce Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 23 Feb 2018 18:06:57 +0100 Subject: [PATCH 028/100] Fixed #4 --- safaribooks.py | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 2d98035..6613b00 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -1,3 +1,4 @@ +#!/usr/bin/env python3 # coding: utf-8 import os import sys @@ -10,8 +11,8 @@ from lxml import html from html import escape from random import random -from urllib.parse import urljoin, urlsplit, urlparse from multiprocessing import Process, Queue, Value +from urllib.parse import urljoin, urlsplit, urlparse PATH = os.path.dirname(os.path.realpath(__file__)) @@ -288,7 +289,8 @@ def __init__(self, args): self.book_title = self.book_info["title"] self.base_url = self.book_info["web_url"] - self.BOOK_PATH = os.path.join(PATH, "Books", self.book_title) + self.clean_book_title = self.clean_dirname(self.book_title) + self.BOOK_PATH = os.path.join(PATH, "Books", self.clean_book_title) self.create_dirs() self.display.info("Output directory:\n %s" % self.BOOK_PATH) @@ -318,7 +320,7 @@ def __init__(self, args): if not args.no_cookies: json.dump(self.cookies, open(COOKIES_FILE, "w")) - self.display.done(self.book_title + ".epub") + self.display.done(self.clean_book_title + ".epub") if not self.display.in_error and not args.log: os.remove(self.display.log_file) @@ -572,6 +574,18 @@ def parse_html(self, root, is_cover=False): return page_css, xhtml + @staticmethod + def clean_dirname(dirname): + if ":" in dirname: + if dirname.index(":") > 45: + dirname = dirname.split(":")[0] + + for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', ':', '|']: + if ch in dirname: + dirname = dirname.replace(ch, "_") + + return dirname + def create_dirs(self): if os.path.isdir(self.BOOK_PATH): self.display.log("Book directory already exists: %s" % self.book_title) @@ -842,13 +856,14 @@ def create_epub(self): open(os.path.join(self.BOOK_PATH, "OEBPS", "content.opf"), "w").write(self.create_content_opf()) open(os.path.join(self.BOOK_PATH, "OEBPS", "toc.ncx"), "w").write(self.create_toc()) - zip_file = os.path.join(self.BOOK_PATH, self.book_title) + zip_file = os.path.join(PATH, "Books", self.book_id) if os.path.isfile(zip_file + ".epub"): os.remove(zip_file + ".epub") if os.path.isfile(zip_file + ".zip"): os.remove(zip_file + ".zip") + shutil.make_archive(zip_file, 'zip', self.BOOK_PATH) - os.rename(zip_file + ".zip", zip_file + ".epub") + os.rename(zip_file + ".zip", os.path.join(self.BOOK_PATH, self.clean_book_title) + ".epub") # MAIN From 3458e4b98a8dc903cff389ffea415de4f9901941 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 23 Feb 2018 18:11:26 +0100 Subject: [PATCH 029/100] Update safaribooks.py --- safaribooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index 6613b00..cffd72a 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -580,7 +580,7 @@ def clean_dirname(dirname): if dirname.index(":") > 45: dirname = dirname.split(":")[0] - for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', ':', '|']: + for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', '|']: if ch in dirname: dirname = dirname.replace(ch, "_") From 67b3e11d26f97b773056bab6df2aa18ae85de7c7 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Sun, 25 Feb 2018 19:10:16 +0100 Subject: [PATCH 030/100] Added verbosity at debug log Fixed #3 --- safaribooks.py | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index cffd72a..6535bbd 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -47,6 +47,7 @@ def __init__(self, log_file): self.book_ad_info = False self.css_ad_info = Value("i", 0) self.images_ad_info = Value("i", 0) + self.last_request = (None,) self.in_error = False self.state_status = Value("i", 0) @@ -85,6 +86,10 @@ def exit(self, error): def unhandled_exception(self, _, o, tb): self.log("".join(traceback.format_tb(tb))) + if any(self.last_request): + self.log("Last request done:\n\tURL: {0}\n\tDATA: {1}\n\tOTHERS: {2}\n\n\t{3}\n{4}\n\n{5}\n" + .format(*self.last_request)) + self.exit("Unhandled Exception: %s (type: %s)" % (o, o.__class__.__name__)) def intro(self): @@ -354,6 +359,12 @@ def requests_provider(self, url, post=False, data=None, update_cookies=True, **k **kwargs ) + self.display.last_request = ( + url, data, kwargs, response.status_code, "\n".join( + ["\t{}: {}".format(*h) for h in response.headers.items()] + ), response.text + ) + except (requests.ConnectionError, requests.ConnectTimeout, requests.RequestException) as request_exception: self.display.error(str(request_exception)) return 0 @@ -577,7 +588,7 @@ def parse_html(self, root, is_cover=False): @staticmethod def clean_dirname(dirname): if ":" in dirname: - if dirname.index(":") > 45: + if dirname.index(":") > 30: dirname = dirname.split(":")[0] for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', '|']: @@ -601,7 +612,7 @@ def create_dirs(self): def save_page_html(self, contents): self.filename = self.filename.replace(".html", ".xhtml") open(os.path.join(self.BOOK_PATH, "OEBPS", self.filename), "wb")\ - .write(self.BASE_HTML.format(contents[0], contents[1]).encode("utf-8", "replace")) + .write(self.BASE_HTML.format(contents[0], contents[1]).encode("utf-8", 'xmlcharrefreplace')) self.display.log("Created: %s" % self.filename) def get(self): @@ -852,9 +863,15 @@ def create_epub(self): else: os.makedirs(meta_info) - open(os.path.join(meta_info, "container.xml"), "w").write(self.CONTAINER_XML) - open(os.path.join(self.BOOK_PATH, "OEBPS", "content.opf"), "w").write(self.create_content_opf()) - open(os.path.join(self.BOOK_PATH, "OEBPS", "toc.ncx"), "w").write(self.create_toc()) + open(os.path.join(meta_info, "container.xml"), "wb").write( + self.CONTAINER_XML.encode("utf-8", "xmlcharrefreplace") + ) + open(os.path.join(self.BOOK_PATH, "OEBPS", "content.opf"), "wb").write( + self.create_content_opf().encode("utf-8", "xmlcharrefreplace") + ) + open(os.path.join(self.BOOK_PATH, "OEBPS", "toc.ncx"), "wb").write( + self.create_toc().encode("utf-8", "xmlcharrefreplace") + ) zip_file = os.path.join(PATH, "Books", self.book_id) if os.path.isfile(zip_file + ".epub"): From cfad39c38229c384580f74d53d46b51ac03ce647 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 26 Feb 2018 20:13:10 +0100 Subject: [PATCH 031/100] Updated and fixed #3 --- safaribooks.py | 79 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 62 insertions(+), 17 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 6535bbd..97ca52b 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -82,16 +82,19 @@ def exit(self, error): output = self.SH_BG_RED + "[!]" + self.SH_DEFAULT + " Aborting..." self.out(output) + + self.save_last_request() sys.exit(128) def unhandled_exception(self, _, o, tb): self.log("".join(traceback.format_tb(tb))) + self.exit("Unhandled Exception: %s (type: %s)" % (o, o.__class__.__name__)) + + def save_last_request(self): if any(self.last_request): self.log("Last request done:\n\tURL: {0}\n\tDATA: {1}\n\tOTHERS: {2}\n\n\t{3}\n{4}\n\n{5}\n" .format(*self.last_request)) - self.exit("Unhandled Exception: %s (type: %s)" % (o, o.__class__.__name__)) - def intro(self): output = self.SH_YELLOW + """ ____ ___ _ @@ -286,6 +289,16 @@ def __init__(self, args): self.display.info("Retrieving book chapters...") self.book_chapters = self.get_book_chapters() + + self.no_cover = False + if "cover" not in self.book_chapters[0]["filename"] or "cover" not in self.book_chapters[0]["title"]: + self.book_chapters = [{ + "filename": "cover", + "title": "Cover", + "web_url": self.book_info["cover"] + }] + self.book_chapters + self.no_cover = True + self.chapters_queue = self.book_chapters[:] if len(self.book_chapters) > sys.getrecursionlimit(): @@ -306,6 +319,7 @@ def __init__(self, args): self.display.info("Downloading book contents... (%s chapters)" % len(self.book_chapters), state=True) self.BASE_HTML = self.BASE_01_HTML + (self.KINDLE_HTML if not args.no_kindle else "") + self.BASE_02_HTML + self.get() self.css_path = "" @@ -442,8 +456,8 @@ def get_book_info(self): return response - def get_book_chapters(self, page=0): - response = self.requests_provider(urljoin(self.api_url, "chapter/" + ("" if not page else "?page=%s" % page))) + def get_book_chapters(self, page=1): + response = self.requests_provider(urljoin(self.api_url, "chapter/?page=%s" % page)) if response == 0: self.display.exit("API: unable to retrieve book chapters.") @@ -463,7 +477,7 @@ def get_book_chapters(self, page=0): for c in result: del response["results"][response["results"].index(c)] - result += response["results"] + result += response["results"] return result + (self.get_book_chapters(page + 1) if response["next"] else []) def get_html(self, url): @@ -532,9 +546,9 @@ def parse_html(self, root, is_cover=False): self.css.append(css_url) self.display.log("Crawler: found a new CSS at %s" % css_url) - stylesheet_count += 1 page_css += "2}.css\" " \ "rel=\"stylesheet\" type=\"text/css\" />\n".format(stylesheet_count) + stylesheet_count += 1 stylesheets = root.xpath("//style") if len(stylesheets): @@ -553,12 +567,6 @@ def parse_html(self, root, is_cover=False): (self.filename, self.chapter_title) ) - if is_cover: - page_css += "" - - book_content = book_content[0] - book_content.rewrite_links(self.link_replace) - # TODO: add all not covered tag for `link_replace` function svg_image_tags = root.xpath("//image") if len(svg_image_tags): @@ -568,12 +576,31 @@ def parse_html(self, root, is_cover=False): svg_url = img.attrib.get(image_attr_href[0]) svg_root = img.getparent().getparent() new_img = svg_root.makeelement("img") - new_img.attrib.update({"src": self.link_replace(svg_url)}) + new_img.attrib.update({"src": svg_url}) svg_root.remove(img.getparent()) svg_root.append(new_img) + book_content = book_content[0] + book_content.rewrite_links(self.link_replace) + xhtml = None try: + if is_cover: + page_css = "" + + cover_html = html.fromstring("
") + cover_div = cover_html.xpath("//div")[0] + + if len(book_content.xpath("//img")): + cover_img = cover_div.makeelement("img") + cover_img.attrib.update({"src": book_content.xpath("//img")[0].attrib["src"]}) + cover_div.append(cover_img) + book_content = cover_html + xhtml = html.tostring(book_content, method="xml", encoding='unicode') except (html.etree.ParseError, html.etree.ParserError) as parsing_error: @@ -644,7 +671,25 @@ def get(self): self.display.book_ad_info = 2 else: - self.save_page_html(self.parse_html(self.get_html(urljoin(self.base_url, self.filename)), is_cover)) + if is_cover and self.no_cover: + response = self.requests_provider(next_chapter["web_url"], update_cookies=False, stream=True) + if response != 0: + with open(os.path.join(self.BOOK_PATH, "OEBPS", "Images", + self.filename + "." + response.headers["Content-Type"].split("/")[-1]), 'wb') as s: + for chunk in response.iter_content(1024): + s.write(chunk) + + cover_html = self.parse_html(html.fromstring( + "
".format( + self.filename + "." + response.headers["Content-Type"].split("/")[-1] + ) + ), is_cover) + self.filename += ".xhtml" + self.book_chapters[0]["filename"] += ".xhtml" + self.save_page_html(cover_html) + continue + + self.save_page_html(self.parse_html(self.get_html(next_chapter["web_url"]), is_cover)) self.display.state(len_books, len_books - len(self.chapters_queue)) @@ -769,7 +814,7 @@ def create_content_opf(self): spine.append("".format(item_id)) alt_cover_id = False - for i in self.images: + for i in set(self.images): dot_split = i.split(".") head = "img_" + escape("".join(dot_split[:-1])) extension = dot_split[-1] @@ -780,7 +825,7 @@ def create_content_opf(self): if not alt_cover_id: alt_cover_id = head - for i in range(1, len(self.css) + 1): + for i in range(len(self.css)): manifest.append("2}\" href=\"Styles/Style{0:0>2}.css\" " "media-type=\"text/css\" />".format(i)) @@ -818,7 +863,7 @@ def parse_toc(l, c=0, mx=0): "{2}" \ "".format( cc["fragment"] if len(cc["fragment"]) else cc["id"], c, - escape(cc["label"]), cc["href"].replace(".html", ".xhtml") + escape(cc["label"]), cc["href"].replace(".html", ".xhtml").split("/")[-1] ) if cc["children"]: From 452148d40b3bb1e21bf8ad81bf41a6e850c60e1d Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 27 Feb 2018 15:13:12 +0100 Subject: [PATCH 032/100] General improvement Fixed #3 (Again) --- safaribooks.py | 158 +++++++++++++++++++++++++++---------------------- 1 file changed, 87 insertions(+), 71 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 97ca52b..b646174 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -290,15 +290,6 @@ def __init__(self, args): self.display.info("Retrieving book chapters...") self.book_chapters = self.get_book_chapters() - self.no_cover = False - if "cover" not in self.book_chapters[0]["filename"] or "cover" not in self.book_chapters[0]["title"]: - self.book_chapters = [{ - "filename": "cover", - "title": "Cover", - "web_url": self.book_info["cover"] - }] + self.book_chapters - self.no_cover = True - self.chapters_queue = self.book_chapters[:] if len(self.book_chapters) > sys.getrecursionlimit(): @@ -309,6 +300,8 @@ def __init__(self, args): self.clean_book_title = self.clean_dirname(self.book_title) self.BOOK_PATH = os.path.join(PATH, "Books", self.clean_book_title) + self.css_path = "" + self.images_path = "" self.create_dirs() self.display.info("Output directory:\n %s" % self.BOOK_PATH) @@ -320,11 +313,21 @@ def __init__(self, args): self.display.info("Downloading book contents... (%s chapters)" % len(self.book_chapters), state=True) self.BASE_HTML = self.BASE_01_HTML + (self.KINDLE_HTML if not args.no_kindle else "") + self.BASE_02_HTML + self.cover = False self.get() + if not self.cover: + self.cover = self.get_default_cover() + cover_html = self.parse_html( + html.fromstring("
".format(self.cover)), True + ) - self.css_path = "" - self.images_path = "" - self.cover = "" + self.book_chapters = [{ + "filename": "default_cover.xhtml", + "title": "Cover" + }] + self.book_chapters + + self.filename = self.book_chapters[0]["filename"] + self.save_page_html(cover_html) self.css_done_queue = Queue(0) if "win" not in sys.platform else WinQueue() self.display.info("Downloading book CSSs... (%s files)" % len(self.css), state=True) @@ -473,13 +476,26 @@ def get_book_chapters(self, page=1): sys.setrecursionlimit(response["count"]) result = [] - result.extend([c for c in response["results"] if "cover." in c["filename"]]) + result.extend([c for c in response["results"] if "cover" in c["filename"] or "cover" in c["title"]]) for c in result: del response["results"][response["results"].index(c)] result += response["results"] return result + (self.get_book_chapters(page + 1) if response["next"] else []) + def get_default_cover(self): + response = self.requests_provider(self.book_info["cover"], update_cookies=False, stream=True) + if response == 0: + self.display.error("Error trying to retrieve the cover: %s" % self.book_info["cover"]) + return False + + file_ext = response.headers["Content-Type"].split("/")[-1] + with open(os.path.join(self.images_path, "default_cover." + file_ext), 'wb') as i: + for chunk in response.iter_content(1024): + i.write(chunk) + + return "default_cover." + file_ext + def get_html(self, url): response = self.requests_provider(url) if response == 0: @@ -508,9 +524,9 @@ def url_is_absolute(url): def link_replace(self, link): if link: if not self.url_is_absolute(link): - link = urljoin(self.base_url, link) if "cover" in link or "images" in link or "graphics" in link or \ link[-3:] in ["jpg", "peg", "png", "gif"]: + link = urljoin(self.base_url, link) if link not in self.images: self.images.append(link) self.display.log("Crawler: found a new image at %s" % link) @@ -520,9 +536,31 @@ def link_replace(self, link): return link.replace(".html", ".xhtml") + else: + if self.book_id in link: + return self.link_replace(link.split(self.book_id)[-1]) + return link - def parse_html(self, root, is_cover=False): + @staticmethod + def get_cover(html_root): + images = html_root.xpath("//img[contains(@id, 'cover') or " + "contains(@name, 'cover') or contains(@src, 'cover')]") + if len(images): + return images[0] + + divs = html_root.xpath("//div[contains(@id, 'cover') or " + "contains(@name, 'cover') or contains(@src, 'cover')]//img") + if len(divs): + return divs[0] + + a = html_root.xpath("//a[contains(@id, 'cover') or contains(@name, 'cover') or contains(@src, 'cover')]//img") + if len(a): + return a[0] + + return None + + def parse_html(self, root, first_page=False): if random() > 0.5: if len(root.xpath("//div[@class='controls']/a/text()")): self.display.exit(self.display.api_error(" ")) @@ -585,22 +623,23 @@ def parse_html(self, root, is_cover=False): xhtml = None try: - if is_cover: - page_css = "" - - cover_html = html.fromstring("
") - cover_div = cover_html.xpath("//div")[0] - - if len(book_content.xpath("//img")): + if first_page: + is_cover = self.get_cover(book_content) + if is_cover is not None: + page_css = "" + cover_html = html.fromstring("
") + cover_div = cover_html.xpath("//div")[0] cover_img = cover_div.makeelement("img") - cover_img.attrib.update({"src": book_content.xpath("//img")[0].attrib["src"]}) + cover_img.attrib.update({"src": is_cover.attrib["src"]}) cover_div.append(cover_img) book_content = cover_html + self.cover = is_cover.attrib["src"] + xhtml = html.tostring(book_content, method="xml", encoding='unicode') except (html.etree.ParseError, html.etree.ParserError) as parsing_error: @@ -620,7 +659,7 @@ def clean_dirname(dirname): for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', '|']: if ch in dirname: - dirname = dirname.replace(ch, "_") + dirname = dirname.replace(ch, "") return dirname @@ -636,6 +675,22 @@ def create_dirs(self): self.display.book_ad_info = True os.makedirs(oebps) + self.css_path = os.path.join(oebps, "Styles") + if os.path.isdir(self.css_path): + self.display.log("CSSs directory already exists: %s" % self.css_path) + + else: + os.makedirs(self.css_path) + self.display.css_ad_info.value = 1 + + self.images_path = os.path.join(oebps, "Images") + if os.path.isdir(self.images_path): + self.display.log("Images directory already exists: %s" % self.images_path) + + else: + os.makedirs(self.images_path) + self.display.images_ad_info.value = 1 + def save_page_html(self, contents): self.filename = self.filename.replace(".html", ".xhtml") open(os.path.join(self.BOOK_PATH, "OEBPS", self.filename), "wb")\ @@ -645,11 +700,11 @@ def save_page_html(self, contents): def get(self): len_books = len(self.book_chapters) - for _ in self.book_chapters: + for _ in range(len_books): if not len(self.chapters_queue): return - is_cover = len_books == len(self.chapters_queue) + first_page = len_books == len(self.chapters_queue) next_chapter = self.chapters_queue.pop(0) self.chapter_title = next_chapter["title"] @@ -671,25 +726,7 @@ def get(self): self.display.book_ad_info = 2 else: - if is_cover and self.no_cover: - response = self.requests_provider(next_chapter["web_url"], update_cookies=False, stream=True) - if response != 0: - with open(os.path.join(self.BOOK_PATH, "OEBPS", "Images", - self.filename + "." + response.headers["Content-Type"].split("/")[-1]), 'wb') as s: - for chunk in response.iter_content(1024): - s.write(chunk) - - cover_html = self.parse_html(html.fromstring( - "
".format( - self.filename + "." + response.headers["Content-Type"].split("/")[-1] - ) - ), is_cover) - self.filename += ".xhtml" - self.book_chapters[0]["filename"] += ".xhtml" - self.save_page_html(cover_html) - continue - - self.save_page_html(self.parse_html(self.get_html(next_chapter["web_url"]), is_cover)) + self.save_page_html(self.parse_html(self.get_html(next_chapter["web_url"]), first_page)) self.display.state(len_books, len_books - len(self.chapters_queue)) @@ -756,14 +793,6 @@ def _start_multiprocessing(self, operation, full_queue): proc.join() def collect_css(self): - self.css_path = os.path.join(self.BOOK_PATH, "OEBPS", "Styles") - if os.path.isdir(self.css_path): - self.display.log("CSSs directory already exists: %s" % self.css_path) - - else: - os.makedirs(self.css_path) - self.display.css_ad_info.value = 1 - self.display.state_status.value = -1 if "win" in sys.platform: @@ -775,14 +804,6 @@ def collect_css(self): self._start_multiprocessing(self._thread_download_css, self.css) def collect_images(self): - self.images_path = os.path.join(self.BOOK_PATH, "OEBPS", "Images") - if os.path.isdir(self.images_path): - self.display.log("Images directory already exists: %s" % self.images_path) - - else: - os.makedirs(self.images_path) - self.display.images_ad_info.value = 1 - if self.display.book_ad_info == 2: self.display.info("Some of the book contents were already downloaded.\n" " If you want to be sure that all the images will be downloaded,\n" @@ -799,7 +820,6 @@ def collect_images(self): self._start_multiprocessing(self._thread_download_images, self.images) def create_content_opf(self): - self.cover = self.images[0] if len(self.images) else "" self.css = next(os.walk(self.css_path))[2] self.images = next(os.walk(self.images_path))[2] @@ -813,7 +833,6 @@ def create_content_opf(self): )) spine.append("".format(item_id)) - alt_cover_id = False for i in set(self.images): dot_split = i.split(".") head = "img_" + escape("".join(dot_split[:-1])) @@ -822,9 +841,6 @@ def create_content_opf(self): head, i, "jpeg" if "jp" in extension else extension )) - if not alt_cover_id: - alt_cover_id = head - for i in range(len(self.css)): manifest.append("2}\" href=\"Styles/Style{0:0>2}.css\" " "media-type=\"text/css\" />".format(i)) @@ -845,7 +861,7 @@ def create_content_opf(self): ", ".join(escape(pub["name"]) for pub in self.book_info["publishers"]), escape(self.book_info["rights"]), self.book_info["issued"], - self.cover if self.cover else alt_cover_id, + self.cover, "\n".join(manifest), "\n".join(spine), self.book_chapters[0]["filename"].replace(".html", ".xhtml") From 7526702f28c74c032986ba907b844f2904af40b0 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 1 Mar 2018 11:47:53 +0100 Subject: [PATCH 033/100] Fixed #7 --- safaribooks.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index b646174..d1d83d1 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -299,7 +299,12 @@ def __init__(self, args): self.base_url = self.book_info["web_url"] self.clean_book_title = self.clean_dirname(self.book_title) - self.BOOK_PATH = os.path.join(PATH, "Books", self.clean_book_title) + + books_dir = os.path.join(PATH, "Books") + if not os.path.isdir(books_dir): + os.mkdir(books_dir) + + self.BOOK_PATH = os.path.join(books_dir, self.clean_book_title) self.css_path = "" self.images_path = "" self.create_dirs() From 62caeae89e184ec2505aa1bb51a8140bf000cfb4 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 1 Mar 2018 12:06:17 +0100 Subject: [PATCH 034/100] Added support for pipenv --- Pipfile | 17 +++++++++ Pipfile.lock | 102 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 119 insertions(+) create mode 100644 Pipfile create mode 100644 Pipfile.lock diff --git a/Pipfile b/Pipfile new file mode 100644 index 0000000..c4608ad --- /dev/null +++ b/Pipfile @@ -0,0 +1,17 @@ +[[source]] + +url = "https://pypi.python.org/simple" +verify_ssl = true +name = "pypi" + +[packages] + +lxml = "*" +requests = "*" + +[dev-packages] + + +[requires] + +python_version = "3.6" diff --git a/Pipfile.lock b/Pipfile.lock new file mode 100644 index 0000000..9d6b8e2 --- /dev/null +++ b/Pipfile.lock @@ -0,0 +1,102 @@ +{ + "_meta": { + "hash": { + "sha256": "bdff55965f0e3fa7b2a9b0e9df281f33860d2ae44891a2916d0ae6ce1d8c4a19" + }, + "host-environment-markers": { + "implementation_name": "cpython", + "implementation_version": "3.6.3", + "os_name": "posix", + "platform_machine": "x86_64", + "platform_python_implementation": "CPython", + "platform_release": "4.13.0-36-generic", + "platform_system": "Linux", + "platform_version": "#40-Ubuntu SMP Fri Feb 16 20:07:48 UTC 2018", + "python_full_version": "3.6.3", + "python_version": "3.6", + "sys_platform": "linux" + }, + "pipfile-spec": 6, + "requires": { + "python_version": "3.6" + }, + "sources": [ + { + "name": "pypi", + "url": "https://pypi.python.org/simple", + "verify_ssl": true + } + ] + }, + "default": { + "certifi": { + "hashes": [ + "sha256:14131608ad2fd56836d33a71ee60fa1c82bc9d2c8d98b7bdbc631fe1b3cd1296", + "sha256:edbc3f203427eef571f79a7692bb160a2b0f7ccaa31953e99bd17e307cf63f7d" + ], + "version": "==2018.1.18" + }, + "chardet": { + "hashes": [ + "sha256:fc323ffcaeaed0e0a02bf4d117757b98aed530d9ed4531e3e15460124c106691", + "sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae" + ], + "version": "==3.0.4" + }, + "idna": { + "hashes": [ + "sha256:8c7309c718f94b3a625cb648ace320157ad16ff131ae0af362c9f21b80ef6ec4", + "sha256:2c6a5de3089009e3da7c5dde64a141dbc8551d5b7f6cf4ed7c2568d0cc520a8f" + ], + "version": "==2.6" + }, + "lxml": { + "hashes": [ + "sha256:41f59cbdab232f11680d5d4dec9f2e6782fd24d78e37ee833447702e34e675f4", + "sha256:e7e41d383f19bab9d57f5f3b18d158655bcd682e7e723f441b9e183e1e35a6b5", + "sha256:155521c337acecf8202091cff85bb9f709f238130ebadf04280fb1db11f5ad8b", + "sha256:d2c985d2460b81c6ca5feb8b86f1bc594ad59405d0bdf68626b85852b701553c", + "sha256:950e63387514aa1b881eba5ac6cb2ec51a118b3dafe99dd80ca19d8fb0142f30", + "sha256:470d7ce41e8047208ba1a376560bad17f1468df1f3097bc83902b26cfafdbb0c", + "sha256:e608839a5ee2180164424ccf279c8e2d9bbe8816d002c58fd97d6b621ba4aa94", + "sha256:87a66bcadac270fc010cb029022a93fc722bf1204a8b03e782d4c790f0edf7ca", + "sha256:2dedfeeecc2d5a939cf622602f5a1ce443ca82407f386880f739f1a9f08053ad", + "sha256:ba05732e4bcf59e948f61588851dcf620fd60d5bbd9d704203e5f59bbaa60219", + "sha256:2190266059fec3c5a55f9d6c30532c64c6d414d3228909c0af573fe4907e78d1", + "sha256:dd291debfaa535d9cb6cee8d7aca2328775e037d02d13f1634e57f49bc302cc4", + "sha256:29a36e354c39b2e24bc4ee103de53417ebb80f976a6ab9e8d093d559e2ac03e1", + "sha256:e37427d5a27eefbcfc48847e0b37f348113fac7280bc857421db39ffc6372570", + "sha256:b106d4d2383382399ad82108fd187e92f40b1c90f55c2d36bbcb1c44bcf940fc", + "sha256:0ee07da52d240f1dc3c83eef5cd5f1b7f018226c1121f2a54d446645779a6d17", + "sha256:3b33549fb8f91b38a7500078242b03cca513f3412a2cdae722e89bf83f95971d", + "sha256:4c12e90886d9c53ab434c8d0cebea122321cce19614c3c6b6d1a7700d7cc6212", + "sha256:79322000279cda10b53c374d53ca632ead3bc51c6aebf8e62c8fa93a4d08b750", + "sha256:6cba398eb37e0631e60e0e080c101cfe91769b2c8267105b64b4625e2581ea21", + "sha256:49a655956f8de69e1258bc0fcfc43eb3bd1e038655784d77d1869b4b81444e37", + "sha256:af8a5373241d09b8fc53e0490e1719ce5dc90a21b19db89b6596c1adcdd52270", + "sha256:e6b6698415c7e8d227a47a3b1038e1b37c2b438a1b48c2db7ad9e74ddbcd1149", + "sha256:155c916cf2645b4a8f2bd5d09065e92d1b67b8d464bdc001e0b524af84bedf6f", + "sha256:fa7320679ced5e25b20203d157280680fc84eb783b6cc650cb0c98e1858b7dd3", + "sha256:4187c4b0cefc3353181db048c51f42c489d9ac51e40b86c4851dc0671372971d", + "sha256:d5d29663e979e83b3fc361e97200f959cddb3a14797391d15273d84a5a8ae44b", + "sha256:940caef1ec7c78e0c34b0f6b94fe42d0f2022915ffc78643d28538a5cfd0f40e" + ], + "version": "==4.1.1" + }, + "requests": { + "hashes": [ + "sha256:6a1b267aa90cac58ac3a765d067950e7dbbf75b1da07e895d1f594193a40a38b", + "sha256:9c443e7324ba5b85070c4a818ade28bfabedf16ea10206da1132edaa6dda237e" + ], + "version": "==2.18.4" + }, + "urllib3": { + "hashes": [ + "sha256:06330f386d6e4b195fbfc736b297f58c5a892e4440e54d294d7004e3a9bbea1b", + "sha256:cc44da8e1145637334317feebd728bd869a35285b93cbb4cca2577da7e62db4f" + ], + "version": "==1.22" + } + }, + "develop": {} +} From af3238968193cbb53442ba2b2b78f6a5285b5cf7 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 1 Mar 2018 12:12:32 +0100 Subject: [PATCH 035/100] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 0700f36..81e4f88 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,10 @@ Cloning into 'safaribooks'... $ cd safaribooks/ $ pip3 install -r requirements.txt + +OR + +$ pipenv install && pipenv shell ``` The program depends of only two **Python 3** modules: From d9689dbac3747a04cde8c5e0c32b37fccfe0f0d5 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 2 Mar 2018 09:43:19 +0100 Subject: [PATCH 036/100] Fixed book name duplicates --- safaribooks.py | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index d1d83d1..4c6a2f7 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -298,7 +298,9 @@ def __init__(self, args): self.book_title = self.book_info["title"] self.base_url = self.book_info["web_url"] - self.clean_book_title = self.clean_dirname(self.book_title) + self.clean_book_title = self.escape_dirname(self.book_title) + " ({0})".format( + self.escape_dirname(", ".join(a["name"] for a in self.book_info["authors"]), clean_space=True) + ) books_dir = os.path.join(PATH, "Books") if not os.path.isdir(books_dir): @@ -657,7 +659,7 @@ def parse_html(self, root, first_page=False): return page_css, xhtml @staticmethod - def clean_dirname(dirname): + def escape_dirname(dirname, clean_space=False): if ":" in dirname: if dirname.index(":") > 30: dirname = dirname.split(":")[0] @@ -666,11 +668,11 @@ def clean_dirname(dirname): if ch in dirname: dirname = dirname.replace(ch, "") - return dirname + return dirname if not clean_space else dirname.replace(" ", "") def create_dirs(self): if os.path.isdir(self.BOOK_PATH): - self.display.log("Book directory already exists: %s" % self.book_title) + self.display.log("Book directory already exists: %s" % self.BOOK_PATH) else: os.makedirs(self.BOOK_PATH) From ccd5c1398a7c92f643aac6f77b1fe809831b9933 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 2 Mar 2018 09:53:38 +0100 Subject: [PATCH 037/100] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 81e4f88..0045ed6 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # SafariBooks Download and generate *EPUB* of your favorite books from [*Safari Books Online*](https://www.safaribooksonline.com) library. -Use this program only for *personal* and/or *educational* purpose. +I'm not responsible for the use of this program, this is only for *personal* and *educational* purpose. ## Overview: * [Requirements & Setup](#requirements--setup) From be33390f82b8a99922729e79bf324b143c3e6df9 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 2 Mar 2018 14:38:56 +0100 Subject: [PATCH 038/100] Bug fixed --- safaribooks.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 4c6a2f7..2163223 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -660,9 +660,8 @@ def parse_html(self, root, first_page=False): @staticmethod def escape_dirname(dirname, clean_space=False): - if ":" in dirname: - if dirname.index(":") > 30: - dirname = dirname.split(":")[0] + if ":" in dirname and dirname.index(":") > 15: + dirname = dirname.split(":")[0] for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', '|']: if ch in dirname: From 27baf7314e2af488f8b9e62784bf072e47d3b1fe Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Sat, 3 Mar 2018 12:54:36 +0100 Subject: [PATCH 039/100] Fixed bug on `escape_dirname` func for Windows users. --- safaribooks.py | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 2163223..2a8790d 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -660,12 +660,16 @@ def parse_html(self, root, first_page=False): @staticmethod def escape_dirname(dirname, clean_space=False): - if ":" in dirname and dirname.index(":") > 15: - dirname = dirname.split(":")[0] + if ":" in dirname: + if dirname.index(":") > 15: + dirname = dirname.split(":")[0] - for ch in ['\\', '/', '<', '>', '`', '\'', '"', '*', '?', '|']: + elif "win" in sys.platform: + dirname = dirname.replace(":", ",") + + for ch in ['~', '#', '%', '&', '*', '{', '}', '\\', '<', '>', '?', '/', '`', '\'', '"', '|', '+']: if ch in dirname: - dirname = dirname.replace(ch, "") + dirname = dirname.replace(ch, "_") return dirname if not clean_space else dirname.replace(" ", "") From d96679c44de70550595ec7f7e41883bf7a58ce48 Mon Sep 17 00:00:00 2001 From: Max Romanovsky Date: Thu, 8 Mar 2018 19:54:28 +0100 Subject: [PATCH 040/100] fixes for missing isbn --- safaribooks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 2a8790d..1412df1 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -863,7 +863,7 @@ def create_content_opf(self): for sub in self.book_info["subjects"]) return self.CONTENT_OPF.format( - (self.book_info["isbn"] if len(self.book_info["isbn"]) else self.book_id), + (self.book_info["isbn"] if (isinstance(self.book_info["isbn"], str) and len(self.book_info["isbn"])) else self.book_id), escape(self.book_title), authors, escape(self.book_info["description"]), @@ -918,7 +918,7 @@ def create_toc(self): navmap, _, max_depth = self.parse_toc(response) return self.TOC_NCX.format( - (self.book_info["isbn"] if len(self.book_info["isbn"]) else self.book_id), + (self.book_info["isbn"] if (isinstance(self.book_info["isbn"], str) and len(self.book_info["isbn"])) else self.book_id), max_depth, self.book_title, ", ".join(aut["name"] for aut in self.book_info["authors"]), From 18cc21485d29fe124b7d8b241daf33f0cb7108b4 Mon Sep 17 00:00:00 2001 From: Max Romanovsky Date: Fri, 9 Mar 2018 13:00:48 +0100 Subject: [PATCH 041/100] Update safaribooks.py --- safaribooks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 1412df1..7b0e653 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -863,7 +863,7 @@ def create_content_opf(self): for sub in self.book_info["subjects"]) return self.CONTENT_OPF.format( - (self.book_info["isbn"] if (isinstance(self.book_info["isbn"], str) and len(self.book_info["isbn"])) else self.book_id), + (self.book_info["isbn"] if self.book_info["isbn"] else self.book_id), escape(self.book_title), authors, escape(self.book_info["description"]), @@ -918,7 +918,7 @@ def create_toc(self): navmap, _, max_depth = self.parse_toc(response) return self.TOC_NCX.format( - (self.book_info["isbn"] if (isinstance(self.book_info["isbn"], str) and len(self.book_info["isbn"])) else self.book_id), + (self.book_info["isbn"] if self.book_info["isbn"] else self.book_id), max_depth, self.book_title, ", ".join(aut["name"] for aut in self.book_info["authors"]), From 5e1eca240a89ff5aeafff27db7aa8b604bce2e26 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 12 Apr 2018 20:31:17 +0200 Subject: [PATCH 042/100] Fixed #10 Fixed #11 General improvement --- safaribooks.py | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 2a8790d..2bde385 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -53,6 +53,10 @@ def __init__(self, log_file): self.state_status = Value("i", 0) sys.excepthook = self.unhandled_exception + def unregister(self): + self.logger.handlers[0].close() + sys.excepthook = sys.__excepthook__ + def log(self, message): self.logger.info(str(message)) @@ -97,10 +101,10 @@ def save_last_request(self): def intro(self): output = self.SH_YELLOW + """ - ____ ___ _ + ____ ___ _ / __/__ _/ _/__ _____(_) - _\ \/ _ `/ _/ _ `/ __/ / - /___/\_,_/_/ \_,_/_/ /_/ + _\ \/ _ `/ _/ _ `/ __/ / + /___/\_,_/_/ \_,_/_/ /_/ / _ )___ ___ / /__ ___ / _ / _ \/ _ \/ '_/(_-< /____/\___/\___/_/\_\/___/ @@ -298,8 +302,8 @@ def __init__(self, args): self.book_title = self.book_info["title"] self.base_url = self.book_info["web_url"] - self.clean_book_title = self.escape_dirname(self.book_title) + " ({0})".format( - self.escape_dirname(", ".join(a["name"] for a in self.book_info["authors"]), clean_space=True) + self.clean_book_title = "".join(self.escape_dirname(self.book_title).split(",")[:2]) + " ({0})".format( + self.escape_dirname(", ".join(a["name"] for a in self.book_info["authors"][:2]), clean_space=True) ) books_dir = os.path.join(PATH, "Books") @@ -349,7 +353,8 @@ def __init__(self, args): if not args.no_cookies: json.dump(self.cookies, open(COOKIES_FILE, "w")) - self.display.done(self.clean_book_title + ".epub") + self.display.done(os.path.join(self.BOOK_PATH, self.book_id + ".epub")) + self.display.unregister() if not self.display.in_error and not args.log: os.remove(self.display.log_file) @@ -551,17 +556,18 @@ def link_replace(self, link): @staticmethod def get_cover(html_root): - images = html_root.xpath("//img[contains(@id, 'cover') or " + images = html_root.xpath("//img[contains(@id, 'cover') or contains(@class, 'cover') or" "contains(@name, 'cover') or contains(@src, 'cover')]") if len(images): return images[0] - divs = html_root.xpath("//div[contains(@id, 'cover') or " + divs = html_root.xpath("//div[contains(@id, 'cover') or contains(@class, 'cover') or" "contains(@name, 'cover') or contains(@src, 'cover')]//img") if len(divs): return divs[0] - a = html_root.xpath("//a[contains(@id, 'cover') or contains(@name, 'cover') or contains(@src, 'cover')]//img") + a = html_root.xpath("//a[contains(@id, 'cover') or contains(@class, 'cover') or" + "contains(@name, 'cover') or contains(@src, 'cover')]//img") if len(a): return a[0] @@ -945,13 +951,11 @@ def create_epub(self): ) zip_file = os.path.join(PATH, "Books", self.book_id) - if os.path.isfile(zip_file + ".epub"): - os.remove(zip_file + ".epub") if os.path.isfile(zip_file + ".zip"): os.remove(zip_file + ".zip") shutil.make_archive(zip_file, 'zip', self.BOOK_PATH) - os.rename(zip_file + ".zip", os.path.join(self.BOOK_PATH, self.clean_book_title) + ".epub") + os.rename(zip_file + ".zip", os.path.join(self.BOOK_PATH, self.book_id) + ".epub") # MAIN From a4ee173dc766efb7bbb07e7974c9962ca2af068d Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 25 Apr 2018 12:54:21 +0200 Subject: [PATCH 043/100] Fixed #12 --- safaribooks.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index ac07e7a..6748c40 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -302,9 +302,8 @@ def __init__(self, args): self.book_title = self.book_info["title"] self.base_url = self.book_info["web_url"] - self.clean_book_title = "".join(self.escape_dirname(self.book_title).split(",")[:2]) + " ({0})".format( - self.escape_dirname(", ".join(a["name"] for a in self.book_info["authors"][:2]), clean_space=True) - ) + self.clean_book_title = "".join(self.escape_dirname(self.book_title).split(",")[:2]) \ + + " ({0})".format(self.book_id) books_dir = os.path.join(PATH, "Books") if not os.path.isdir(books_dir): From a1c9bb94b6f668f522137857fef21704c9f5de43 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 16 May 2018 23:07:45 +0200 Subject: [PATCH 044/100] Cover image improvement --- safaribooks.py | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 6748c40..04e8a3e 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -8,7 +8,7 @@ import argparse import requests import traceback -from lxml import html +from lxml import html, etree from html import escape from random import random from multiprocessing import Process, Queue, Value @@ -169,7 +169,7 @@ def api_error(response): return message -class WinQueue(list): # TODO: error while use Process in Windows: can't pickle _thread.RLock objects +class WinQueue(list): # TODO: error while use `process` in Windows: can't pickle _thread.RLock objects def put(self, el): self.append(el) @@ -555,18 +555,22 @@ def link_replace(self, link): @staticmethod def get_cover(html_root): - images = html_root.xpath("//img[contains(@id, 'cover') or contains(@class, 'cover') or" - "contains(@name, 'cover') or contains(@src, 'cover')]") + lowercase_ns = etree.FunctionNamespace(None) + lowercase_ns["lower-case"] = lambda _, n: n[0].lower() if n and len(n) else "" + + images = html_root.xpath("//img[contains(lower-case(@id), 'cover') or contains(lower-case(@class), 'cover') or" + "contains(lower-case(@name), 'cover') or contains(lower-case(@src), 'cover') or" + "contains(lower-case(@alt), 'cover')]") if len(images): return images[0] - divs = html_root.xpath("//div[contains(@id, 'cover') or contains(@class, 'cover') or" - "contains(@name, 'cover') or contains(@src, 'cover')]//img") + divs = html_root.xpath("//div[contains(lower-case(@id), 'cover') or contains(lower-case(@class), 'cover') or" + "contains(lower-case(@name), 'cover') or contains(lower-case(@src), 'cover')]//img") if len(divs): return divs[0] - a = html_root.xpath("//a[contains(@id, 'cover') or contains(@class, 'cover') or" - "contains(@name, 'cover') or contains(@src, 'cover')]//img") + a = html_root.xpath("//a[contains(lower-case(@id), 'cover') or contains(lower-case(@class), 'cover') or" + "contains(lower-case(@name), 'cover') or contains(lower-case(@src), 'cover')]//img") if len(a): return a[0] From f69ba7eedd0b87f475b041ed3c474680bab4a324 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 21 May 2018 21:34:45 +0200 Subject: [PATCH 045/100] Update README.md --- README.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 0045ed6..09ab01b 100644 --- a/README.md +++ b/README.md @@ -70,10 +70,12 @@ optional arguments: ``` The first time you'll use the program, you'll have to specify your Safari Books Online account credentials. -For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (see file format [`here`](/../../issues/2#issuecomment-367726544)). +For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (for **SSO** look file format [`here`](/../../issues/2#issuecomment-367726544)). Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. -If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. +If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. + +You can configure proxies by setting on your system the environment variables `HTTP_PROXY` and `HTTPS_PROXY`. The program default options are thought for ensure best compatibilities for who want to export the `EPUB` to E-Readers like Amazon Kindle. If you want to do it, I suggest you to convert the `EPUB` to `AZW3` with [Calibre](https://calibre-ebook.com/). You can also convert the book to `MOBI` and if you'll do it with Calibre be sure to select `Ignore margins` in the conversion options: From aaa8176a7661774114a2f6a3e6e585bca3b5df62 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 21 May 2018 21:42:08 +0200 Subject: [PATCH 046/100] Added credential parser --- safaribooks.py | 107 ++++++++++++++++++++++++++++--------------------- 1 file changed, 61 insertions(+), 46 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 04e8a3e..f155278 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -280,7 +280,7 @@ def __init__(self, args): else: self.display.info("Logging into Safari Books Online...", state=True) - self.do_login(*[c.replace("'", "").replace('"', "") for c in args.cred]) + self.do_login(*args.cred) if not args.no_cookies: json.dump(self.cookies, open(COOKIES_FILE, "w")) @@ -402,6 +402,20 @@ def requests_provider(self, url, post=False, data=None, update_cookies=True, **k return response + @staticmethod + def parse_cred(cred): + if ":" not in cred: + return False + + sep = cred.index(":") + new_cred = ["", ""] + new_cred[0] = cred[:sep].strip("'").strip('"') + if "@" not in new_cred[0]: + return False + + new_cred[1] = cred[sep + 1:] + return new_cred + def do_login(self, email, password): response = self.requests_provider(self.BASE_URL) if response == 0: @@ -962,48 +976,49 @@ def create_epub(self): # MAIN -arguments = argparse.ArgumentParser(prog="safaribooks.py", - description="Download and generate an EPUB of your favorite books" - " from Safari Books Online.", - add_help=False, - allow_abbrev=False) - -arguments.add_argument( - "--cred", metavar="", default=False, - help="Credentials used to perform the auth login on Safari Books Online." - " Es. ` --cred \"account_mail@mail.com:password01\" `." -) -arguments.add_argument( - "--no-cookies", dest="no_cookies", action='store_true', - help="Prevent your session data to be saved into `cookies.json` file." -) -arguments.add_argument( - "--no-kindle", dest="no_kindle", action='store_true', - help="Remove some CSS rules that block overflow on `table` and `pre` elements." - " Use this option if you're not going to export the EPUB to E-Readers like Amazon Kindle." -) -arguments.add_argument( - "--preserve-log", dest="log", action='store_true', help="Leave the `info_XXXXXXXXXXXXX.log`" - " file even if there isn't any error." -) -arguments.add_argument("--help", action="help", default=argparse.SUPPRESS, help='Show this help message.') -arguments.add_argument( - "bookid", metavar='', - help="Book digits ID that you want to download. You can find it in the URL (X-es):" - " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" -) - -args_parsed = arguments.parse_args() - -if args_parsed.cred: - cred = args_parsed.cred.split(":") - if len(cred) != 2 or "@" not in cred[0]: - arguments.error("invalid credential: %s" % args_parsed.cred) - - args_parsed.cred = cred - -else: - if args_parsed.no_cookies: - arguments.error("invalid option: `--no-cookies` is valid only if you use the `--cred` option") - -SafariBooks(args_parsed) +if __name__ == "__main__": + arguments = argparse.ArgumentParser(prog="safaribooks.py", + description="Download and generate an EPUB of your favorite books" + " from Safari Books Online.", + add_help=False, + allow_abbrev=False) + + arguments.add_argument( + "--cred", metavar="", default=False, + help="Credentials used to perform the auth login on Safari Books Online." + " Es. ` --cred \"account_mail@mail.com:password01\" `." + ) + arguments.add_argument( + "--no-cookies", dest="no_cookies", action='store_true', + help="Prevent your session data to be saved into `cookies.json` file." + ) + arguments.add_argument( + "--no-kindle", dest="no_kindle", action='store_true', + help="Remove some CSS rules that block overflow on `table` and `pre` elements." + " Use this option if you're not going to export the EPUB to E-Readers like Amazon Kindle." + ) + arguments.add_argument( + "--preserve-log", dest="log", action='store_true', help="Leave the `info_XXXXXXXXXXXXX.log`" + " file even if there isn't any error." + ) + arguments.add_argument("--help", action="help", default=argparse.SUPPRESS, help='Show this help message.') + arguments.add_argument( + "bookid", metavar='', + help="Book digits ID that you want to download. You can find it in the URL (X-es):" + " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" + ) + + args_parsed = arguments.parse_args() + + if args_parsed.cred: + cred = SafariBooks.parse_cred(args_parsed.cred) + if not cred: + arguments.error("invalid credential: %s" % args_parsed.cred) + + args_parsed.cred = cred + + else: + if args_parsed.no_cookies: + arguments.error("invalid option: `--no-cookies` is valid only if you use the `--cred` option") + + SafariBooks(args_parsed) From b76e768c389796de1842fce525ede34a1b089f0a Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 21 May 2018 22:36:05 +0200 Subject: [PATCH 047/100] Update README.md --- README.md | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 09ab01b..d9cbe86 100644 --- a/README.md +++ b/README.md @@ -69,11 +69,11 @@ optional arguments: --help Show this help message. ``` -The first time you'll use the program, you'll have to specify your Safari Books Online account credentials. -For the next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (for **SSO** look file format [`here`](/../../issues/2#issuecomment-367726544)). +The first time you use the program, you'll have to specify your Safari Books Online account credentials (look [`here`](/../../issues/15) for special character). +The next times you'll download a book, before session expires, you can omit the credential, because the program save your session cookies in a file called `cookies.json` (for **SSO** look the file format [`here`](/../../issues/2#issuecomment-367726544)). Pay attention if you use a shared PC, because everyone that has access to your files can steal your session. -If you don't want to cache the cookies, just use the `--no-cookies` option and provide all the time your `--cred`. +If you don't want to cache the cookies, just use the `--no-cookies` option and provide all time your `--cred`. You can configure proxies by setting on your system the environment variables `HTTP_PROXY` and `HTTPS_PROXY`. @@ -106,7 +106,13 @@ In the other hand, if you're not going to export the `EPUB`, you can use the `-- [-] ISBN: 9781491958704 [-] Publishers: O'Reilly Media, Inc. [-] Rights: Copyright © O'Reilly Media, Inc. - [-] Description: By taking you through the development of a real web application from beginning to end, the second edition of this hands-on guide demonstrates the practical advantages of test-driven development (TDD) with Python. You’ll learn how to write and run tests before building each part of your app, and then develop the minimum amount of code required to pass those tests. The result? Clean code that works.In the process, you’ll learn the basics of Django, Selenium, Git, jQuery, and Mock, along with curre... + [-] Description: By taking you through the development of a real web application + from beginning to end, the second edition of this hands-on guide demonstrates the + practical advantages of test-driven development (TDD) with Python. You’ll learn + how to write and run tests before building each part of your app, and then develop + the minimum amount of code required to pass those tests. The result? Clean code + that works.In the process, you’ll learn the basics of Django, Selenium, Git, + jQuery, and Mock, along with curre... [-] Release Date: 2017-08-18 [-] URL: https://www.safaribooksonline.com/library/view/test-driven-development-with/9781491958698/ [*] Retrieving book chapters... From 082c7a929bc21625a144a3bb7e48fc821564b991 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 11 Jul 2018 15:36:18 +0200 Subject: [PATCH 048/100] Fixed #21 --- safaribooks.py | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index f155278..3e6ce31 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -181,7 +181,7 @@ class SafariBooks: HEADERS = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", - "accept-encoding": "gzip, deflate, br", + "accept-encoding": "gzip, deflate", "accept-language": "it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7", "cache-control": "no-cache", "cookie": "", @@ -780,8 +780,7 @@ def _thread_download_css(self, url): self.display.error("Error trying to retrieve this CSS: %s\n From: %s" % (css_file, url)) with open(css_file, 'wb') as s: - for chunk in response.iter_content(1024): - s.write(chunk) + s.write(response.content) self.css_done_queue.put(1) self.display.state(len(self.css), self.css_done_queue.qsize()) From b276f67da171cd59e5de1495b5ee3ea4b4cf30b5 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 3 Aug 2018 09:57:53 +0200 Subject: [PATCH 049/100] Bug fixes --- safaribooks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 3e6ce31..7edec1f 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -88,7 +88,7 @@ def exit(self, error): self.out(output) self.save_last_request() - sys.exit(128) + sys.exit(1) def unhandled_exception(self, _, o, tb): self.log("".join(traceback.format_tb(tb))) @@ -523,7 +523,7 @@ def get_default_cover(self): def get_html(self, url): response = self.requests_provider(url) - if response == 0: + if response == 0 or response.status_code != 200: self.display.exit( "Crawler: error trying to retrieve this page: %s (%s)\n From: %s" % (self.filename, self.chapter_title, url) From 34d2cad294dd8b520bf20b8c386dd0970b7ebe1e Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 19 Oct 2018 10:01:13 +0200 Subject: [PATCH 050/100] Fixed #37 --- safaribooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index 7edec1f..0067c8d 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -58,7 +58,7 @@ def unregister(self): sys.excepthook = sys.__excepthook__ def log(self, message): - self.logger.info(str(message)) + self.logger.info(str(message).encode("utf-8", "replace")) def out(self, put): sys.stdout.write("\r" + " " * self.columns + "\r" + put + "\n") From cee0a0c4dfea608a315d04bae9bccc87f1f9e53d Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 29 Oct 2018 20:36:47 +0100 Subject: [PATCH 051/100] Fix vulnerable 'requests' library --- requirements.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 6964772..2666c38 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,2 +1,3 @@ lxml>=4.1.1 -requests>=2.18.4 +requests>=2.20.0 + From 5873ec80b2feedffb5f0e11c93779cf1102c4004 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Mon, 29 Oct 2018 20:41:38 +0100 Subject: [PATCH 052/100] Fix vulnerable library --- Pipfile.lock | 92 +++++++++++++++++++++++++++------------------------- 1 file changed, 47 insertions(+), 45 deletions(-) diff --git a/Pipfile.lock b/Pipfile.lock index 9d6b8e2..ee72369 100644 --- a/Pipfile.lock +++ b/Pipfile.lock @@ -5,14 +5,14 @@ }, "host-environment-markers": { "implementation_name": "cpython", - "implementation_version": "3.6.3", + "implementation_version": "3.6.5", "os_name": "posix", "platform_machine": "x86_64", "platform_python_implementation": "CPython", - "platform_release": "4.13.0-36-generic", + "platform_release": "4.15.0-36-generic", "platform_system": "Linux", - "platform_version": "#40-Ubuntu SMP Fri Feb 16 20:07:48 UTC 2018", - "python_full_version": "3.6.3", + "platform_version": "#39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018", + "python_full_version": "3.6.5", "python_version": "3.6", "sys_platform": "linux" }, @@ -31,10 +31,10 @@ "default": { "certifi": { "hashes": [ - "sha256:14131608ad2fd56836d33a71ee60fa1c82bc9d2c8d98b7bdbc631fe1b3cd1296", - "sha256:edbc3f203427eef571f79a7692bb160a2b0f7ccaa31953e99bd17e307cf63f7d" + "sha256:339dc09518b07e2fa7eda5450740925974815557727d6bd35d319c1524a04a4c", + "sha256:6d58c986d22b038c8c0df30d639f23a3e6d172a05c3583e766f4c0b785c0986a" ], - "version": "==2018.1.18" + "version": "==2018.10.15" }, "chardet": { "hashes": [ @@ -45,57 +45,59 @@ }, "idna": { "hashes": [ - "sha256:8c7309c718f94b3a625cb648ace320157ad16ff131ae0af362c9f21b80ef6ec4", - "sha256:2c6a5de3089009e3da7c5dde64a141dbc8551d5b7f6cf4ed7c2568d0cc520a8f" + "sha256:156a6814fb5ac1fc6850fb002e0852d56c0c8d2531923a51032d1b70760e186e", + "sha256:684a38a6f903c1d71d6d5fac066b58d7768af4de2b832e426ec79c30daa94a16" ], - "version": "==2.6" + "version": "==2.7" }, "lxml": { "hashes": [ - "sha256:41f59cbdab232f11680d5d4dec9f2e6782fd24d78e37ee833447702e34e675f4", - "sha256:e7e41d383f19bab9d57f5f3b18d158655bcd682e7e723f441b9e183e1e35a6b5", - "sha256:155521c337acecf8202091cff85bb9f709f238130ebadf04280fb1db11f5ad8b", - "sha256:d2c985d2460b81c6ca5feb8b86f1bc594ad59405d0bdf68626b85852b701553c", - "sha256:950e63387514aa1b881eba5ac6cb2ec51a118b3dafe99dd80ca19d8fb0142f30", - "sha256:470d7ce41e8047208ba1a376560bad17f1468df1f3097bc83902b26cfafdbb0c", - "sha256:e608839a5ee2180164424ccf279c8e2d9bbe8816d002c58fd97d6b621ba4aa94", - "sha256:87a66bcadac270fc010cb029022a93fc722bf1204a8b03e782d4c790f0edf7ca", - "sha256:2dedfeeecc2d5a939cf622602f5a1ce443ca82407f386880f739f1a9f08053ad", - "sha256:ba05732e4bcf59e948f61588851dcf620fd60d5bbd9d704203e5f59bbaa60219", - "sha256:2190266059fec3c5a55f9d6c30532c64c6d414d3228909c0af573fe4907e78d1", - "sha256:dd291debfaa535d9cb6cee8d7aca2328775e037d02d13f1634e57f49bc302cc4", - "sha256:29a36e354c39b2e24bc4ee103de53417ebb80f976a6ab9e8d093d559e2ac03e1", - "sha256:e37427d5a27eefbcfc48847e0b37f348113fac7280bc857421db39ffc6372570", - "sha256:b106d4d2383382399ad82108fd187e92f40b1c90f55c2d36bbcb1c44bcf940fc", - "sha256:0ee07da52d240f1dc3c83eef5cd5f1b7f018226c1121f2a54d446645779a6d17", - "sha256:3b33549fb8f91b38a7500078242b03cca513f3412a2cdae722e89bf83f95971d", - "sha256:4c12e90886d9c53ab434c8d0cebea122321cce19614c3c6b6d1a7700d7cc6212", - "sha256:79322000279cda10b53c374d53ca632ead3bc51c6aebf8e62c8fa93a4d08b750", - "sha256:6cba398eb37e0631e60e0e080c101cfe91769b2c8267105b64b4625e2581ea21", - "sha256:49a655956f8de69e1258bc0fcfc43eb3bd1e038655784d77d1869b4b81444e37", - "sha256:af8a5373241d09b8fc53e0490e1719ce5dc90a21b19db89b6596c1adcdd52270", - "sha256:e6b6698415c7e8d227a47a3b1038e1b37c2b438a1b48c2db7ad9e74ddbcd1149", - "sha256:155c916cf2645b4a8f2bd5d09065e92d1b67b8d464bdc001e0b524af84bedf6f", - "sha256:fa7320679ced5e25b20203d157280680fc84eb783b6cc650cb0c98e1858b7dd3", - "sha256:4187c4b0cefc3353181db048c51f42c489d9ac51e40b86c4851dc0671372971d", - "sha256:d5d29663e979e83b3fc361e97200f959cddb3a14797391d15273d84a5a8ae44b", - "sha256:940caef1ec7c78e0c34b0f6b94fe42d0f2022915ffc78643d28538a5cfd0f40e" + "sha256:fa39ea60d527fbdd94215b5e5552f1c6a912624521093f1384a491a8ad89ad8b", + "sha256:ae07fa0c115733fce1e9da96a3ac3fa24801742ca17e917e0c79d63a01eeb843", + "sha256:caf0e50b546bb60dfa99bb18dfa6748458a83131ecdceaf5c071d74907e7e78a", + "sha256:abf181934ac3ef193832fb973fd7f6149b5c531903c2ec0f1220941d73eee601", + "sha256:62939a8bb6758d1bf923aa1c13f0bcfa9bf5b2fc0f5fa917a6e25db5fe0cfa4e", + "sha256:4815892904c336bbaf73dafd54f45f69f4021c22b5bad7332176bbf4fb830568", + "sha256:81992565b74332c7c1aff6a913a3e906771aa81c9d0c68c68113cffcae45bc53", + "sha256:02bc220d61f46e9b9d5a53c361ef95e9f5e1d27171cd461dddb17677ae2289a5", + "sha256:bccb267678b870d9782c3b44d0cefe3ba0e329f9af8c946d32bf3778e7a4f271", + "sha256:2f31145c7ff665b330919bfa44aacd3a0211a76ca7e7b441039d2a0b0451e415", + "sha256:aab09fbe8abfa3b9ce62aaf45aca2d28726b1b9ee44871dbe644050a2fff4940", + "sha256:b9c78242219f674ab645ec571c9a95d70f381319a23911941cd2358a8e0521cf", + "sha256:a623965c086a6e91bb703d4da62dabe59fe88888e82c4117d544e11fd74835d6", + "sha256:9d862e3cf4fc1f2837dedce9c42269c8c76d027e49820a548ac89fdcee1e361f", + "sha256:5be031b0f15ad63910d8e5038b489d95a79929513b3634ad4babf77100602588", + "sha256:75830c06a62fe7b8fe3bbb5f269f0b308f19f3949ac81cfd40062f47c1455faf", + "sha256:a7783ab7f6a508b0510490cef9f857b763d796ba7476d9703f89722928d1e113", + "sha256:e16e07a0ec3a75b5ee61f2b1003c35696738f937dc8148fbda9fe2147ccb6e61", + "sha256:438a1b0203545521f6616132bfe0f4bca86f8a401364008b30e2b26ec408ce85", + "sha256:8c892fb0ee52c594d9a7751c7d7356056a9682674b92cc1c4dc968ff0f30c52f", + "sha256:c4df4d27f4c93b2cef74579f00b1d3a31a929c7d8023f870c4b476f03a274db4", + "sha256:22f253b542a342755f6cfc047fe4d3a296515cf9b542bc6e261af45a80b8caf6", + "sha256:e175a006725c7faadbe69e791877d09936c0ef2cf49d01b60a6c1efcb0e8be6f", + "sha256:edd9c13a97f6550f9da2236126bb51c092b3b1ce6187f2bd966533ad794bbb5e", + "sha256:dbbd5cf7690a40a9f0a9325ab480d0fccf46d16b378eefc08e195d84299bfae1", + "sha256:db0d213987bcd4e6d41710fb4532b22315b0d8fb439ff901782234456556aed1", + "sha256:60842230678674cdac4a1cf0f707ef12d75b9a4fc4a565add4f710b5fcf185d5", + "sha256:5c93ae37c3c588e829b037fdfbd64a6e40c901d3f93f7beed6d724c44829a3ad", + "sha256:d3266bd3ac59ac4edcd5fa75165dee80b94a3e5c91049df5f7c057ccf097551c", + "sha256:36720698c29e7a9626a0dc802ef8885f8f0239bfd1689628ecd459a061f2807f" ], - "version": "==4.1.1" + "version": "==4.2.5" }, "requests": { "hashes": [ - "sha256:6a1b267aa90cac58ac3a765d067950e7dbbf75b1da07e895d1f594193a40a38b", - "sha256:9c443e7324ba5b85070c4a818ade28bfabedf16ea10206da1132edaa6dda237e" + "sha256:a84b8c9ab6239b578f22d1c21d51b696dcfe004032bb80ea832398d6909d7279", + "sha256:99dcfdaaeb17caf6e526f32b6a7b780461512ab3f1d992187801694cba42770c" ], - "version": "==2.18.4" + "version": "==2.20.0" }, "urllib3": { "hashes": [ - "sha256:06330f386d6e4b195fbfc736b297f58c5a892e4440e54d294d7004e3a9bbea1b", - "sha256:cc44da8e1145637334317feebd728bd869a35285b93cbb4cca2577da7e62db4f" + "sha256:8819bba37a02d143296a4d032373c4dd4aca11f6d4c9973335ca75f9c8475f59", + "sha256:41c3db2fc01e5b907288010dec72f9d0a74e37d6994e6eb56849f59fea2265ae" ], - "version": "==1.22" + "version": "==1.24" } }, "develop": {} From a90362fd287f88d7fe316d0aa7206516fa4649c2 Mon Sep 17 00:00:00 2001 From: Krzysztof Barczynski Date: Mon, 3 Dec 2018 20:40:42 -0600 Subject: [PATCH 053/100] Fixed login url --- safaribooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index 0067c8d..46720bc 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -417,7 +417,7 @@ def parse_cred(cred): return new_cred def do_login(self, email, password): - response = self.requests_provider(self.BASE_URL) + response = self.requests_provider("https://www.safaribooksonline.com/accounts/login") if response == 0: self.display.exit("Login: unable to reach Safari Books Online. Try again...") From 65b9e4b140003eaf24f72301200f918e2a6a5b0c Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Tue, 4 Dec 2018 11:36:50 +0100 Subject: [PATCH 054/100] Update safaribooks.py Reuse global variable --- safaribooks.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/safaribooks.py b/safaribooks.py index 46720bc..d6b6df9 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -417,7 +417,7 @@ def parse_cred(cred): return new_cred def do_login(self, email, password): - response = self.requests_provider("https://www.safaribooksonline.com/accounts/login") + response = self.requests_provider(self.LOGIN_URL) if response == 0: self.display.exit("Login: unable to reach Safari Books Online. Try again...") From 6d8f222279eb9151e73f627fb47ce273d1c1f596 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 3 Jan 2019 14:30:14 +0100 Subject: [PATCH 055/100] Fixed #51 Fixed #56 Fixed #58 Fixed #59 Fixed #61 --- safaribooks.py | 63 ++++++++++++++++++++++++++------------------------ 1 file changed, 33 insertions(+), 30 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index d6b6df9..1f96b94 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -18,6 +18,10 @@ PATH = os.path.dirname(os.path.realpath(__file__)) COOKIES_FILE = os.path.join(PATH, "cookies.json") +SAFARI_BASE_HOST = "learning.oreilly.com" +SAFARI_BASE_URL = "https://" + SAFARI_BASE_HOST + + class Display: BASE_FORMAT = logging.Formatter( @@ -145,11 +149,11 @@ def state(self, origin, done): ) def done(self, epub_file): - self.info("Done: %s\n\n" + self.info("Done: %s\n\n" % epub_file + " If you like it, please * this project on GitHub to make it known:\n" " https://github.com/lorenzodifuccia/safaribooks\n" " e don't forget to renew your Safari Books Online subscription:\n" - " https://www.safaribooksonline.com/signup/\n\n" % epub_file + + " " + SAFARI_BASE_URL + "\n\n" + self.SH_BG_RED + "[!]" + self.SH_DEFAULT + " Bye!!") @staticmethod @@ -158,7 +162,7 @@ def api_error(response): if "detail" in response and "Not found" in response["detail"]: message += "book's not present in Safari Books Online.\n" \ " The book identifier is the digits that you can find in the URL:\n" \ - " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" + " `" + SAFARI_BASE_URL + "/library/view/book-name/XXXXXXXXXXXXX/`" else: os.remove(COOKIES_FILE) @@ -179,6 +183,9 @@ def qsize(self): class SafariBooks: + LOGIN_URL = SAFARI_BASE_URL + "/accounts/login/" + API_TEMPLATE = SAFARI_BASE_URL + "/api/v1/book/{0}/" + HEADERS = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-encoding": "gzip, deflate", @@ -186,16 +193,13 @@ class SafariBooks: "cache-control": "no-cache", "cookie": "", "pragma": "no-cache", - "referer": "https://www.safaribooksonline.com/home/", + "origin": SAFARI_BASE_URL, + "referer": LOGIN_URL, "upgrade-insecure-requests": "1", - "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " - "Chrome/62.0.3202.94 Safari/537.36" + "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) " + "Chrome/60.0.3112.113 Safari/537.36" } - BASE_URL = "https://www.safaribooksonline.com" - LOGIN_URL = BASE_URL + "/accounts/login/" - API_TEMPLATE = BASE_URL + "/api/v1/book/{0}/" - BASE_01_HTML = "\n" \ "" # Format: ID, Depth, Title, Author, NAVMAP - TOC_NCX = "" \ + TOC_NCX = "\n" \ "" \ - "" \ - "" \ - "" \ - "" \ - "" \ - "" \ - "" \ - "{2}" \ - "{3}" \ - "{4}" \ + " \"http://www.daisy.org/z3986/2005/ncx-2005-1.dtd\">\n" \ + "\n" \ + "\n" \ + "\n" \ + "\n" \ + "\n" \ + "\n" \ + "\n" \ + "{2}\n" \ + "{3}\n" \ + "{4}\n" \ "" def __init__(self, args): @@ -364,7 +368,7 @@ def return_cookies(self): return " ".join(["{0}={1};".format(k, v) for k, v in self.cookies.items()]) def return_headers(self, url): - if "safaribooksonline" in urlsplit(url).netloc: + if SAFARI_BASE_HOST in urlsplit(url).netloc: self.HEADERS["cookie"] = self.return_cookies() else: @@ -441,10 +445,9 @@ def do_login(self, email, password): self.LOGIN_URL, post=True, data=( - ("csrfmiddlewaretoken", ""), ("csrfmiddlewaretoken", csrf), + ("csrfmiddlewaretoken", csrf), ("email", email), ("password1", password), - ("is_login_form", "true"), ("leaveblank", ""), - ("dontchange", "http://") + ("login", "Sign In"), ("next", "") ), allow_redirects=False ) @@ -531,7 +534,7 @@ def get_html(self, url): root = None try: - root = html.fromstring(response.text, base_url=self.BASE_URL) + root = html.fromstring(response.text, base_url=SAFARI_BASE_URL) except (html.etree.ParseError, html.etree.ParserError) as parsing_error: self.display.error(parsing_error) @@ -591,7 +594,7 @@ def get_cover(html_root): return None def parse_html(self, root, first_page=False): - if random() > 0.5: + if random() > 0.8: if len(root.xpath("//div[@class='controls']/a/text()")): self.display.exit(self.display.api_error(" ")) @@ -798,7 +801,7 @@ def _thread_download_images(self, url): self.display.images_ad_info.value = 1 else: - response = self.requests_provider(urljoin(self.BASE_URL, url), + response = self.requests_provider(urljoin(SAFARI_BASE_URL, url), update_cookies=False, stream=True) if response == 0: @@ -1004,7 +1007,7 @@ def create_epub(self): arguments.add_argument( "bookid", metavar='', help="Book digits ID that you want to download. You can find it in the URL (X-es):" - " `https://www.safaribooksonline.com/library/view/book-name/XXXXXXXXXXXXX/`" + " `" + SAFARI_BASE_URL + "/library/view/book-name/XXXXXXXXXXXXX/`" ) args_parsed = arguments.parse_args() From acee92ddbe636c03e3750beb1cc118f2a56eea0c Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Thu, 3 Jan 2019 14:48:09 +0100 Subject: [PATCH 056/100] PEP8 and working for #57 --- safaribooks.py | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index 1f96b94..c066ffd 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -22,7 +22,6 @@ SAFARI_BASE_URL = "https://" + SAFARI_BASE_HOST - class Display: BASE_FORMAT = logging.Formatter( fmt="[%(asctime)s] %(message)s", @@ -229,7 +228,7 @@ class SafariBooks: "" # Format: ID, Title, Authors, Description, Subjects, Publisher, Rights, Date, CoverId, MANIFEST, SPINE, CoverUrl - CONTENT_OPF = "\n" \ + CONTENT_OPF = "\n" \ "\n" \ "\n"\ @@ -253,7 +252,7 @@ class SafariBooks: "" # Format: ID, Depth, Title, Author, NAVMAP - TOC_NCX = "\n" \ + TOC_NCX = "\n" \ "\n" \ "\n" \ @@ -1013,11 +1012,11 @@ def create_epub(self): args_parsed = arguments.parse_args() if args_parsed.cred: - cred = SafariBooks.parse_cred(args_parsed.cred) - if not cred: + parsed_cred = SafariBooks.parse_cred(args_parsed.cred) + if not parsed_cred: arguments.error("invalid credential: %s" % args_parsed.cred) - args_parsed.cred = cred + args_parsed.cred = parsed_cred else: if args_parsed.no_cookies: From 5de40909e0543e1bf43eae3cb88ea53fe6946a43 Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Fri, 4 Jan 2019 14:46:30 +0100 Subject: [PATCH 057/100] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index d9cbe86..dc6f458 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # SafariBooks Download and generate *EPUB* of your favorite books from [*Safari Books Online*](https://www.safaribooksonline.com) library. I'm not responsible for the use of this program, this is only for *personal* and *educational* purpose. +Before any usage please read the *O'Reilly*'s [Terms of Service](https://learning.oreilly.com/terms/). ## Overview: * [Requirements & Setup](#requirements--setup) From a238dee1be79d2d0df6aa6414b77afe2a06bd1cb Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 13 Feb 2019 23:38:42 +0100 Subject: [PATCH 058/100] Fix #16 --- safaribooks.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index c066ffd..dae5cd5 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -61,10 +61,10 @@ def unregister(self): sys.excepthook = sys.__excepthook__ def log(self, message): - self.logger.info(str(message).encode("utf-8", "replace")) + self.logger.info(message.encode("utf-8", "replace")) def out(self, put): - sys.stdout.write("\r" + " " * self.columns + "\r" + put + "\n") + sys.stdout.write("\r" + " " * self.columns + "\r" + put.encode("utf-8", "replace") + "\n") def info(self, message, state=False): self.log(message) From 9938473f06b2dd3c729baf803df31bc233d9bece Mon Sep 17 00:00:00 2001 From: Lorenzo Di Fuccia Date: Wed, 13 Feb 2019 23:47:43 +0100 Subject: [PATCH 059/100] Fixes #63 and #64 --- safaribooks.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/safaribooks.py b/safaribooks.py index dae5cd5..2d64982 100644 --- a/safaribooks.py +++ b/safaribooks.py @@ -208,10 +208,11 @@ class SafariBooks: "\n" \ "{0}\n" \ "