Skip to content

Periodically archive reddit saved posts and comments to easily readable HTML files.

License

Notifications You must be signed in to change notification settings

sriramcu/complete_reddit_backup

Repository files navigation

Combine BDFR Runs Sequentially

Periodically archive reddit saved posts and comments to easily readable HTML files.

Are you somebody that likes to keep saving reddit posts, in case you want to go back to them in the future? Do you have a hoarding tendency when it comes to this? Do you worry that one day, the user of one of your more important saved post might delete their account and posts, or edit them in protest of Reddit policies? If so, you may have used the BDFR tool to archive these saved posts offline, but found that there was no easy way to render the posts like how they appear on the reddit site itself. The original tool outputs JSON and XML. You want HTML pages rendered neatly, and the ability to keep adding to them on a periodic basis, by merging newly saved reddit posts with the BDFR output that you already have offline. This program is meant to satisfy this requirement.

The index.html file generated will group posts by subreddit for easy reference.

Acknowledgments

Setup

  1. pip install bdfr. For more information on BDFR setup, go to https://github.com/Serene-Arc/bulk-downloader-for-reddit#installation
  2. Clone this repo and run pip install -r requirements.txt
  3. Rename example_config.cfg to my_config.cfg and fill up client_id, client_secret and user_token based on the app you created in https://www.reddit.com/prefs/apps. Refer to the BDFR README for more info on this.
  4. Assuming you have never run the bdfr or the bdfrtohtml tools in the past, run the program for the first time without any arguments: python reddit_backup.py
  5. Above command will create a folder called "html_pages" in the current directory.
  6. Make sure you move the contents of this folder to another location on your system since they will be overwritten later. Note down this new location to use later on as an argument to this program.

Usage

python reddit_backup.py -d <input_dir> -v <verbose>

-d or --input_dir: Specify the input directory (default: empty string)-

Leave empty for first time execution as mentioned in the setup, future runs will use the location noted down in the last step of the setup.

-v or --verbose: Set the verbosity level (default: 1)

Working of the Program

  1. First, the program runs the bdfr tool and stores the result in the bdfr folder. If a bdfr folder already exists, it will give the user two choices- delete existing and re-run the tool; or proceed with existing stored output.
  2. Then, the bdfrtohtml tool is run to generate html from above bdfr folder.
  3. Before running further steps, we backup the input directory to the program_backups folder with the timestamp, deleting backups from more than 5 runs ago.
  4. We then transfer the HTML pages generated in step 2 to the input directory, modifying it in-place. The existing index HTML file, instead of being overwritten, is combined with the new index file, by transferring the post references by means of HTML parsing.
  5. This combined index html is reordered to maintain/create grouping by subreddit. Duplicate entries are removed.
  6. styles.css will be preserved between runs and the newly generated styles.css will always be discarded. This is to ensure you can modify your custom styles once and leave it unchanged as long as you want.
  7. idList.txt and media folder will be deleted/overwritten each time. Do not use this program if you are interested in preserving their contents.
  8. All intermediate files are either stored in program backups (auto deleting old ones as mentioned), moved to input folder based on aforementioned program logic, or stored in the bdfr folder. No stray files need to be deleted by the user.
  9. After the core part of the program is done, comparisons are made between old bdfr html folder and the new output generated by this program. It is logged to the console in verbose mode and also to a timestamp file in the program_backups/comparison_logs directory. If the comparisons reveal some changes made by the program that are undesirable, go to the program backups folder for the older copy.

Note

  1. By pip installing bdfr, you will always use the latest version of the bdfr tool pushed to PyPI. However, the bdfrtohtml tool is merely "vendored" in the bdfr-html directory of this repo. It is not a submodule and will not be updated, even if the original repo changes in the future (unchanged since 2021, as of September 2024).

  2. In case you are experiencing PRAW errors, such as DuplicateReplaceException or AssertionError (usually in _insert_comment()) while this program runs the BDFR tool internally, make the following changes to comment_forest.py in praw.models in the place your python libraries are stored on your system:

img.png

About

Periodically archive reddit saved posts and comments to easily readable HTML files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published