Skip to content
Tim RΓΌhsen edited this page Jul 20, 2016 · 30 revisions

Wget2 Introduction

The development of Wget2 started and everybody is invited to contribute, test, discuss, etc.
The codebase is hosted in the 'wget2' branch of wget's git repository and on github - both will be regularly synced.

Wget2 on Savannah (checkout branch 'wget2' afte cloning)

Wget2 on Github

The idea is to have a fresh and maintainable codebase with features like multithreaded downloads, HTTP2, OCSP, HSTS, Metalink, IDNA2008, Public Suffix List, Multi-Proxies, Sitemaps, Atom/RSS Feeds, compression (gzip, deflate, lzma, bzip2), support for local filenames, etc.
Some of these feature have been built into Wget in the meantime, but some other are really hard to implement into the old codebase.

Most of the functionality is exposed via library API (libwget), to allow external programs make use of it. E.g. have a look at examples/print_css_urls.c - just a few lines of C to parse and print out all URLs from a CSS file.

Wget2 will stay as an own executable separate from Wget.
So you can install and test Wget2 without endangering your existing architecture and scripts.

What is missing

  • FTP(S) support
  • WARC support
  • Several Wget options are missing.
  • API documentation incomplete

New options

--force-css         Treat input file as CSS. (default: off)
--force-sitemap     Treat input file as Sitemap. (default: off)
--force-atom        Treat input file as Atom Feed. (default: off)
--force-rss         Treat input file as RSS Feed. (default: off)
--force-metalink    Treat input file as Metalink. (default: off)
--metalink          Parse and follow metalink files and don't save them (default: on)
--max-threads       Max. concurrent download threads. (default: 5)
--gnutls-options    Custom GnuTLS priority string. Interferes with --secure-protocol. (default: none)
--ocsp-stapling     Use OCSP stapling to verify the server's certificate. (default: on)
--ocsp              Use OCSP server access to verify server's certificate. (default: on)
--ocsp-file         Set file for OCSP chaching. (default: .wget_ocsp)
--http2             Use HTTP/2 protocol if possible. (default: on)
--input-encoding    Character encoding of the file contents read with --input-file. (default: local encoding)
--cookie-suffixes   Load public suffixes from file. They prevent 'supercookie' vulnerabilities.
--chunk-size        Download large files in multithreaded chunks. (default: 0 (=off))
                    Example: wget --chunk-size=1M
--check-hostname    Check the server's certificate's hostname. (default: on)
--dns-caching       Caching of domain name lookups. (default: on)
--http-proxy        Set HTTP proxy/proxies, overriding environment variables.
--https-proxy       Set HTTPS proxy/proxies, overriding environment variables.
--input-encoding    Character encoding of the file contents read with --input-file. (default: local encoding)
--tcp-fastopen      Enable TCP Fast Open (TFO). (default: on)
--robots            Respect robots.txt standard for recursive downloads. (default: on)
--random-file       File to be used as source of random data.
--fsync-policy      Use fsync() to wait for data being written to the pysical layer. (default: off)

Different behavior of Wget2

  • new 'include' statement for config files, e.g. to load /etc/wget/conf.d/*.conf
  • --input-file - (reading URLs from stdin) starts downloading with the first URL to allow slow URL generators feed Wget2
  • check HTTP 'ETag' to avoid parsing doublettes
  • use HTTP 'Accept-Encoding': gzip, deflate, lzma, bzip2
  • CLI string options can be set to NULL by prepending a --no-, e.g. --no-user-agent
  • boolean CLI options can all be set to true or false
  • $WGETRC is not read so far

Differing CLI options Wget/Wget2

Option Wget Wget2 Comment
--accept-regex βœ“
--ask-password βœ“
--auth-no-challenge βœ“
--background βœ“
--body-data βœ“
--body-file βœ“
--check-hostname βœ“
--chunk-size βœ“
--config βœ“ βœ“ Same as --config-file, for compatibilty with Wget1.x
--config-file βœ“
--convert-file-only βœ“
--cookie-suffixes βœ“
--dns-caching βœ“
--exclude-directories βœ“
--egd-file βœ“ βœ“ A Noop for compatibility (GnuTLS can be compiled/configured to use EGD)
--follow-ftp βœ“
--metalink βœ“
--force-atom βœ“
--force-css βœ“
--force-metalink βœ“
--force-rss βœ“
--force-sitemap βœ“
--ftp-password βœ“
--ftps-clear-data-connection βœ“
--ftps-fallback-to-ftp βœ“
--ftps-implicit βœ“
--ftps-resume-ssl βœ“
--ftp-user βœ“
--glob βœ“
--header βœ“
--gnutls-options βœ“
--http2 βœ“
--http-proxy βœ“
--https-proxy βœ“
--if-modified-since βœ“ Wget2 uses If-Modified-Since when timestamping is turned on
--ignore-length βœ“
--include-directories βœ“
--input-encoding βœ“
--input-metalink βœ“ (βœ“) Wget2 uses a combination of --input-file and --force-metalink
--limit-rate βœ“ For Wget2 use a bandwidth limiter like trickle
--metalink-over-http βœ“ Wget2 does this automatically
--method βœ“
--max-threads βœ“
--netrc-file βœ“ Mainly for test code usage to test .netrc files
--ocsp βœ“
--ocsp-file βœ“
--ocsp-stapling βœ“
--passive-ftp βœ“
--preferred-location βœ“ Wget2 respects priorities and order of locations
--preserve-permissions βœ“
--proxy-password βœ“
--proxy-user βœ“
--random-file βœ“
--regex-type βœ“
--rejected-log βœ“
--reject-regex βœ“
--relative βœ“
--remove-listing βœ“
--report-speed βœ“
--retr-symlinks βœ“
--retry-connrefused βœ“
--robots βœ“ Wget1.x has a robots command but no option, -e robots=1 does the job
--show-progress βœ“
--start-pos βœ“
--tcp-fastopen βœ“
--unlink βœ“
--warc-cdx βœ“
--warc-compression βœ“
--warc-dedup βœ“
--warc-digests βœ“
--warc-file βœ“
--warc-header βœ“
--warc-keep-log βœ“
--warc-max-size βœ“
--warc-tempdir βœ“