I'll post any code that I create to generate the python scraper for expanding Watson's corpus.
I'm using the following libraries:
- http://scrapy.org/
- http://code.google.com/p/pygoogle/ (depending on whether I can get some universal rules working to scrape meaningful text across domains)
- https://python-docx.readthedocs.org/en/latest/ (for outputting Watson-friendly .docx files with headers)
- http://www.unixuser.org/~euske/python/pdfminer/