-
Notifications
You must be signed in to change notification settings - Fork 40
/
read_parse_url.py
58 lines (48 loc) · 2.07 KB
/
read_parse_url.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#!/usr/bin/env python
#-----------------------------------------------------------------------------
# This example will read in a URL and parse the contents.
# Be aware that URL handling is one of those things that
# has changed drastically between Python 2 and 3.
#
# The HTMLParser is Python's built-in library.
# It's not that great...and tedious.
# BeautifulSoup is a good option if you need an environment
# where only pure Python (lxml is done partially in C) is allowed (Google App Engine),
# but let's use lxml. It's fast and easy (insert joke here).
# You can install lxml by running the following command
# in the terminal on almost any UNIX/UNIX-like machine:
#
# sudo easy_install --allow-hosts=lxml.de,*.python.org lxml
#
# If you have issues with that or you are on Windows see this for guidance:
# http://lxml.de/index.html#download
#-----------------------------------------------------------------------------
#-----------------------------------------------------------------------------
# Import any needed libraries below
#-----------------------------------------------------------------------------
import lxml.html
import urllib
#-----------------------------------------------------------------------------
# Begin the main program.
#-----------------------------------------------------------------------------
# The URL we want to parse.
URL_TO_PARSE = "http://www.reddit.com"
# Read in the URL you defined.
# Also note that this handles SSL as well.
geturl = urllib.urlopen(URL_TO_PARSE)
# Save the contents of the URL in its entirety for analysis.
readurl = geturl.read()
# Close geturl. We don't need it again.
geturl.close()
# We could do a lot more interesting stuff with this,
# but let's stay generic and just get all of the URLs
# from the contents.
#Parse the contents with lxml.
parse_readurl = lxml.html.fromstring(readurl)
# Select the URL in href for all a tags using XPath.
# Returns a list of URLs.
readurl_urls = parse_readurl.xpath("//a/@href")
# Since readurl_urls is a list, let's iterate
# over it and print the URLs.
for url in readurl_urls:
print url