-
Notifications
You must be signed in to change notification settings - Fork 222
/
item_03_Difference_bytes_str_unicode.py
145 lines (105 loc) · 5.67 KB
/
item_03_Difference_bytes_str_unicode.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# Item 3: Know the difference between bytes, str, and unicode
import os
# In Python 3, there are two types that represent sequences of characters:
# bytes and str. Instances of bytes contain raw 8-bit values. Instances of
# str contain Unicode characters.
# In Python 2, there two types that represent sequences of characters: str and
# unicode. In contrast to Python 3, instances of str contain raw 8-bit values.
# Instances of unicode contain Unicode characters.
# There are many ways to represent Unicode characters as binary data (raw
# 8-bits values). The most common encoding in UTF-8. Importantly, str
# instances in Python 3 and unicode instances in Python 2 do not have an
# associated binary encoding. To convert Unicode characters to binary data,
# you must use the encode method. To convert binary data to Unicode
# characters, you must use the decode method.
# When you're writing Python programs, it's important to do encoding and
# decoding of Unicode at the furthest boundary of your interfaces. The core of
# your program should use Unicode character types (str in Python 3, unicode in
# Python 2) and should not assume any thing about character encodings. This
# approach allows you to be very accepting of alternative text encodings
# (such as Latin-1, Shift JIS, and Big5) while being strict about your output
# text encoding (idealy, UTF-8).
# The split between character types leads to two common situations in Python
# code:
# 1. You want to operate on raw 8-bit values that are UTF-8-encoded characters
# (or some other encoding).
# 2. You want to operate on Unicode characters that have no specific encoding.
# You'll often need two helper functions to convert between these two cases
# and and to ensure that the type of input values matches your code's
# expectations.
# In Python 3, you'll need one method that takes a str or bytes and always
# returns a str.
def to_str(bytes_or_str):
if isinstance(bytes_or_str, bytes):
value = bytes_or_str.decode('utf-8')
else:
value = bytes_or_str
return value # Instance of str
# You'll need another method that takes a str and bytes and always returns a
# bytes.
def to_bytes(bytes_or_str):
if isinstance(bytes_or_str, str):
value = bytes_or_str.encode('utf-8')
else:
value = bytes_or_str
return value # Instance of str
# In Python 2, you'll need one method that takes a str or unicode and always
# returns a unicode.
def to_unicode(unicode_or_str):
if isinstance(unicode_or_str, str):
value = unicode_or_str.decode('utf-8')
else:
value = unicode_or_str
return value # Instance of unicode
# You'll need another method that takes str or unicode and always returns a str.
def to_str(unicode_or_str):
if isinstance(unicode_or_str):
value = unicode_or_str.encode('utf-8')
else:
value = unicode_or_str
return value # Instance of str
# There are two big gotchas when dealing with raw 8-bit values and Unicode
# characters in Python.
# The first issue is that in Python 2, unicode and str instances seem to be
# the same type when a str only contains 7-bit ASCII characters.
# 1. You can combine such a str and unicode together using the + operator.
# 2. You can compare such str and unicode instances using equality and
# inequality operators.
# All of this behavior means that you can often pass a str or unicode instance
# to a function expecting one or the other and things will just work (as long
# as you're only dealing with 7-bit ASCII). In Python 3, bytes and str
# instances are never equivalent-not even the empty string-so you must be more
# deliberate about the types of character sequences that you're passing around.
# The second issue is that in Python 3, operations involving file handles
# (returned by the open built-in function) default to UTF-8 encoding. In
# Python 2, file operations default to binary encoding. This causes surprising
# failures, especially for programmers accustomed to Python 2.
# For example, say you want to write some random binary data to a file. In
# Python 2, this works. In Python 3, this breaks.
with open('random.bin', 'w') as f:
f.write("random")
# f.write(os.urandom(10))
# TypeError: write() argument must be str, not bytes
# The cause of this exception is the new encoding argument for open that was
# added in Python 3. This parameter defaults to 'utf-8'. That makes read and
# write operations on file handles expect str instances containing Unicode
# characters instead of bytes instances containing binary data.
# To make this work properly, you must indicate that the data is being
# opened in write binary mode ('wb') instead of write character mode ('w').
# Here, I use open in a way that works correctly in Python 2 and Python 3:
with open('random.bin', 'wb') as f:
f.write(os.urandom(10))
# This problem also exists for reading data from files. The solution is the
# same: Indicate binary mode by using 'rb' instead of 'r' when opening a file.
# Things to Remember
# 1. In Python 3, bytes contains sequences of 8-bit values, str contains
# sequences of Unicode characters. bytes and str instances can't be
# used together with operators (like > or +).
# 2. In Python 2, str contains sequences of 8-bit values, unicode contains
# sequences of Unicode characters. str and unicode can be used together
# with operators if the str only contains 7-bit ASCII characters.
# 3. Use helper functions to ensure that the inputs you operate on are the
# type of character sequence you expect (8-bit values, UTF-8 encoded
# characters, Unicode characters, etc.)
# 4. If you want to read or write binary data to/from a file, always open the
# file using a binary mode (like 'rb' or 'wb').