-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64
Comments
I got around this by the following method.
|
@mima3 this is one of the ways to do it. The other being changing the |
Is there a valid method for ver 7.0 or later versions? |
Is it possible to just check the encoding of the webpage based on what they claim? There are two major ways of getting the encoding to decode.
|
@BrandonKMLee your first example works on top of the second one so they are not two separate things. And also majority of the times the encoding reported by website are wrong so it is always a trial and to find the best encoding on the user side. |
@rajatomar788 in that case wound need to run through Python Chatdet or cChardet to "smell" the text, even if it is not a guarantee it is a good default to have. |
我通过修改schedulers.py解决了: class Scheduler(SchedulerBase):
def _handle_resource(self, resource):
try:
self.logger.debug('Scheduler trying to get resource at: [%s]' % resource.url)
resource.get(resource.context.url)
# NOTE :meth:`get` can change the :attr:`filepath` of the resource
resource.encoding = 'utf-8' # 这里添加一行
self.index.add_resource(resource)
except ConnectionError:
self.logger.error(
"Scheduler ConnectionError Failed to retrieve resource from [%s]"
% resource.url)
# self.index.add_entry(resource.url, resource.filepath)
except Exception as e:
self.logger.exception(e)
# self.index.add_entry(resource.url, resource.filepath)
else:
self.logger.debug('Scheduler running handler for: [%s]' % resource.url)
resource.retrieve()
self.index.add_resource(resource) |
The encoding of the downloaded website is a Unicode Numeric character reference, and this encoding does not display the real content in the browser
The text was updated successfully, but these errors were encountered: