Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

gaowanliang · 2021-01-30T10:58:48Z

The encoding of the downloaded website is a Unicode Numeric character reference, and this encoding does not display the real content in the browser

mima3 · 2022-04-11T08:08:02Z

I got around this by the following method.

create new class that inherits from WebPage
create new save_html

#略
            root.getroottree().write(file_name, method="html", encoding=self.encoding)
#略

rajatomar788 · 2022-04-11T12:12:17Z

@mima3 this is one of the ways to do it. The other being changing the .encoding attribute of the WebPage object.

muzicstation · 2023-02-01T07:10:29Z

Is there a valid method for ver 7.0 or later versions?

BradKML · 2023-04-02T16:57:41Z

Is it possible to just check the encoding of the webpage based on what they claim? There are two major ways of getting the encoding to decode.

https://stackoverflow.com/a/19156107 resource.headers.get_content_charset()
https://stackoverflow.com/a/38807852 page.info().getparam('charset')

rajatomar788 · 2023-04-03T02:50:56Z

@BrandonKMLee your first example works on top of the second one so they are not two separate things. And also majority of the times the encoding reported by website are wrong so it is always a trial and to find the best encoding on the user side.

BradKML · 2023-04-03T03:58:09Z

@rajatomar788 in that case wound need to run through Python Chatdet or cChardet to "smell" the text, even if it is not a guarantee it is a good default to have.

PeterBon · 2023-09-06T02:33:09Z

Is there a valid method for ver 7.0 or later versions?

我通过修改schedulers.py解决了：

class Scheduler(SchedulerBase):
    def _handle_resource(self, resource):
        try:
            self.logger.debug('Scheduler trying to get resource at: [%s]' % resource.url)
            resource.get(resource.context.url)
            # NOTE :meth:`get` can change the :attr:`filepath` of the resource
            resource.encoding = 'utf-8'  # 这里添加一行
            self.index.add_resource(resource)
        except ConnectionError:
            self.logger.error(
                "Scheduler ConnectionError Failed to retrieve resource from [%s]"
                % resource.url)
            # self.index.add_entry(resource.url, resource.filepath)
        except Exception as e:
            self.logger.exception(e)
            # self.index.add_entry(resource.url, resource.filepath)
        else:
            self.logger.debug('Scheduler running handler for: [%s]' % resource.url)
            resource.retrieve()
        self.index.add_resource(resource)

claell mentioned this issue Oct 17, 2023

UTF-8 encoding issues #120

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

gaowanliang commented Jan 30, 2021

mima3 commented Apr 11, 2022

rajatomar788 commented Apr 11, 2022 •

edited

Loading

muzicstation commented Feb 1, 2023

BradKML commented Apr 2, 2023

rajatomar788 commented Apr 3, 2023

BradKML commented Apr 3, 2023

PeterBon commented Sep 6, 2023

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

Encoding issues for websites in non-English languages such as Chinese, Japanese, etc. #64

Comments

gaowanliang commented Jan 30, 2021

mima3 commented Apr 11, 2022

rajatomar788 commented Apr 11, 2022 • edited Loading

muzicstation commented Feb 1, 2023

BradKML commented Apr 2, 2023

rajatomar788 commented Apr 3, 2023

BradKML commented Apr 3, 2023

PeterBon commented Sep 6, 2023

rajatomar788 commented Apr 11, 2022 •

edited

Loading