Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide public access to _content etc #3329

Open
3 tasks done
sebbASF opened this issue Jun 14, 2024 · 8 comments
Open
3 tasks done

provide public access to _content etc #3329

sebbASF opened this issue Jun 14, 2024 · 8 comments

Comments

@sebbASF
Copy link

sebbASF commented Jun 14, 2024

  • I have searched the issues (including closed ones) and believe that this is not a duplicate.
  • I have searched the documentation and believe that my question is not covered.
  • I am willing to lend a hand to help implement this feature.

Feature Request

Plugins regularly need to access _content.
However pylint (rightly) complains that this is using protected-access.

AFAICT it is expected that plugins may need to access _content (and _summary, _content) so the protected status is just a nuisance when using pylint.

Renaming would cause lots of issues, but it would be possible to provide a public accessor for use by plugins.

@egberts
Copy link
Contributor

egberts commented Jun 29, 2024

Would any of these Pelican signals help you deal with access to each document's content, specifically content signals?

Something like:

  • signals.content_object_init.connect(my_plugin_content_processor)

About 80% of the plugins use this signal.

A content handler would look like this:

from pelican import signals
from pelican.contents import Content, Article, Page

def my_content_object_init(content_class):
    # Description:
    #   First signal handler to provide the actual content of any article/page/static
    #   file.
    #
    # arg1 : content_class:Content
    #
    # article of Article(Content) class provides the following variable member items:
    #   allowed_statuses:tuple, author:Author, authors:list, category:Category,
    #   content:str, date:SafeDatetime, date_format:str, default_status:str,
    #   default_template:str, filename:str, get_content:partial, get_summary:partial,
    #   in_default_lang:bool, lang:str, locale_date:str, mandatory_properties:tuple,
    #   metadata:dict, private:str, reader:str, relative_dir:str,
    #   relative_source_path:str, save_as:str, settings:dict, slug:str,
    #   source_path:str, status:str, summary:str, tags:list, template:str,
    #   timezone:Zoneinfo, title:str, translations:list, url:str, url_format:dict
    #
    # Callstack
    #     signals.content_object_init.send()
    #     Content.__init__()
    #     Article.__init__()
    #     Readers.read_file()
    #     ArticlesGenerator.generate_context()
    #     Pelican.run()
    #
    # 4th article-related signal
    # 3rd signal in ArticlesGenerator.generate_context()
    # Still inside read_file()
    # First signal appearance having a content provided by Markdown.read_file()
    #
    # Hooked using signals.content_object_init.connect(my_content_object_init)
    #
    print('my_content_object_init called')
    print('my_content_object_init: content: {0!s}'.format(content_class.content))

    if not (isinstance(content_class, Article) or isinstance(content_class, Page)):
        return
    # Do your article/page processing here
    return

you can set above handler up by doing content_object_init signal:

# This is how pelican plugin works.
# register() is a well-established function name used by Pelican plugin
# handler for this plugin to get recognized, inserted, initialized, and
# its processors added into and by the Pelican app.
import logging

def register():
    logger.info(
        'MY plugin registered for Pelican, using new 4.0 plugin variant')
    signals.content_object_init.connect(my_content_object_init)

@sebbASF
Copy link
Author

sebbASF commented Jun 29, 2024

I don't see how that helps.

This request is about getting public access to the protected field _content which is part of the Content object, not about getting access to the Content object.

@egberts
Copy link
Contributor

egberts commented Jun 30, 2024

As I have had reviewed all the signals (as of v4.9.1), I am not fully convinced ... yet... that content needs to be made available outside of signals.content_object_init signal ... as a 'unprotected' access. Of course, I am not the designer, but this current Pelican design is resonating with me.

While Python (or JetBrain IDE PyCharm) may be able to access this protected ._content element item, ideally the plugin should only be using the Pelican-community-unprotected variety of .content element item and that is alone provided toward your own plugin content processor function as hooked by the signals.content_object_init handler.

Is there a particular signal stage that you need content access within? I have listed all the signals used in Pelican v4.9.1 in chronological order:

    # All signals are listed here as of Pelican v4.9.1
    signals.initialized.connect()
    signals.get_generators.connect()
    signals.readers_init()  # Article class
    signals.generator_init()  #ArticlesGenerator class
    signals.article_generator_init.connect()
    signals.readers_init() 
    signals.readers_init()  # Page class
    signals.generator_init()  # PagesGenerator
    signals.page_generator_init()
    signals.readers_init()
    signals.generator_init()
    signals.readers_init()  # Static class
    signals.generator_init()  # StaticGenerator
    signals.static_generator_init()
    signals.article_generator_preread.connect()
    signals.article_generator_context.connect()
    signals.content_object_init.connect()
    signals.article_generator_pretaxonomy.connect()
    signals.article_generator_finalized.connect()
    signals.page_generator_preread.connect()
    signals.page_generator_context.connect()
    signals.content_object_init.connect()
    signals.page_generator_finalized.connect()
    signals.static_generator_preread.connect()
    signals.static_generator_context.connect()
    signals.content_object_init.connect()
    signals.static_generator_finalized.connect()
    signals.all_generators_finalized.connect()
    signals.get_writers()
    signals.feed_generated()
    signals.feed_written()
    signals.article_generator_write_article.connect()
    signals.content_written()
    signals.article_writer_finalized.connect()
    signals.page_generator_write_page.connect()
    signals.content_written()
    signals.page_writer_finalized()
    signals.content_written()
    signals.pelican_finalized()

@egberts
Copy link
Contributor

egberts commented Jun 30, 2024

Here are some plugins that reference _content:

https://github.com/getpelican/pelican-plugins/blob/c61bd12914fd52af1808c53151a07225e7c3341c/glossary/glossary.py#L36

Got it. I think I may have a fix, but no time to test it.

Right off the bat, I can tell you that this particular plugin should be easily fixable by replacing the article_generator_finalized signal with the signal.content_object_init.connect(parse_content):

def register():
    signals.initialized.connect(get_excludes)
    signals.content_object_init.connect(parse_content)
    signals.page_generator_context.connect(set_definitions)

Upgrading the protected content._content into a normal content.content.

def parse_content(content):
    # vvvvv NEW CODE vvvvv
    # Only process Article or Page subclass contents
    if not (isinstance(content_class, Article) or isinstance(content_class, Page)):
        return
    # ^^^^^ NEW CODE ^^^^^
   # resume normal code
    soup = bs4.BeautifulSoup(content._content, 'html.parser')
   ...

Notice a choice of article or page, modify that as needed.

Oh yea, totally remove the parse_articles function and its articles' looping, as the signal is now operating on a single per-document basis.

@sebbASF
Copy link
Author

sebbASF commented Jul 1, 2024

Your suggested change does solve the issue. It does not change the line where _content is referenced:
soup = bs4.BeautifulSoup(content._content, 'html.parser')

@egberts
Copy link
Contributor

egberts commented Jul 1, 2024

Oops, my bad. Please, if you haven't, replace ALL instances of ._content with .content, I meant. Did that work as well?

_content is a protected variable, in short, it is a read-only variable that is discourage from making any access to it by a function.

@sebbASF
Copy link
Author

sebbASF commented Jul 1, 2024

We are going round in circles.

._content and .content don't always return the same value, otherwise plugins would not need to use _content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants