nltk didn't solved this kind or parsing problem #22

me-suzy · 2024-11-16T16:06:47Z

I solved this problem without nltk library, because nltk didn't solved it.

https://stackoverflow.com/questions/79160811/python-compare-html-tags-in-ro-folder-with-their-corresponding-tags-in-en-folde

This is the way that I handled the problem. For this I made identifiers...

Read all here.

https://gist.github.com/me-suzy/1a25babeaea6d9ae2d375cdee77b987d

Please update nltk with new logical, so that in the future to solve this kind of parsing problem.

I try this code, but didn't work !

from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

# Asigură-te că resursele NLTK sunt descărcate
nltk.download('punkt')
nltk.download('stopwords')

def extract_article_tags(file_path):
    """Extrage toate tagurile din secțiunea dintre <!-- ARTICOL START --> și <!-- ARTICOL FINAL -->."""
    with open(file_path, 'r', encoding='utf-8') as file:
        soup = BeautifulSoup(file.read(), 'html.parser')

    # Identifică secțiunea delimitată de comentarii
    article_start = soup.find(string=lambda text: text and "ARTICOL START" in text)
    article_end = soup.find(string=lambda text: text and "ARTICOL FINAL" in text)

    if article_start and article_end:
        # Extrage conținutul dintre comentarii
        article_content = []
        for sibling in article_start.find_next_siblings():
            if sibling == article_end:
                break
            article_content.append(sibling)

        # Creăm un nou obiect BeautifulSoup doar pentru această secțiune
        article_soup = BeautifulSoup(''.join(str(tag) for tag in article_content), 'html.parser')
        return article_soup.find_all('p')  # Returnează toate tagurile <p> din articol
    else:
        print("Nu am găsit secțiunea delimitată de comentarii în fișier.")
        return []

def process_tag_text(tag):
    """Procesează textul dintr-un tag pentru comparare."""
    text = tag.get_text(strip=True)  # Extrage textul curat
    tokens = word_tokenize(text.lower())  # Tokenizează textul
    stop_words = set(stopwords.words('english') + stopwords.words('romanian'))  # Elimină stop words
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return filtered_tokens

def compare_tags(ro_file, en_file):
    """Compară tagurile dintre două fișiere HTML și identifică diferențele."""
    # Extrage tagurile din fiecare fișier
    ro_tags = extract_article_tags(ro_file)
    en_tags = extract_article_tags(en_file)

    # Procesează textul din fiecare tag
    ro_texts = [{'original': tag, 'tokens': process_tag_text(tag)} for tag in ro_tags]
    en_texts = [{'original': tag, 'tokens': process_tag_text(tag)} for tag in en_tags]

    # Compară tagurile pe baza token-urilor
    unique_ro_tags = []
    for ro in ro_texts:
        if not any(ro['tokens'] == en['tokens'] for en in en_texts):
            unique_ro_tags.append(ro)

    # Afișează tagurile unice în RO
    print("\nTaguri unice în RO (care nu au corespondent în EN):")
    for tag in unique_ro_tags:
        print(f"- {tag['original'].get_text(strip=True)}")

# Căi către fișiere
ro_file = 'd:/3/ro/a-domni-cu-adevarat-nu-este-un-lucru-la-indemana-primului-venit.html'
en_file = 'd:/3/en/to-truly-rule-is-not-a-thing-within-the-reach-of-the-first-comer.html'

# Compară tagurile și afișează diferențele
compare_tags(ro_file, en_file)

The text was updated successfully, but these errors were encountered:

ekaf · 2024-12-06T11:15:40Z

Is this an issue? Just claiming that some code doesn't work may not be enough to raise any kind of interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nltk didn't solved this kind or parsing problem #22

nltk didn't solved this kind or parsing problem #22

me-suzy commented Nov 16, 2024

ekaf commented Dec 6, 2024

nltk didn't solved this kind or parsing problem #22

nltk didn't solved this kind or parsing problem #22

Comments

me-suzy commented Nov 16, 2024

ekaf commented Dec 6, 2024