Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain filtering with a user-defined list, non regex based #49

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ZJaume
Copy link
Member

@ZJaume ZJaume commented Jan 11, 2024

To perform domain filtering of explicit content with UT1 blocklist, I first tried using the url filter option but the resulting regex was enormous and therefore the time spent matching the url. So, I've implemented a domain filter that just stores the domain list in an unordered set. To perform the filtering, at each document, extracts the domain from its url and checks if it is in the list.

Doubts I have about the implementation:

  • The logging messages of discarded documents by domain are currently in trace level instead of info. The number of documents discarded in each warc file was significantly high. I didn't want to fill up the log files with too many messages.
  • It is debatable whether if the url filter is useful any more and could be entirely replaced by domain filter. At least the way we are using it is to filter out domains mostly?¿

@ZJaume ZJaume changed the title Domain filtering with a user-defined list based, non regex based Domain filtering with a user-defined list, non regex based Jan 11, 2024
This reduces significantly the reading time. Before this it took a
couple of seconds, now the time is less than 1 second.
@ZJaume
Copy link
Member Author

ZJaume commented Jan 11, 2024

To run just simply use the new cli option --domain-filters adult_domains.gz with UT1 blocklist.

It was taking like a couple of seconds more to run on each warc2text call. Now with the compressed file reading, it should be 1s or less.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant