Skip to content

🕸️ A Corpus for Simulating Search on Mastodon.

License

Notifications You must be signed in to change notification settings

webis-de/mastodon-search

Repository files navigation

CI status Maintenance Issues Pull requests Commit activity License

🕸️ mastodon-search

A Corpus for Simulating Search on Mastodon.

Installation

  1. Install Python 3.11 or higher.

  2. Create and activate a virtual environment:

    python3.11 -m venv venv/
    source venv/bin/activate
  3. Install dependencies:

    pip install -e .

Usage

Use this repository to crawl, analyze, and search Mastodon posts.

Hint: You can always list all available commands of our crawler by running:

mastodon-search -h

Crawling

Crawling a single instance

The central command used to crawl an instance is stream-to-es. It opens a connection to the specified Mastodon instance, receives new posts, and stores them in an Elasticsearch index:

mastodon-search stream-to-es --host https://es.example.com --username es_username --password es_password mastodon.example.com

Behind the scenes, this will fetch posts using Mastodon's streaming API. Because the streaming API is unavailable on many instances, our crawler gracefully falls back to using regular HTTP GET requests with the public timeline API.

Obtaining and analyzing instance data

An initial list of nodes can be obtained from https://nodes.fediverse.party/:

wget https://nodes.fediverse.party/nodes.json

Now, enrich the list of instances with global and weekly activity stats. Be aware that the below command can take a few hours to complete:

mastodon-search obtain-instance-data nodes.json mastodon_instance_data/

Sampling instances for crawling

With the activity stats obtained, we can draw a representative sample out of all the instances:

mastodon-search choose-instances mastodon_instance_data/ out.csv

TODO: Don't we do this in the notebooks?

Analyzing

We provide [Jupyter notebooks] for easily analyzing the instances and crawled posts.

To open a notebook, just run, e.g.:

jupyter notebook notebooks/mastodon-instance-data-vis.ipynb

Correlation of instance statistics

The correlation between all available instance statistics can be calculated by running:

mastodon-search calculate-correlation mastodon_instance_data/

Docker image

Our code can also run in a container. First, build the image with this command:

docker build -t mastodon_search .

To run commands using the Docker image just created, replace the mastodon-search command from the previous sections with docker run mastodon_search. If you want to save statuses to an Elasticsearch running on your localhost, the command should look like the following code snippet. (You can leave out --network=host if it's not on your local machine.)

docker run --network host mastodon_search stream-to-es --host http://localhost --username es_username --password es_password mastodon.example.com

Deployment

Crawling can be parallelized on a Kubernetes cluster. To do so, install Helm and configure kubectl for your cluster.

You are then ready to deploy the Helm chart on the cluster and start the crawling:

helm install --dry-run --set esUsername="<REDACTED>" --set esPassword="<REDACTED>" --set-file instances="./data/instances.txt" mastodon-crawler ./helm

If the above command worked and the Kubernetes resources to be deployed look good to you, just remove the --dry-run flag to actually deploy the crawlers.

To stop the crawling, just uninstall the Helm chart:

helm uninstall mastodon-crawler

To re-start the crawling, first uninstall and then re-install the Helm chart.

Development

First, install Python 3.11 or higher and then clone this repository. From inside the repository directory, create a virtual environment and activate it:

python3.11 -m venv venv/
source venv/bin/activate

Then, install the test dependencies:

pip install -e .[tests]

After having implemented a new feature, please check the code format, inspect common LINT errors, and run all unit tests with the following commands:

ruff .                         # Code format and LINT
# mypy .                         # Static typing
bandit -c pyproject.toml -r .  # Security
pytest .                       # Unit tests

Contribute

If you have found a bug in this crawler or feel some feature is missing, please create an issue. We also gratefully accept pull requests!

If you are unsure about anything, post an issue or contact us:

We are happy to help!

Further resources

License

This repository is released under the MIT license.

About

🕸️ A Corpus for Simulating Search on Mastodon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published