⛔ [DEPRECATED] Active at https://github.com/UTMediaCAT/mediacat-domain-crawler
MediaCAT is open-source web-based application, with a curated search engine. It crawls designated news websites and twitter accounts for citations of or hyperlinks to a list of source sites. MediaCAT then archives all referring stories and source stories, in preparation for an advanced analysis of the relations across the digital news-scape.
Voyage currently has 2 components:
- Web Server is capable of editing and displaying all the stored data as well as scopes you will provide to Explorer, through your favorite browser.
- Explorer searches the web using scopes given through to the Web Server and goes for exploring for you. It will automatically store all relevant informations found on the way, so that you can show all the loot through Web Server.
Before installation, verify you meet the following requirements
The required version should be installed on Debian Jessie (and up), as well as Ubuntu 14.04 LTS (and up).
You can check your current version by python --version
Note: The project as of right now is supported up to Python 3.5.2. on Ubuntu 16.0.4. It is currently in the works to make the project compatible with Python 3.6.9. on Ubuntu 18.0.4.
If your Python version differs from Python 3.5, we highly recommend using virtual environment tools (such as pyenv) to help manage multiple Python versions.
Typically, to use Python 2 use python
.
To use Python 3 use python3
.
To use whatever python version is set in your python virtual environment, use python
.
You can check your current version by wget --version
- Clone the repo
- Go to the main folder
- Run the install script:
sudo -i
sudo ./InstallScript.sh
In order to use Postgres, we'll need to log into that account. You can do that by typing:
sudo -i -u postgres
You will be asked for your normal user password and then will be given a shell prompt for the postgres user.
You can get a Postgres prompt immediately by typing:
psql
By default, when you create a PostgreSQL cluster, password authentication for the database superuser (“postgres”) is disabled. In order to make Django have access to this user, you will need to add password savely for this user.
In the Postgres prompt:
postgres=# \password
Enter new password: password
Enter it again:password
In the Postgres prompt:
postgres=# create database mediacat;
postgres=# create database crawler;
You may exit out of postgres now
Plase configure the databse setting in Frontend/Frontend/settings.py
.
For example:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mediacat',
'USER': 'postgres',
'PASSWORD': 'password',
'HOST': 'localhost',
'PORT': '5432',
}
}
You can edit the config.yaml file for personal settings
For production instances, be sure to use a new randomized SECRET_KEY in Frontend/Frontend/settings.py
.
A new SECRET key can be generate with the following python script:
import random
''.join(random.SystemRandom().choice('abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_=+)') for _ in range(50))
Please configure twitter credentials in config.yaml before using twitter crawler. You can get twitter credentials from https://apps.twitter.com/.
please make sure to apply migrations under Frontend folder:
python manage.py makemigrations
python manage.py migrate
And create admin users
python manage.py createsuperuser
- To start
python server.py run
(note: if using port 80, thensudo
is needed to run/stop the server)
By default, this Django app is set to listen on all public IPs (port 80).
You can now access the server through http://IP:PORT/admin
The default is http://127.0.0.1/admin
Here you can view your action history and quick navigations to the database
Here, you can view and edit 4 requirement to explore:
- Referring Sites: The sites in which explorer will look into. It will automatically get validated when adding.
- Referring Twitter Accounts: The twitter accounts which explorer will look into. It will automatically be validated when adding.
- Source Sites: The sites which explorer looks for in the articles/tweets if they are used as source.
- Source Twitter Accounts : The twitter accounts which explorer will look for in articles/tweets if they are used as source.
- Keywords: The words which explorer look for in the articles/tweets if they are used.
Here, you can view the collected data by the explorer. Furthermore, you can download the archived entry as Web Archive. For demo, it is filled with pre-explored entries.
Here, you can download all the data stored in the database as Json format.
Here, you can view the statistics among the collected entries.
For example, you can view how many articles got collected per day as a Annotation Chart
Here, you can view the relations between each of the 4 scopes, based on the exploration.
Here, you can manage the users and groups used for log in. Furthermore, users can have different permissions.
Once your scope is ready, you may use the following explorers under src folder to crawl news and Tweets:
- Article Explorer will explore through the Referring Sites for articles
- Twitter Crawler will explore through Referring Twitter Accounts's posts
The article explorer will explore each site under a given domain. After this crawler is finished crawling the entire domain the shallow crawler will activate. At this point, the article explorer will only go N levels
down from the domain's homepage. A visual prompt indicating shallow crawling will be visible in the Scope/Referring Sites
tab. The level
value is set to a default of 3, but can be changed in the config.yaml file.
article_explorer_run.sh
found under the src/
folder contains some lines of code that will automatically restart the crawler after a certain period of time.
To run the crawler you must first run the warc_queue.py
so that the warc files will be created as the crawler runs. Note: We create a screen so that the warc queue can operate in the background.
screen -S warc
python src/warc_queue.py
(Ctrl+A followed by Ctrl+D to get back to the original screen) After this, you must run the actual crawler.
screen -S article
python src/article_explorer.py
Twitter crawler has three modes of crawling: timeline
, streaming
and history
, with timeline
and streaming
based on twarc and history
based on GetOldTweets-python.
timeline
mode will crawl the timeline of Referring Twitter Accounts with up to 3200 of a user's most recent Tweets (Twitter's API constraint). You can set the frequency of timeline re-crawling in config.yaml (the default frequency is crawling timeline every 30 days).
python twitter_crawler.py timeline
streaming
mode will crawl Tweets of Referring Twitter Accounts on a real-time basis.
python twitter_crawler.py streaming
history
mode will collect all Tweets posted by Referring Twitter Accounts.
python twitter_crawler.py history
Running twitter crawler with no parameter will run all three modes together by default.
python twitter_crawler.py
Unit test files are located under src/unit_tests