Cruncher is an application that crawls over TechCrunch informations about posts, stores data about authors, articles and provides a web API to consume it.
Why? This is a challenge for Cheesecake Labs. Because I love this company and I need this job :)
This is my first real project with Django and Django REST Framework, including applying DevOps during the process. I tried to follow the recommendations and best practices available on the Internet and the documentations. Here are some considerations:
- I need improve the front-end.
- Digital Ocean is a big jerk sometimes, but work as well.
- Improve production deployment. It's my first time with DO.
- Write more tests.
- Collect more data with the crawler.
The production API endpoint can be consumed at cruncher.oclubecast.com/api/v1/
.
Here are the GETs:
/api/v1/authors/
: get all the authors paginated./api/v1/authors/<id>
: get details on one author./api/v1/articles/
: get all the articles paginated./api/v1/articles/<id>
: get details on one article.
Here's a list of softwares, libraries and servicesused in this project:
- Digital Ocean
- Docker
- docker-compose
- NGINX
- Django
- Django REST Framework
- Gunicorn
- coverage
- PostgreSQL
- psycopg2
- Scrapy
As Freddie Mercury used to say... "It's so eaaaasy" - Understanders will understand. But you need a Docker and Docker-Compose installed.
Repository clone:
git clone https://github.com/perylemke/cruncher
cd cruncher
Run docker-compose to build and migrate Django models:
docker-compose build
docker-compose up -d
docker-compose run web python manage.py makemigrations
docker-compose run web python manage.py migrate
Run the crawler (here with 100 requests limit):
docker-compose run scrap scrapy crawl cruncher --set CLOSESPIDER_PAGECOUNT=100
You're ready to go! Consume the API at http://localhost:8000/api/v1/
.
Now Cruncher is not more in production. Clone this repository and DO IT! :)