Skip to content

Commit

Permalink
Merge Sitemap and Scaleway from polomarcus/barometre (#83)
Browse files Browse the repository at this point in the history
* docker: scrapper app

* fix(docker): scrapper app name logs

* docker: force container name

* wip: docker and config

* wip: docker

* chores: update poetry lock

* fix: docker start

* fix: docker streamlit

* fix(streamlit): docker start

* chores: poetry lock

* chores: python 3.10 to 3.11

* remove: scrap_sitemap

* wip: add SQLAlchemy

* wip(refacto): SQLAchemy + test

* fix: ci

* fix: ci install poetry

* test(wip): save using pandas.to_sql

* ci: --dev-dependency

* chores: pytest update

* clean: notebooks should be in another repo

* lint

* ci: test job

* ci: need docker to run test

* ci: directly use docker

* docker: wait for PG to be ready

* ci: docker exit code from container

* ci(test): use github action services to run PG

* ci(test): use poetry first

* ci(test): use github action services to run PG

* ci: searching for postgres host on CI

* ci: searching for postgres host on CI

* ci: log connection

* ci: yet

* ci: fix test

* ci: lint yaml

* wip(refacto): scrapping sitemap

* refacto: use ENV to dev to test locally

* test: find section in urls

* test: query_one_sitemap_and_transform

* fix(ci): env

* ci: add nginx

* ci: nginx background

* refacto: remove unsued code

* refacto: sitemap

* fix: test local ci

* fix: test depending on env

* fix: test depending on env

* auto review

* doc: autoreview

* ci: desactive all jobs

* review: add error log

* feat: add url to save inside pg

* chores: remove some deps

* cd: scaleway docker

* poetry lock

* fix: wrong folder

* Docker Scaleway (#6)

* wip

* bump: dokcer v2 to v3

* refacto: docker

* refacto: posgres 15

* fix(test): using hash for PK

* db: change type of PK

* chores: removing drop tables to clean test DB

* fix: PK using consitent hash

* fix db

* fix(db): create schemas

* fix(db): create schemas

* feat: add healthcheck for scaleways deployment (#7)

A healthcheck was added. Scaleway will be able to know if a container is
up or not :
* locahost:5000/

App version was added and is managed by the CI on every commit on main
branch

* fix: cd

* [no ci]: bumping version

* Feat/add medias (#8)

Configuring media to get and add tests for some

* [no ci]: bumping version

* refacto: change PK due to publication date changing over time (#9)

Some medias changed their publication date instead of last modification,
so it messed up the PK


To do after deployment :
* delete duplicate due to new PK

```
psql
>
DELETE FROM
    sitemap_table a
        USING sitemap_table b
WHERE a.news_title = b.news_title;
```

* [no ci]: bumping version

* fix(streamlit): use sqlachemy (#10)

Use env variable to connect to PG

* [no ci]: bumping version

* medias: add francebleu, nouvelobs, mediapart

fix: nouvelobs

* [no ci]: bumping version

* Feat: parse description meta tag for every news (#11)

To be the more generinic possible we parse this tag from every news :
 ```
<meta name="description" content="coucou">
```
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta

Something to have in mind during first deployments : execution time and concurrency done with asyncio

:warning: websites made with JS cannot be parsed yet (need https://www.zenrows.com/blog/scraping-javascript-rendered-web-pages#requirements)

* [no ci]: bumping version

* feat: only save not known sitemap (PG already saved id) and sitemap lastmod/publication date

To avoid wasteful scrapping :
* we compare first the sitemaps we already know inside PG, then on the
difference of the sitemap.xml parsed, we continue to parse or not.
 * reading sitemap.xml we only keep 7 day-old news

* [no ci]: bumping version

* fix: corsematin

* [no ci]: bumping version

* refacto: healthcheck port renamed to PORT

* fix(description): log 20minutes missing hat

* [no ci]: bumping version

* log: log level + colors

* [no ci]: bumping version

* refacto: docker compose / CD push steps (#13)

* [no ci]: bumping version

* add medias: charentelibre, courrierpicard, ... (#14)

* [no ci]: bumping version

---------

Co-authored-by: barometre-github-actions <[email protected]>
Co-authored-by: Rambier Estelle <[email protected]>
  • Loading branch information
3 people authored Nov 24, 2023
1 parent 96b2d08 commit 7e0ba98
Show file tree
Hide file tree
Showing 113 changed files with 8,595 additions and 1,173,918 deletions.
8 changes: 8 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
pgdata
.git
.venv
venv
.vscode
notebooks
LICENSE
.idea
26 changes: 0 additions & 26 deletions .github/workflows/check_integration.yml

This file was deleted.

35 changes: 0 additions & 35 deletions .github/workflows/db_backup_on_scaleway.yml

This file was deleted.

56 changes: 56 additions & 0 deletions .github/workflows/deploy-main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Build & Deploy to Scaleway

on:
push:
# Sequence of patterns matched against refs/heads
branches:
- main


# to be able to force deploy
workflow_dispatch:


env:
PYTHON_VERSION: '3.11'
POETRY_VERSION: '1.6.1'

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}
- uses: actions/checkout@v4
- name: Login to Scaleway Container Registry
uses: docker/login-action@v3
with:
username: nologin
password: ${{ secrets.SCALEWAY_API_KEY }}
registry: ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}
- name: Build ingest_to_db image
run: docker build -f Dockerfile_ingest . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
- name: Push ingest_to_db Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
- name: Build streamlit image
run: docker build -f Dockerfile_streamlit . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/streamlit
- name: Push streamlit Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/streamlit
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
- name: Poetry install & bump version
run: |
poetry install --only dev
poetry version patch
git config user.name barometre-github-actions
git config user.email [email protected]
git add pyproject.toml
git commit -m "[no ci]: bumping version"
git push origin main
20 changes: 20 additions & 0 deletions .github/workflows/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Docker Compose CI

on:
workflow_dispatch: # https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: init and load data
run: docker compose up -d
- name: sleep
run: sleep 60
- name: log sitemap
run: docker logs sitemap
- name: log db ingestion
run: docker logs ingest_to_db
- name: log streamlit
run: docker logs streamlit
42 changes: 0 additions & 42 deletions .github/workflows/homepage_lemonde.yml

This file was deleted.

35 changes: 0 additions & 35 deletions .github/workflows/main.yml

This file was deleted.

42 changes: 0 additions & 42 deletions .github/workflows/scrap_sitemap.yml

This file was deleted.

35 changes: 0 additions & 35 deletions .github/workflows/scrap_sitemap_and_ingest_db.yml

This file was deleted.

42 changes: 0 additions & 42 deletions .github/workflows/scrap_tv_program.yml

This file was deleted.

42 changes: 0 additions & 42 deletions .github/workflows/scrap_youtube.yml

This file was deleted.

Loading

0 comments on commit 7e0ba98

Please sign in to comment.