Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge Sitemap and Scaleway from polomarcus/barometre (#83)
* docker: scrapper app * fix(docker): scrapper app name logs * docker: force container name * wip: docker and config * wip: docker * chores: update poetry lock * fix: docker start * fix: docker streamlit * fix(streamlit): docker start * chores: poetry lock * chores: python 3.10 to 3.11 * remove: scrap_sitemap * wip: add SQLAlchemy * wip(refacto): SQLAchemy + test * fix: ci * fix: ci install poetry * test(wip): save using pandas.to_sql * ci: --dev-dependency * chores: pytest update * clean: notebooks should be in another repo * lint * ci: test job * ci: need docker to run test * ci: directly use docker * docker: wait for PG to be ready * ci: docker exit code from container * ci(test): use github action services to run PG * ci(test): use poetry first * ci(test): use github action services to run PG * ci: searching for postgres host on CI * ci: searching for postgres host on CI * ci: log connection * ci: yet * ci: fix test * ci: lint yaml * wip(refacto): scrapping sitemap * refacto: use ENV to dev to test locally * test: find section in urls * test: query_one_sitemap_and_transform * fix(ci): env * ci: add nginx * ci: nginx background * refacto: remove unsued code * refacto: sitemap * fix: test local ci * fix: test depending on env * fix: test depending on env * auto review * doc: autoreview * ci: desactive all jobs * review: add error log * feat: add url to save inside pg * chores: remove some deps * cd: scaleway docker * poetry lock * fix: wrong folder * Docker Scaleway (#6) * wip * bump: dokcer v2 to v3 * refacto: docker * refacto: posgres 15 * fix(test): using hash for PK * db: change type of PK * chores: removing drop tables to clean test DB * fix: PK using consitent hash * fix db * fix(db): create schemas * fix(db): create schemas * feat: add healthcheck for scaleways deployment (#7) A healthcheck was added. Scaleway will be able to know if a container is up or not : * locahost:5000/ App version was added and is managed by the CI on every commit on main branch * fix: cd * [no ci]: bumping version * Feat/add medias (#8) Configuring media to get and add tests for some * [no ci]: bumping version * refacto: change PK due to publication date changing over time (#9) Some medias changed their publication date instead of last modification, so it messed up the PK To do after deployment : * delete duplicate due to new PK ``` psql > DELETE FROM sitemap_table a USING sitemap_table b WHERE a.news_title = b.news_title; ``` * [no ci]: bumping version * fix(streamlit): use sqlachemy (#10) Use env variable to connect to PG * [no ci]: bumping version * medias: add francebleu, nouvelobs, mediapart fix: nouvelobs * [no ci]: bumping version * Feat: parse description meta tag for every news (#11) To be the more generinic possible we parse this tag from every news : ``` <meta name="description" content="coucou"> ``` https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta Something to have in mind during first deployments : execution time and concurrency done with asyncio :warning: websites made with JS cannot be parsed yet (need https://www.zenrows.com/blog/scraping-javascript-rendered-web-pages#requirements) * [no ci]: bumping version * feat: only save not known sitemap (PG already saved id) and sitemap lastmod/publication date To avoid wasteful scrapping : * we compare first the sitemaps we already know inside PG, then on the difference of the sitemap.xml parsed, we continue to parse or not. * reading sitemap.xml we only keep 7 day-old news * [no ci]: bumping version * fix: corsematin * [no ci]: bumping version * refacto: healthcheck port renamed to PORT * fix(description): log 20minutes missing hat * [no ci]: bumping version * log: log level + colors * [no ci]: bumping version * refacto: docker compose / CD push steps (#13) * [no ci]: bumping version * add medias: charentelibre, courrierpicard, ... (#14) * [no ci]: bumping version --------- Co-authored-by: barometre-github-actions <[email protected]> Co-authored-by: Rambier Estelle <[email protected]>
- Loading branch information