WDG Umbrella Project

This repo contains the source code for the scraper to scrape posts matching the specified format in the site. The scraper will pick up these post, convert to an entity within the SQLite DB under the apps/wdgscraper project. The SQLite files are available and saved on ./db/ folder.

Getting started

You will need Elixir installed for this project to work - assuming you have done that already here are the step-by-step guide on how to run the repository from scratch.

git clone https://github.com/ktunprasert/wdg-umbrella/
cd wdg-umbrella
# download & build the dependencies
mix deps.get
# perform migration to ensure we're on the latest
mix ecto.migrate
# launch the scraper and scrape threads for /wdg/ threads
mix scrape.archive
# this instead scrapes the catalog.json
mix scrape.catalog
# alternatively you may use thread-based scraping
# the command below will scrape threads = [123, 456, 789]
mix scrape.thread 123 456 789

To generate the static pages you will have to move into the apps/serum_static directory

cd apps/serum_static
# runs the local live-reload server - defaults at localhost:8080
mix serum.server
# builds the final payload for static server
MIX_ENV=prod serum.build

The final output are located at apps/serum_static/site/

Pipelines

The pipeline is set up with a hourly/4 crontab meaning it runs every 4 hours, the scraper (currently scrape.catalog) will pick up all the /wdg/ threads and tries to parse all the candidates to be inserted into the database.

If no posts are found or the post in that thread already exists within the database, it will be ignored. The workflow history can be seen here https://github.com/ktunprasert/wdg-umbrella/actions/workflows/scrape.yml

The static site generation workflow will detect for changes within the db folder (scraper has committed something or manual removal of posts) as well as the template changes within the apps/serum_static/ folder as this will affect how the site will look in the end. This job generates a production static generation payload as your basic HTML/CSS/JS and send it to the wdg-one organisation's GitHub.io page.

Post format

By default, the post format should match the following. Failure to provide a tag prefix will fail and won't be picked up by the scraper correctly.

:: my-project-title ::
dev:: anon
tools:: node, react, etc
link:: https://my.website.com
repo:: https://github.com/user/repo
progress:: Lorem ipsum dolor sit amet, consetetur sadipscing elitr

Here's the matching regex for the enthusiast out there

@title ~r/::\s?(.+)\s?::/U
@dev ~r/dev::\s?([^<\n]+)<?/
@tools ~r/tools::\s?([^<\n]+)<?/
@link ~r/link::\s?([^<\n]+)<?/
@repo ~r/repo::\s?([^<\n]+)<?/
@progress ~r/progress::\s?([^<]+)(?:<\/pre>)?/

Note that you will need to provide a project title or it won't be count as a valid scrapable post

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github/workflows		.github/workflows
apps		apps
config		config
db		db
.formatter.exs		.formatter.exs
.gitignore		.gitignore
README.md		README.md
assets		assets
desu.json		desu.json
mix.exs		mix.exs
mix.lock		mix.lock
posts		posts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WDG Umbrella Project

Getting started

Pipelines

Post format

About

Releases

Packages

Contributors 2

Languages

ktunprasert/wdg-umbrella

Folders and files

Latest commit

History

Repository files navigation

WDG Umbrella Project

Getting started

Pipelines

Post format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages