- π Of Overview and Noble Purpose
- π Of Purpose, Intention, and Worthy Usage
- πΌοΈ BEHOLD! The Diagrammatic Depiction of the ETL Pipeline
- β¨ Features of Noble Craft
- ποΈ The Grand Architecture
- π For the Journeyman Getting Started
- βοΈ Preparations for Thy Quest
- π₯ Commencement of Deployment
- π οΈ Usage of This Mechanism
- π§ Customize to Thy Liking
- π Project Structure
β οΈ Known Issues and Their Vanquishment- π― A Roadmap of Future Glories
- π€ The Spirit of Fellowship
- π License
- π For the Unversed in Antiquityβs Tongue
Greetings, kind scholars and brave data wranglers! Lend thy ears and open thine eyes, for I shall regale thee with the tale of a most wondrous endeavor: the Books ETL Pipeline Project. In this hallowed pursuit, we dost weave together the intricate threads of data extraction, transformation, and loading to uncover knowledge most profound.
This grand mechanism, devised by tireless toil and wisdom, doth unite the realms of Python, Docker, PostgreSQL, and Airflow. By its might, one may harvest bookly treasures from the vast libraries of OpenLibrary and Google Books, cleanse and refine them, and store them in databanks for enlightenment and analysis.
Lo, this project is not merely a tool but a masterwork that doth exemplify the art and science of data engineering. Scholars, practitioners, and seekers of wisdom alike may find value herein, as it is both a tome of learning and a marvel of modern craft.
Thus, embark, good reader, upon this journey of discovery, and let the annals of data yield their secrets unto thee!
Hark! This noble endeavor is fashioned to fetch and hold knowledge, tracking the comings and goings of books upon the digital shelves. But lo! Its utility extendeth far beyond the boundaries of this humble purpose. Prithee know, fair user, that thou mayest adapt its workings to suit thine own curiosities. By a simple tweak of query, thou mayest turn this engine toward thine own pursuitsβbe it tracking wares, scrolls, or other matters of great import. Wield this tool as thy will decrees, and may it serve thee well in thy noble quests!
Hear ye, hear ye! Gather thy gaze upon this most wondrous depiction of the grand ETL pipeline!
Within its bounds, thou shalt witness the harmonious interplay of myriad parts, each a vital cog in this celestial mechanism. From Security Sanctuaries to ensure the sacred safety of thine operations, to the Testing Grounds whereupon thy code is proven and hardened, this diagram illustrates the majestic flow of data, transformed from its humble JSON origins into a regal table of fieldsβfit for analysis and insight.
π Security: Lo, the bastions of access control and protection, ensuring no ill-begotten hand may meddle with the data's purity.
π§ͺ Testing: Prithee, regard this as the proving grounds where robustness is forged, where bugs are vanquished, and the pipeline stands resilient.
π³ Docker Enclosure: Witness the orchestration of containers, wherein each component dwelleth in isolation yet communicateth with precision, making the entire pipeline agile and portable.
π€ Data Extraction: Here lieth the cradle of our endeavor, whence data is lifted from its JSON confines and set forth upon its transformative journey.
π οΈ Data Transformation: The alchemy of the pipeline! Fields are cleansed, shaped, and readied for their destined purpose. Here, titles, authors, years, and sources are refined into their final glorious forms.
π Final Table: The culmination of all labors! Behold the tabular majesty, wherein the fruits of thy effortsβtitles, authors, publication years, and moreβstand ready to enlighten thy endeavors.
π© Airflow Sorcery: Marvel at the enchanted scheduler, tirelessly orchestrating the pipeline's every step with grace and precision. Here, in this tableau of wisdom, the ETL process cometh alive. Gaze upon its intricacies, for herein liest not just a method but a marvel, where chaos is tamed and knowledge is borne.
-
π οΈ Extraction of Many Founts: Gathers knowledge from the OpenLibrary and Google Books APIs, like a wise scholar pulling treasures from ancient tomes.
-
π§Ή Purification of Data: Cleanseth and enriches the raw information, ensuring it is fair and fit for study.
-
π Integration with the Repository of Postgres: Deposits the bounty into a steadfast database for safekeeping and recall.
-
π‘οΈ Defenses and Logging of Errors: Implements vigilant sentinels to guard against mishaps and record the chronicles of the pipeline.
-
β° Automation of Timely Tasks: Employeth the magic of Airflow to schedule thy tasks, ensuring they commence with precision.
-
π© Slack Heraldry: Dispatches messengers to announce the state of thine efforts in real-time.
-
π¦ Encasement in Dockerβs Vessel: Encircles the pipeline in the aegis of Docker for deployment and scaling to lands far and wide.
- π Extraction:
- Summoneth data from OpenLibrary and Google Books π‘.
- Handles peculiarities of pagination and rate limits, like a skilled juggler with flaming torches.
- π Transformation:
- Cleanseth and standardizes the records π§Ό.
- Resolves missing fields and maketh the data ready for usage.
- π₯ Loading:
- Deposits the enriched bounty into Postgresβ eternal vaults πΎ.
- Employeth conflict resolution to smite duplicate entries.
- π Orchestration:
- Commands the dance of tasks through an Airflow DAG β»οΈ.
- Schedules and retries with the wisdom of experience π.
- π³ Containerization:
- Packages all components within Docker's might vessel π.
- Uses Docker Compose to steer the ships ποΈ.
- π Monitoring:
- Announceth pipeline statuses via Slack π².
- Airflow's interface reveals all activity π.
- π³ Docker & Docker Compose
- π Python 3.8 or above
- π requirements.txt should provide thee with required Python incantations
- π Slack Token, should thou seek notifications
- π οΈ Basic wit in SQL and Python
-
Cloneth the repository:
git clone https://github.com/VBlackie/books_etl.git cd Books_ETL_Pipeline
-
Declare Thy Secrets: Create a .env file with:
POSTGRES_USER=airflow POSTGRES_PASSWORD=airflow POSTGRES_DB=books_db AIRFLOW__WEBSERVER__SECRET_KEY=<your_secret_key> AIRFLOW_ADMIN_USERNAME=admin AIRFLOW_ADMIN_PASSWORD=admin SLACK_CHANNEL=<your-slack-channel> SLACK_API_TOKEN=<your-slack-api-token> GOOGLE_BOOKS_API_KEY=<your-google-books-api-key>
-
Raise Thy Docker Containers:
docker-compose up --build
-
Enter the Interface of Airflow:
- Navigate to http://localhost:8080.
- Credentials: Username: admin Password: admin
-
Query Thy Database and unearth its treasures: Should thou prefer the sanctity of pgAdmin4:
-
Connect unto the database with the following credentials:
- Host: localhost
- Port: 5433
- Username: airflow
- Password: airflow
- Database Name: books_db
Alternatively, should thou be inclined to use the command line:
psql -h localhost -p 5433 -U airflow -d books_db
βοΈ Modify the Query to Suit Thy Quest
Dost thou seek knowledge beyond data engineering? Fear not, for the script is designed to be molded to thy whims! Within the sacred function extract_books_data
in extract.py
, thou shalt find the query:
def extract_books_data():
url = 'https://openlibrary.org/search.json?q=data+engineering' # Focused query on data engineering
Replace 'data+engineering' with the essence of thy pursuit. Forsooth, be it "philosophy", "alchemy", or any subject dear to thee, and lo, the knowledge shall be fetched accordingly.
Should thee wish to extend the reach of this mechanism, thou mayst craft a new script for extracting data. To ensure thy creation aligns with the grand pipeline, thou must honor the sacred format of the transform.py script
. The records must be transformed thusly:
transformed_data.append({
'title': book['title'],
'author': book['author'],
'published_date': book['published_date'],
'isbn': book['isbn'],
'source': book['source']
})
This ensures the data from thy new source melds seamlessly with the rest of the enriched tome of knowledge.
Ere thou dost deploy thy customizations, ensure thy worketh withstands the trials of unit tests. Use the tests provided within the tests/ realm to confirm compatibility. The command to summon the trials is:
pytest tests/
Run this incantation within thy projectβs sanctuary to verify thy changes passeth all scrutiny.
By following these steps, thou canst tailor this repository to serve thy most peculiar pursuits. Modify, expand, and testβthis pipeline shall bend to thy will whilst retaining its elegance and might.
π Books_ETL_Pipeline/
βββ π dags/
β βββ book_etl_dag.py # DAG of Airflow
β βββ extract.py # Gatherer of Data from OpenLibrary
β βββ extract_from_google.py # Gatherer of Data from GoogleBooks
β βββ transform.py # Purifier of Records
β βββ load.py # Depositor of Information
β βββ slack_notifications.py # Herald of Notifications
βββ π logs/ # Chronicles of Airflow
βββ π plugins/ # Custom Enhancements
βββ π tests/ # Realm of Testing and Validation
β βββ test_extract.py # Examiner of Gatherer Logic
β βββ test_transform.py # Scrutinizer of Data Purification
β βββ test_load.py # Overseer of Data Deposition
β βββ test_etl_pipeline.py # Examiner of Integrity
βββ π³ docker-compose.yml # Configuration of the Fleet
βββ π requirements.txt # The Scroll of Dependencies
βββ π .env # Hidden Secrets
- Scheduler Heartbeat Falters π οΈ:
- Ensure Airflow volumes are intact.
- Use docker system prune -f to cleanse thy setup.
- SQL Insert Woes π:
- Ensure table schema matches the load.py script.
- Log Vanish into the Ether π§:
- Verify mappings in docker-compose.yml
- The Goblins of Slumber Delay Thy Database π€:
- At first run, the database machinery doth refuse to awaken promptly, for the goblins within linger in slumber.
- Prithee, restart thy services twice, and lo, the machinery shall spring to life!
- The Discordant Slack Bug π‘οΈ:
- At times, the herald of Slack refuseth to deliver messages due to network gremlins or ill-configured secrets.
- In such cases, the channel variable (SLACK_CHANNEL) was hardcoded to bypass these quirks.
- Prithee, should thou encounter this discord, ensure:
- Thy network doth allow Slack communication.
- Thy Slack token and channel ID are both correct.
- Alas, this issue remains unvanquished, requiring patience or thy workaround!
- π Extend support to Goodreads or others.
- π Bind the pipeline with Metabase for noble visualization.
- π Enhance metadata reporting.
- π Embrace CI/CD for automated testing.
Contributions are welcome! Sharpen thy code and submit thy Pull Requests. Together, let us make this project legendary π.
This project is bestowed under the MIT License. It is free to use, modify, and cherish.
A Glossary of Ye Olde Terms
Fear not, gentle reader, should the flowery language of this proclamation confound thee! Below is a humble guide to the more curious words thou mayst encounter within this hallowed text:
-
Alas! β A cry of sorrow or regret, used to express lamentation. Example: "Alas! The goblins of slumber delay thy database!"
-
Anon β Soon, shortly, in a little while. Example: "Deploy thy pipeline anon and uncover treasures untold!"
-
Behold! β Look upon this with awe and wonder! Example: "Behold! The Diagrammatic Depiction of the ETL Pipeline!"
-
Doth β An archaic form of 'does,' used for emphasis. Example: "Lo, this project doth exemplify the art of data engineering."
-
Hark! β Pay heed! Listen well, for what follows is of utmost importance. Example: "Hark! This noble endeavor is fashioned to fetch and hold knowledge!"
-
Hear ye! Hear ye! β An announcement or proclamation, commanding attention. Example: "Hear ye, hear ye! Gather thy gaze upon this most wondrous depiction!"
-
Lo! β Behold! A word to draw attention to something noteworthy. Example: "Lo, this pipeline is not merely a tool but a masterwork!"
-
Methinks β I believe, I consider, or it seems to me. Example: "Methinks this endeavor shall serve thee well in thy noble quest!"
-
Prithee β I entreat thee, or I ask of thee. Example: "Prithee know, fair user, that thou mayest adapt its workings."
-
Thou/Thy/Thee/Thine β You/Your/To You/Yours (respectively). Example: "Command thy pipeline and monitor thy logs with diligence."
-
Verily β Truly, indeed, without a doubt. Example: "Verily, this mechanism is a marvel of data engineering!"