This project was inspired by not wanting to pay to use someone else's historical weather API. After running out of a free trial for the weather.com, I got to thinking - the many paid weather APIs out there are not all funding weather stations or satellites and collecting their own data, right!? I assume they take their data from a public source such as the NOAA and then sell this data processed into an easily accessible API, so I attempted to do the same thing on my own.
To begin with, I used a shell script to downloaded a year's worth of data from the NOAA RAP weather model, which offers weather data at a resolution of 1 hour and every 12km over the coverage area - it was around 250GB of raw data. This weather data is stored in the GRIB2 file format, which I parsed using the Python modules XArray (with the cfgrib engine) and Pandas. After loading the parameters I was interested in to a Pandas dataframe, I was able to pass each datapoint as a row in a MariaDB SQL table. This table is hosted by Amazon Web Services RDS service - adding a year's data was nearly 300 million individual rows, and took nearly two days. Overall, this project taught me a lot about database administration and its challenges.
Lastly, I knew I wanted this database to be automatically updated, not just store a year's worth of data from the time it was initially created. I envisioned a first-in first-out cache system, where every day the most recent data would be added automatically, and the oldest day of data would be removed (ideally, all data would stay and the DB would only grow, but storage limits dictate that it can only hold one year at a time). To accomplish this, I utilized two serverless AWS Lambda functions - one is a downloader that uses Selenium to scrape the latest data from the NOAA's web form and save it to a S3 storage bucket. The next function is a parser, which uses a modified version of my local parser to parse the file in the S3 bucket, post it to the database, and then empty the bucket after parsing. These functions are scheduled to execute daily using AWS CloudWatch events.
Here is the first image I received after processing the data (using matplotlib), showing a map of temperature over the coverage area:
-
I used MyISAM over the InnoDB engine because even though it is older and less performant on queries, it uses less disk space than InnoDB
-
Data types for each column in the table (designed to reduce data storage where possible)
-
First step for scraping latest data from NOAA data portal:
driver.get('https://www.ncei.noaa.gov/has/HAS.FileAppRouter?datasetname=RAP130&subqueryby=STATION&applname=&outdest=FILE') input_element = driver.find_element(By.NAME, "emailadd") input_element.send_keys("[email protected]", Keys.RETURN) submit_button = driver.find_element(By.CSS_SELECTOR, ".HASButton:first-child") submit_button.click() source = driver.page_source regex = re.compile(r'HAS\d+') match = regex.search(source) end_url = 'http://www1.ncdc.noaa.gov/pub/has/model/' + match[0]
-
Inserting data into the SQL table:
query = """INSERT INTO weather (time_start, latitude, longitude, t, gust, sde, prate, ltng) VALUES (%s,%s,%s,%s,%s,%s,%s,%s)""" for row in rows[8000:130000]: # this range contains bounding box coords time_start, time_stop, latitude, longitude, t, gust, sde, prate, crain, ltng = row if latitude >= 22.262387 and latitude <= 50.648574 and longitude >= -127.640625 and longitude <= -64.359375: records.append((time_start, latitude, longitude, t, gust, sde, prate, ltng)) cursor.executemany(query, records)
-
Building ecCodes C library for use in my parsing Docker container and installing dependencies:
ENV ECCODES_URL=https://confluence.ecmwf.int/download/attachments/45757960/eccodes-2.10.0-Source.tar.gz?api=v2 \ ECCODES_VERSION=eccodes-2.10.0-Source RUN cd /tmp && wget --output-document=${ECCODES_VERSION}.tar.gz ${ECCODES_URL} && tar -zxvf ${ECCODES_VERSION}.tar.gz RUN ls RUN cd /tmp/${ECCODES_VERSION} && mkdir build && cd build && \ cmake -DENABLE_FORTRAN=false -DPYTHON_LIBRARY_DIR=/usr/lib64/python3.6 -DPYTHON_INCLUDE_DIR=/usr/include/python3.6m -DPYTHON_EXECUTABLE=/usr/bin/python3 .. \ && make -j2 && make install RUN pip install pandas xarray boto3 mysql.connector cfgrib eccodes
I initially tried to download each file manually in the browser, but that took way too much time. My next step was to attempt to use some Chrome download manager extensions, but these too were poorly designed and confusing to use. In the end, I realized I could make my own 'download manager' so I learned a bit of shell script to efficiently automate this task.
I knew from the start that my goal of loading a year of data to the database would make space a challenge, especially on the AWS free tier. One step I took after realizing I might go over this size was to switch from my initial PostgreSQL database to a MariaDB database. While Postgres offers more features and is a more modern database, for size MariaDB was really helpful. The same data took about 60% of the storage size using this database.
With a database this size, I knew I would need some indices to improve query runtime. I chose to add indexes after the table was created, but one issue I ran into with this was that the CREATE INDEX process copies the entire table before executing, which meant I would go over my alotted 20gb on the AWS free tier. Here is a plot of the memory capacity of my database while I was trying to add indices (y-axis is remaining space in mb): After multiple CREATE INDEX attempts failed due to running out of memory, I realized I would have to cut my database size to keep the combined memory of db, the db copy, and the indices under 20gb. To cut the size, I dropped the column 'crain' which was a boolean saying whether it was raining or not. This freed just 0.3gb, so I made some more changes: alter all weather attribute columns from a real (8 bytes) to a float (4 bytes) - this dropped a whole 2gb! I then added a triple composite index on time, latitude, and longitude, which dropped a sample query runtime from ~13s to ~300ms.
AWS Lambda functions utilize "layers" for any necessary imports not included in Python, and limits the size of these layers to just 250MB. Because I was using some pretty heavy imports (Chrome driver, cfgrib, pandas) I ran over this limit. The workaround I found is that if you use a Docker container to run your functions within, you have 10GB of space allocated for each container, which was more than enough for me. It was tough to get the Dockerfile right for the parser function, as it required some built C libraries, but I eventually got working containers for both my functions to run serverless in the cloud.
-
My latest idea for this project is another Lambda function that will run daily, executing SQL queries for different weather records/extremes from the database (i.e. hottest average day, highest snow depth, etc), and saving those results to a secondary MySQL table. On the frontend, users will be able to select a record they want to view, and it will direct them to the right time/place to view that data without running a heavy SQL query every time a user wants that information. Here are a few examples of records I found interesting (as of 2/1/2023):
-
Hottest average day in U.S. in the last year - July 20, 2022 (83F average)
-
Highest individual temperature in the U.S. in the last year - July 17, 2022 (122F in Death Valley, CA) This was not the real record, which was 127F on September 1, 2022, also in Death Valley (I suspect by only capturing every 3rd hour for the database, I missed this point)
-
Lowest individual temperature - February 13, 2022 (-45F in Lake Nipigon, Ontario, Canada)
-
Highest individual snow level - April 19, 2022 (25ft in Lillooet Icefield, British Columbia, Canada)
-
Windiest day - September 28, 2022 (Hurricane Ian approaches Florida)