Skip to content
@project-katara

PROJECT KATARA

Project Katara was built for the NASA Space Apps Challenge 2023 Hackathon

Katara is an ecosystem with Geoprocessing, 3D Map Visualization and a Large Language Model (Llama 2).

At first glance, it's just a chat with a globe, but the ecosystem goes far beyond what you can see.

The entire ecosystem is divided into five main layers: Geoprocessing, Frontend, Backend, AI (Large Language Model - Llama 2), Storage (LLAMA 7B, Water Datasets).

Click on the link for the complete flowchart of the katara ecosystem

Artificial Intelligence (AI) - LLAMA

Our LLaMa was created using a pre-processed model called Llama-2-7b-Chat-GGUF. Basically, it converts Llama 2 to a standard called GPT-Generated Unified Format.

We used the 7 billion parameter model, which is the repository of the enhanced 7B model, optimized for dialogue use cases and converted to the Hugging Face Transformers format.

Our model has five main parts:

EMBEDDINGS: We used hkunlp/instructor-large.

Embeddings are representations of values or objects like text, images, and audio that are designed to be consumed by machine learning models and semantic search algorithms. They translate objects like these into a mathematical form according to the factors or traits each one may or may not have, and the categories they belong to.

Essentially, embeddings enable machine learning models to find similar objects. Given a photo or a document, a machine learning model that uses embeddings could find a similar photo or document. Since embeddings make it possible for computers to understand the relationships between words and other objects, they are foundational for artificial intelligence (AI).

DB - Database Object responsible for saving training data in memory so that it can be consumed later by the model itself.

RETRIEVER - Also known as Retrieval-Augmented Generation is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs' generative process.

LLM - Large Language Models are a core component of LangChain. LangChain does not serve its own LLMs, but rather provides a standard interface for interacting with many different LLMs.

RetrievalQA - Retrieval Question-Answering (QA) is an impressive technology that excels at extracting answers from a given context.

Geoprocessing

Our geoprocessing was built using the open source GeoServer technology, which was consumed using the Web Map Service communication protocol.

Our GeoServer server has been loaded with maps from various sources, all with their appropriate licenses respected.

Maps:

Global River Classification (GloRiC): The Global River Classification (GloRiC) provides river types and sub-classifications for all river stretches contained in the HydroRIVERS database.

HydroLAKES (Lake Polygons): Aims to provide the shoreline polygons of all global lakes with a surface area of at least 10 ha.

HydroRIVERS (HydroRIVERS): represents a vectorized line network of all global rivers that have a catchment area of at least 10 km² or an average river flow of at least 0.1 m³/sec, or both. HydroRIVERS has been extracted from the gridded HydroSHEDS core layers at 15 arc-second resolution

HydroLAKES (Lake pour points): Aims to provide the shoreline polygons of all global lakes with a surface area of at least 10 ha.

General Bathymetric Chart of the Oceans (GEBCO): Aims to provide the most publicly available bathymetry data sets for the world’s oceans.

Earth Observatory (Water Vapor): The Earth Observatory is part of the EOS Project Science Office at NASA Goddard Space Flight

Global Imagery Browse Services (GIBS): GIBS provides quick access to over 1,000 satellite imagery products, covering every part of the world. Most imagery is updated daily—available within a few hours after satellite observation, and some products span almost 30 years. The satellite imagery can be rendered in your own web client or GIS application.

Socioeconomic Data and Applications Center (SEDAC): A Data Center in NASA's Earth Observing System Data and Information System (EOSDIS) — Hosted by CIESIN at Columbia University

Terrestris: This service presents the data of the OpenStreetMap-Project in a clear and simple way.

Environmental Performance Index (SEDAC): The 2022 Environmental Performance Index (EPI) provides a data-driven summary of the state of sustainability around the world.

Backend

Our backend was created with two programming languages, Python and Javascript. It has two parts:

RestFull API (Python): We used the FastAPI framework to create all the communication and documentation between the LLaMa layer and HTTP calls.

The documentation can be accessed from our Katara LLaMA repository, which is hosted on huggingface.

API Websocket (Javascript): We built our websocket using nodejs. It is responsible for managing all the classrooms between teachers and students.

Frontend

Our Frontend was built to provide the best experience for students and teachers on desktop or mobile.

It is divided into three main parts:

Globe: Responsible for rendering all the maps for the user.

Classroom: Responsible for integrating teachers and students in a single shared room, so everyone can follow answers and questions in real time.

Student: Responsible for interacting with our LLaMa service, providing inputs (questions) and consuming outputs (answers).

Storage

Our AI was trained with data provided by NASA. The training process consisted of a few steps: Data Screening, Data Capture, Data Processing and finally Data Processing by our LLaMa version 2.

The data was taken from the following sources:

Internal Data

LLaMa 2: Dataset from Facebook's own model

Llama-2-7b-Chat-GGUF: Model that uses the GPT-Generated Unified Format

External Data

Earth Observatory

Environmental Performance Index (EPI)

Wikipedia

Climatekids - Nasa

Climate - Nasa

Center for Science Education

Earth Data - Nasa

HydroSheds

Food and Agriculture Organization of the United Nations - FAO

Pinned Loading

  1. katara-nasa katara-nasa Public

    JavaScript 2

  2. geoserver geoserver Public

    Forked from geoserver/geoserver

    Official GeoServer repository

    Java

  3. geoserver-docker geoserver-docker Public

    HTML

  4. katara-llama katara-llama Public

    Python

Repositories

Showing 5 of 5 repositories

Top languages

Loading…

Most used topics

Loading…