-
Notifications
You must be signed in to change notification settings - Fork 2
Feasibility Study: AI‐powered build time adaptor generation
- Project Overview:
OpenFn is a suite of data integration, interoperability, and business process automation (i.e., workflow) tools that's used by governments, NGOs, and social enterprises in the health and humanitarian sectors. It enables users to connect any system, and comes with adaptors (i.e. connectors) for over 70 apps. The development of new adaptors requires technical skills and the diversity in the type of systems to interact with makes this a lengthy process.
- Objective:
The primary objective of this project is to streamline the development process by automating adaptor (code) generation tasks, enhancing developer productivity while maintaining code quality and adherence to best programming practices within OpenFn . A platform leveraging LLMs is to build to understand the adaptor requirements in a job and return a testable JS adaptor. The initial proposal is represented in the lower two levels in the diagram below.
- Scope:
The initial scope of the project involves making an in-depth feasibility of the project and delivering a working prototype within the 15.10.2023 - 15.12.2023 timeframe. The development encompasses the integration of cutting-edge LLMs, including GPT, Starcoder, for natural language understanding and code synthesis specific to JavaScript development environments. Due to the scope the use of pre-trained models and prompt engineering will be preferred in place of fine-tuning.
- Data Requirements:
The project requires OpenFN datasets (adaptor, cli repositories), language usage patterns, potential model inputs, and documentation related to JavaScript frameworks, libraries, and syntax.
- Data Sources:
Model inputs: Links to API documentations (source, target) e.g. Mailchimp
Business requirements e.g. "Given a list of {Campaign Members} from {Salesforce}, {add or update} {Member} in {Mailchimp}". Gathering the most frequent workflows/requirements types from Aissatou.
Technical specifications: Given the example above, this will refer to the properties of the members in the Salesforce and Mailchimp needed for the task. This can potentially be extracted from the APIs, however from what I have learned this week, it usually involves all stakeholders and can require a manual process. An example in a spreadsheet can be found here.
Technical definitions: This refers to the function definitions and adaptor description before development. An example from Github includes:
// Gets a drive reference and writes to state.drives.default
// @param specifier: specify the drive to get. default 'me'.
// If an id/owner pair, will return the default drive for the specified owner
// @param name: optional name for the drive. Will be written to state.drives[name].
getDrive(specifier, name, callback)
// Examples:
getDrive({}, 'joe'); // get my drive and name it joe
getDrive('me', 'joe'); // get my drive and name it joe
getDrive({ id: 'driveId' }, 'joe'); // Get a drive by Id
getDrive({ id: '<siteId>', owner: 'sites'}, 'joe'); // get default drive for a site
getDrive({ id: '<userId>', owner: 'users' }); // get default drive for a user
// return an array of folder contents
// We should safely URI encode the path
// Path must start with /, else will be treated as an id
// writes to state.data
// uses default drive
getFolder(pathOrId, { options }, (callback = s => s));
// return a single file's content (pass { metadata: true } to get metadata only)
// return an array of folder contents
// We should safely URI encode the path
// Path must start with /, else will be treated as an id
// writes to state.data
// uses default drive
getFile(pathOrId, { options }, callback);
// writes one or more files to the path provided
writeFiles(pathOrId, { options }, data, callback)
// loose spec for getFile/writeFiles options
const options = {
driveName: 'string', // drive of the file to use (optional)
metadata: false, // return metadata, not content
$filter: 'string' // filter on the query. Needs to be URI encoded
};
-
Data Preprocessing: The above data is to be prepocessed and standardised for model evaluation.
-
Development Tools: Integration of development tools and libraries, including TensorFlow, PyTorch, and Hugging Face, will facilitate development, AI model evaluation, and deployment.
- Language Models:
GPT-3, distilBERT, Starcoder, CodeT5 and other Transformer-based models will be employed to enable natural language processing, code generation, and contextual understanding of JavaScript syntax and semantics. This list is to be expanded for the initial evaluation and narrowed down by the end of the project.
- Pretrained Models:
Exploration of pretrained LLM models, especially those that have been fine-tuned on JavaScript-specific datasets to facilitate more accurate and context-aware code generation.
- Model Customization:
Pre-trained models are not capable of fulling comprehending proprietary standards out the box (e.g. using OpenFn language-common). The prototype will focus on achieving the best performance with pre-trained models.
Eventually, evaluate the potential benefits of customising the final models by finetuning then on proprietary OpenFN JavaScript datasets to enhance its ability to understand and generate domain-specific code is recommended.
- Technical Constraints:
Integration complexities and performance limitations associated with large-scale AI models may pose technical challenges during the implementation phase. To limit this, the project is to be developed in Python, use plug-n-play libraries like HuggingFace to minimise deployments, and iterative evaluations with OpenFN is to be conducted.
- Data Security and Privacy:
Safeguarding sensitive data and ensuring compliance with data protection regulations represent critical concerns throughout the project lifecycle. To limit this, no customer data will be used and all data used should be transparent.
- Model Performance Reliability:
Ensuring the reliability and accuracy of code generation outputs while minimizing bias and ensuring interpretability are crucial aspects requiring careful consideration. See the evaluation section.
- Resource Allocation:
An ML engineer working full-time on the project.
Access to a GPU with 32 GB RAM (For models that are to be hosted locally for testing).
- Performance Metrics:
Performance factors include code correctness, syntactic coherence, semantic relevance, and adherence to JavaScript coding standards and best practices. Precision, Recall, and F1-score will serve as key performance metrics to assess the accuracy and reliability of the AI-generated JavaScript code.
- Testing Methodology:
Rigorous unit testing and integration testing protocols will be established to assess the system's consistency, adaptability to varying business scenarios. The various inputs above will serve to measure adaptability. The system is to be tested on code generation from business requirements as well technical specifications.
- User Acceptance Testing (UAT):
User acceptance testing will involve soliciting feedback from software developers at OpenFn to evaluate the functionality, and overall effectiveness of the AI-powered build time adaptor generation.
- Continuous Improvement:
Continuous monitoring, feedback integration, and periodic model finetuning will be integral components of the project's roadmap to ensure ongoing enhancements and adaptability to dynamic JavaScript development trends. In the context of the prototype, sufficient documentation is to be provided for monitoring the developed system
TBD
- Milestones: Define clear project milestones and deliverables to monitor progress and facilitate effective project management.
Week 1: Analysis of the project, gathering requirements, cli & adaptor walkthroughs, scaffold repo setup (See Github Readme). Week 2: Initial inference microservice, job based code generation service development, AI session in Berlin
- Gantt Chart: Create a detailed Gantt chart outlining the project timeline, tasks, dependencies, and critical deadlines to establish a clear roadmap for project execution.
This is not relevant for the project, but I believe enough information should be provided for OpenFn to later conduct cost-benefit analysis. Any requirements not in this document should be expressed by the stakeholders.
-
Communication Channels: Communication is mainly carried out on Slack. The following channels are also used:
-
Zenhub (Backlog, Milestones and Tasks)
-
Github (Repositories + Wiki documentation)
-
Status Reports and Updates: Establishing a regular reporting schedule for status updates, project milestones, and key achievements to keep stakeholders informed and engaged in the project's development and success. The project will follow one-week sprints, with a weekly call scheduled for Mondays to align the goals for the week, a quick sync on Wednesdays to address any immediate concerns, and an update on Fridays to review the progress made and the completion of tasks within the sprint. This approach ensures consistent communication and transparency throughout the project's lifecycle.
- Ethical Framework: TBD