Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port over PlanSearch from independent repo #1

Merged
merged 29 commits into from
Oct 18, 2024
Merged

Port over PlanSearch from independent repo #1

merged 29 commits into from
Oct 18, 2024

Conversation

evanzwang
Copy link
Contributor

No description provided.

evanzwang and others added 29 commits July 10, 2024 01:34
* Add query util for search

* Rename query.py to queriers.py and small refactors

* Build out framework for basic prompting

* Fix small prompting bug with starter code; allow eval with codrm
pipeline

* Add chain-of-thought

* Refactor into SearchModel for more inheritance

* Add backtranslating task

* Add timeout exception catching

* Add AnthropicQuerier and refactor querier.py

* Add option to toggle few shot and sys prompt

* Add experiment directory and refactor

* Refactor eval code into `scale_lcb_eval`

* Add logs to gitignore

* Add functionality for sample and private tests in `base_classes`

* Initial scaffolding for parsel

* Refactor `queriers.py` to take all functionality in LLMQuerier

* Refactor adding args

* Add caching to `queriers.py`

* Add parsel parameters

* Implement Parsel

* Delete unnecessary lines

* Add more unused code

* Change SYSTEM_PROMPT in generation in backtranslate task

* Add cache-file argument

* Add utility `create_dataset_with_sols` to create dataset for
backtranslation

* Refactor utils for parsing into `parsing_utils.py`

Delete and refactor more unused utils

More deletions

Refactor functions

Further function refactoring

Refactor (mostly fn prompts)

Refactor `fn.py`

Refactor `Test` into `base_classes.py`

Refactor fn queries into `parsel_queries.py`

Small bugfixes

* Refactor `exec_utils.py` to outside `parsel`

* Add option to requery in `queriers.py` even with caching

* Add stdio test to `exec_utils.py`

* Add simple filtering model

* Replace returned Nones with an empty string

* Add logging to simple-filter

* Fix small bug in Test class

* Catch strange httpcore error

* Add simple idea model

* Allow for len of public tests to be 0 in exec_utils

* Refactor `self.queriers` and add `simple_idea_model` to eval

* fix bug from copy-pasting wrong

* Add bulk querying for fn impls

* Account for indent in `filter_for_fn`

* Change hyperparameters

* Add integration tests in `scripts`

* Catch APIConnectionError in queriers.py

* Introduce `idea_filter_model`

* Catch internal server error

* Minor refactor

* Refactor and fix bugs in simple filtering
Made new class to take in `SearchModel`s and run filtering on top

* Add `JSONDecodeError` and `UnicodeDecodeError` to `queriers.py`

* Add separate temperatures for idea and code

* Add utils to merge cache files

* Revert gitignore

* Add .gitignore files
* Add observation model

* Increase backoff and add jitter

* Fix typo

* Add zero-shot to `simple_idea` and fix small bug in `filter_models`

* Adjust parameters for timeout

* Add CF submit util

* Add plotting utils

* Add partial prompts

* Fix simple filter with simple idea

* Refactor model selection dictionaries in `queriers.py`

* Add querier_utils.py that was forgotten in previous commit

* Add price tracking and exec warmup

* Add num_words to backtranslate as arg

* Refactor small names of querier

* Change warmup to 200

* Add Path makedir to cache.json in case dir doesn't exist

* Add vLLM support

* Add base deepseek lite model

* Fix small bugs in querier

* Fix small bug from calling torch.cuda.is_bf16_supported()

* Add new DeepSeek models and add base model functionality

* Fix small bug where final price was not output

* Generalize codeforces parsing

* Change parameters for vLLM

* Add pseudocode model

* Add GPT-4o-mini and fix bug in vLLM inference

* Add small section to backtranslate prompts

* Fix small bug where None was not supported for tests

* Add Python script to create taco dataset with nl solutions

* Fix bugs in `create_taco_backtranslate.py`

* Make requery automatically True

* Fix bugs in create_taco_backtranslate

* Add notebooks

* Add functionality for custom local models

* Rename observation to simple observation

* Add no-intuition prompt to idea

* Add internal querier

* Fix bug in completion (no chat) code for basic

* Fix base prompting
* Add llama 3.1 8b and 70b as supported by scale llm engine

* Add scale LLM engine requery time

* Create synthetic TACO solutions from gpt-4o

* Catch more llm engine errors

* Fix exception catching

* Add combo observation first iteration

* Add more error catching

* Log in combo observation and fix small bug

* Fix small logging in `combo_observation_model.py`

* Add second iteration of combo observation

* Add automatic batching if too large query

* Add pbar to querying

* Fix small bug where iteration 2 is not called

* Fix bug where problems is not updated to be expanded

* Tweak exponential backoff

* Partially change `combo_ratio_finder.ipynb` to start getting
performances
* Add num_workers to code exec

* Add num_shots to simple prompting

* Add num words and sys prompt option to idea search

* Idea prompt fixes

* Add space to prompt

* Change querying params

* Add llama-3.1

* Add file to create datasets from generations

* Add `format` arg for chat vs completion manual change
python research/hugh/open_weights_code/eval_all_checkpoints.py --s3_checkpoint_location=s3://scale-ml/hugh/rlxf/dpo/meta_llama3.1-8b_731-5e6_1/checkpoints/ --s3_output_location=openweightscode/eval_results/dpo_731-5e6_1 --num-gpus=4 --max_checkpoints=20

Co-authored-by: Hugh Zhang <[email protected]>
* Add `fail_codes` to Problem

* Unify datasets and merge generate/eval pipelines

* Add exec args

* Auto-detect optional features

* Unify create_test_bank.py

* Rename parse_dataset_utils to dataset_utils

* cript to clean tests and add public tests

* Increase slightly the exponential backoff

* Add make dataset utilities

* Add pyproject.toml for better imports

* Add __init__.py

* Commit dataset and misc scripts/notebooks

* Add `load_json.ipynb` for completeness

* Add code for taco_cleaner

* Add format_taco_data.py

* Remove unneeded content

* Remove older create generation dataset scripts

* Add code to create idea solve graphs

* Fix small bugs and add `fn_args_join`

* Add timeout to queriers and add better heuristics in `add_public_tests`

* emove line from debugging

* Improve add_public_test heuristic filters

* Refactor into CompletionCache class

* Fix minor keywarg typo

* Small changes to `add_public_tests.py`

* Change model to be customizable

* [UNFINISHED] partially implement `queriers.py` into threadpool

* Update `search` imports

* Edit `add_public_tests.py` for better prompts

* Change plot over time

* Add `filter_lcb_data.py`

* Add `filter_zero_public_tests.py`

* Add thread pool to querier

* Add new OOP-based queriers

* Add notebook to parse D to F data

* Add `num-gpu` to eval.py args

* Small

* Small bugfixes in `query_clients.py` and add `model_configs`

* Add `query_clients.py` modularity to `queriers.py` and upstream

* Remove unneeded large dicts

* Remove needed parts of `querier_utils.py`

* Add custom vLLM config demo

* Fix minor merge bug

* Remove default arguments at lower levels
dpo_basic_config.json contains basic DPO successful checkpoint

Infra to launch RLXF sweeps (in alpha). Also infra to launch basic eval
scripts in eval_all_checkpoints.py

Co-authored-by: Hugh Zhang <[email protected]>
* Add format and merge taco notebook

* Update code_exec_reqs to requery
* Add SGLang querying functionality

* Add example model_config

* Improve SGLang and write `simultaneous_eval.sh` script

* Fix combo obs to be analyzable

* Small assert to make sure testbank URL is right

* Switch `make_dpo_dataset.py` model config to fix

* Add greater sglang

* Add parsing of starter_code to fix testbank bug

* Adjust minor details to `dataset_utils.py` and `query_clients.py`
* Add SGLang querying functionality

* Add example model_config

* Improve SGLang and write `simultaneous_eval.sh` script

* Fix combo obs to be analyzable

* Small assert to make sure testbank URL is right

* Switch `make_dpo_dataset.py` model config to fix

* Add greater sglang

* Add parsing of starter_code to fix testbank bug

* Adjust minor details to `dataset_utils.py` and `query_clients.py`

* Add reward model functionality to `make_dpo_dataset.py`

* Update configs

* Add utility to choose gpus in simultaneous eval

* Add DataParallel to `reward_model_utils.py`

* Add PARAMS to each `query_client` and minor refactor

* Add `model_config_utils.py` to add overwrite args

* Add better model args for all existing search methods

* Fix minor bug in `make_dpo_dataset`

* Add query parameters to query logs

* Relax float constraint on query client price

* Fix Anthropic `query_client.py` and adjust rate limits

* Reduce amounts of print in `query_clients`

* Add `story` prompt method inside `one_prompt_models.py`

Rename `basic_prompting.py` to `one_prompt_models.py`

* Delete unneeded files

* Adjust imports in `make_dpo_dataset.py`

* Add `stringify` and `unstringify` to prepare for other tests

* Fix small `simple_filter_models.py` bug

* Add new one_prompt methods, add exec_string to Problem

* Add notebooks to parse MBPP and HumanEval

* Add modifications in `exec_utils.py` for exec_string

* Small change can detete lol

* Add `batch_apply_on_nested_list` to `python_utils.py`

* Refactor `combo_observation_model.py`

* Add notebooks to parse *_plus datasets

* Add `completions-from-model` arg

To not necessarily do completion-limit x as much problems

* Small changes to `queriers.py` to accomodate for `tuple` convos
* Change `F_mbpp_plus` and `create_test_bank.py` to support exec_string

* Add `parse_orig_lcb.ipynb` for creating new LCB

* Update `parse_orig_lcb.ipynb` to have _C dataset

* Add `map_nary_fn_on_nested_list` to `python_utils`. Add
`check_similar.py`

* Save test results with underscore between name and test

* Add `deepseek-coder` functionality

* Add LLMEngine 405b

* Change `check_similar.py` to "general" instead of "specific" idea

* Change querying to: idea for code -> Yes/No

* Add script to run most exps

* Fix `simple_filter_models.py` for non-list Test public tests

* Fix Anthropic querying

* Fix bug with testhash and human_eval (add fn_name to exec)

* Add `search.` imports

* Add utility to re-eval a previous results.json.gz file

* Fix small, auxiliary bug on the test hash fn_name problem

* Edit `metrics.py` to support public test filtering

* Fix another auxiliary bug within scripts/re_eval_codes.py

* Add graphing notebook for final results for completeness

* Better infra for graphing

* Fix small human_eval_plus public test issue

* Change `generate_solutions` to return `list[list[str]]`

* Update `check_similar.py` to take in `args.cache-file`

* Update `parse_mbpp_plus` to include 3 more better tests

* Catch Anthropic BadRequest

* Commit notebook changes (?? unsure what changed)

* Add `base_classes` change for `generate_solutions` super method change

* Add SGLang completions functionality

* Remove print("HI") in `query_clients.py`

* Add "This NL solution is wrong" prompt in `combo_observation`

* Add new changes to graph pass@k notebooks

* Update `scripts/run_exps_...`

* Add `baby-deepseek` models with SGLang

* Fix small assertion bug in `query_clients.py`

* (combo obs) Add another step to merge fixes with original nl solution

* Change small type hint (you can ignore this commit)

* Update model_configs and scripts

* Add Fireworks and Together

* Update `check_similar.py` to requery if badrequest

* Comment out line that requires k to be less than min n_comp

* Undo `combo_observation_model.py` enhanced 'fix' prompting

* Update `run_exps` scripts

* Update `graph_notebooks`

* Add `caches` folder to gitignore
* Augment testing using more processes

* Adjust Firework parameters

* Change `exec-public` arg to `exec-type`

* Change `re_eval_codes.py` to reflect test execution changes

* Add checkpointing to `re_eval_codes` for cache

* Add right vs wrong in check similar

* Catch rare case where OpenAI logprobs is null

* Add graph plotters for paper push

* Make `re_eval_codes.py` auto-checkpoint

* Mildly change prompts

* Add fireworks llama 70b

* Subtly change the way n_completions work.
(Should have been with the other prompt commit)

* Update scripts

* Add o1

* Update graphing notebooks

* Update gitignore
* Refactor combo obs with observation node, add num layers

* Add arg to fix idea or no

* Refactor prompt generation fns to `prompts` folder

* Add --without-pseudocode flag to combo obs

* Add without-idea arg

* Add unincluded prompts

* Change num-completion args into 2

* Add graphing utilities for ablations
Copy link
Contributor

@hughbzhang hughbzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Evan's full PlanSearch code.

@marcos-f7z marcos-f7z merged commit 16df032 into main Oct 18, 2024
1 check passed
@evanzwang evanzwang deleted the master branch October 18, 2024 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants