Port over PlanSearch from independent repo #1

evanzwang · 2024-10-18T21:15:32Z

No description provided.

* Add query util for search * Rename query.py to queriers.py and small refactors * Build out framework for basic prompting * Fix small prompting bug with starter code; allow eval with codrm pipeline * Add chain-of-thought * Refactor into SearchModel for more inheritance * Add backtranslating task * Add timeout exception catching * Add AnthropicQuerier and refactor querier.py * Add option to toggle few shot and sys prompt * Add experiment directory and refactor * Refactor eval code into `scale_lcb_eval` * Add logs to gitignore * Add functionality for sample and private tests in `base_classes` * Initial scaffolding for parsel * Refactor `queriers.py` to take all functionality in LLMQuerier * Refactor adding args * Add caching to `queriers.py` * Add parsel parameters * Implement Parsel * Delete unnecessary lines * Add more unused code * Change SYSTEM_PROMPT in generation in backtranslate task * Add cache-file argument * Add utility `create_dataset_with_sols` to create dataset for backtranslation * Refactor utils for parsing into `parsing_utils.py` Delete and refactor more unused utils More deletions Refactor functions Further function refactoring Refactor (mostly fn prompts) Refactor `fn.py` Refactor `Test` into `base_classes.py` Refactor fn queries into `parsel_queries.py` Small bugfixes * Refactor `exec_utils.py` to outside `parsel` * Add option to requery in `queriers.py` even with caching * Add stdio test to `exec_utils.py` * Add simple filtering model * Replace returned Nones with an empty string * Add logging to simple-filter * Fix small bug in Test class * Catch strange httpcore error * Add simple idea model * Allow for len of public tests to be 0 in exec_utils * Refactor `self.queriers` and add `simple_idea_model` to eval * fix bug from copy-pasting wrong * Add bulk querying for fn impls * Account for indent in `filter_for_fn` * Change hyperparameters * Add integration tests in `scripts` * Catch APIConnectionError in queriers.py * Introduce `idea_filter_model` * Catch internal server error * Minor refactor * Refactor and fix bugs in simple filtering Made new class to take in `SearchModel`s and run filtering on top * Add `JSONDecodeError` and `UnicodeDecodeError` to `queriers.py` * Add separate temperatures for idea and code * Add utils to merge cache files * Revert gitignore * Add .gitignore files

* Add observation model * Increase backoff and add jitter * Fix typo * Add zero-shot to `simple_idea` and fix small bug in `filter_models` * Adjust parameters for timeout * Add CF submit util * Add plotting utils * Add partial prompts * Fix simple filter with simple idea * Refactor model selection dictionaries in `queriers.py` * Add querier_utils.py that was forgotten in previous commit * Add price tracking and exec warmup * Add num_words to backtranslate as arg * Refactor small names of querier * Change warmup to 200 * Add Path makedir to cache.json in case dir doesn't exist * Add vLLM support * Add base deepseek lite model * Fix small bugs in querier * Fix small bug from calling torch.cuda.is_bf16_supported() * Add new DeepSeek models and add base model functionality * Fix small bug where final price was not output * Generalize codeforces parsing * Change parameters for vLLM * Add pseudocode model * Add GPT-4o-mini and fix bug in vLLM inference * Add small section to backtranslate prompts * Fix small bug where None was not supported for tests * Add Python script to create taco dataset with nl solutions * Fix bugs in `create_taco_backtranslate.py` * Make requery automatically True * Fix bugs in create_taco_backtranslate * Add notebooks * Add functionality for custom local models * Rename observation to simple observation * Add no-intuition prompt to idea * Add internal querier * Fix bug in completion (no chat) code for basic * Fix base prompting

* Add llama 3.1 8b and 70b as supported by scale llm engine * Add scale LLM engine requery time * Create synthetic TACO solutions from gpt-4o * Catch more llm engine errors * Fix exception catching * Add combo observation first iteration * Add more error catching * Log in combo observation and fix small bug * Fix small logging in `combo_observation_model.py` * Add second iteration of combo observation * Add automatic batching if too large query * Add pbar to querying * Fix small bug where iteration 2 is not called * Fix bug where problems is not updated to be expanded * Tweak exponential backoff * Partially change `combo_ratio_finder.ipynb` to start getting performances

* Add num_workers to code exec * Add num_shots to simple prompting * Add num words and sys prompt option to idea search * Idea prompt fixes * Add space to prompt * Change querying params * Add llama-3.1 * Add file to create datasets from generations * Add `format` arg for chat vs completion manual change

python research/hugh/open_weights_code/eval_all_checkpoints.py --s3_checkpoint_location=s3://scale-ml/hugh/rlxf/dpo/meta_llama3.1-8b_731-5e6_1/checkpoints/ --s3_output_location=openweightscode/eval_results/dpo_731-5e6_1 --num-gpus=4 --max_checkpoints=20 Co-authored-by: Hugh Zhang <[email protected]>

* Add `fail_codes` to Problem * Unify datasets and merge generate/eval pipelines * Add exec args * Auto-detect optional features * Unify create_test_bank.py * Rename parse_dataset_utils to dataset_utils * cript to clean tests and add public tests * Increase slightly the exponential backoff * Add make dataset utilities * Add pyproject.toml for better imports * Add __init__.py * Commit dataset and misc scripts/notebooks * Add `load_json.ipynb` for completeness * Add code for taco_cleaner * Add format_taco_data.py * Remove unneeded content * Remove older create generation dataset scripts * Add code to create idea solve graphs * Fix small bugs and add `fn_args_join` * Add timeout to queriers and add better heuristics in `add_public_tests` * emove line from debugging * Improve add_public_test heuristic filters * Refactor into CompletionCache class * Fix minor keywarg typo * Small changes to `add_public_tests.py` * Change model to be customizable * [UNFINISHED] partially implement `queriers.py` into threadpool * Update `search` imports * Edit `add_public_tests.py` for better prompts * Change plot over time * Add `filter_lcb_data.py` * Add `filter_zero_public_tests.py` * Add thread pool to querier * Add new OOP-based queriers * Add notebook to parse D to F data * Add `num-gpu` to eval.py args * Small * Small bugfixes in `query_clients.py` and add `model_configs` * Add `query_clients.py` modularity to `queriers.py` and upstream * Remove unneeded large dicts * Remove needed parts of `querier_utils.py` * Add custom vLLM config demo * Fix minor merge bug * Remove default arguments at lower levels

dpo_basic_config.json contains basic DPO successful checkpoint Infra to launch RLXF sweeps (in alpha). Also infra to launch basic eval scripts in eval_all_checkpoints.py Co-authored-by: Hugh Zhang <[email protected]>

* Add format and merge taco notebook * Update code_exec_reqs to requery

* Add SGLang querying functionality * Add example model_config * Improve SGLang and write `simultaneous_eval.sh` script * Fix combo obs to be analyzable * Small assert to make sure testbank URL is right * Switch `make_dpo_dataset.py` model config to fix * Add greater sglang * Add parsing of starter_code to fix testbank bug * Adjust minor details to `dataset_utils.py` and `query_clients.py`

* Add SGLang querying functionality * Add example model_config * Improve SGLang and write `simultaneous_eval.sh` script * Fix combo obs to be analyzable * Small assert to make sure testbank URL is right * Switch `make_dpo_dataset.py` model config to fix * Add greater sglang * Add parsing of starter_code to fix testbank bug * Adjust minor details to `dataset_utils.py` and `query_clients.py` * Add reward model functionality to `make_dpo_dataset.py` * Update configs * Add utility to choose gpus in simultaneous eval * Add DataParallel to `reward_model_utils.py` * Add PARAMS to each `query_client` and minor refactor * Add `model_config_utils.py` to add overwrite args * Add better model args for all existing search methods * Fix minor bug in `make_dpo_dataset` * Add query parameters to query logs * Relax float constraint on query client price * Fix Anthropic `query_client.py` and adjust rate limits * Reduce amounts of print in `query_clients` * Add `story` prompt method inside `one_prompt_models.py` Rename `basic_prompting.py` to `one_prompt_models.py` * Delete unneeded files * Adjust imports in `make_dpo_dataset.py` * Add `stringify` and `unstringify` to prepare for other tests * Fix small `simple_filter_models.py` bug * Add new one_prompt methods, add exec_string to Problem * Add notebooks to parse MBPP and HumanEval * Add modifications in `exec_utils.py` for exec_string * Small change can detete lol * Add `batch_apply_on_nested_list` to `python_utils.py` * Refactor `combo_observation_model.py` * Add notebooks to parse *_plus datasets * Add `completions-from-model` arg To not necessarily do completion-limit x as much problems * Small changes to `queriers.py` to accomodate for `tuple` convos

* Change `F_mbpp_plus` and `create_test_bank.py` to support exec_string * Add `parse_orig_lcb.ipynb` for creating new LCB * Update `parse_orig_lcb.ipynb` to have _C dataset * Add `map_nary_fn_on_nested_list` to `python_utils`. Add `check_similar.py` * Save test results with underscore between name and test * Add `deepseek-coder` functionality * Add LLMEngine 405b * Change `check_similar.py` to "general" instead of "specific" idea * Change querying to: idea for code -> Yes/No * Add script to run most exps * Fix `simple_filter_models.py` for non-list Test public tests * Fix Anthropic querying * Fix bug with testhash and human_eval (add fn_name to exec) * Add `search.` imports * Add utility to re-eval a previous results.json.gz file * Fix small, auxiliary bug on the test hash fn_name problem * Edit `metrics.py` to support public test filtering * Fix another auxiliary bug within scripts/re_eval_codes.py * Add graphing notebook for final results for completeness * Better infra for graphing * Fix small human_eval_plus public test issue * Change `generate_solutions` to return `list[list[str]]` * Update `check_similar.py` to take in `args.cache-file` * Update `parse_mbpp_plus` to include 3 more better tests * Catch Anthropic BadRequest * Commit notebook changes (?? unsure what changed) * Add `base_classes` change for `generate_solutions` super method change * Add SGLang completions functionality * Remove print("HI") in `query_clients.py` * Add "This NL solution is wrong" prompt in `combo_observation` * Add new changes to graph pass@k notebooks * Update `scripts/run_exps_...` * Add `baby-deepseek` models with SGLang * Fix small assertion bug in `query_clients.py` * (combo obs) Add another step to merge fixes with original nl solution * Change small type hint (you can ignore this commit) * Update model_configs and scripts * Add Fireworks and Together * Update `check_similar.py` to requery if badrequest * Comment out line that requires k to be less than min n_comp * Undo `combo_observation_model.py` enhanced 'fix' prompting * Update `run_exps` scripts * Update `graph_notebooks` * Add `caches` folder to gitignore

* Augment testing using more processes * Adjust Firework parameters * Change `exec-public` arg to `exec-type` * Change `re_eval_codes.py` to reflect test execution changes * Add checkpointing to `re_eval_codes` for cache * Add right vs wrong in check similar * Catch rare case where OpenAI logprobs is null * Add graph plotters for paper push * Make `re_eval_codes.py` auto-checkpoint * Mildly change prompts * Add fireworks llama 70b * Subtly change the way n_completions work. (Should have been with the other prompt commit) * Update scripts * Add o1 * Update graphing notebooks * Update gitignore

* Refactor combo obs with observation node, add num layers * Add arg to fix idea or no * Refactor prompt generation fns to `prompts` folder * Add --without-pseudocode flag to combo obs * Add without-idea arg * Add unincluded prompts * Change num-completion args into 2 * Add graphing utilities for ablations

hughbzhang

This is Evan's full PlanSearch code.

evanzwang and others added 29 commits July 10, 2024 01:34

Fix research/evan/search/queriers.py (#10538)

c76e1d7

Add search. input (#10683)

61627a8

Basic DPO configs. (#10684)

f6b03a6

dpo_basic_config.json contains basic DPO successful checkpoint Infra to launch RLXF sweeps (in alpha). Also infra to launch basic eval scripts in eval_all_checkpoints.py Co-authored-by: Hugh Zhang <[email protected]>

Evanzwang/monkey search 6 (#10691)

7d66158

* Add format and merge taco notebook * Update code_exec_reqs to requery

Merge branch 'master' of github.com:evanzwang/plansearch

339d18f

Merge branch 'master' of github.com:evanzwang/plansearch

33b405a

Add gitignore

0cc7495

Add README.md

560b98f

Add CodeRM submodule

bc652f8

Update readme

c31b6e0

Add recurse submodules

09ab36f

Add partial IOI functionality

94a58fd

Merge branch 'master' of github.com:evanzwang/plansearch

84c1ace

Update CodeRM to Scale version

becdd1b

Update requirements

55d9435

Change requirements

87a0eae

Update readme

f3b745c

Merge branch 'main'

782c2ba

evanzwang requested a review from hughbzhang October 18, 2024 21:15

evanzwang enabled auto-merge October 18, 2024 21:17

hughbzhang approved these changes Oct 18, 2024

View reviewed changes

marcos-f7z disabled auto-merge October 18, 2024 22:05

marcos-f7z merged commit 16df032 into main Oct 18, 2024
1 check passed

evanzwang deleted the master branch October 18, 2024 22:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port over PlanSearch from independent repo #1

Port over PlanSearch from independent repo #1

evanzwang commented Oct 18, 2024

hughbzhang left a comment

Port over PlanSearch from independent repo #1

Port over PlanSearch from independent repo #1

Conversation

evanzwang commented Oct 18, 2024

hughbzhang left a comment

Choose a reason for hiding this comment