diff --git a/Llama-Guard/MODEL_CARD.md b/Llama-Guard/MODEL_CARD.md index a18124021..944a593e1 100644 --- a/Llama-Guard/MODEL_CARD.md +++ b/Llama-Guard/MODEL_CARD.md @@ -1,18 +1,31 @@ # Model Details -Llama-Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based input-output safeguard model. It can be used for classifying content in both LLM inputs (prompt classification) and in LLM responses (response classification). +Llama Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based +input-output safeguard model. It can be used for classifying content in both LLM +inputs (prompt classification) and in LLM responses (response classification). -It acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories. Here is an example: +It acts as an LLM: it generates text in its output that indicates whether a +given prompt or response is safe/unsafe, and if unsafe based on a policy, it +also lists the violating subcategories. Here is an example: -![](Llama-Guard_example.png) +![](Llama Guard_example.png) -In order to produce classifier scores, we look at the probability for the first token, and turn that into an “unsafe” class probability. Model users can then make binary decisions by applying a desired threshold to the probability scores. +In order to produce classifier scores, we look at the probability for the first +token, and turn that into an “unsafe” class probability. Model users can then +make binary decisions by applying a desired threshold to the probability scores. # Training and Evaluation ## Training Data -We use a mix of prompts that come from the Anthropic [dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we have collected in house, in a separate process from our production redteaming. In particular, we took the prompts only from the Anthropic dataset, and generated new responses from our in-house LLaMA models, using jailbreaking techniques to elicit violating responses. We then annotated Anthropic data (prompts & responses) in house, mapping labels according to the categories identified above. Overall we have ~13K training examples. +We use a mix of prompts that come from the Anthropic +[dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we +have collected in house, in a separate process from our production redteaming. +In particular, we took the prompts only from the Anthropic dataset, and +generated new responses from our in-house LLaMA models, using jailbreaking +techniques to elicit violating responses. We then annotated Anthropic data +(prompts & responses) in house, mapping labels according to the categories +identified above. Overall we have ~13K training examples. ## Taxonomy of harms and Risk Guidelines @@ -27,14 +40,16 @@ the following components: Together with this model, we release an open taxonomy inspired by existing open taxonomies such as those employed by Google, Microsoft and OpenAI in the hope -that it can be useful to the community. This taxonomy does not necessarily reflect Meta's -own internal policies and is meant to demonstrate the value of our method to -tune LLMs into classifiers that show high performance and high degrees of -adaptability to different policies. +that it can be useful to the community. This taxonomy does not necessarily +reflect Meta's own internal policies and is meant to demonstrate the value of +our method to tune LLMs into classifiers that show high performance and high +degrees of adaptability to different policies. -### The Llama-Guard Safety Taxonomy & Risk Guidelines +### The Llama Guard Safety Taxonomy & Risk Guidelines -Below, we provide both the harm types themselves under this taxonomy and also examples of the specific kinds of content that would be considered harmful under each category: +Below, we provide both the harm types themselves under this taxonomy and also +examples of the specific kinds of content that would be considered harmful under +each category: - **Violence & Hate** encompasses statements that encourage or could help people plan or engage in violence. Similarly, statements that advocate @@ -85,6 +100,6 @@ in our paper: [LINK TO PAPER]. | | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) | | --------------- | --------------------- | ---------- | --------- | ----------------------- | -| Llama-Guard | **0.945** | 0.847 | **0.626** | **0.953** | +| Llama Guard | **0.945** | 0.847 | **0.626** | **0.953** | | OpenAI API | 0.764 | **0.856** | 0.588 | 0.769 | | Perspective API | 0.728 | 0.787 | 0.532 | 0.699 | diff --git a/Llama-Guard/README.md b/Llama-Guard/README.md index fb0ad6208..c196f99f6 100644 --- a/Llama-Guard/README.md +++ b/Llama-Guard/README.md @@ -1,42 +1,42 @@ -# GuardLlama +# Llama Guard -GuardLlama is a new experimental model that provides input and output guardrails -for LLM deployments. +Llama Guard is a new experimental model that provides input and output +guardrails for LLM deployments. # Download -In order to download the model weights and tokenizer, please visit the Meta -website and accept our License. +In order to download the model weights and tokenizer, please visit the +[Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) +and accept our License. Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download. Pre-requisites: Make sure you have wget and md5sum installed. Then to run the -script: ./download.sh. +script: `./download.sh`. Keep in mind that the links expire after 24 hours and a certain amount of -downloads. If you start seeing errors such as 403: Forbidden, you can always +downloads. If you start seeing errors such as `403: Forbidden`, you can always re-request a link. -# Access on HuggingFace - -[TODO CHANGE LINK] We are also providing downloads on Hugging Face. You must -first request a download from the Meta website using the same email address as -your Hugging Face account. After doing so, you can request access to any of the -models on Hugging Face and within 1-2 days your account will be granted access -to all versions. - # Quick Start -TODO to be written. +Since Llama Guard is a fine-tuned Llama-7B model (see our +[model card](MODEL_CARD.md) for more information), the same quick start steps +outlined in our +[README file](https://github.com/facebookresearch/llama/blob/main/README.md) for +Llama2 apply here. + +In addition to that, we added examples using Llama Guard in the +[Llama 2 recipes repository](https://github.com/facebookresearch/llama-recipes). # Issues -Please report any software “bug”, or other problems with the models through one -of the following means: +Please report any software bug, or other problems with the models through one of +the following means: -- Reporting issues with the GuardLlama model: +- Reporting issues with the Llama Guard model: [github.com/facebookresearch/purplellama](github.com/facebookresearch/purplellama) - Reporting issues with Llama in general: [github.com/facebookresearch/llama](github.com/facebookresearch/llama) @@ -57,4 +57,4 @@ as our accompanying [Acceptable Use Policy](USE_POLICY). # References -Research Paper: [TODO ADD LINK] +[Research Paper](https://ai.facebook.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) diff --git a/Llama-Guard/download.sh b/Llama-Guard/download.sh index 044624fd1..9b7332986 100644 --- a/Llama-Guard/download.sh +++ b/Llama-Guard/download.sh @@ -1,70 +1,19 @@ #!/usr/bin/env bash - # Copyright (c) Meta Platforms, Inc. and affiliates. -# This software may be used and distributed according to the terms of the Llama 2 Community License Agreement. +# +# This source code is licensed under the MIT license found in the +# LICENSE file in the root directory of this source tree. set -e - read -p "Enter the URL from email: " PRESIGNED_URL echo "" -read -p "Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all: " MODEL_SIZE TARGET_FOLDER="." # where all files should end up mkdir -p ${TARGET_FOLDER} - -if [[ $MODEL_SIZE == "" ]]; then - MODEL_SIZE="7B,13B,70B,7B-chat,13B-chat,70B-chat" -fi - echo "Downloading LICENSE and Acceptable Usage Policy" wget --continue ${PRESIGNED_URL/'*'/"LICENSE"} -O ${TARGET_FOLDER}"/LICENSE" wget --continue ${PRESIGNED_URL/'*'/"USE_POLICY.md"} -O ${TARGET_FOLDER}"/USE_POLICY.md" - echo "Downloading tokenizer" wget --continue ${PRESIGNED_URL/'*'/"tokenizer.model"} -O ${TARGET_FOLDER}"/tokenizer.model" -wget --continue ${PRESIGNED_URL/'*'/"tokenizer_checklist.chk"} -O ${TARGET_FOLDER}"/tokenizer_checklist.chk" -CPU_ARCH=$(uname -m) - if [ "$CPU_ARCH" = "arm64" ]; then - (cd ${TARGET_FOLDER} && md5 tokenizer_checklist.chk) - else - (cd ${TARGET_FOLDER} && md5sum -c tokenizer_checklist.chk) - fi - -for m in ${MODEL_SIZE//,/ } -do - if [[ $m == "7B" ]]; then - SHARD=0 - MODEL_PATH="llama-2-7b" - elif [[ $m == "7B-chat" ]]; then - SHARD=0 - MODEL_PATH="llama-2-7b-chat" - elif [[ $m == "13B" ]]; then - SHARD=1 - MODEL_PATH="llama-2-13b" - elif [[ $m == "13B-chat" ]]; then - SHARD=1 - MODEL_PATH="llama-2-13b-chat" - elif [[ $m == "70B" ]]; then - SHARD=7 - MODEL_PATH="llama-2-70b" - elif [[ $m == "70B-chat" ]]; then - SHARD=7 - MODEL_PATH="llama-2-70b-chat" - fi - - echo "Downloading ${MODEL_PATH}" - mkdir -p ${TARGET_FOLDER}"/${MODEL_PATH}" - - for s in $(seq -f "0%g" 0 ${SHARD}) - do - wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/consolidated.${s}.pth"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/consolidated.${s}.pth" - done - - wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/params.json"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/params.json" - wget --continue ${PRESIGNED_URL/'*'/"${MODEL_PATH}/checklist.chk"} -O ${TARGET_FOLDER}"/${MODEL_PATH}/checklist.chk" - echo "Checking checksums" - if [ "$CPU_ARCH" = "arm64" ]; then - (cd ${TARGET_FOLDER}"/${MODEL_PATH}" && md5 checklist.chk) - else - (cd ${TARGET_FOLDER}"/${MODEL_PATH}" && md5sum -c checklist.chk) - fi -done +mkdir -p ${TARGET_FOLDER}"/llama-guard" +wget --continue ${PRESIGNED_URL/'*'/"consolidated.00.pth"} -O ${TARGET_FOLDER}"/llama-guard/consolidated.00.pth" +wget --continue ${PRESIGNED_URL/'*'/"params.json"} -O ${TARGET_FOLDER}"/llama-guard/params.json" diff --git a/README.md b/README.md index c0ce6f95c..4fe723e22 100644 --- a/README.md +++ b/README.md @@ -3,29 +3,104 @@
- 🤗 Hugging Face   | Blog   | Website   | CyberSec Eval Paper   | Llama Guard Paper  
+ 🤗 Models on Hugging Face  | Blog  | Website  | CyberSec Eval Paper   | Llama Guard Paper 
---------------------------------------------------------------------------------
+---
+
# Purple Llama
-Purple Llama is a an umbrella project that over time will bring together tools and evals to help the community build responsibly with open generative AI models. The initial release will include tools and evals for Cyber Security and Input/Output safeguards but we plan to contribute more in the near future.
+
+Purple Llama is a an umbrella project that over time will bring together tools
+and evals to help the community build responsibly with open generative AI
+models. The initial release will include tools and evals for Cyber Security and
+Input/Output safeguards but we plan to contribute more in the near future.
## Why purple?
-Borrowing a [concept](https://www.youtube.com/watch?v=ab_Fdp6FVDI) from the cybersecurity world, we believe that to truly mitigate the challenges which generative AI presents, we need to take both attack (red team) and defensive (blue team) postures. Purple teaming, composed of both red and blue team responsibilities, is a collaborative approach to evaluating and mitigating potential risks and the same ethos applies to generative AI and hence our investment in Purple Llama will be comprehensive.
+
+Borrowing a [concept](https://www.youtube.com/watch?v=ab_Fdp6FVDI) from the
+cybersecurity world, we believe that to truly mitigate the challenges which
+generative AI presents, we need to take both attack (red team) and defensive
+(blue team) postures. Purple teaming, composed of both red and blue team
+responsibilities, is a collaborative approach to evaluating and mitigating
+potential risks and the same ethos applies to generative AI and hence our
+investment in Purple Llama will be comprehensive.
## License
-Components within the Purple Llama project will be licensed permissively enabling both research and commercial usage. We believe this is a major step towards enabling community collaboration and standardizing the development and usage of trust and safety tools for generative AI development. More concretely evals and benchmarks are licensed under the MIT license while any models use the Llama 2 Community license. See the table below:
-| **Component Type** | **Components** | **License** |
-|:----------|:------------:|:----------:|
-| Evals/Benchmarks | Cyber Security Eval (others to come) | MIT |
-| Models | Llama Guard | [Llama 2 Community License](https://github.com/facebookresearch/PurpleLlama/blob/main/LICENSE) |
+Components within the Purple Llama project will be licensed permissively
+enabling both research and commercial usage. We believe this is a major step
+towards enabling community collaboration and standardizing the development and
+usage of trust and safety tools for generative AI development. More concretely
+evals and benchmarks are licensed under the MIT license while any models use the
+Llama 2 Community license. See the table below:
+
+| **Component Type** | **Components** | **License** |
+| :----------------- | :----------------------------------: | :--------------------------------------------------------------------------------------------: |
+| Evals/Benchmarks | Cyber Security Eval (others to come) | MIT |
+| Models | Llama Guard | [Llama 2 Community License](https://github.com/facebookresearch/PurpleLlama/blob/main/LICENSE) |
+
+## Evals & Benchmarks
+
+### Cybersecurity
+
+We are sharing what we believe is the first industry-wide set of cybersecurity
+safety evaluations for LLMs. These benchmarks are based on industry guidance and
+standards (e.g., CWE and MITRE ATT&CK) and built in collaboration with our
+security subject matter experts. With this initial release, we aim to provide
+tools that will help address some risks outlined in the
+[White House commitments on developing responsible AI](https://www.whitehouse.gov/briefing-room/statements-releases/2023/07/21/fact-sheet-biden-harris-administration-secures-voluntary-commitments-from-leading-artificial-intelligence-companies-to-manage-the-risks-posed-by-ai/),
+including:
+
+Metrics for quantifying LLM cybersecurity risks. Tools to evaluate the frequency
+of insecure code suggestions. Tools to evaluate LLMs to make it harder to
+generate malicious code or aid in carrying out cyberattacks. We believe these
+tools will reduce the frequency of LLMs suggesting insecure AI-generated code
+and reduce their helpfulness to cyber adversaries. Our initial results show that
+there are meaningful cybersecurity risks for LLMs, both with recommending
+insecure code and for complying with malicious requests. See our
+[Cybersec Eval paper](https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/)
+for more details.
+
+## Input/Output Safeguards
+
+As we outlined in Llama 2’s
+[Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/), we
+recommend that all inputs and outputs to the LLM be checked and filtered in
+accordance with content guidelines appropriate to the application.
+
+### Llama Guard
+
+To support this, and empower the community, we are releasing
+[Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/),
+an openly-available model that performs competitively on common open benchmarks
+and provides developers with a pretrained model to help defend against
+generating potentially risky outputs.
+
+As part of our ongoing commitment to open and transparent science, we are
+releasing our methodology and an extended discussion of model performance in our
+[Llama Guard paper](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/).
+This model has been trained on a mix of publicly-available datasets to enable
+detection of common types of potentially risky or violating content that may be
+relevant to a number of developer use cases. Ultimately, our vision is to enable
+developers to customize this model to support relevant use cases and to make it
+easier to adopt best practices and improve the open ecosystem.
## Getting Started
-To get started and learn how to use Purple Llama components with Llama models, see the getting started guide [here](https://ai.meta.com/llama/get-started/). The guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides. Additionally, you will find supplemental materials to further assist you while responsibly building with Llama. The guide will be updated as more Purple Llama components get released.
+
+To get started and learn how to use Purple Llama components with Llama models,
+see the getting started guide [here](https://ai.meta.com/llama/get-started/).
+The guide provides information and resources to help you set up Llama including
+how to access the model, hosting, how-to and integration guides. Additionally,
+you will find supplemental materials to further assist you while responsibly
+building with Llama. The guide will be updated as more Purple Llama components
+get released.
## FAQ
-For a running list of frequently asked questions, for not only Purple Llama components but also generally for Llama models, see the FAQ [here](https://ai.meta.com/llama/faq/).
+
+For a running list of frequently asked questions, for not only Purple Llama
+components but also generally for Llama models, see the FAQ
+[here](https://ai.meta.com/llama/faq/).
## Join the Purple Llama community
+
See the [CONTRIBUTING](CONTRIBUTING.md) file for how to help out.