Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned Loading

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 940 154

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 521 26

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.2k 263

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    1k 41

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 555 52

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 549 125

Repositories

Showing 10 of 163 repositories
  • scin Public

    The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

    google-research-datasets/scin’s past year of commit activity
    Jupyter Notebook 82 9 2 0 Updated Nov 23, 2024
  • MISeD Public

    MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

    google-research-datasets/MISeD’s past year of commit activity
    9 3 0 0 Updated Nov 20, 2024
  • uicrit Public

    UICrit is a dataset containing human-generated natural language design critiques, corresponding bounding boxes for each critique, and design quality ratings for 1,000 mobile UIs from RICO. This dataset was collected for our UIST '24 paper: https://arxiv.org/abs/2407.08850.

    google-research-datasets/uicrit’s past year of commit activity
    6 1 0 0 Updated Nov 19, 2024
  • hiertext Public

    The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.

    google-research-datasets/hiertext’s past year of commit activity
    Jupyter Notebook 267 CC-BY-SA-4.0 24 2 1 Updated Nov 9, 2024
  • WordGraph Public

    The WordGraph dataset contains multilingual lexicon entries linked to wikipedia entities, focusing on human-denoting names and demonym adjectives. Each lexicon entries contain inflected word-form and morphological information all locales.

    google-research-datasets/WordGraph’s past year of commit activity
    0 CC0-1.0 0 0 0 Updated Nov 7, 2024
  • Education-Dialogue-Dataset Public

    Dataset of conversations, generated by prompting Gemini Ultra. These are conversations between a teacher and a student, where the teacher is prompted with specific topic to teach the student, and the student is prompted with their learning preferences. https://arxiv.org/abs/2405.14655

    google-research-datasets/Education-Dialogue-Dataset’s past year of commit activity
    3 0 0 0 Updated Oct 29, 2024
  • google-research-datasets/sanpo_dataset’s past year of commit activity
    Python 40 Apache-2.0 2 3 2 Updated Oct 28, 2024
  • GeniL Public

    GeniL dataset is an effort for detecting various types of generalization in language. This multilingual dataset covers sentences in EN, FR, ES, PT, AR, HI, BN, MS, and ID and is annotated by native speakers of each language. Each sentence is collected from a public corpora of language and contains at least one identity group name and an attribute.

    google-research-datasets/GeniL’s past year of commit activity
    2 CC-BY-4.0 1 0 0 Updated Oct 19, 2024
  • tap-typing-with-touch-sensing-images Public

    The Tap Typing with Touch Sensing Images (TSI) dataset contains data of user taps on a mobile touchscreen keyboard, including elliptical features and capacitive sensing images of the taps. The dataset aligns each tap with a key the user intended to type during data collection so it can be used for keyboard decoder training and/or evaluation.

    google-research-datasets/tap-typing-with-touch-sensing-images’s past year of commit activity
    1 CC-BY-4.0 1 1 0 Updated Oct 15, 2024
  • mittens Public

    Datasets for measuring misgendering in translation

    google-research-datasets/mittens’s past year of commit activity
    5 0 0 0 Updated Oct 4, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.