Lightning IR

Your one-stop shop for fine-tuning and running neural ranking models.

Lightning IR is a library for fine-tuning and running neural ranking models. It is built on top of PyTorch Lightning to provide a simple and flexible interface to interact with neural ranking models.

Want to:

fine-tune your own cross- or bi-encoder models?
index and search through a collection of documents with ColBERT or SPLADE?
re-rank documents with state-of-the-art models?

Lightning IR has you covered!

Installation

We're currently setting up the package on PyPI. In the meantime, you can install the package from source.

git clone https://github.com/webis-de/lightning-ir.git
cd lightning-ir
pip install .

Getting Started

The easiest way to use Lightning IR is via the CLI. It uses the PyTorch Lightning CLI and adds additional options to provide a unified interface for fine-tuning and running neural ranking models.

The behavior of the CLI can be customized using yaml configuration files. See the configs directory for several example configuration files. For example, the following command can be used to re-rank the official TREC DL 19/20 re-ranking set with a pre-finetuned cross-encoder model. It will automatically download the model and data, run the re-ranking, write the results to a TREC-style run file, and report the nDCG@10 score.

lightning-ir re_rank \
  --config ./configs/trainer/inference.yaml \
  --config ./configs/callbacks/rank.yaml \
  --config ./configs/data/re-rank-trec-dl.yaml \
  --config ./configs/models/monoelectra.yaml

For more details, see the Usage section.

Model Zoo

Cross-encoders

Model Name	TREC DL 19	TREC DL 20
monoelectra-base	0.75	0.77
monoelectra-large	0.75	0.79
monoT5 (Coming soon)	--	--

Bi-encoders

Model Name	TREC DL 19/20 nDCG@10
BERT Bi-encoder (Coming soon)	--
ColBERT (Coming soon)	--
SPLADE (Coming soon)	--
XTR (Coming soon)	--

Usage

Command Line Interface

The CLI offers four subcommands:

$ lightning-ir -h
Lightning Trainer command line tool

subcommands:
  For more details of each subcommand, add it as an argument followed by --help.

  Available subcommands:
    fit                 Runs the full optimization routine.
    index               Index a collection of documents.
    search              Search for relevant documents.
    re_rank             Re-rank a set of retrieved documents.

Configurations files need to be provided to specifiy model, data, and fine-tuning/inference parameters. See the configs directory for examples. Four types of configurations exists:

trainer: Specifies the fine-tuning/inference parameters and callbacks.
model: Specifies the model to use and its parameters.
data: Specifies the dataset(s) to use and its parameters.
optimizer: Specifies the optimizer parameters (only needed for fine-tuning).

Example

The following example demonstrates how to fine-tune a BERT-based single-vector bi-encoder model using the official MS MARCO triples. The fine-tuned model is then used to index the MS MARCO passage collection and search for relevant passages. Finally, we show how to re-rank the retrieved passages.

Fine-tuning

To fine-tune a bi-encoder model on the MS MARCO triples dataset, use the following configuration file and command:

bi-encoder-fit.yaml

trainer:
  callbacks:
  - class_path: ModelCheckpoint
  max_epochs: 1
  max_steps: 100000
data:
  class_path: LightningIRDataModule
  init_args:
    train_batch_size: 32
    train_dataset:
      class_path: TupleDataset
      init_args:
        tuples_dataset: msmarco-passage/train/triples-small
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: bert-base-uncased
    config:
      class_path: BiEncoderConfig
    loss_functions:
    - class_path: RankNet
optimizer:
  class_path: AdamW
  init_args:
    lr: 1e-5

lightning-ir fit --config bi-encoder-fit.yaml

The fine-tuned model is saved in the directory lightning_logs/version_X/huggingface_checkpoint/.

Indexing

We now assume the model from the previous fine-tuning step was moved to the directory models/bi-encoder. To index the MS MARCO passage collection with faiss using the fine-tuned model, use the following configuration file and command:

bi-encoder-index.yaml

trainer:
  callbacks:
  - class_path: IndexCallback
    init_args:
        index_config:
          class_path: FaissFlatIndexConfig
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 256
    inference_datasets:
    - class_path: DocDataset
      init_args:
        doc_dataset: msmarco-passage

lightning-ir index --config bi-encoder-index.yaml

The index is saved in the directory models/bi-encoder/indexes/msmarco-passage.

Searching

To search for relevant documents in the MS MARCO passage collection using the bi-encoder and index, use the following configuration file and command:

bi-encoder-search.yaml

trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/bi-encoder
    index_dir: models/bi-encoder/indexes/msmarco-passage
    search_config:
      class_path: FaissFlatSearchConfig
      init_args:
        k: 100
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2019/judged
    - class_path: QueryDataset
      init_args:
        query_dataset: msmarco-passage/trec-dl-2020/judged

lightning-ir search --config bi-encoder-search.yaml

The run files are saved as models/bi-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Re-ranking

Assuming we've also fine-tuned a cross-encoder that is saved in the directory models/cross-encoder, we can re-rank the retrieved documents using the following configuration file and command:

cross-encoder-re-rank.yaml

trainer:
  callbacks:
  - class_path: RankCallback
model:
  class_path: BiEncoderModule
  init_args:
    model_name_or_path: models/cross-encoder
    evaluation_metrics:
    - nDCG@10
data:
  class_path: LightningIRDataModule
  init_args:
    num_workers: 1
    inference_batch_size: 4
    inference_datasets:
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2019.run
        depth: 100
        sample_size: 100
        sampling_strategy: top
    - class_path: RunDataset
      init_args:
        run_path_or_id: models/bi-encoder/runs/msmarco-passage-trec-dl-2020.run
        depth: 100
        sample_size: 100
        sampling_strategy: top

lightning-ir re_rank --config cross-encoder-re-rank.yaml

The run files are saved as models/cross-encoder/runs/msmarco-passage-trec-dl-20XX.run. Additionally, the nDCG@10 scores are printed to the console.

Name		Name	Last commit message	Last commit date
Latest commit History 418 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
configs		configs
docs		docs
examples		examples
lightning_ir		lightning_ir
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lightning IR

Installation

Getting Started

Model Zoo

Cross-encoders

Bi-encoders

Usage

Command Line Interface

Example

Fine-tuning

Indexing

Searching

Re-ranking

About

Releases

Packages

Contributors 3

Languages

License

webis-de/lightning-ir

Folders and files

Latest commit

History

Repository files navigation

Lightning IR

Installation

Getting Started

Model Zoo

Cross-encoders

Bi-encoders

Usage

Command Line Interface

Example

Fine-tuning

Indexing

Searching

Re-ranking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages