Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: colpali design draft #427

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

wip: colpali design draft #427

wants to merge 8 commits into from

Conversation

joein
Copy link
Member

@joein joein commented Dec 18, 2024

it's a draft of second iteration of work on colpali #394

@joein joein changed the title wip: design draft wip: colpali design draft Dec 18, 2024
@I8dNLo
Copy link
Contributor

I8dNLo commented Dec 23, 2024

To check out values for tests I use code examples from here

Copy link
Member Author

@joein joein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LateInteractionMultimodalEmbedding,
)

__all__ = ["LateInteractionMultimodalEmbedding"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be also exportable from fastembed

from fastembed import LateInteractionMultimodalEmbedding

from PIL import Image

# vectors are abridged and rounded for brevity
CANONICAL_COLUMN_VALUES = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call it CANONICAL_IMAGE_VALUES? Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following our style, but you are right. It should

Comment on lines +116 to +117
embeddings_3 = list(model.embed_text(docs, batch_size=10, parallel=0))
embeddings_3 = np.stack(embeddings_3, axis=0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll never run such a test (we just won't rent a monster capable of handling it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove

**kwargs,
) -> Iterable[np.ndarray]:
"""
Encode a list of documents into list of embeddings.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode a list of images

If None, don't use data-parallel processing, use default onnxruntime threading instead.

Returns:
List of embeddings, one per document
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one per image

Comment on lines +109 to +121
Encode a list of documents into list of embeddings.
We use mean pooling with attention so that the model can handle variable-length inputs.

Args:
images: Iterator of image paths or single image path to embed
batch_size: Batch size for encoding -- higher values will use more memory, but be faster
parallel:
If > 1, data-parallel encoding will be used, recommended for offline encoding of large datasets.
If 0, use all available cores.
If None, don't use data-parallel processing, use default onnxruntime threading instead.

Returns:
List of embeddings, one per document
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • images
  • mean pooling stuff is redundant
  • one per image

{
"model": "akshayballal/colpali-v1.2-merged",
"dim": 128,
"description": "Text embeddings, Unimodal (text), Aligned to image latent space, ColBERT-compatible, 512 tokens max, 2024.",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description is kinda slippery

can we actually call it text / unimodal embeddings?

is it aligned to image latent space or vice versa?

what do you mean here by colbert compatible?

Comment on lines +102 to +104
self.mask_token_id = None
self.pad_token_id = None
self.skip_list = set()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need it if we don't use it?

query += "\n"

texts_query.append(query)
encoded = self.tokenizer.encode_batch(texts_query)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not query max length be 50?

PAD_TOKEN = "<pad>"
QUERY_MARKER_TOKEN_ID = [2, 9413]
IMAGE_PLACEHOLDER_SIZE = (3, 448, 448)
EMPTY_TEXT_PLACEHOLDER = np.array([257152] * 1024 + [2, 50721, 573, 2416, 235265, 108])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually token ids of the following string '<image>' * 1024 + '<bos>Describe the image.\n'
Could we make it nicer? It's not really readable at the moment

EVEN_ATTENTION_MASK is also not really readable, maybe instead of having this even_attention_mask we could assign 1030 to a constant which seems to be a bit more reasonable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants