-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wip: colpali design draft #427
base: main
Are you sure you want to change the base?
Conversation
To check out values for tests I use code examples from here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LateInteractionMultimodalEmbedding, | ||
) | ||
|
||
__all__ = ["LateInteractionMultimodalEmbedding"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be also exportable from fastembed
from fastembed import LateInteractionMultimodalEmbedding
from PIL import Image | ||
|
||
# vectors are abridged and rounded for brevity | ||
CANONICAL_COLUMN_VALUES = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should call it CANONICAL_IMAGE_VALUES
? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just following our style, but you are right. It should
embeddings_3 = list(model.embed_text(docs, batch_size=10, parallel=0)) | ||
embeddings_3 = np.stack(embeddings_3, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll never run such a test (we just won't rent a monster capable of handling it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove
**kwargs, | ||
) -> Iterable[np.ndarray]: | ||
""" | ||
Encode a list of documents into list of embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encode a list of images
If None, don't use data-parallel processing, use default onnxruntime threading instead. | ||
|
||
Returns: | ||
List of embeddings, one per document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one per image
Encode a list of documents into list of embeddings. | ||
We use mean pooling with attention so that the model can handle variable-length inputs. | ||
|
||
Args: | ||
images: Iterator of image paths or single image path to embed | ||
batch_size: Batch size for encoding -- higher values will use more memory, but be faster | ||
parallel: | ||
If > 1, data-parallel encoding will be used, recommended for offline encoding of large datasets. | ||
If 0, use all available cores. | ||
If None, don't use data-parallel processing, use default onnxruntime threading instead. | ||
|
||
Returns: | ||
List of embeddings, one per document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- images
- mean pooling stuff is redundant
- one per image
{ | ||
"model": "akshayballal/colpali-v1.2-merged", | ||
"dim": 128, | ||
"description": "Text embeddings, Unimodal (text), Aligned to image latent space, ColBERT-compatible, 512 tokens max, 2024.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description is kinda slippery
can we actually call it text / unimodal embeddings?
is it aligned to image latent space or vice versa?
what do you mean here by colbert compatible?
self.mask_token_id = None | ||
self.pad_token_id = None | ||
self.skip_list = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need it if we don't use it?
query += "\n" | ||
|
||
texts_query.append(query) | ||
encoded = self.tokenizer.encode_batch(texts_query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not query max length be 50?
PAD_TOKEN = "<pad>" | ||
QUERY_MARKER_TOKEN_ID = [2, 9413] | ||
IMAGE_PLACEHOLDER_SIZE = (3, 448, 448) | ||
EMPTY_TEXT_PLACEHOLDER = np.array([257152] * 1024 + [2, 50721, 573, 2416, 235265, 108]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually token ids of the following string '<image>' * 1024 + '<bos>Describe the image.\n'
Could we make it nicer? It's not really readable at the moment
EVEN_ATTENTION_MASK is also not really readable, maybe instead of having this even_attention_mask
we could assign 1030
to a constant which seems to be a bit more reasonable
it's a draft of second iteration of work on colpali #394