In this tutorial, we'll perform object detection on images using CLIP obtained from vector search. It will consist of 2 steps User should put name of an object in query then
- Vector Search will be performed to get images
- Most similar image will be utilized to detect query object
Dataset used for this example is from Huggingface. This dataset includes Images and their captions and dataset is embedded using CLIP.
Note: User can change dataset as per their usage/example.
For performing Object Detection it follow same steps as YOLO. Here are the steps listed to detect any object.
- Splitting image into patches
- Parsing Patches with a window size of 4 and stride 1 with CLIP
- Once all the patches are parsed and analysed by CLIP, next step is to calculate Xmin, Ymin, Xmax, and Ymax.
- Plotting Bbox on image