-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AniDB vector search API #551
base: master
Are you sure you want to change the base?
Conversation
I had queries issue with the YouTube agent, running out of hits under my api key number, so you could run into too many users since it is used by many users downloading continuously, certain refreshing 1000+ serie in Plex, in which case is it worth the risk of switching the search ? |
I haven't done any load testing, but I might add result caching which may be able to support that many. I'm hoping the massive disclaimer that users should really host the API themselves is enough, though: Worse case, a new change where I take the API offline and require users to update it themselves. In that case I already have code to return an error response, which will fallback to the existing matching code.
You mean the existing code, right? It's added as a separate path, so if the option is enabled, the existing matching code will no longer run. Or are you asking how it's more accurate? I think the real draw here is not needing to adhere to any naming conventions. I can name my series folders anything I want (within reason) and still get matches. I mostly made this for myself, as I name my anime folders Please try the API for yourself to get a feel of what it's capable of compared to the existing matching! For example https://anidb.khell.net/api/anidb/id?name=Raeliana%20noble |
When using a metadata source, one should use the title from the metadata source... |
I absolutely agree with you about using the title from the metadata source. However, users (including myself) have different preferences and naming conventions, and what this offers is flexibility to handle that. More importantly, it's relevant to users who follow metadata names exactly too, since it uses a ML model that understands the semantic similarities between titles and inversely the dissimilarities, so it's also more capable of handling minor misspellings and alternate names. Here's a more realistic example for My search: Existing search I'm not suggesting you use a name that's totally different from the metadata - that was just an example to demonstrate the power of this - but the example shows that with the current matching you can name it exactly as it appears on AniDB and you still won't necessarily get a correct match. Combined with the other problem in this example (anime series often using characters not supported on NTFS filesystem by default) it's often a major pain point for users that even when they try to follow naming conventions, it won't always work or be possible. I just happen to also support totally out of the box naming conventions :) With respect to your concerns about local hosting, I totally understand where you're coming from. However, think of the API hosting as just a strongly suggested option for power users who want to ensure uptime and performance. For regular users, they can still use the API hosted by me, and I plan to keep it operational as much as possible within my capacity. In fact, everything being open source means that anyone could also choose to host it and share the load, and we could perhaps document available versions of this API somewhere (or even change the config to a dropdown to select APIs).
One last note, if users turn this on, it'll effectively use AniDB for the entire search, as it generates [anidb-xxxx] ids for all entries. As I prefer AniDB to TVDB, I think this is actually preferable. |
Also with a bit more work, it should be possible to use this to do individual episode scanning, too, e.g: |
…error edge cases.
Thanks a lot for writing and maintaining this agent all these years.
I've added some new code to replace the default series name matching in the AniDB agent to direct the search to what I'm calling the "AniDB Vector Search API". I've described what it does in the README, but in short, it uses a pre-trained machine learning model to generate embeddings of AniDB's anime series titles, and then performs a semantic similarity search on these embeddings to retrieve the top five matches given a search query (in this case, the title the Plex scanner will extract). This allows users to ignore the standard series naming conventions (within reason).
You can find the dataset and embedding on Huggingface here: https://huggingface.co/datasets/khellific/anidb-series-embeddings
You can find how it is used and the API source code here: https://github.com/khell/anidb-semantic-search-api
I intend to keep these embeddings regularly updated.
Please let me know what you think about merging this, and if there's anything you'd like for me to change. Also, while I'm hosting a version of the API for free now, I can't guarantee I'll keep it running or that it will be able to keep up with load if too many users query it as I'm only running it some spare capacity with a single worker, but as the API is open source I don't expect that to be a problem.