Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AniDB vector search API #551

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

AniDB vector search API #551

wants to merge 5 commits into from

Conversation

khell
Copy link

@khell khell commented Jun 18, 2023

Thanks a lot for writing and maintaining this agent all these years.

I've added some new code to replace the default series name matching in the AniDB agent to direct the search to what I'm calling the "AniDB Vector Search API". I've described what it does in the README, but in short, it uses a pre-trained machine learning model to generate embeddings of AniDB's anime series titles, and then performs a semantic similarity search on these embeddings to retrieve the top five matches given a search query (in this case, the title the Plex scanner will extract). This allows users to ignore the standard series naming conventions (within reason).

You can find the dataset and embedding on Huggingface here: https://huggingface.co/datasets/khellific/anidb-series-embeddings
You can find how it is used and the API source code here: https://github.com/khell/anidb-semantic-search-api

I intend to keep these embeddings regularly updated.

Please let me know what you think about merging this, and if there's anything you'd like for me to change. Also, while I'm hosting a version of the API for free now, I can't guarantee I'll keep it running or that it will be able to keep up with load if too many users query it as I'm only running it some spare capacity with a single worker, but as the API is open source I don't expect that to be a problem.

@ZeroQI
Copy link
Owner

ZeroQI commented Jun 18, 2023

I had queries issue with the YouTube agent, running out of hits under my api key number, so you could run into too many users since it is used by many users downloading continuously, certain refreshing 1000+ serie in Plex, in which case is it worth the risk of switching the search ?
How does it improve accuracy towards my code based on the AniDB title xml ?

@khell
Copy link
Author

khell commented Jun 18, 2023

in which case is it worth the risk of switching the search

I haven't done any load testing, but I might add result caching which may be able to support that many. I'm hoping the massive disclaimer that users should really host the API themselves is enough, though:
image

Worse case, a new change where I take the API offline and require users to update it themselves. In that case I already have code to return an error response, which will fallback to the existing matching code.

How does it improve accuracy towards my code based on the AniDB title xml ?

You mean the existing code, right? It's added as a separate path, so if the option is enabled, the existing matching code will no longer run. Or are you asking how it's more accurate? I think the real draw here is not needing to adhere to any naming conventions. I can name my series folders anything I want (within reason) and still get matches. I mostly made this for myself, as I name my anime folders ENGLISH NAME [JAPANESE NAME] and I've in the past spent an hour manually going through and fixing all the matches when I setup my Plex library. I don't want to append [anidb-xxxx] to all my folder names and really wanted to stick to my own naming convention! With this change, users can do exactly that, and even small typos will still be correctly matched the majority of the time.

Please try the API for yourself to get a feel of what it's capable of compared to the existing matching! For example https://anidb.khell.net/api/anidb/id?name=Raeliana%20noble

@ZeroQI
Copy link
Owner

ZeroQI commented Jun 18, 2023

When using a metadata source, one should use the title from the metadata source...
I use the anidb xml to search the title if the series only has one season so I am interested to see if your solution improve the search, but if it need a local hosting of the API I cannot condone it as more simple...

@khell
Copy link
Author

khell commented Jun 19, 2023

I absolutely agree with you about using the title from the metadata source. However, users (including myself) have different preferences and naming conventions, and what this offers is flexibility to handle that. More importantly, it's relevant to users who follow metadata names exactly too, since it uses a ML model that understands the semantic similarities between titles and inversely the dissimilarities, so it's also more capable of handling minor misspellings and alternate names. Here's a more realistic example for Kaguya-sama wa Kokurasetai: Tensai-tachi no Ren'ai Zunousen which is actually a super overloaded name as there are three different series on AniDB with very similar names:

My search:
Correct match (99.5% similarity - only difference is the backtick character change to a ' due to filesystem limitation):
https://anidb.khell.net/api/anidb/id?name=Kaguya-sama%20wa%20Kokurasetai:%20Tensai-tachi%20no%20Ren%27ai%20Zunousen
{"id":"anidb-14111","language":"x-jat","name":"Kaguya-sama wa Kokurasetai: Tensai-tachi no Ren`ai Zunousen","score":0.9956528544425964}

Existing search
Wrong match (and only 76% score):
Kaguya-sama wa Kokurasetai? Tensai-tachi no Ren`ai Zunousen (2021) [main(x-jat)] [anidb-15807]

I'm not suggesting you use a name that's totally different from the metadata - that was just an example to demonstrate the power of this - but the example shows that with the current matching you can name it exactly as it appears on AniDB and you still won't necessarily get a correct match. Combined with the other problem in this example (anime series often using characters not supported on NTFS filesystem by default) it's often a major pain point for users that even when they try to follow naming conventions, it won't always work or be possible. I just happen to also support totally out of the box naming conventions :)

With respect to your concerns about local hosting, I totally understand where you're coming from. However, think of the API hosting as just a strongly suggested option for power users who want to ensure uptime and performance. For regular users, they can still use the API hosted by me, and I plan to keep it operational as much as possible within my capacity. In fact, everything being open source means that anyone could also choose to host it and share the load, and we could perhaps document available versions of this API somewhere (or even change the config to a dropdown to select APIs).

I use the anidb xml to search the title if the series only has one season

One last note, if users turn this on, it'll effectively use AniDB for the entire search, as it generates [anidb-xxxx] ids for all entries. As I prefer AniDB to TVDB, I think this is actually preferable.

@khell
Copy link
Author

khell commented Jun 19, 2023

Also with a bit more work, it should be possible to use this to do individual episode scanning, too, e.g:
https://anidb.khell.net/api/anidb/id?name=[DKB]%20Tengoku%20Daimakyou%20-%20S01E12%20[1080p]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants