AniDB vector search API #551

khell · 2023-06-18T17:02:17Z

Thanks a lot for writing and maintaining this agent all these years.

I've added some new code to replace the default series name matching in the AniDB agent to direct the search to what I'm calling the "AniDB Vector Search API". I've described what it does in the README, but in short, it uses a pre-trained machine learning model to generate embeddings of AniDB's anime series titles, and then performs a semantic similarity search on these embeddings to retrieve the top five matches given a search query (in this case, the title the Plex scanner will extract). This allows users to ignore the standard series naming conventions (within reason).

You can find the dataset and embedding on Huggingface here: https://huggingface.co/datasets/khellific/anidb-series-embeddings
You can find how it is used and the API source code here: https://github.com/khell/anidb-semantic-search-api

I intend to keep these embeddings regularly updated.

Please let me know what you think about merging this, and if there's anything you'd like for me to change. Also, while I'm hosting a version of the API for free now, I can't guarantee I'll keep it running or that it will be able to keep up with load if too many users query it as I'm only running it some spare capacity with a single worker, but as the API is open source I don't expect that to be a problem.

ZeroQI · 2023-06-18T17:44:10Z

I had queries issue with the YouTube agent, running out of hits under my api key number, so you could run into too many users since it is used by many users downloading continuously, certain refreshing 1000+ serie in Plex, in which case is it worth the risk of switching the search ?
How does it improve accuracy towards my code based on the AniDB title xml ?

khell · 2023-06-18T18:39:24Z

in which case is it worth the risk of switching the search

I haven't done any load testing, but I might add result caching which may be able to support that many. I'm hoping the massive disclaimer that users should really host the API themselves is enough, though:

Worse case, a new change where I take the API offline and require users to update it themselves. In that case I already have code to return an error response, which will fallback to the existing matching code.

How does it improve accuracy towards my code based on the AniDB title xml ?

You mean the existing code, right? It's added as a separate path, so if the option is enabled, the existing matching code will no longer run. Or are you asking how it's more accurate? I think the real draw here is not needing to adhere to any naming conventions. I can name my series folders anything I want (within reason) and still get matches. I mostly made this for myself, as I name my anime folders ENGLISH NAME [JAPANESE NAME] and I've in the past spent an hour manually going through and fixing all the matches when I setup my Plex library. I don't want to append [anidb-xxxx] to all my folder names and really wanted to stick to my own naming convention! With this change, users can do exactly that, and even small typos will still be correctly matched the majority of the time.

Please try the API for yourself to get a feel of what it's capable of compared to the existing matching! For example https://anidb.khell.net/api/anidb/id?name=Raeliana%20noble

ZeroQI · 2023-06-18T19:54:27Z

When using a metadata source, one should use the title from the metadata source...
I use the anidb xml to search the title if the series only has one season so I am interested to see if your solution improve the search, but if it need a local hosting of the API I cannot condone it as more simple...

khell · 2023-06-19T02:14:43Z

I absolutely agree with you about using the title from the metadata source. However, users (including myself) have different preferences and naming conventions, and what this offers is flexibility to handle that. More importantly, it's relevant to users who follow metadata names exactly too, since it uses a ML model that understands the semantic similarities between titles and inversely the dissimilarities, so it's also more capable of handling minor misspellings and alternate names. Here's a more realistic example for Kaguya-sama wa Kokurasetai: Tensai-tachi no Ren'ai Zunousen which is actually a super overloaded name as there are three different series on AniDB with very similar names:

My search:
Correct match (99.5% similarity - only difference is the backtick character change to a ' due to filesystem limitation):
https://anidb.khell.net/api/anidb/id?name=Kaguya-sama%20wa%20Kokurasetai:%20Tensai-tachi%20no%20Ren%27ai%20Zunousen
{"id":"anidb-14111","language":"x-jat","name":"Kaguya-sama wa Kokurasetai: Tensai-tachi no Ren`ai Zunousen","score":0.9956528544425964}

Existing search
Wrong match (and only 76% score):
Kaguya-sama wa Kokurasetai? Tensai-tachi no Ren`ai Zunousen (2021) [main(x-jat)] [anidb-15807]

I'm not suggesting you use a name that's totally different from the metadata - that was just an example to demonstrate the power of this - but the example shows that with the current matching you can name it exactly as it appears on AniDB and you still won't necessarily get a correct match. Combined with the other problem in this example (anime series often using characters not supported on NTFS filesystem by default) it's often a major pain point for users that even when they try to follow naming conventions, it won't always work or be possible. I just happen to also support totally out of the box naming conventions :)

With respect to your concerns about local hosting, I totally understand where you're coming from. However, think of the API hosting as just a strongly suggested option for power users who want to ensure uptime and performance. For regular users, they can still use the API hosted by me, and I plan to keep it operational as much as possible within my capacity. In fact, everything being open source means that anyone could also choose to host it and share the load, and we could perhaps document available versions of this API somewhere (or even change the config to a dropdown to select APIs).

I use the anidb xml to search the title if the series only has one season

One last note, if users turn this on, it'll effectively use AniDB for the entire search, as it generates [anidb-xxxx] ids for all entries. As I prefer AniDB to TVDB, I think this is actually preferable.

khell · 2023-06-19T02:22:02Z

Also with a bit more work, it should be possible to use this to do individual episode scanning, too, e.g:
https://anidb.khell.net/api/anidb/id?name=[DKB]%20Tengoku%20Daimakyou%20-%20S01E12%20[1080p]

…error edge cases.

khell added 3 commits June 18, 2023 19:10

Add ML vector search.

8097844

Fix matching (ignore server scoring).

fec0152

Update README and config descriptions.

52b21b7

Support multiple matches and use the correct name in metadata.

bb48877

Fix score ranking collisions from precision loss + handle additional …

a2b1aa1

…error edge cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AniDB vector search API #551

AniDB vector search API #551

khell commented Jun 18, 2023 •

edited

Loading

ZeroQI commented Jun 18, 2023

khell commented Jun 18, 2023

ZeroQI commented Jun 18, 2023

khell commented Jun 19, 2023 •

edited

Loading

khell commented Jun 19, 2023

AniDB vector search API #551

Are you sure you want to change the base?

AniDB vector search API #551

Conversation

khell commented Jun 18, 2023 • edited Loading

ZeroQI commented Jun 18, 2023

khell commented Jun 18, 2023

ZeroQI commented Jun 18, 2023

khell commented Jun 19, 2023 • edited Loading

khell commented Jun 19, 2023

khell commented Jun 18, 2023 •

edited

Loading

khell commented Jun 19, 2023 •

edited

Loading