Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face detection #342

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Face detection #342

wants to merge 35 commits into from

Conversation

whyboris
Copy link
Owner

@whyboris whyboris commented Jan 24, 2020

Face detection and extraction works 😎

App builds without crashing -- 125MB file 👌 not too big.

Currently built app doesn't seem extract photos because the weights folder needs to be moved elsewhere within the installed directory (e.g. Program Files/Video Hub App 2/...

I'll update the branch later today once I fix that up.

Will close #341

🚀 Will release version 2.2.0 once this feature is done 🚀

update: this feature got delayed to release version 3.0.0 🤷 ... I might get back to it later this year 😅

@whyboris whyboris requested a review from cal2195 January 24, 2020 23:53
@whyboris whyboris mentioned this pull request Jan 24, 2020
@whyboris whyboris added the ⛑️ WIP Work in progress label Jan 26, 2020
@cal2195
Copy link
Collaborator

cal2195 commented Jan 27, 2020

Just tried this out - really really cool!! 🚀

What's the plan for extracting all videos, and storing faces that are the same person, and grouping the same videos together?

@cal2195
Copy link
Collaborator

cal2195 commented Jan 27, 2020

I'd assume something along the lines of:

  • An array with each actor (video list & faces)
  • For each new face, check all known actors and compute similarity
  • If within a threshold, add to that actors array (video id & found faces)

That way we'd have a pre-computed list of actors, along with associated videos. Could then let the user add more metadata.

@cal2195
Copy link
Collaborator

cal2195 commented Jan 27, 2020

It looks like a lot of the face recognition work is done for us:
https://github.com/justadudewhohacks/face-api.js/#face-recognition-by-matching-descriptors

@whyboris
Copy link
Owner Author

whyboris commented Jan 27, 2020

Current intention for the future features, not all will be in this PR:

  1. is to extract every face for every filmstrip and store it in a new folder (with the same file name as the hash as the file).
  2. add a 'faces' view that will simply show a filmstrip with all the faces found
  3. create another 'auto-generated tag' view that shows only the two-word tags (hopefully mostly names), but next to each it will have a single photo
  4. each of the photos can be replaced by dragging-dropping an image file.
  5. this new view will also (like the auto-generated tags) allow to add more names to the list
  6. the view will behave like the auto-generated tag view -- clicking will open all the videos that contain that 'name'

These are the initial features -- should be pretty easy to implement. Once it's set and released, I will consider others.

Before that's done -- I'm open to more suggestions / recommendation for change, etc.

Part of the reason I'm aiming for those 6 things is that they are easy from the code point of view (re-using same functionality a lot). We'll see how it goes.

@cal2195
Copy link
Collaborator

cal2195 commented Jan 27, 2020

I'm confused about two things:

  1. What benefit is it having a filmstrip of faces for each video? (Maybe I'm just missing something)
    Especially if there's multiple actors in a video, I'd assume we'd want to treat each face indiviually, like hash-1.jpg,hash-2.jpg...

  2. How will these faces get matched to names? Are you planning any recognition initially?

Just from my point of view, I think we should aim (not necessarily right now), for extracting all faces, grouping faces (regardless of the video they came from) within a threshold, and allowing users to name that face, selecting which will show all videos that face appears in. What's your thoughts on this? 😄

Just want to be on the same page! 📖

@cal2195
Copy link
Collaborator

cal2195 commented Jan 27, 2020

Re: filmstrips of faces - I actually think this would be very cool to show all the faces of one actor, but I think the storage of these should be individual files for the reasons above (multiple actors per video) - combining them in-app would be my go-to idea.

It would also be more interesting I think, to shuffle up the face order, ie. not have all the faces from video1, then video2, etc...

Clicking a face would still bring up the relevant video though. 😄

@whyboris
Copy link
Owner Author

whyboris commented Jan 27, 2020

At the moment I'm not doing person identification, only face extraction.

I'm unsure why many of the features exist in this app :trollface: -- people request things, other things I suspect are desired, and others are just easy-enough to add that I might as well.

Once I have person identification it may be useful to do something with that, but for now, just extracting faces is enough.

With identification, as you suggest, we could then auto-tag each video with that face found. But I'm unclear how resource-heavy it is to do all these operations. Face detection is fast and straight-forward, so I'm doing that as a first step.

@cal2195
Copy link
Collaborator

cal2195 commented Jan 28, 2020

I'd be quite interested to implement the face identification as above after this, but I'm just afraid of needing to change the underlying structure of the data (storing jpgs as filmstrips vs faces, etc.). 🤔

If you'd be okay with me changing up some of these things after this merges, I'm happy to work on this! 😄

@whyboris
Copy link
Owner Author

I think you're very right that it makes sense to think through the future functionality so I don't have to break backwards compatibility when adding another feature.

This is a major-enough feature that I could bump VHA to version 3 (good for communicating publicly that there are big new features 🤷‍♂ ) ... but this is besides the point -- I don't want to break compatibility with an earlier release if I can help it (and a bit of planning will help).

I think before I continue with this PR, I'll try out the face recognition to see how well it works (how much time it takes, etc) ... I know Picasa was able to do it a decade ago (feature released in 2008 😱 ) so maybe it will work fast-enough with this library on today's computers 🤞

All ideas and thoughts welcome 👍

@cal2195
Copy link
Collaborator

cal2195 commented Jan 28, 2020

I'm all for keeping the data as flexible as possible, and as far as I can tell, the best storage would be:

faces/<video-hash>-<face-index>.jpg

This would allow for the current feature (just concat all indexes for a given file hash), and allow to index any face by video-hash and id in the future for any features.

For example (pseudocode):

class Person {
   name: String;
   faceMatches: List<Face>;
}

class Face {
   originalVideoHash: String;
   index: int;
}

This possible class would allow listing videos with an actor, moving faces to correct actor groupings etc.

Just keeping the future in mind! 🚀 😄

@whyboris
Copy link
Owner Author

btw -- if you're still occasionally doing things with your image-viewer face detection could be a pretty-easy addition (copy my code 😉 ) https://github.com/cal2195/image-viewer 👍

@whyboris
Copy link
Owner Author

whyboris commented Feb 1, 2020

I played around with this and I think I've decided that a face filmstrip is still a good idea.

I'm not sure who would use it, but I suspect there may be people who would want to see a filmstrip of just faces.

The way the app is set up, it can consume a filmstrip very easily, so very little code will need to be changed.

I am uneasy about extracting each face as its own file, because rather than 1 strip per file, we may have as many as 100 image files (imagine a long video with many screens extracted and many faces per screen).

The upshot is very little code will need to change:

  1. Create 2 (two!) new folders: faces and facestrip (think film-strip / photo booth strip)
  • The faces folder will have the first face screenshot (for faster loading when browsing)
  • the facestrip folder will have all faces in a row (for easy scrolling through them)
  1. New mode that can be toggled is "face mode" which will turn every view into a face view!
  • the face mode will make the thumbnail and detail view half-width.
  • text and clip views will be unaffected

I'm pretty sure this will work out well -- I'll try to get a rough draft by end of Sunday 🤞

@whyboris
Copy link
Owner Author

whyboris commented Feb 2, 2020

Thinking forward to potential(!) face recognition functionality:

⚠️ first off -- facial recognition will be tricky: both, in coding, and in getting a good user interface. It might not happen, happen in stages, or only partially ⚠️

I imagine the primary goal is to be able to ask the question "find all videos that have this person" (even if you have never manually entered the person's name anywhere).

The library I'm using does facial recognition by generating a vector with 128 values:
https://github.com/justadudewhohacks/face-api.js/#face-recognition-model

So (as far as I can tell) the app would manually have to have a database of vectors to compare against (maybe support file inside vha- folder). It would also have to cluster very-similar faces into "same person". We may want to somehow let users assign names to these found faces. It would be excellent if in the case where filenames contain a person's name, it was easy to confirm that the face matches (no need to manually type in the name).

The process might be something like this:

  1. Generate vectors for every found face (store an array of vectors for each video)
  2. Cluster faces per video (now each video has a small set of face vectors, each representing a unique person who may appear more than once in the video).
    • possibly make every face representation vector have a weight (of how many vectors were averaged together to form it -- so that when we average another instance, it updates the average proportionally)
  3. Each video's "facestrip" will have a corresponding list matching face vector to person
    • e.g. [1, 1, 1, 2, 1, 2, 3] for a facestrip with 7 faces found (3 unique people).
    • edit: this might not be needed, see comment after next
  4. Cluster faces across all videos, generating objects that contain the average vector for a person, and the list of all the videos they are found in.

At this point, we can drop any face (even a photo from the computer) and the app will find the face(s) that are most-likely matches to it (we manually find the closest match(es) to the picked photo).

All this is tricky, and we'll have to think through all the cross-linking, so we can update things in the future (add new people, merge people and have all the references update automatically, without bugs).

Lots to think through!

The important part (decision at this point) is whether having a facestrip will make things harder than if we have individual photos; I think it will not.

@whyboris
Copy link
Owner Author

whyboris commented Feb 2, 2020

It would be a great bonus if the resulting set of identified people (map from vector to name) can be stored separately / exported. This way, there will not be a need to re-label the same individuals in another hub!

The user interface for tagging people, merging faces that are of the same person, and telling the app that some faces don't belong is going to be very challenging 😓

@whyboris
Copy link
Owner Author

whyboris commented Feb 2, 2020

I am hoping the face recognition process may avoid having to do clustering of any kind. The comment above previous one might not be describing the process we'll end up using.

This routine works only if it's sensible to (experimentally find and) use some sort of a rough cutoff for what face is "close-enough" that the app considers the photo to be of the same person as found before.

Perhaps, we will simply go through each video, one-at-a-time, following some version of this routine:

  1. Within each facestrip, go through each face and compare to all face-vectors found in this filmstrip; either
    • create a new "person" with their own vector representation (add to this temporary list of face-vectors), or
    • average the new found face with the closest vector
  2. After every facestrip has been thus processed, compare the resulting persons with the global list of persons. Either:
    • add this person to the new global list (if no face there is close-enough)
      • create a screenshot with the person's id as the file-name in the persons folder
    • remember which person from the global list was in this video
  3. Update the clip with the new information (e.g. "persons": [5, 27, 40] indicating which people from the global list are in this video

At this point we'll have a new folder persons with a growing number of images, all named corresponding to persons in the global list (e.g. 5.jpg).

Eventually the user will be able to add names to each 'person' and merge people (by naming them the same name perhaps). When merging, we will just go through the ImageElement list and update the "persons" array with new number (e.g. 40 -> 5).

The global person list will perhaps have a shape like this:

interface Person: {
  vector: number[]; // 128 number representation of the face
  weight: number; // number of faces merged into this person, so new merges adjust vector proportionally
  videos: string[]; // array of hashes pointing to videos
  id: number; // the unique id given upon creation, corresponds to the .jpg file name
}

This way it would be easy to find all videos with a particular person -- just show all the videos in the 'videos' array.

@cal2195
Copy link
Collaborator

cal2195 commented Feb 2, 2020

I feel like this is getting a little over complicated - we should keep it as simple as possible. 😄

I strongly suggest we do one of these rather than storing a filmstrip .jpg file (note this is about file storage, not the face-filmstrip feature - I rather like this feature):

  1. Store each face found as it's own seperate jpg file.
  2. Don't store face images at all, and just store the face coordinates and video hash to grab the face from instead.

I have two reasons for this:

  1. This allows the greatest flexibility for the future - we're really going to hurt ourselves if we add more face related features (eg. recognition) if we tie faces to the videos their from. (Imagine loading in 20-screenshots-sized files to grab just one face, 100+ times).
  2. The resulting file sizes will be smaller - with a filmstrip, you upscale faces so they're all uniform size, which is wasted space. Storing small faces (which they usually are) at their small size, just with the correct ratio will save on space considerably.

And lastly, along with either of the options above, I suggest we store all facial recognition data to disk, with the corresponding face (database, files or otherwise).

I know it might seem wasteful, but I am 100% for thinking ahead for the future, and this way, any new features will already have the facial data computed from the first extraction. There's no point filtering the data now, if we don't know exactly what we'll want in the future. And 128 numbers isn't exactly a lot of data these days... 🤣

Imagine we implement facial recognition using algorithm X, but then realise that algoritm Y is vastly superior in results and runtime. If we have stored all the data we extracted the first time, it's just a simple reprocessing of this data (which would take seconds), rather than needing to increment a major version number and remove backwards compatibility.

I know it's more difficult to do, but implementing your feature by stitching together X face jpgs will be much better in the long run, rather than deciding a format now that is restrictive, and either hacking solutions later, or risking losing compatibility.

Hope that makes sense! 😄 👍

@cal2195
Copy link
Collaborator

cal2195 commented Feb 2, 2020

Actually, now that I think about it some more, why don't we just store the face coordinates for each for each filmstrip, and the recognition data? 🤔

For now, you could just have a class which stores:

class Face {
    videoHash: string;
    faceOrds: (x,y,w,h);
    faceData: number[128];
}

Then it should be easy to implement your feature, and we don't have to store any new images! 💡

Just for each filmstrip in "face mode" - when you scroll across the frame, just use CSS-magic to zoom into each face, like we do along the X axis to view each screenshot?

And to find the faces for each filmstrip, just filter the above class by video hash, and order the resulting faces by X value, followed Y value! 😄

@whyboris
Copy link
Owner Author

whyboris commented Feb 2, 2020

Thank you for your thoughts 🤝

Thank you for clarifying which features we are discussing. The facestrip idea is easy to release because I already have most of the the code done. What you were correctly pointing out is we don't want to get into a dead-end and have to undo things for the face recognition feature.

What I describe above (in the last comment) is an attempt to spell out the details of how the code would run. I'm currently unsure how face recognition would work (seems like the library we use only gives us vectors per face, and we need to compare the Euclidean distance to check if the face is similar to anything in our database). So we'll need some 'database'.

Face coordinates (face vectors) need to be extracted only once, and once we know which person is in the video, we can just store the person's id in the ImageElement -- else we'll have a ton of duplicated information, and the .vha2 file will balloon up with a lot of data. I'm pretty sure we don't want vectors inside the .vha file 🤔

I feel like once a person exists (we have face coordinates for the person), when in an appropriate view (perhaps Details View?) each video will have a headshot of all people found in the video.

Different videos will have the same headshot shown (so each person will have a single .jpg assigned to them). So we don't need to have a .jpg for every face found -- only one per person.

I'm unsure (will run experiments today) how quickly we can compare a face against, say, 1,000 'persons' ... this may change how I think about the problem.

I'll write more later (having lunch now), I know I've not responded to everything you said yet 👍

@whyboris
Copy link
Owner Author

whyboris commented Feb 2, 2020

Important note: looks like if we just do face detection, many faces get detected, but if we run the same script but ask the face-api.js library to also generated the 128-number face vector, it detects fewer faces (probably because they are not aligned well-enough or something).

This may mean that we might want to do two passes -- one for just detecting faces for a good filmstrip, and a second pass for just facial recognition.

I have the code now that will extract the largest image for the single-face preview for each video file too: whyboris/extract-faces-node#2

I'm still experimenting to see how well simple vector comparison works (speed etc -- may need to change my approach to the whole feature depending on how things go).

I'll keep everyone posted here 🙆‍♂

@StevenReitsma
Copy link

I have some experience with face recognition so I'm sharing my thoughts hoping they will be of use.

This routine works only if it's sensible to (experimentally find and) use some sort of a rough cutoff for what face is "close-enough" that the app considers the photo to be of the same person as found before.

In my experiments this is possible. I normally use a threshold of 0.6 for Euclidean distance (so distance < 0.6 means the face matches). This is also the default in the well known Python face_detection package (https://face-recognition.readthedocs.io/en/latest/_modules/face_recognition/api.html#compare_faces) which I think uses the exact same model for generating the embeddings.

Perhaps, we will simply go through each video, one-at-a-time, following some version of this routine:

  1. Within each facestrip, go through each face and compare to all face-vectors found in this filmstrip; either

    • create a new "person" with their own vector representation (add to this temporary list of face-vectors), or
    • average the new found face with the closest vector

I think averaging the embeddings might be a bad idea. Since a face can have many different instances due to lighting, position, rotation, wearing glasses vs not wearing glasses, etc, it's possible for the same face to have widely different embeddings. I would therefore recommend saving all the embeddings separately so it's easier to find matches. A downside is the additional processing required when matching, but calculating Euclidean distance is not that costly.

In my face recognition applications I save every embedding and when a new image is processed I compare the extracted embeddings with every embedding in the database (no performance issues for ~2000 embeddings, but YMMV). For videos the amount of embeddings is of course a lot bigger so maybe in the end you have to cluster the embeddings somehow for performance reasons. Still, I would then refrain from clustering to a single embedding per person, but instead save multiple.

Looking forward to seeing more progress on this! Would be amazing to filter videos by person!

@whyboris
Copy link
Owner Author

Thank you Steven!

I came across the 0.6 threshold an an article by the author of face-api.js - it's good to know it's a good default number to use.

I wish averaging embedding was a good idea -- in my short experiment, I made pairwise comparisons of 5 photos of the same person from a video frames I extracted (10 total comparisons) and only 3 were below 0.6 threshold. When I averaged two of the embeddings that were below 0.6 threshold and compared the average to all the other images -- they were all below 0.6 😅

I compared the average of one person's face to an average of another person's face, they were different.

I'll experiment with this some more when I have time to see how much mileage I'll get out of it, even against your warning 😅

In case of 'averaging' photos, when you do that with different people, you'll get an 'average person' , but I'm hoping that averaging vectors does something different 🤞 especially that I'm aiming to average across the same person 🤷‍♂

All feedback/comments/advice is very welcome 🙇

My current experiment continues here: whyboris/extract-faces-node#2

@whyboris
Copy link
Owner Author

whyboris commented Jun 22, 2020

I've not forgotten about this feature -- I just want to finish releasing version 3.0.0 first -- see here #456 🎉

@whyboris whyboris changed the base branch from master to main July 26, 2020 20:04
@whyboris whyboris mentioned this pull request Nov 4, 2020
1 task
@whyboris
Copy link
Owner Author

May as well note here: the vectorizations of faces will be stored in a separate JSON -- not inside what is currently the.vha2 file (containing all the metadata about videos). This will be most convenient to release as VHA v4.0.0 and update the file to .vha4 - making it a .zip format (but renamed extension) which will contain two files: the usual metadata (same as .vha2 but with references to faces as extra field) and the new JSON with all the vectors 👍

@whyboris
Copy link
Owner Author

🤞 Hoping to resume this feature in 2024 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⛑️ WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Face detection
3 participants