Why is Face Recognition so Hard?

Introduction

Facial comparison, a vital cog in the vast machinery of facial recognition technology, operates on principles that meld both art and science. At the core of this process are deep learning algorithms and neural networks, trained meticulously to recognize and compare minute facial features with astounding accuracy.

Each face is a unique amalgamation of numerous features and characteristics, such as the distance between the eyes, the width of the nose, the depth of the eye sockets, and the shape of the cheekbones, among others. This collection of facial features is referred to as a faceprint, a unique identifier akin to a fingerprint.

A Client Request

A client contacted us and asked if we could find the name and location of someone who had skipped bail and they needed to issue a warrant.

They need visual confirmation of the Person they thought was “subject X” actually was “subject X”

The client provided us with 3 different pictures, each picture had the face of the “person of interest”, but they gave no name and no location information, just what was provided in the picture.
Luckily FaceMRI can do all the complicated searching and find the right person.
But here is what happens under the hood.

Note: This is not the “Person of Interest” but the photographs by andrea-piacquadio at pexels.com . But we will use these pictures as an example of the process.

We used our FaceMRI depth breadth first search to try and find this person on the Internet, and I wanted to explain what makes that so hard.

Extracting Faces

Face recognition is the process of extracting faces from pictures, and comparing those pictures to a “person of interest” and seeing if they are the same person.
If I were to break this down into simple steps it might be

Find all faces in the picture

We start by making a Face profile of the “Person of Interest”
Find all people in this picture, by running a Face Recognition filter though it.

Pass the image into a Neural network that has been trained to recognize faces. This will run a Filter over the image looking for shapes that look like faces. This might find the same face multiple times.

Pick the best faces

Most Face Recognition Extraction neural networks will find the same face more than 1 time, and it gets worse if there are multiple faces in the image.

We use an algorithm to take the largest face rectangle and discard the ones overlapping.

What Face Recognition Method should I use ?

There are many methods for finding faces in images and their use will depend on the type of data you are processing and how you are processing the data. Don’t be fooled, the latest and greatest face recognition neural network may not be the best for your application.

There is haarcascade and haarcascade_frontalface_default, everyone uses this model in the beginning, it’s fast but full of false positives especially in low resolution images.

There is res10_300x300_ssd_iter_140000, Face Detection model, which is good but has issues finding all the faces in group pictures, and the input picture needs a lot of work before passing it into the model. It also has issues with images over 3000 pixels resolution.

There are many more face detection models based on CNN and DNN networks and we looked at them all.
Team FaceMRI spent many months testing the different face detection models, for speed, accuracy, face positives, false negatives and memory footprint. We handpicked the very best model for use in all scenarios.

Align the Face

Most faces in the wild are not head perfect orientations, they may be side faces or skewed front faces, and it’s important for now anyway to have all faces if possible have the same orientation.

We need all our faces to be in the same orientation if possible for the Face Recognition phase to work. You can see in this image the face is at a steep angle.

Now we run the Face image though a Eyes Detection neural network.

Run Face though Eye detection Neural Network
Find the 2 eye coordinates
Using simple trig to calculate the angle needed for rotation
Rotate the image
Re-crop the image so that we don’t get missing corners.

That is a lot of work for each face.
Luckily we only have to do steps 1 to 3, if the rotation required is less than +- 10 degrees.

Learn the Face Embedding

Next we need a unified way to learn what each face looks like, so we can compare 2 faces and see if they are the same person.
For this we use Face Embeddings, face embeddings are a neural network that is trained on over 1 million different faces, of all ages, gender and races.

We run our 3 faces from the “Person of Interest” through the face embedding network and get a 128 dimension vector for each face.

Inside that vector is the network’s representation of the Face, the nose, eyes, ears, hair, skin and other aspects of the face. That vector embedding is not very easy to visualize however.

Now that we have the Face embedding, we can run the Embedding through another network called the VanillaV embedding network, this basically gives us the gender of the Face for free. We can save this trick for later!

Lets visualize the 128 dimension vector, the square root of 128 is 11.3177, so lets create an 11 x 11 pixel canvas.
Take each dimension and normalize it from the [-2 to +2] clamping into [0 to 256] so we can visualize its pixels.
We won’t be using this vector visualization but it’s helpful later on when we look at vector latent spaces.

Now we have our “Person of Interest” which includes 3 faces, Vector embedding and Vector visualization.

But that represents what each face vector looks like in isolation, and not in any relationship to the other faces.

Visualizing the Vector Faces of a Person

Vector spaces is how FaceMRI can get the Gender of a Face for free, without any work.
Below is a small introduction into vector latent spaces for those interested.

We can visualize these 3 faces of the “Person of Interest” POI in a latent vector space in 2D.
We can also take the 128 dimension vector and normalize each vector from [-2 to +2] to [0 to 2000].

so our vector [-2, 0.001,….., -0.8883] becomes something like [0, …., 200], which would be the Y position of our pixel, and the vector dimension would become our x position.
So [0, 15, 200] would be p0,0 p1,15, p2,2000 and we can plot these in a bitmap image.

And we flip the picture 90 degrees so it fits nicely into our blog page.

Now can visualize vectors in the same vector space and plot our 3 faces, using a different color for each face.
Hidden inside that space is a faces’ gender, race, eye color and age; you just need some face math to extract it.

And when we get a closer look, it’s really quite beautiful. You just need the correct telescope to see the Face Age, Gender and Race. We built our own special Telescope called “VanillaV” just for this.

Leveraging the Vector Embedding and Latent Space

We will come up with some constraints for this “POI” in this vector latent space.
You may have noticed these 128 dimension vectors [-1, …., -0.4] are similar to 3 dimensional vectors [0 , 6 , 7 ]. A 3d vector can be mapped to the real world coordinate geometry system we use every day, using Euclidean space.

We can even use the algorithms Euclidean Distance, TSS correlation and Cosine correlation.
Lets give each Face an ID [749, 1349, 799] and the Person ID 1552

In the wild a Person can have wildly different faces, with a mask, glasses, makeup, smile etc.
So it is expected that we don’t see a massive correlation between the distance
Euclidean (9.79, 10.45, 9.81) Cosine (0.17, 0.36, 0.119) TSS (1.41, 1.92, 1.14)
And we can use those values later to double and triple check when we think we have found a candidate face.

Finding our “Person of Interest”

Now that we have our “Person of Interest” POI we need to see how we would compare that to another face.Let’s take a look at a real world picture of a group selfie.

Quick overview of the steps required again

Step 1: Run the image though the Face Extraction neural network

Step 2: For each face (12 faces)

Extract the face from the picture
Get the Eyes of the face
Rotate the face if required, re crop
Run the Face through the Embedding model.
Run the Face through the VanillaV model to get the Gender.

And yes some faces are partial faces, but we still need to do the same amount of work.
But we won’t need to align the face since it will only have 1 eye, lucky us !
So we needed to do all that work for each face in the picture, EVERY SINGLE TIME.

Comparing the POI face to the candidate faces

Now we can compare the POI face to each of the 12 candidate faces.
We will need to compare the 3 faces in our POI, to the 12 faces = 36 comparisons.

But we also know which faces are Male (6) and Female (6).
So we only need to run the 3 POI faces though 6 candidate faces, cutting down the workload by 50%. We can then run each comparison in 6 threads.

We won’t go over the math for the Face Vector Comparisons in this blog as it is a bit involved, but it can run quickly.

More Than Just Images = Videos

There are more than just images in the world you know, there are videos too.

Can we see if our POI is in a YouTube Video, or TicTok video, or on a video file on your harddrive or a USB key ?Looking for people in images is not enough anymore, videos are the biggest search space for face recognition.

Searching for people inside videos is a massive topic , we will go into small detail here.
In theory a video is just a collection of images, played over at a fixed speed. Some videos have 24 frames per seconds, 26, 30 and oven 60.

Let’s take a 90 minute video that has 26 frames a second, ( 60 sec * 90 mins ) = 5400 seconds at ( 26 frames * 5400 ) = 140,400 frames or pictures.

A worst case scenario is a video with 12 people in each picture in each frame for 90 minutes.
That would be ( 140,400 * 90 ) = 12,636,000 faces in the video.

Lowest case scenario there is on average 1 face per frame = 140,400 faces.
That is just 1 feature length film !

3.7 million videos are uploaded to YouTube EVERY SINGLE DAY.
34 million short videos are uploaded to TicTok EVERY SINGLE DAY.
90 million pictures and videos are uploaded to Instagram EVERY SINGLE DAY.

Cutting down the search space

We need a way to search through a video fast, looking for a POI or multiple POI from a video, Here is an example video from Chicago airport again from Pexels.com

The video has 231 frames, I am showing the first 14 frames below for reference.

And usually we can skip N frames, so we have less images to process.
Skipping frames not only reduces the search space but means the video decoder can skip decoding some frames, moving to the next frame faster.

Tracking and Stacking

Next we will perform an algorithm called Track and Stack, videos designed for human consumption normally have some sort of continuity where the subject is in the video for a few consecutive frames, then is gone.
We can take advantage of this by analyzing batches of frames, normally 32 or more at a time.

Looking at the first 4 frames of this image sequence and finding the faces in frame.
But we only find the faces and don’t run the full set of algorithms.

For each face

Extract the face from the picture ( we can skip all the other steps for now )
~~Get the Eyes of the face~~
~~Rotate the face if required, re crop~~
~~Run the Face through the Embedding model.~~
~~Run the Face through the VanillaV model to get the Gender.~~

Now we have 10 faces from those 4 frames but we don’t know how to sort those Faces into the X different people.
We mark each face x,y position with a radius r Tracker, and if the Face in next Frame is inside that radius tracker, we stack those faces into a person.

There are obvious drawbacks to this approach, because if the frame changes and there is a face in the same tracker position, we now have a stack of faces in a person that are NOT the same person. We will then label each Person Stack P1, P2 and P3.
A long video might have 600 Person Stacks, but here are 3 for this example.

The face quality is really poor, but welcome to real-live face video processing, this is not your Face Recognition OpenCV tutorial where everything is perfect!
We will need to compare all the faces in each Person stack to make sure they belong to that person, we can’t use the Face Embeddings since we would need to run the [Eyes and Angle correction] for each face in the stack which is too much processing.

Instead we do a simple HASH for faces in the Person stack.
Hashing is just taking an RGB image and creating a string of letters “AJDHD”.
We create a hash for each Face in the Person Stack.

It is easy to see which Hash doesn’t match in our Person P1 stack, in this case the Face with Hash DGDSG is moving into a new Person Stack. RGB image hashing in reality is very complicated but we won’t go into the details here.

Frame Hashing

After looking at X number of frames we can take advantage of frame hashing, where we keep a record of the previous Frame Hash, and if the current Frame Hash is 95% similar to the last HASH we can skip the current frame!
Image Frame hashing can be complex and does have a time penalty, you need to make sure the time penalty for creating the hashes is less than actually processing the frame to begin with.

With some intelligent hashing and other metrics we only look at 25% of the frames in a video. For long videos with 90,000 frames we can perform jump-ahead and look-back algorithms. “Jump-ahead” and “look-back” algorithms are complex but ensure we look at the minimum number of frames while getting the maximum number of faces.

After looking at all the frames and now have a collection of Person Stacks, possibly 600 person stacks, and each Person Stack has a maximum of 10 faces.
If a stack has more than 10 faces, we make a new Person stack.

For each Face in the Person stack we do

Pick the largest face bound box
Run that face though the Face Embedding network
Note: we skipped the face alignment for now- we can get back to that.
Pick a face at random from the stack, and pass it though the Face embedding network.
Compare the gender and embedding of those 2 faces.
Use some metrics to discard the lowest quality faces in the Person stack.

We reduce the 600 stacks of 10 faces = (6000 faces) to just (1800).
Next sort those 600 stacks into 2 gender bucks ( 300 male and 300 female) on average.
Run multithreading on each bucket to speed things up.
Next compare 300*300 faces, and even less with more optimizations.

In the end we might end up with 20 Person stacks, but we can now have 20 faces per person stack.

We now run these 20 Person stacks through a set of more expensive and precise algorithms to maybe turn it into 15 people. Now we run the Face Alignment on each of those Faces in those Person stacks, which takes less time since we don’t have that many people.

Now we compare those 15 people to the POI “Person of Interest” using our image embedding algorithms.
Note we are now comparing People to People and NOT Faces to Faces, since a Person might have 3 , 10 or 20 faces.

Now you are comparing a Person with 3 faces to a Person with 15 faces, and there are more algorithms for this.
BOOM we either have a match or we don’t !!!
A ton of work to find a “Person of Interest” in just 1 video don’t you think !

Back to Searching

Now that we have the technology to find and compare faces in most digital media ( files, images, folders, USB Keys, videos, webs, CCTV, RTSP cameras….)
We can find our “Person of Interest” right?

Not so fast: where do we look ? you can google “white woman looking at a camera” all day long.
Google and Bing results for “white woman looking at a camera”

Find faces by image search alone is not so easy.

How Much Ice-Berg ?

We can split our search into 2 types: Image Search and Video Search.
The biggest hosting platforms for Videos are: YouTube, Instagram, FaceBook, Vimeo for adhoc videos and a host of others.

We won’t focus on Netflix, Hulu or other streaming platforms since we could just do a Face lookup using IMDB, find the actress then cross reference the films she is in.
But how many video hosting platforms are there and can we search them all ?

And how much do we search ? Should we search EVERY SINGLE VIDEO ON YOUTUBE !

We can use some secret sauce drilling to drill the tip of each ice-berg and then use heuristics to know where to drill next.

FaceMRI has special drilling services for each platform to find the right face faster.

But wait there’s more

We can increase search performance by breaking the face recognition into 3 types, Real-time, Online and Offline.

Online
Online is searching the web for images and videos on websites and youtube etc.
Normally we need the entire video or image streamed before we start searching.
Note we are not talking about RTSP and other streaming protocols here as different Person stack algorithms are used.

Offline
Offine is searching a USB stick, Hard Drive, network drive but also has the added burden of “chain of custody” .
This is where FaceMRI goes into forensics mode for police and law enforcement.
We can’t modify the source space in any way, we found a video on a USB key but we can’t make changes to the USB key because it could be used in a chain of custody.That case we need to copy the images and videos over, do the processing but also log everything and create a chain of custody reports.
Chain of custody requires large amounts of Metadata to be stored between searches.

Realtime
In this scenario the data is coming to use in real-time and we need to find the POI in real-time and inform the user immediately.

In this scenario the constraints are

Find the POI as fast as possible
Tell the user the POI has been found and not wait for triple confirmation
The data is usually being streamed

You can connect to a RTSP/ Webcam / CCTV and start receiving the image.
Even though it looks like a video file the stream can essentially be random.
Here is a sequence of 100 frames, but frames can be dropped, lost or not recorded.

So we can’t use “Track and Stack” algorithms, and each face found needs to be aligned and run through the embedding networks etc.

After each Face is found, we spin that off into a Search thread and compare that face immediately to the POI.
Luckily FaceMRI does all this for you, and you can download and use FaceMRI for Windows and Mac at www.facemri.com

More than just faces

At FaceMRI we like to say

“It’s not about who you find, But about what you discover.”

After you search a Video, find all the faces and create all the embeddings you are left with a train of metadata about each face, person and frame.

We can take advantage of this hard-work and create network graphs.
Take an example of tracking down a criminal POI, you find their face in all the frames and all the other people in the video too.
You can then ask questions like “who else was in the same frames as our POI ?“

After finding the POI, we can see all the people the POI was in frames with. Which you can in turn query for other people.

With all the faces and metadata saved for each video, you can re-query your original videos for new POI for free and do complex queries like “show me all the frames in video X that have 2 or more people “.

Your team can even export face graphs to the cloud and do live face search collaboration.

7 thoughts on “Why is Face Recognition so Hard?”

Hairstyles December 2, 2024 at 7:09 pm


It is perfect time to make some plans for the future and it’s time to be happy. I’ve read this post and if I could I wish to suggest you few attention-grabbing things or tips. Maybe you could write subsequent articles regarding this article. I wish to learn more issues approximately it!
Hairstyles December 3, 2024 at 6:11 pm


you’re really a good webmaster. The web site loading speed is incredible. It seems that you are doing any unique trick. Moreover, The contents are masterwork. you’ve done a magnificent job on this topic!
Hairstyles December 4, 2024 at 3:14 pm


I have realized that car insurance corporations know the cars which are at risk of accidents along with other risks. In addition they know what style of cars are given to higher risk along with the higher risk they may have the higher a premium amount. Understanding the basic basics associated with car insurance will let you choose the right form of insurance policy that may take care of your requirements in case you happen to be involved in an accident. Thank you sharing your ideas on your blog.
Beauty Fashion December 11, 2024 at 7:37 pm


Thanks for the new stuff you have discovered in your blog post. One thing I’d like to reply to is that FSBO associations are built as time passes. By launching yourself to owners the first weekend their FSBO is actually announced, prior to the masses start off calling on Friday, you produce a good association. By mailing them equipment, educational supplies, free reviews, and forms, you become the ally. By taking a personal affinity for them as well as their circumstances, you develop a solid interconnection that, on many occasions, pays off in the event the owners opt with an adviser they know and also trust — preferably you actually.
Hairstyles December 12, 2024 at 3:50 am


Thanks for making me to obtain new suggestions about desktops. I also hold the belief that certain of the best ways to maintain your laptop in primary condition is with a hard plastic material case, or even shell, which fits over the top of the computer. Most of these protective gear are usually model precise since they are manufactured to fit perfectly above the natural casing. You can buy these directly from the seller, or through third party sources if they are intended for your laptop computer, however only a few laptop could have a shell on the market. Just as before, thanks for your guidelines.
Hairstyles VIP January 4, 2025 at 10:36 am


Spot on with this write-up, I really think this website wants way more consideration. I抣l in all probability be again to read way more, thanks for that info.
Hairstyles VIP January 9, 2025 at 7:56 am


I抣l right away grab your rss feed as I can not to find your email subscription link or e-newsletter service. Do you’ve any? Please allow me recognise so that I may subscribe. Thanks.

749 vs 1349	749 vs 799	1349 vs 799

DistancesEuclidean 9.79Cosine 0.17TS 1.41	DistancesEuclidean 10.45Cosine 0.119TS 1.92	DistancesEuclidean 9.81Cosine 0.36TS 1.14

YouTube	FaceBook	Instagram
LinkedIN	tiktoc	Twitter
We can use the ice-berg search principle, search the top of all the video platform ice-bergs. And we can use clues to see if we are getting closer, and deep dive into that particular ice-berg;