Agender Project
This is a small demo project to try and test OpenCV library and also implement onthefly face detection, age and gender estimation using pretrained models.
This article can also be found on medium.
What do you do when you want to try something new in deep learning? Of course you search for articles and opensource projects first!
Existing OpenSource Projects for Gender and Age Estimation
Disclaimer: There are many more projects that are not listed here. But I believe I have covered the most popular ones, that appear at first pages of search results.
How did I search
I have googled for:
gender age estimation
gender age opencv
gender age keras
gender age tensorflow
gender age caffemodel
gender age pytorch
I was looking at one or two first pages of results only. Then I excluded:
 articles with restricted access,
 projects without source code,
 projects with source code written in other than python language,
 projects that perform only age or gender estimation, not both,
 project duplicates or copies.
After that I dig into source code to find details of input image format, output format, model architecture, weight size, license, pretrained model availability, etc.
List of existing projects
Here is what I've found for the topic:

Age and Gender Classification using MobileNets by Kinar Ravishankar.
 Source code: https://github.com/KinarR/agegenderestimatorkeras
 License:
MIT
 Framework:
Keras/TensorFlow
 Input: RGB images of any size, author used:
224x224x3
 Output:
 gender: two binary classes: Male and Female, choose maximum.
 age: 21 class, use softmax, choose maximum and multiply it's index by 4.76, which gives you roughly [0  100] years interval.
 Model weights size: we can estimate it as this model is based on
MobileNet_v1_224
, followed by oneDense(1024>1024)
layer plus two outputDense(1024>1)
layers. So there are approximately (4.24 MP + 1.05 MP) = 5.29 MP (=Million Parameters). Which is about 21 Mb forfloat32
.  Pretrained model available: NO

How to build an age and gender multitask predictor with deep learning in TensorFlow by Cole Murray
 Source code: https://github.com/ColeMurray/agegenderestimationtutorial
 License: unspecified
 Framework:
TensorFlow
 Input: RGB images
224x224x3
 Output:
 gender: two binary classes: Male and Female, choose maximum.
 age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum
 Model weights size: we can estimate it from model architecture:
Conv(5x5, 3>32)
>MaxPool(2>1)
>Conv(5x5, 32>64)
>MaxPool(2>1)
>Conv(5x5, 64>128)
>MaxPool(2>1)
>Dense(28*28*128 > 1024)
>Dense(1024 > 101)
,Dense(1024 > 2)
.2400 + 51200 + 204800 + 102760448 + 103424 + 2048 = 103.1MP
Which is approximately 393 Mb.  Pretrained model available: NO

Predicting apparent Age and Gender from face picture : Keras + Tensorflow by Youness Mansar
 Source code: https://github.com/CVxTz/face_age_gender
 License:
MIT
 Framework:
Keras/TensorFlow
 Input: RGB images
224x224x3
 Output:
 gender: one number in range [0..1], where 0 = Female, 1 = Male.
 age: 8 classes [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
 Model weights size: We can estimate it from model architecture:
ResNet50
>Dense(100)
>Dense(1)
. Approximately: 100 Mb.  Pretrained model available: NO

SSRNet: A Compact Soft Stagewise Regression Network for Age Estimation by TsunYi Yang, YiHsuan Huang, YenYu Lin, PiCheng Hsiu, YungYu Chuang.
 Source code: https://github.com/shamangary/SSRNet
 Third party source code: https://github.com/shamangary/SSRNet
 License:
Apache License 2.0
 Framework:
Keras/TensorFlow
 Input: RGB images
64x64x3
 Output:
 gender: one number in range [0..1], where 0 = Female, 1 = Male.
 age: one number
 Model weights size:
 gender: 0.32 Mb,
 age: 0.32 Mb,
 Pretrained model available: YES
 Last models update: Apr 2018

Mxnet version implementation of SSRNet for age and gender estimation by @wayen820
 Source code: https://github.com/wayen820/gender_age_estimation_mxnet
 License: unspecified
 Framework:
MXNET
 Input: RGB image
112x112x3
 Output:
 gender: one number in range [0..1], where 0 = Female, 1 = Male.
 age: one number
 Model weights size:
 gender: 3.94 Mb
 age: 1.95 Mb
 Pretrained model available: YES
 Last models update: Oct 2018

Age and Gender Classification Using Convolutional Neural Networks by Gil Levi and Tal Hassner.
 Source code: https://github.com/GilLevi/AgeGenderDeepLearning
 License:
as is
 Framework:
Caffe
. But models could be loaded withOpenCV
.  Input:
256x256x3
 Output:
 gender: two binary classes: Male and Female, choose maximum.
 age: 8 classes: [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
 Model weights size:
 gender: 43.5 Mb,
 age: 43.5 Mb.
 Pretrained model available: YES, separate models for gender and age.
 Last models update: Sep 2017

Age and Gender Deep Learning with TensorFlow by Rude Carnie (? Daniel Pressel)
 Source code: https://github.com/dpressel/rudecarnie
 License: unspecified
 Framework:
TensorFlow
 Input: RGB images
256x256x3
 Output:
 gender: two binary classes: Male and Female, choose maximum.
 age: 8 classes: [0..2], [4..6], [8..12], [15..20], [25..32], [38..43], [48..53], [60..100], use softmax, choose maximum.
 Model weights size:
 gender: inception_v3 based model  166 Mb,
 age: inception_v3 based model  166 Mb.
 Pretrained model available: YES, separate networks for gender and age.
 Last models update: Apr/Feb 2017

Easy Real time gender age prediction from webcam video with Keras by Chengwei Zhang
 Source code: https://github.com/Tony607/Keras_age_gender
 License: unspecified
 Framework:
Keras/TensorFlow
 Input: RGB images
64x64x3
. Possibly, any size can be chosen.  Output:
 gender: one number [0..1], where 1 means Female,
 age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum
 Model weights size: 186 Mb.
 Pretrained model available: YES
 Last model update: Jan 2018

Age and Gender Estimation by Yusuke Uchida
 Source code: https://github.com/yu4u/agegenderestimation
 License:
MIT
 Framework:
Keras/TensorFlow
 Input: RGB image of any size. Author used
32x32x3
 Output:
 gender: one number [0..1], where 1 means Female,
 age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum
 Model weights size: 187 Mb.
 Pretrained model available: YES
 Last models update: Feb 2018

Age and gender estimation based on Convolutional Neural Network and TensorFlow by Boyuan Jiang
 Source code: https://github.com/BoyuanJiang/AgeGenderEstimateTF
 License:
MIT
 Framework:
TensorFlow
 Input: RGB image
160x160x3
 Output:
 gender: one number, 0 = Female, 1 = Male.
 age: one number
 Model weights size: 246.5 Mb.
 Pretrained model available: YES
 Last models update: Nov 2017

Apparent Age and Gender Prediction in Keras by Sefik Ilkin Serengil
 Source code: https://github.com/serengil/tensorflow101/blob/master/python/agegenderpredictionrealtime.py and also https://github.com/serengil/tensorflow101/blob/master/python/gender_prediction.ipynb
 License: unspecified
 Framework:
Keras/TensorFlow
 Input: RGB image
224x224x3
 Output:
 gender: one number, 0 = Woman, 1 = Man
 age: vector of 101 classes probabilities for ages [0..100], choose maximum or use weighted sum
 Model weights size:
 gender: 553 Mb
 age: 514 Mb
 Pretrained model available: YES, separate models for gender and age
 Last models update: Nov 2017

Multi output neural network in Keras (Age, gender and race classification) by Sanjaya Subedi
 Source code: https://github.com/jangedoo/agegenderraceprediction
 License: unspecified
 Framework:
Keras/TensorFlow
 Input: RGB image
198x198x3
 Output:
 gender: one number, 0 = Male, 1 = Female
 age: one number
 race: vector of 5 classes: ['White', 'Black', 'Asian', 'Indian', 'Others']
 Model weights size: unknown
 Pretrained model available: NO
Summary table
No  Name  Article  Source  License  Framework  Input  Output  Size  Pretrained 

1  MobileNets by Kinar Ravishankar  link  link  MIT  Keras/TensorFlow  224x224x3  gender: 2 classes, age: 21 classes  ~21Mb  NO 
2  ConvNet by Cole Murray  link  link  unspecified  TensorFlow  224x224x3  gender: 2 classes, age: 101 classes  ~393Mb  NO 
3  ResNet50 by Youness Mansar  link  link  MIT  Keras/TensorFlow  224x224x3  gender: one number, age: 8 classes  ~100Mb  NO 
4  SSRNet (original)  link  link  Apache License 2.0  Keras/TensorFlow  64x64x3  gender: one number, age: one number  0.32Mb  YES 
5  SSRNet on MXNET  None  link  unspecified  MXNET  112x112x3  gender: one number, age: one number  1.95Mb, 3.94Mb  YES 
6  ConvNet by Gil Levi and Tal Hassner  link  link  as is  Caffe  256x256x3  gender: 2 classses, age: 8 classes  43.5Mb, 43.5Mb  YES 
7  Inception_v3 by Rude Carnie  None  link  unspecified  TensorFlow  256x256x3  gender: 2 classses, age: 8 classes  166Mb, 166Mb  YES 
8  ConvNet by Chengwei Zhang  link  link  unspecified  Keras/TensorFlow  64x64x3  gender: 1 number, age: 101 class  186Mb  YES 
9  ConvNet by Yusuke Uchida  None  link  MIT  Keras/TensorFlow  32x32x3  gender: 1 number, age: 101 class  187Mb  YES 
10  ConvNet by Boyuan Jiang  None  link  MIT  TensorFlow  160x160x3  gender: one number, age: one number  246.5Mb  YES 
11  ConvNet by Sefik Ilkin Serengil  link  link  unspecified  Keras/TensorFlow  224x224x3  gender: 1 number, age: 101 class  553Mb, 514Mb  YES 
12  ConvNet by Sanjaya Subedi  link  link  unspecified  Keras/TensorFlow  198x198x3  gender: 1 number, age: 1 number, race: 5 classes  unknown  NO 
Note: I did not include model's accuracy provided by authors in the description because it has no meaning when different models are tested on different test datasets!
Choosing model
I decided to choose two most lightweight networks, which are able to process video onthefly using only average CPU.
My choice is:

No 4, SSRNet, which has separate models for gender and age of size only 0.32 Mb! They are very fast in comparision with other models.

No 6, models by Gil Levi and Tal Hassner, these are also two separate models for gender and age that are widely used by developers as they are about 43 Mb.
Of course I would like to have one neural net for both gender and age estimation. Maybe I will spend some time and train a model by myself. In this case I would definitely use staged training technique proposed by SSRNet authors.
This Project Architecture
This simple program randomly chooses a video file from videos
directory.
Then it reads frame by frame in cycle until the end or until user pressed ESC key.
For each frame:
 Get a smaller resized frame. As it is faster to process small images and this merely does not affect quality.
 Find faces on a small frame.
 Use faces coordinates of a small frame to extract faces patches from original (big) frame.
 Convert and adjust faces patches to a format that model expects. Construct a blob with all faces.
 Pass a blob of faces through model(s) to get predicted genders and ages for all faces.
 Draw a rectangle around each face and a label with estimated gender and age.
Below you may find some more details.
Initialization
Face detector is initialized basing on the face_detector_kind
argument:
# Initialize face detector
if (face_detector_kind == 'haar'):
#face_cascade = cv.CascadeClassifier('face_haar/lbpcascade_frontalface_improved.xml')
face_cascade = cv.CascadeClassifier('face_haar/haarcascade_frontalface_alt.xml')
else:
face_net = cv.dnn.readNetFromTensorflow('face_net/opencv_face_detector_uint8.pb', 'face_net/opencv_face_detector.pbtxt')
Model to estimate age and gender is initialized basing on the age_gender_kind
argument:
# Load age and gender models
if (age_gender_kind == 'ssrnet'):
# Setup global parameters
face_size = 64
face_padding_ratio = 0.10
# Default parameters for SSRNet
stage_num = [3, 3, 3]
lambda_local = 1
lambda_d = 1
# Initialize gender net
gender_net = SSR_net_general(face_size, stage_num, lambda_local, lambda_d)()
gender_net.load_weights('age_gender_ssrnet/ssrnet_gender_3_3_3_64_1.0_1.0.h5')
# Initialize age net
age_net = SSR_net(face_size, stage_num, lambda_local, lambda_d)()
age_net.load_weights('age_gender_ssrnet/ssrnet_age_3_3_3_64_1.0_1.0.h5')
else:
# Setup global parameters
face_size = 227
face_padding_ratio = 0.0
# Initialize gender detector
gender_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_gender.prototxt', 'age_gender_net/gender_net.caffemodel')
# Initialize age detector
age_net = cv.dnn.readNetFromCaffe('age_gender_net/deploy_age.prototxt', 'age_gender_net/age_net.caffemodel')
# Mean values for gender_net and age_net
Genders = ['Male', 'Female']
Ages = ['(02)', '(46)', '(812)', '(1520)', '(2532)', '(3843)', '(4853)', '(60100)']
Reading video
Currently video stream is read from random file from videos
directory.
import os
import cv2 as cv
import numpy as np
import time
# Initialize numpy random generator
np.random.seed(int(time.time()))
# Set video to load
videos = []
for file_name in os.listdir('videos'):
file_name = 'videos/' + file_name
if os.path.isfile(file_name) and file_name.endswith('.mp4'):
videos.append(file_name)
source_path = videos[np.random.randint(len(videos))]
# Create a video capture object to read videos
cap = cv.VideoCapture(source_path)
Detecting faces
Generally, there are two common ways to detect faces:
 using HAAR cascade,
 using trained CNN model.
Of course, CNN model is more accurate, but it requires more computational resources and runs slower.
In this project I decided to implement both ways and choose one via argument face_detector_kind
.
Detecting faces with either HAAR or ConvNet is very easy:
def findFaces(img, confidence_threshold=0.7):
# Get original width and height
height = img.shape[0]
width = img.shape[1]
face_boxes = []
if (face_detector_kind == 'haar'):
# Get grayscale image
gray = cv.cvtColor(img, cv.COLOR_BGR2GRAY)
# Detect faces
detections = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5)
for (x, y, w, h) in detections:
padding_h = int(math.floor(0.5 + h * face_padding_ratio))
padding_w = int(math.floor(0.5 + w * face_padding_ratio))
x1, y1 = max(0, x  padding_w), max(0, y  padding_h)
x2, y2 = min(x + w + padding_w, width  1), min(y + h + padding_h, height  1)
face_boxes.append([x1, y1, x2, y2])
else:
# Convert input image to 3x300x300, as NN model expects only 300x300 RGB images
blob = cv.dnn.blobFromImage(img, 1.0, (300, 300), mean=(104, 117, 123), swapRB=True, crop=False)
# Pass blob through model and get detected faces
face_net.setInput(blob)
detections = face_net.forward()
for i in range(detections.shape[2]):
confidence = detections[0, 0, i, 2]
if (confidence < confidence_threshold):
continue
x1 = int(detections[0, 0, i, 3] * width)
y1 = int(detections[0, 0, i, 4] * height)
x2 = int(detections[0, 0, i, 5] * width)
y2 = int(detections[0, 0, i, 6] * height)
padding_h = int(math.floor(0.5 + (y2  y1) * face_padding_ratio))
padding_w = int(math.floor(0.5 + (x2  x1) * face_padding_ratio))
x1, y1 = max(0, x1  padding_w), max(0, y1  padding_h)
x2, y2 = min(x2 + padding_w, width  1), min(y2 + padding_h, height  1)
face_boxes.append([x1, y1, x2, y2])
return face_boxes
Please note the global variable face_padding_ratio
which determines how to enlarge face_box detected by any algorithm. It's value depends on the face detection algorithm and on age/gender estimation algorithm. Ideally, you should choose it's value so that faces you get will be very similar to those that model was trained on.
Extracting faces patches
This is done in two steps:
 Convert face
box
coordinates from small frame to the big original frame:box_orig
.  Get part of the original frame specified by coordinates:
face_bgr
.
We could, of course, extract faces from the small frame. The reason to extract patches from big frame this is that we want to keep as much quality as possible. But we should keep in mind that this also may require slightly more calculations than in the first case.
def collectFaces(frame, face_boxes):
faces = []
# Process faces
for i, box in enumerate(face_boxes):
# Convert box coordinates from resized frame_bgr back to original frame
box_orig = [
int(round(box[0] * width_orig / width)),
int(round(box[1] * height_orig / height)),
int(round(box[2] * width_orig / width)),
int(round(box[3] * height_orig / height)),
]
# Extract face box from original frame w.r.t. image boundary
face_bgr = frame[
max(0, box_orig[1]):min(box_orig[3] + 1, height_orig  1),
max(0, box_orig[0]):min(box_orig[2] + 1, width_orig  1),
:
]
faces.append(face_bgr)
return faces
Now faces
list contains faces patches, all of different sizes.
Estimating age and gender
In most cases neural networks are designed to work in batch mode. I.e. they can process many input samples at ones. This is especially useful at training time, as such batch mode training usually helps models to converge faster than in stochastic mode training (one sample at a time).
But before we could feed all faces into model we must resize them into a format that model expects. At least we should make all faces the same size and normalize their values.
SSRNet expects input to be a tensor of size: N x 64 x 64 x 3
, where N is the number of faces, 64x64 is the height and width correspondingly and 3 stands for RGB. Individual values in tensor should be scaled to [0...1]. Please note the function call cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX)
which does the required normalization.
ConvNet by Gil Levi and Tal Hassner expects input to be a tensor of size: N x 3 x 227 x 227
, where N is the number of faces, 3 means channels of RGB and 227x227 is for height and width correspondingly. Individual channels in tensor should have mean 0 but should not be scaled. Please note the parameters scalefactor=1.0
and mean=(78.4263377603, 87.7689143744, 114.895847746)
in the function call cv.dnn.blobFromImages
which do exactly this.
As said, different models require different images preprocessing. So it is done as follows:
def predictAgeGender(faces):
if (age_gender_kind == 'ssrnet'):
# Convert faces to N,64,64,3 blob
blob = np.empty((len(faces), face_size, face_size, 3))
for i, face_bgr in enumerate(faces):
blob[i, :, :, :] = cv.resize(face_bgr, (64, 64))
blob[i, :, :, :] = cv.normalize(blob[i, :, :, :], None, alpha=0, beta=255, norm_type=cv.NORM_MINMAX)
# Predict gender and age
genders = gender_net.predict(blob)
ages = age_net.predict(blob)
# Construct labels
labels = ['{},{}'.format('Male' if (gender >= 0.5) else 'Female', int(age)) for (gender, age) in zip(genders, ages)]
else:
# Convert faces to N,3,227,227 blob
blob = cv.dnn.blobFromImages(faces, scalefactor=1.0, size=(227, 227),
mean=(78.4263377603, 87.7689143744, 114.895847746), swapRB=False)
# Predict gender
gender_net.setInput(blob)
genders = gender_net.forward()
# Predict age
age_net.setInput(blob)
ages = age_net.forward()
# Construct labels
labels = ['{},{}'.format(Genders[gender.argmax()], Ages[age.argmax()]) for (gender, age) in zip(genders, ages)]
return labels
That's it.
Results
While implementing this project I analyzed different articles and models to estimate human gender and age by image.
I have discovered that there are a lot of good models with high accuracy that are yet too big and slow to compute.
On the other hand there are some small models with lower accuracy that could be used for realtime video processing.
I have successfully used two such models for realtime estimation of age and gender using only average CPU:
 SSRNet by TsunYi Yang, YiHsuan Huang, YenYu Lin, PiCheng Hsiu, YungYu Chuang.
 ConvNet by Gil Levi and Tal Hassner.
The result is great. It was fun to do!
Gender is estimated firmly while age estimation fluctuates around true value. All is done in realtime!
Future thoughts
Nowadays cameras are getting cheaper and are placed literally everywhere. But we can never have enough people to watch all those cameras.
I believe there exists a demand for small and accurate models that could estimate and describe content of video stream in realtime. Models that could run on a RaspberryPI or other small platforms.
But today researches are mostly concentrated on accuracy, but not on applicability of their models. Researchers get more benefits if their model wins first score for accuracy in Kaggle competition. But no benefits if model is the most efficient one. i.e. has decent results with significantly less computations. My thoughts are the same as in this article by Michał Marcinkiewicz: The Real World is not a Kaggle Competition
Of course, one may argue that analyzing content of a video is still a complex task. And complex tasks require tons of calculations anyway.
But I see at least several ways to achieve high efficiency:

Soft stagewise regression as proposed by authors of SSRNet. I encourage you to read their article. It is actually a novel approach in NN training. I believe that if we reformulate their basic idea it can be distributed to all other areas of deep learning. Not only to regression but also to classification, feature extraction, etc.

Layers reusage as proposed by Okan K ̈op ̈ukl ̈u, Maryam Babaee, Stefan H ̈ormann, Gerhard Rigol in their article CONVOLUTIONAL NEURAL NETWORKS WITH LAYER REUSE. Why use many layers each with their own parameters if we can repeat the same filters multiple times?

Hidden units reusage. I did not find any article or even mention of this simple idea. Please tell me if you know any. The idea is described below.
Hidden units reusage
A typical content analyzing pipeline consists of several modules running in sequence or in parallel.
For instance, in this simple project we have:
Input frame
>ConvNet to detect faces
>faces
faces
>ConvNet to estimage gender
>genders
faces
>ConvNet to estimage age
>ages
Where 2 and 3 may run in parallel.
In more sophisticated projects we could also find:
Input frame
>ConvNet to recognize common objects
>COCO names
Input frame
>ConvNet for semantic segmentation
>segmented image mask
Note that each ConvNet
typically consists of many sequential layers. But I guess that first convolution layers of different networks are very similar.
I believe that if you take two different networks trained for different tasks, you will find similar filter's weights in first layers of both networks. As they act like basic filters for borders detection.
It means that in complex projects similar filters process the same image several times.
I.e. first you apply these filters when you find faces in image. Then you again apply these (or similar) filters when you detect gender of a person. And then again  when you estimate person's age.
We can save processing time if we get rid of unnecessary calculations and reuse hidden units as results of first layer's filters applied to input image.
Of course, it's a little bit challenging as it requires:
 specially choose pretrained first layers,
 freeze their parameters when training rest of model layers,
 extract hidden units values, which could be hard in some frameworks.
That is it. Thank you for reading!
TODO:
 Search for projects for face and gender estimation.
 Dig inside each project and describe all projects found.
 Choose one or two projects and insert their code into my demo project.
 Implement video reading with OpenCV, facedetection and age and gender estimation.
 Add argument to detect faces either with HAAR or with CNN.
 Add argument to choose model for age and gender estimation.
 Debug and refactor code.
 Write this short article.