# Human Pose Estimation with Stacked Hourglass Network and TensorFlow

For full source code, please go to https://github.com/ethanyanjiali/deep-vision/tree/master/Hourglass/tensorflow. I really appreciate your ⭐STAR⭐ that supports my efforts.

Human is good at making different poses. Human is good at understanding these poses too. This makes body language become such an essential part of our daily communication, work, and entertainment. Unfortunately, poses have so much variance, so it’s not an easy task for a computer to recognize a pose from a picture…until we have deep learning!

With a deep neural network, the computer can learn a generalized pattern of human poses, and predict joints location accordingly. The Stacked Hourglass Network is just such kind of network, and I’m going to show you how to use it to make a simple human pose estimation. Although first introduced in 2016, it’s still one of the most important networks in pose estimation area, and widely used in lots of applications. No matter if you want to build a software to track basketball player’s action, or make a body language classifier based on a person’s pose, this would be a handy hands-on tutorial for you.

# Network Architecture

## Overview

To simply put, Stacked Hourglass Network (HG) is a stack of hourglass modules. It got this name because the shape of each hourglass module closely resemble an hourglass, as we can see from the picture below:

The idea behind stacking multiple HG (Hourglass) modules instead of forming a giant encoder and decoder network is that each HG module will produce a full heat-map for joint prediction. Thus, the latter HG module can learn from the joint predictions of the previous HG module.

Why would a heat-map help human pose estimation? This is a pretty common technique nowadays. Unlike facial keypoints, human pose data has lots of variances, which makes it hard to converge if we just simply regress the joint coordinates. Smart researchers come up with an idea to use heat-map to represent a joint location in an image. This preserves the location information, and then we just need to find the peak of the heat-map and use that as the joint location (plus some minor adjustment since heat-map is coarse). For a 256×256 input image, our heat-map will be 64×64.

In addition, we would also calculate the loss for each intermediate prediction, which helps us to supervise not only the final output but also all HG modules effectively. This is a brilliant design back then because pose estimation relies on the relationship among each area of the human body. For example, without seeing the location of the body, it’s hard to tell if an arm is left arm or right arm. By using a full prediction as the next modules’ input, we are forcing the network to pay attention to other joints while predicting a new join location.

## Hourglass Module

So how does this HG (Hourglass) module itself look like? Let’s take a look at another diagram from the original paper:

In the diagram, each box is a residual block plus some additional operations like pooling. If you are not familiar with residual block and bottleneck structure, I’d recommend you to read some ResNet article first. In general, an HG module is an encoder and decoder architecture, where we downsample the features first, and then upsample the features to recover the info and form a heat-map. Each encoder layer would have a connection to its decoder counterpart, and we could stack as many as layers we want. In the implementation, we usually make some recursions and let this HG module to repeat itself.

I understand that it still seems too “convoluted” here, so it might be easier just to read the code. Here’s a piece of code copied from my Stacked Hourglass implementation on Github deep-vision repo:

def HourglassModule(inputs, order, filters, num_residual):
"""
One Hourglass Module. Usually we stacked multiple of them together.
https://github.com/princeton-vl/pose-hg-train/blob/master/src/models/hg.lua#L3

inputs:
order: The remaining order for HG modules to call itself recursively.
num_residual: Number of residual layers for this HG module.
"""
# Upper branch
up1 = BottleneckBlock(inputs, filters, downsample=False)

for i in range(num_residual):
up1 = BottleneckBlock(up1, filters, downsample=False)

# Lower branch
low1 = MaxPool2D(pool_size=2, strides=2)(inputs)
for i in range(num_residual):
low1 = BottleneckBlock(low1, filters, downsample=False)

low2 = low1
if order > 1:
low2 = HourglassModule(low1, order - 1, filters, num_residual)
else:
for i in range(num_residual):
low2 = BottleneckBlock(low2, filters, downsample=False)

low3 = low2
for i in range(num_residual):
low3 = BottleneckBlock(low3, filters, downsample=False)

up2 = UpSampling2D(size=2)(low3)

return up2 + up1


This module looks like an onion, and let’s start from the outmost layer first. up1 went through two bottleneck blocks and added together with up2. This represents two bigs boxes on the left and top, and also the right-most plus sign. The whole flow is up in the air, so we call it up channel. On line 17, there’s also a low channel. This low1 goes through some pooling and bottleneck block, then goes into another smaller Hourglass module! On the diagram, it’s the second layer of the big onion. And this is also why we are using recursion here. We keep repeating this HG module until layer 4, where you just have a single bottleneck instead of an HG module. And this final layer in the three tiny boxes in the middle of the diagram.

If you are familiar with some image classification networks, it’s clear that the author borrows the idea of skip connection very heavily. This repeating pattern connects the corresponding layers in the encoder and decoder together, instead of just having one flow of features. This not only helps the gradient to pass through but also lets the network consider features from different scales when decoding.

## Intermediate Supervision

Now that we have an Hourglass module, and we know that the whole network consists of multiple modules like this, but how do we stack them together precisely? Here comes the final piece of the network: intermediate supervision.

As you can see from the diagram above, when we produce something from the HG module, we split the output into two paths. The top path includes some more convolutions to further process the features and then go to the next HG module. The interesting thing happens at the bottom path. Here we use the output of that convolution layer as an intermediate heat-map result (blue box) and then calculate loss between this intermediate heat-map and the ground-truth heat-map. In other words, if we have 4 HG modules, we will need to calculate four losses in total: 3 for the intermediate result, and 1 for the final result.

# Prepare the Data

## MPII Dataset

Once we finished the code for the Stacked Hourglass network, it’s time for us to think about what kind of data we’d like to use to train this network. If you have your own dataset, that’s great. But here I’d like to mention an open dataset for those beginners who want to have something to train on first. And it’s called MPII Dataset (Max Planck Institute for Informatics). You could find the download link here.

Although this dataset is mostly used for single person pose estimation, it does provide joints annotations for multiple people in the same image. For each person, it gives the coordinates for 16 joints, such as the left ankle or right shoulder.

However, the original dataset annotation is in Matlab format, which is really hard to use nowadays. An alternative is to use a preprocessed JSON format annotations provided by Microsoft here. The Google Drive link is here. After you download this JSON annotation, you would see a list with elements like this:

{
"joints_vis": [
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1
],
"joints": [
[
804,
711
],
[
816,
510
],
[
908,
438
],
[
1040,
454
],
[
906,
528
],
[
883,
707
],
[
974,
446
],
[
985,
253
],
[
982.7591,
235.9694
],
[
962.2409,
80.0306
],
[
869,
214
],
[
798,
340
],
[
902,
253
],
[
1067,
253
],
[
1167,
353
],
[
1142,
478
]
],
"image": "005808361.jpg",
"scale": 4.718488,
"center": [
966,
340
]
}


“joint_vis” indicates the visibility of a joint. In recent datasets, we usually need to differentiate occluded joints and visible joints. But in MPII, we only care about if the joint is in the view of the image: 1 -> in the view, 0-> out of the view. “joints” is a list of joint coordinates, and they follow the order of 0 - r ankle, 1 - r knee, 2 - r hip, 3 - l hip, 4 - l knee, 5 - l ankle, 6 - pelvis, 7 - thorax, 8 - upper neck, 9 - head top, 10 - r wrist, 11 - r elbow, 12 - r shoulder, 13 - l shoulder, 14 - l elbow, 15 - l wrist.

## Cropping

The less clear part is “the scale” and “center”. Sometimes we could have more than one person in the image, so we need to crop out the one we are interested. Unlike the MSCOCO dataset, MPII didn’t give us the bounding box of the person. Instead, it gave us a center coordinate and a rough scale of the person. Both value is not accurate, but still represent the general location of a person in an image. Note that you’ll need to multiply “scale” by 200px to get the true height of a person. But how about the width? Unfortunately, the dataset didn’t really specify it. And the body may align somewhat horizontally, which makes the width way larger than height. One of the examples I saw before is the curling player crawling on the ground, and if you only use the height to crop, you could end up leaving his arms out. After some experiments, here’s my proposal to crop the image:

# avoid invisible keypoints whose value are <= 0

# find \left-most, top, bottom, and right-most keypoints

xmin = keypoint_xmin - tf.cast(body_height * margin, dtype=tf.int32)
xmax = keypoint_xmax + tf.cast(body_height * margin, dtype=tf.int32)
ymin = keypoint_ymin - tf.cast(body_height * margin, dtype=tf.int32)
ymax = keypoint_ymax + tf.cast(body_height * margin, dtype=tf.int32)

# make sure the crop is valid
effective_xmin = xmin if xmin > 0 else 0
effective_ymin = ymin if ymin > 0 else 0
effective_xmax = xmax if xmax < img_width else img_width
effective_ymax = ymax if ymax < img_height else img_height
effective_height = effective_ymax - effective_ymin
effective_width = effective_xmax - effective_xmin


In short, we filter out invisible joints first and calculate coordinates of the left-most, top-most, bottom-most, and right-most joint from 16 joints. These four coordinates give us a region where we could at least include all available joint annotations. Then I padded this region based on a proportion of this person’s height, which is also calculated from the “scale” field. Lastly, we need to make sure that this crop would not go out of the border.

## Gaussian

Another important thing to know about the ground-truth data is gaussian. When we curate the ground-truth heat-map, we don’t just assign 1 for the joint coordinates and assign 0 for all other pixels. This would make ground-truth too sparse to learn. If the model prediction is just a few pixels off, we should sort of encourage this behavior.

How do we model this encouragement in our loss function? If you took a probability class before, you might remember Gaussian distribution:

The center has the highest value and gradually decreasing values for the area around the center. This is exactly what we need. We would draw such a Gaussian pattern in our all-zero ground-truth canvas like the first figure below. And when you combine all 16 joints in one heat-map, it looks like the second figure below.

As you can see from the code, we calculate the size of the patch first, when sigma is 1, the size would be 7, and the center would be (3,3). Then we generate a meshgrid to represent the coordinates of each cell in this patch. And finally, substitute them into the gaussian formula.

scale = 1
size = 6 * sigma + 1
x, y = tf.meshgrid(tf.range(0, 6*sigma+1, 1), tf.range(0, 6*sigma+1, 1), indexing='xy')

# the center of the gaussian patch should be 1
center_x = size // 2
center_y = size // 2

# generate this 7x7 gaussian patch
gaussian_patch = tf.cast(tf.math.exp(-(tf.square(x - center_x) + tf.math.square(y - center_y)) / (tf.math.square(sigma) * 2)) * scale, dtype=tf.float32)


Note that the final code to generate a Gaussian is more complicated than this because it needs to handle some border cases. For full code, please take a look at my repo here: https://github.com/ethanyanjiali/deep-vision/blob/master/Hourglass/tensorflow/preprocess.py#L91

# Loss Function

Until now, we discussed the network architecture and also the data to use. With those, we can make a forward pass for some training data to get a feeling about output. But modern deep learning is about back-propagation and gradient descent, which requires us to calculate the loss between ground-truth and prediction. So let’s get on it.

Fortunately, the loss function for Stacked Hourglass is pretty simple. You just take Mean Square Error between two vectors, which could be done in one line of code (vanilla version). However, in reality, I found it’s still kind of hard for the model to converge, and it learned to cheat by predicting all zeros to reach a local optimal. My solution here (improved version) is to assign a bigger weight for foreground pixels (those gaussian we drew) and make it hard for the network to just ignore these non-zero values. I choose 82 here because there’re 82 times background pixels than foreground pixels for a 7×7 patch in 64×64 heat-map.

# vanilla version
loss += tf.math.reduce_mean(tf.math.square(labels - output))
# improved version
weights = tf.cast(labels > 0, dtype=tf.float32) * 81 + 1
loss += tf.math.reduce_mean(tf.math.square(labels - output) * weights)


# Predictions

So far, we’ve discussed the network, the data, and the optimization goal (loss). This should be sufficient for you to start your own training. Once we finished the training and get a model, it’s still not done yet. One shortcoming of using heat-map compared with regressing directly is the granularity. For example, with a 256×256 input, we are getting a 64×64 heat-map to represent key-point location. The four-times down-scale doesn’t seem quite bad. However, we usually first resize a bigger image, such as 720×480, into this 256×256 input. In this scenario, a 64×64 heat-map would be too coarse. To alleviate this problem, researchers came up an interesting idea. Instead of just using the pixel with max value, we also take into consideration of the neighbor pixel with largest value. Since the neighbor pixel also has a high , it infers that the actual key-point location might be a bit towards the direction of neighbor pixel. Sounds familiar, right? It’s pretty much like our gradient descent, which also points to the optimal solution.

Above is the prediction example of our network. The top one is all the joint locations. The bottom one is the skeleton, drew by linking those joints together. Although the result looks pretty decent, I have to admit that this is an easy example. In reality, there’re lots of twisted poses or occluded joints, which brings significant challenges to our network. For example, when there is only one foot in the image, the network could confuse itself by assign both left and right foot to the same location. How do we address this? Since this is a more a topic about improvement, I’ll leave it to you to think about it first and write another article to discuss in the future.

# Conclusion

Congratulation, you reach the end of this tutorial. If you comprehend everything we discussed above, you should have a solid understanding of the theory and major challenges now. To start coding it in TensorFlow, I would suggest you to clone/fork my repo: https://github.com/ethanyanjiali/deep-vision/tree/master/Hourglass/tensorflow, follow the instruction to prepare dataset, and give it a run. If you run into any problems, please leave a Github issue so that I could take a look. And again, if you like my article or my repo, please ⭐star⭐ my repo and that will be the biggest support for me.