Smart Virtual Fitness Trainer

Building a fitness training application using Python and Machine Learning

This article is a result of the capstone project for a Diploma in Artificial Intelligence and Machine Learning. Please find the complete code and project report at this GitHub location.

Business Case

With the advent of the Covid-19 pandemic, a lot of business sectors were hit badly. Due to gyms being closed down following the covid-19 norms and protocols, the fitness industry suffered a massive setback. However, as the saying goes — “necessity is the mother of invention”, the concept of “going virtual” also came into the fitness industry — like in various others — teaching, social meetups, etc. However, it comes with its challenges. It requires some level of technical expertise from the user’s side. Both the trainers and trainees have to get accustomed to connecting and training over the internet. Also, from the trainer’s side, it is not easy to check and correct everyone over the video, as it is difficult to manually track all the trainees at the same time. We can say that computer vision and AI are inevitable in the fitness industry now. As smartphones are getting cheaper and more powerful, and breakthroughs in IoT are paving the way for cheap sensors and smart wearables, it is adding more weight to this hypothesis. So, we try to build a solution that can work to assist any gym instructor. This cannot replace the instructor but can make their work more efficient and organized. The idea is to digitize the routine part of the work like observing poses, keeping counts, making slight corrections, etc, and thus enable the instructor to engage with more people. The user can practice in front of the system and monitor themselves in their own sweet time.

We cover the following exercises in the application:
1. Jumping Jacks
2. Crunches
3. Lunges
4. Plank
5. Squat

We will try to build the following features in our application:
● Provide real-time, easy-to-understand audio-visual feedback for pose correction
● Count repetitions and keep time for the user
● Carry out basic analytics and provide insights for user

System Design

System Design

Application Modules

We break the whole system into various modules, each having specific functionalities.

For User Input, we use a laptop webcam. It will take the video feed and show it live on the UI. Each frame captured is processed further by the next module. We developed an app using Streamlit to work as the UI. The user faces the application and carries on with their exercise. The whole body should fit properly in the video frame.

The Frame Processor module takes each of the extracted frames and processes it sequentially. The first and foremost task was to detect the human body from a given image frame and mark the various key points(body points) by pose estimation. We explored two state-of-the-art openly available pose detection packages for this — like MoveNet(17 key points) and MediaPipe(33 key points). After exploring both libraries, we decided to use MediaPipe which has a good balance of speed and accuracy. This package is fairly easy to install and use. We also have control to set the tracking and detection confidence thresholds, and can also enrich our feature sets with the visibility score it gives. After pose estimation, we use these key points to featurize the given image frame. Lastly, the frame processor recommends any correction to the user given the detected pose.

The Pose Classifier module uses the pre-trained classification models and returns the detected exercise type from the given image frame. The first level of classification tries to detect the correct exercise type. This detected exercise along with the features is further used to drill down to the specific terminal pose of that exercise type. To avoid too much jitter in the classification we also use a sliding window approach of 3 frames class. This ensures that the predicted class labels are smoothened.

Pose Classifier Module

The Speech Engine module is responsible for audio output to the user. Any specific message(like exercise feedback or any error) can be passed to this module, which converts the text to a sound signal and plays for the user.

Main Engine is the actual driving module of the application. It extracts frames in RGB format from the stream at an interval of 200ms (to avoid too much congestion) and sends the frames to downstream modules. It gets the predicted labels for the corresponding frames and recommended corrections for the user(if any). It displays the reference gif for the user, on the UI for the detected exercise type, and maintains the set counts for each of the exercises done by the user.

We also have some other utility modules like Worker which is a template class for multiprocessing. This enables us to run all the modules parallelly.

Data Collection

Collecting good quality training data is one of the essential prerequisites for any ML problem. We needed data to train our exercise type classifier and exercise pose classifier. We divided all the exercises we are covering into 2 terminal poses — start and end. The idea is — any exercise begins with the start pose and transitions to the end pose. We tried to find any open-source images for the exercise types we are training for, but no properly labeled training data was available. So we searched on youtube and internet articles for images of experts doing these exercises and collected screenshots. Some of the samples are as below:

Crunches_Start — Crunches_End
Jumping-Jacks_Start — Jumping-Jacks_End
Lunges_Start — Lunges_End
Planks_Start — Planks_End
Squats_Start — Squats_End


MediaPipe Output

Firstly we discard some of the body points (of the 33 points) from the MediaPipe output, which are too small compared to the body size and might add more noise rather than information. We consider only these specific body points:

Body points taken from MediaPipe

We also calculate some more body points using the points we already got above. They are:

  • neck = (left shoulder + right shoulder) x 0.5
  • torso length = (distance neck-left hip + distance neck-right hip) x 0.5
  • mid hips = (left hip + right hip) x 0.5
  • core = (neck + mid hips) x 0.5

We experimented with various kinds of featurization techniques using all the body points and their visibility score given by the MediaPipe library like calculating distances between body points(joints), calculating the angles between the important joint connections, visibility of different body points, etc. We found that the best results come from using a combination of angles and distances. So these are the features we finalized:

Distance features between body points(Euclidean distance in a 2-D plane)

1-joint distance
● Core — Nose
● Core — Left Elbow
● Core — Right Elbow
● Core — Left Wrist
● Core — Right Wrist
● Core — Left Knee
● Core — Right Knee
● Core — Left Ankle
● Core — Right Ankle

2-joint distance
● Left Shoulder — Left Wrist
● Right Shoulder — Right Wrist
● Left Hip — Left Elbow
● Right Hip — Right Elbow
● Left Shoulder — Left Knee
● Right Shoulder — Right Knee
● Left Hip — Left Ankle
● Right Hip — Right Ankle
● Left Knee — Left Foot Index
● Right Knee — Right Foot Index

Cross-joint distance
● Left Wrist — Right Wrist
● Left Elbow — Right Elbow
● Left Shoulder — Right Shoulder
● Left Hip — Right Hip
● Left Knee — Right Knee

We normalize the distances by torso length. Also, there were some other features that we removed due to high collinearity as found in the experiments.

Angle features between 3-body points(Cosine angles in a 2-D plane)

We calculate the cosine angle formed by 2 body points with a 3rd point. The angles are calculated in the range of 0–360°.
● Angle left elbow — neck — right elbow
● Angle left knee — mid hips — right knee
● Angle nose — neck — mid hips
● Angle nose — core — ground
Lastly, we also normalize the angles by 360.

Visibility features of left and right body profiles

We also use the visibility score given by the MediaPipe framework for each body point to find the visibility score for the left and right body profiles and use them as features.

Left Profile
● (left shoulder.visibility + left hip.visibility) x 0.5
● (left hip.visibility + left knee.visibility) x 0.5

Right Profile
● (right shoulder.visibility + right hip.visibility) x 0.5
● (right hip.visibility + right knee.visibility) x 0.5

The whole feature set is produced by concatenating all these constituent features horizontally.

Training Data

Initially, we started with 10 pictures of each exercise type (5 each for the start and end pose), taken from similar angles with slight differences. So, we have 50 pictures for 5 types of exercises. We found that, since the number of data points is extremely small, almost all the classifiers had the problem of overfitting. Also, since the training data only had snapshots of youtube experts' fitness videos, which means, all of them were nearly perfect poses. However, in our case the end users are non-experts, so we have to make the classifier robust that can recognize exercise poses even if done incorrectly.

For this, we use the concept of synthetic data generation by adding noise. We add both positive and negative samples by adding little or more noise in some of the selected features only. To create a positive sample we select half the number of features and add little random noise to them. If too many features are changed it might alter the entire pose. The step-wise algorithm to multiply the training data is as below:

multiply data by noise(data, target_n, to_append):
1. size_mu:= 50% of the total number of features in given data
2. create target_n-1 sample sizes by drawing samples from a normal distribution having:
mean:= size_mu, standard deviation:= 1.5
3. for every size in sample_sizes from step 2:
a. size = max(1, size)
b. generate random noise for all the features from a normal distribution having:
mean:= 0, standard deviation:= 0.025
c. randomly chose the ‘size’ number of features and add noise to them
d. if any of the angles features become negative we simply complement it, or if any of the distance features become negative we cap it to alpha=0.00001
4. if to_append is not null, we append its value to the end of the data list and return it

If we only have training data containing exercise classes, then in the real environment, the classifier will always categorize all frames into some exercise class. For example: even if the user is simply standing, the classifier might predict “squat_start” with some probability. So we must have data samples that are NOT one of the defined classes. For generating negative samples for a particular feature pose, we choose a small number of features but increase the amount of noise for those features. This is based on our theoretical understanding that even if a particular feature in the pose differs greatly, that should be considered a different pose from the original.

negative sampling by noise(data, target_n, to_append):
1. size_mu:= 5% of the total number of features in given data
2. create target_n sample sizes by drawing samples from a normal distribution having:
mean:= size_mu, standard deviation:= 1.5
3. for every size in sample_sizes from step 2:
a. size = max(1, size)
b. generate random noise for all the features from a normal
distribution having:
mean:= 0.5, standard deviation:= 0.05
c. randomly chose the ‘size’ number of features and add noise to them
d. if any of the angles features become negative we simply complement it, or if any of the distance features become negative we cap it to alpha=0.00001
4. if to_append is not null, we append its value to the end of the data list and return it

50 datapoints converted to 3500 datapoints after both of the above algorithms.

Feature-Value Plot for 1 sample of Jumping Jack
Feature-Value Plot for 1 sample of Crunches

Model Training & Tuning

The application requires a classifier of extremely low latency. Also, we don’t have a large amount of well-labeled training data, so the possibility of using any deep learning method (which can do automated feature extraction) does not seem feasible. So we continue our experiments on the data collated using classical ML techniques. We explore various classification algorithms like - Logistic Regression, k-Nearest Neighbour, Support Vector Machine, SGD Classifier, Multinomial Naive Bayes, and XGBoost.

Classifier Level 1
Our first classifier is the one that predicts the type of exercise, given a set of features from the image. We run all the classification algorithms and get the below results:

Classifiers Level I Scores

We can see, going by F1 score metrics and Test Score that XGBoost is the best classifier for exercise type. Although it takes the most amount of prediction time compared to all of them, the absolute time taken is around 100ms which is acceptable, given the good accuracy, it is providing. This is intuitive that a decision tree-based algorithm is likely to perform well on this kind of hand-curated features.

Classification Report for Level 1 Classifier

Classifier Level 2

Once the first classifier predicts the class of exercise, the prediction itself is added as a feature and sent to the next level of classification, for specifying the pose of that exercise type. We again repeated our experiments on all kinds of classifiers and got the below results.

Classifiers Level 2 Scores

It is interesting to note that for the 2nd level of classification, almost all the classifiers can perform well barring Multinomial Naive Bayes. The 1st level classification result added as a feature proves to be extremely helpful. Based on the F1 Score and Test Score kNN and SVM are the best classifiers but we choose kNN as it also has a better Train and CV score.

Classification Report for Level 2 Classifier

While using the classifiers in the application we faced an issue. Since we were working with a laptop cam with low FPS, sometimes the extracted frame was not good and MediaPipe could not extract features from those frames. This also happens if the user was too fast and the picture becomes blurred. This caused jitter as some of the frames had no exercise type or pose class associated with them. To avoid this issue we use a moving window approach. Our theoretical assumption is that any pose is not likely to be drastically different from the just previous poses in a matter of milliseconds, so we can associate that frame with the pose having the maximum majority in the window size. We have chosen a window size of 3.

Pose Correction

Once we get the exercise type and the specific pose of that exercise, the next thing is to check if the user is doing anything wrong specific to that pose. To solve this problem we use the body points coordinates extracted using the MediaPipe framework and our calculated features. We then employ some rules specific to the detected pose — based on a study we conducted about what are the key things to be taken care of, while the body is in that specific pose while doing that exercise.

For the start pose of jumping jacks, we check if the gap between feet is comparable to the shoulder’s width and the angle made by the arms should be between 120 and 180. For the end pose of this exercise, we ensure that the arms are well above the head, and the feet are between 1.5 to 3 times shoulder-width apart.

When we are doing crunches, we ensure that the user starts with a relaxed position lying on the ground and ends with their head slightly raised from the neck and core.

While doing lunges, we check if the user has their core straight all the time and when they are ending the pose, their legs should be almost at right angles. Also, their knees should not cross the toe-line.

For planks, it is important for the person doing the exercise to have a straight body throughout. Right from head to heels.

For squats, it is important to ensure that the head is straight and the knees are not crossing the toe-line while lowering the body.

Once we figure out any major deviation from the set rules we inform the user to correct the posture.

Application Deployment

We codify all the algorithms and develop an application in a form that can be easily deployed on any system for the end user. After exploring various packages we finally decided to use Streamlit. The best thing is that it helps to easily convert our core python app into a working web app without too much work required for the front end.

We also develop our application modules to function separately as different processes(multi-processing) so that they work coherently while communicating with each other using python Queues.

Screenshot 1
Screenshot 2

Conclusion and Future Work

We finally built an application that can work as a virtual fitness trainer for any user. We worked to collect and collate data from scratch and also applied our augmentation techniques using random gaussian noise over the data points. We also applied our k-Nearest Boosted Pose model and achieved fairly high accuracy on the test data. We employed a rule-based approach for solving the problem of pose correction by leveraging the classification and feature extraction module. Finally, we deployed a working application using the Streamlit package for the end user.

We were also able to identify some areas of improvement in our work. Firstly, MediaPipe latency is the main bottleneck for our application. Since our system is just CPU-based MediaPipe takes a little less than 200ms to process a single image frame. So we have to take frames at some specific intervals and drop the others — but this leads to information loss. If we process the frames too faster then after some time the MediaPipe queue gets overloaded and the real-time effect of the application is lost. The user carries out some exercise which is counted by the engine after some lag. Also, our classifiers are trained on highly tailor-made data using augmentation. It will help if we get better-curated actual data. Another approach we could take with increased data is to make the pose correction based on some heuristics or ML-based methods.

You can find the complete code for this project here on GitHub.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store