Tutorial

How to train and use a custom YOLOv7 model

Updated on April 10, 2025

How to train and use a custom YOLOv7 model

Introduction

Object detection is undoubtedly one of the “Holy Grails” of deep learning technology’s promise. In the practice of combining image classification and object identification, object detection involves identifying the location of a discrete object in an image and correctly classifying it. Bounding boxes are then predicted and placed within a copy of the image so the user can see the model’s predicted classifications directly.

YOLO has remained one of the premiere object detection networks since its creation for two primary reasons: its accuracy, relatively low cost, and ease of use. These traits together have made YOLO undoubtedly one of the most famous DL models outside of the data science community at large due to this useful combination. Having undergone multiple development iterations, YOLOv7 is the latest version of the popular algorithm and has improved significantly from its predecessors.

In this blog tutorial, we will start by examining the greater theory behind YOLO’s action and architecture and comparing YOLOv7 to its previous versions. We will then jump into a coding demo detailing all the steps you need to develop a custom YOLO model for your object detection task. We will use NBA game footage as our demo dataset and attempt to create a model that can distinguish and label the ball handler separately from the rest of the players on the court.

Prerequisites

In order to follow along with this article, you will need experience with Python code and a basic understanding of Deep Learning. We will assume that all readers have access to sufficiently powerful machines so they can run the code provided.

If you do not have access to a GPU, we suggest using DigitalOcean GPU Droplets.

To get started with Python programming, we recommend following this beginner’s guide to set up your system and get ready to run your first tutorials.

What is YOLO?

Generalization results on Picasso and People-Art datasets from original YOLO paper Source

The original YOLO model was introduced in the paper “You Only Look Once: Unified, Real-Time Object Detection” in 2015. At the time, RCNN models were the best way to perform object detection, and their time-consuming, multi-step training process made them cumbersome to use in practice. YOLO was created to do away with as much of that hassle as possible. By offering single-stage object detection, they reduced training & inference times and massively reduced the cost to run object detection.

Since then, various groups have tackled YOLO to make improvements. Some examples of these new versions include the powerful YOLOv5 and YOLOR. Each of these iterations attempted to improve upon past incarnations, and YOLOv7 is now the highest-performing model of the family with its release.

How does YOLO work?

YOLO works to perform object detection in a single stage by first separating the image into N grids. Each of these grids is of equal size SxS. Each of these regions is used to detect and localize any objects they may contain. For each grid, bounding box coordinates, B, for the potential object(s) are predicted with an object label and a probability score for the predicted object’s presence.

As you may have guessed, this leads to a significant overlap of predicted objects from the grids’ cumulative predictions. To handle this redundancy and reduce the predicted objects to those of interest, YOLO uses Non-Maximal Suppression to suppress all the bounding boxes with comparatively lower probability scores.

Image Divided into Grids; Before Non- Maximal Suppression; After Non Maximal Suppression (Final Output)

To achieve this, YOLO first compares the probability scores associated with each decision and takes the largest score. It then removes the bounding boxes with the largest Intersection over Union with the chosen high-probability bounding box. This step is then repeated until only the desired final bounding boxes remain.

What changes were made in YOLOv7

comparison of object detection models efficacy

Several new changes were made for YOLOv7. This section will attempt to break down these changes and show how these improvements led to the massive boost in performance compared to predecessor models.

Extended efficient layer aggregation networks

Model re-parameterization is the practice of merging multiple computational models at the inference stage to accelerate inference time. In YOLOv7, the technique “Extended efficient layer aggregation networks” or E-ELAN is used to perform this feat.

Source

E-ELAN implements expand, shuffle, and merge cardinality techniques to continuously improve the adaptability and capability to learn the network without affecting the original gradient path. This method aims to use group convolution to expand the channel and cardinality of computational blocks. It does so by applying the same group parameter and channel multiplier to each computational block in the layer. The feature mAP is then calculated by the block and shuffled into several groups, as set by the variable ‘g’ and combined. This way, the number of channels in each group of feature maps is the same as the number of channels in the original architecture. We then add the groups together to merge cardinality. By only changing the model architecture in the computational block, the transition layer is left unaffected, and the gradient path is fixed. Source

Model scaling for concatenation-based models

Source

It is common for YOLO and other object detection models to release a series of models that scale up and down in size to be used in different use cases. For scaling, object detection models need to know the depth of the network, the width of the network, and the resolution that the network is trained on. In YOLOv7, the model simultaneously scales the network depth and width while concatenating layers. Ablation studies show that this technique optimizes the model architecture while scaling for different sizes. Normally, something like scaling up depth will cause a ratio change between the input channel and output channel of a transition layer, which may lead to a decrease in the hardware usage of the model. The compound scaling technique used in YOLOv7 mitigates this and other negative effects on performance made when scaling.

Trainable bag of freebies

Source

The YOLOv7 authors used gradient flow propagation paths to analyze how re-parameterized convolution should be combined with different networks. The above diagram shows how the convolutional blocks should be placed, with the check-marked options representing that they worked.

Coarse for the auxiliary heads and fine for the lead loss head

Source

Deep supervision is a technique that adds an extra auxiliary head in the middle layers of the network and uses the shallow network weights with assistant loss as the guide. This technique is useful for improving even when model weights typically converge. In the YOLOv7 architecture, the head responsible for the final output is called the lead head, and the head used to assist in training is called the auxiliary head. YOLOv7 uses lead head prediction as guidance to generate coarse-to-fine hierarchical labels, which are used for auxiliary head and lead head learning, respectively.

Altogether, these improvements have led to the significant increases in capability and decrease in cost we saw in the above diagram when compared to its predecessors.

Setting up your custom datasets

Now that we understand why and how YOLOv7 has improved over past techniques, we can try it out! For this demo, we will download videos of NBA highlights and create a YOLO model that can accurately detect which players on the court are actively holding the ball. The challenge here is to get the model to accurately and reliably detect and discern the ball handler from the other players on the court. To do this, we can go to YouTube and download some NBA highlight reels. We can then use VLC’s snapshot filter to break down the videos into sequences of images.

To continue the training, you must choose an appropriate labeling tool to label the newly made custom dataset. YOLO and related models require that the data used for training has each of the desired classifications accurately labeled, usually by hand. We chose to use RoboFlow for this task. The tool is free to use online and quickly, can perform augmentations and transformations on uploaded data to diversify the dataset, and can even freely triple the amount of training data based on the input augmentations. The paid version comes with even more useful features.

Create a RoboFlow account, start a new project, and then upload the relevant data to the project space. We will use two possible classifications for this task: ‘ball-handler’ and ‘player.’ To label the data with RoboFlow once it is uploaded, all you need to do is click the “Annotate” button on the left-hand menu, click on the dataset, and then drag your bounding boxes over the desired objects, in this case, basketball players with and without the ball.

This data is composed entirely of in-game footage, and all commercial breaks or heavily 3d CGI-filled frames were excluded from the final dataset. Each player on the court was identified as 'player, ’ the label for most of the bounding box classifications in the dataset. Nearly every frame, but not all, also included a ‘ball-handler.’ The ‘ball-handler’ is the player currently in possession of the basketball. To avoid confusion, the ball handler is not double-labeled as a player in any frames. To account for different angles used in-game footage, we included angles from all shots and maintained the same labeling strategy for each angle. Originally, we attempted a separate ‘ball-handler-floor’ and ‘player-floor’ tag when the footage was shot from the ground, but this only added confusion to the model. Generally speaking, it is suggested that you have 2000 images for each type of classification. However, labeling so many images with many objects by hand is extremely time-consuming, so we will use a smaller sample for this demo. It still works reasonably well, but if you wish to improve this model’s capability, the most important step would be to expose it to more training data and a more robust validation set.

We used 1668 (556x3) training photos for our training set, 81 for the test set, and 273 for the validation set. In addition to the test set, we will create our qualitative test to assess the model’s viability by testing the model on a new highlight reel. You can generate your dataset using the generate button in RoboFlow and then output it to your Notebook through the curl terminal command in the YOLOv7 - PyTorch format. Below is the code snippet you could use to access the data used for this demo:

curl -L "https://app.roboflow.com/ds/4E12DR2cRc?key=LxK5FENSbU" > roboflow.zip; unzip roboflow.zip; rm roboflow.zip

Code demo

The file ‘data/coco.yaml’ is configured to work with our data.

First, we will load in the required data and the model baseline we will fine-tune the following:

    !curl -L "https://app.roboflow.com/ds/4E12DR2cRc?key=LxK5FENSbU" > roboflow.zip; unzip roboflow.zip; rm roboflow.zip
    !wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7_training.pt
    ! mkdir v-test
    ! mv train/ v-test/
    ! mv valid/ v-test/

Next, we have a few required packages to be installed, so running this cell will prepare your environment for training. We are downgrading Torch and Torchvision because YOLOv7 cannot work on the current versions.

    !pip install -r requirements.txt
    !pip install setuptools==59.5.0
    !pip install torchvision==0.11.3+cu111 -f https://download.pytorch.org/whl/cu111/torch_stable.html

Helpers

    import os 
    
    # remove roboflow extra junk
    
    count = 0
    for i in sorted(os.listdir('v-test/train/labels')):
        if count >=3:
            count = 0
        count += 1
        if i[0] == '.':
            continue
        j = i.split('_')
        dict1 = {1:'a', 2:'b', 3:'c'}
        source = 'v-test/train/labels/'+i
        dest = 'v-test/train/labels/'+j[0]+dict1[count]+'.txt'
        os.rename(source, dest)
        
    count = 0
    for i in sorted(os.listdir('v-test/train/images')):
        if count >=3:
            count = 0
        count += 1
        if i[0] == '.':
            continue
        j = i.split('_')
        dict1 = {1:'a', 2:'b', 3:'c'}
        source = 'v-test/train/images/'+i
        dest = 'v-test/train/images/'+j[0]+dict1[count]+'.jpg'
        os.rename(source, dest)
        
    for i in sorted(os.listdir('v-test/valid/labels')):
        if i[0] == '.':
            continue
        j = i.split('_')
        source = 'v-test/valid/labels/'+i
        dest = 'v-test/valid/labels/'+j[0]+'.txt'
        os.rename(source, dest)
        
    for i in sorted(os.listdir('v-test/valid/images')):
        if i[0] == '.':
            continue
        j = i.split('_')
        source = 'v-test/valid/images/'+i
        dest = 'v-test/valid/images/'+j[0]+'.jpg'
        os.rename(source, dest)
    for i in sorted(os.listdir('v-test/test/labels')):
        if i[0] == '.':
            continue
        j = i.split('_')
        source = 'v-test/test/labels/'+i
        dest = 'v-test/test/labels/'+j[0]+'.txt'
        os.rename(source, dest)
        
    for i in sorted(os.listdir('v-test/test/images')):
        if i[0] == '.':
            continue
        j = i.split('_')
        source = 'v-test/test/images/'+i
        dest = 'v-test/test/images/'+j[0]+'.jpg'
        os.rename(source, dest)

The next section of the notebook aids in setup. Because RoboFlow data outputs with an additional string of data and id’s appended to the end of the filename, we first remove all of the extra text. These would have prevented training from running as they differ from the jpg to the corresponding txt file. The training files are also in triplicate, so the training rename loop contains additional steps.

Train

Now that our data is set up, we are ready to train our model on our custom dataset. We used a 2 x A6000 model to train our model for 50 epochs. The code for this part is simple:

    # Train on single GPU
    !python train.py --workers 8 --device 0 --batch-size 8 --data data/coco.yaml --img 1280 720 --cfg cfg/training/yolov7.yaml --weights yolov7_training.pt --name yolov7-ballhandler --hyp data/hyp.scratch.custom.yaml --epochs 50
    
    # Train on 2 GPUs
    !python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --workers 16 --device 0,1 --sync-bn --batch-size 8 --data data/coco.yaml --img 1280 720 --cfg cfg/training/yolov7.yaml --weights yolov7_training.pt --name yolov7-ballhandler --hyp data/hyp.scratch.custom.yaml --epochs 50

We have provided two methods for running training on a single GPU or multi-GPU system. By executing this cell, training will begin using the desired hardware. You can modify these parameters here, and additionally, you can modify the hyperparameters for YOLOv7 at ‘data/hyp.scratchcustom.yaml’. Let’s go over some of the more important of these parameters.

workers (int): How many subprocesses should be parallelized during training?
img (int): The resolution of our images. For this project, the images were resized to 1280 x 720.
batch_size (int): Determines the number of samples processed before the model update is created.
nproc_per_node (int): Number of machines to use during training. For multi-GPU training, this usually refers to the number of available machines to point to.

During training, the model will output the memory reserved for training, the number of images examined, the total number of predicted labels, precision, recall, and mAP @.5 at the end of each epoch. You can use this information to help identify when the model is ready to complete training and understand its efficacy on the validation set.

At the end of the training, the best, last, and some additional model stages will be saved to the corresponding directory in “runs/train/yolov7-ballhandler[n]”, where n is the number of times training has been run. It will also save some relevant data about the training process. You can change the name of the save directory in the command with the --name flag.

Detect

Once model training has been completed, we cannot use the model to perform object detection in real-time. This can work on both image and video data and will output the predictions for you in real-time in the form of the frame, including the bounding box(es). We will use detect to qualitatively assess the model’s efficacy at its task. For this purpose, we downloaded unrelated NBA game footage from YouTube and uploaded it to the Notebook as a novel test set. You can also directly plug in a URL with an HTTPS, RTPS, or RTMP video stream as a URL string, but YOLOv7 may prompt a few additional installs before it can proceed.

Once we have entered our parameters for training, we can call on the detect.py script to detect any of the desired objects in our new test video.

    !python detect.py --weights runs/train/yolov7-ballhandler/weights/best.pt --conf 0.25 --img-size 1280 --source video.mp4 --name test

After training for fifty epochs, using the same methods described above, you can expect your model to perform approximately like the one shown in the videos below:

Due to the diversity of training image angles, this model can account for all kinds of shots, including floor level and a more distant ground level from the opposite baseline. For the vast majority of shots, the model is able to correctly identify the ball handler and simultaneously label each additional player on the court.

The model is not perfect, however. Sometimes, while turned around, the occlusion of part of a player’s body confounds the model as it tries to assign ball handler labels to players in these positions. This often occurs while a player’s back is turned to the camera, likely due to the frequency of this happening for guards setting up plays or while driving to the basket.

Other times, the model identifies multiple players on the court as being in possession, such as during the fast break shown above. It’s also notable that dunking and blocking in the close camera view can confuse the model. Finally, if most of the players occupy a small area of the court, it can obscure the ball handler from the model and cause confusion.

Overall, the model appears to be generally succeeding at detecting each player and ball handler from the perspective of our qualitative view but suffers from some difficulties in the rarer angles used during certain plays, when the half court is extremely crowded with players, and while doing more athletic plays that aren’t accounted for in the training data, like unique dunks. From this, we can surmise that the problem is not the quality of our data nor the amount of training time but the volume of training data. Ensuring a robust model would likely require around three times the number of images in the current training set.

Let’s now use YOLOv7’s built-in test program to assess our data on the test set.

Test

The test.py script is the simplest and quickest way to assess the quality of your model using your test set. It rapidly assesses the quality of the predictions made on the test set and returns them in a legible format. When used in tandem with our qualitative analyses, we better understand how our model is performing.

RoboFlow suggests a 70-20-10 train-test-validation split of a dataset for YOLO, in addition to 2000 images per classification. Since our test set is small, several classes are likely underrepresented, so take these results with a grain of salt and use a more robust test set than we chose for your own projects. Here, we use test.yaml instead of coco.yaml.

    !python test.py --data data/test.yaml --img 1280 --batch 16 --conf 0.001 --iou 0.65 --device 0 --weights runs/train/yolov7-ballhandler/weights/best.pt --name yolov7_ballhandler_testing

You will then get an output in the log and several figures and data points assessing the model’s efficacy on the test set saved to the prescribed location. In the logs, you can see the total number of images in the folder, the number of labels for each category in those images, and then the precision, recall, and mAP@.5 for both the cumulative predictions and each classification type.

results

As we can see, the data reflects a healthy model that achieves at least ~ .79 mAP@.5 efficacy at predicting each of the real labels in the test set.

The ball handlers comparatively lower recall, precision, and mAP@.5, given our distinct class imbalance and the extreme similarity between classes, makes complete sense in the context of how much data was used for training. It is sufficient to say that the quantitative results corroborate our qualitative findings and that the model is capable but requires more data to reach full utility.

Closing thoughts

YOLOv7 isn’t just accurate—it’s also super easy to use, especially when paired with a powerful labeling tool like Roboflow. We picked this challenge because it’s genuinely tricky, even for humans, to tell which basketball players have the ball and which don’t—so it’s a great test for machine learning models. The results so far are really promising. This kind of tech could be used in everything from player tracking and stat keeping to gambling insights and training tools.

If you want to try it yourself, follow the workflow we shared using your own custom dataset. And if you need a place to train your models, DigitalOcean’s GPU Droplets make it simple and affordable to get started. Roboflow also has tons of public and community datasets to explore, so take a look before you begin labeling. Thanks for reading, and happy building!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products