Deep Learning Pallet Detection

Paper Summary: “A Comparison of Deep Learning Models for Pallet Detection in Industrial Warehouses”

Dickson Wu
5 min readNov 10, 2021

Paper by: Michela Zaccaria, Riccardo Monica, Jacopo Aleotti

Abstract:

This paper compares the task of recognizing pallets (fronts and pockets) in warehouses between 3 different object detection models: Faster R-CNN, SSD, and YOLOv4 (the first two did better than YOLOv4).

Introduction:

This paper discusses the problem of pallet detection — used for robots working in warehouses where they have to pick up and move around these pallets.

The pipeline works like this: We use a CNN in order to find the front and pockets of a pallet. Then we use a decision block to choose the final pallet proposals (we only accept a detection if we see the fronts + the pockets to achieve a high level of confidence and safety).

A downside of using CNNs is that they need a ton of labelled data. Usually, we train and evaluate CNNs on benchmark datasets — which doesn’t always have the data for the industry. Plus industry needs to adapt to different conditions. So the authors collected all that data for us!

This data has lots of different conditions, multiple pallets, on the ground or on racks, arbitrary orientation or partly covered with transparent plastic wrap.

Related Work:

There’s been little work for automatic pallet detection. One group used SSD, but this paper aims to use multiple models and test them all out. Plus this paper tries to find the pockets as well!

There has been work using conventional computer vision (stereo, frontal views). But those works aren’t able to adapt to robust lighting conditions (since they need colour), plus they can only deal with rectangular features.

Proposed Method:

First, a CNN detects the 2 classes, (front and pockets). We can use this data to determine where to put the forklift forks for pallet retrieval. Then we take the detections and put them into a decision block.

Convolution Neural Network Architectures:

They compared 3 different architectures: Faster R-CNN, Single Shot Multi-box Detector (SSD) and You Only Look Once v4 (YOLOv4).

Faster R-CNN works by searching the whole image for objects, then performs classification on these objects. First, it was R-CNN, where we used a selective search algorithm to detect region proposals.

Fast R-CNN came along and was able to generate region proposals directly on the feature map, thus no need to put each bounding box through classification. Then Faster R-CNN came along and used Region Proposal Network to replace the selective search algorithm. Each iteration made it faster and more accurate!

We can use different feature extraction backbones, this paper uses ResNet50. It was trained on 10,000 iterations with a batch size of 2. There was a warm-up learning rate, the images were resized while keeping the original aspect ratio to many sizes. It took 12 hours to train.

Unlike Faster R-CNN, SSD does everything in 1 shot rather than a 2 step process. Each location on the feature map already has bounding boxes with varying sizes, aspect ratios, and resolutions.

The architecture has a CNN to extract features from the network, which feeds into other convolutions layers which gets multi-scale feature maps, which goes into another convolution that outputs the confidence + offsets.

This paper uses a HarDNet85 backbone + trained for 150 epochs, images with a fixed resolution, batch size of 8, and learning rate of 0.004 + decay factor of 0.0001 and momentum of 0.9. Took 8 hours to train.

There’s also YOLOv4 (which has 4 versions). YOLO divides the image into a grid, each cell has bounding boxes. It could detect things quickly but there was a high localization error. After 3 iterations we get Yolov4 (by optimizing architecture, data augmentations and training techniques). They trained it for 24,000 iterations, batch size of 16, took 23 hours.

Design Block:

This is just a simple piece of code with some heuristic rules. If the size of the front pallet has a greater area than X and it has 2 pockets, it’s valid! Else just discard it. But if the area is smaller than X, then we count it as a pallet regardless of how many pockets it has. We count a pocket being in a pallet if 80% of the pocket is inside the front.

Dataset:

Images were from a mineral water company. 1344 images were taken from various distances, configurations, lighting conditions. Some images were in the same camera pose but with different lighting conditions (and thus annotations were cheaper). These sub-groups were then grouped into groups, where the distance to pallet + height = the same:

Experiments:

CNN's Detection Results:

We apply a ton of augmentations on the images to make sure it generalizes. We start off with models trained on MS COCO, and then we fine-tune it on this dataset. Here are the results for the 3 networks:

The models did quite poorly for small objects. It did better for pallet front than pockets. Also, we care about AP75 because that’s when the precision is over 75 percent (for high confidence settings)

Decision Block:

Here are the results

Conclusion:

This paper assesses 3 different models on the task of automatic pallet detection -> Faster R-CNN and SSD perform better than YOLO-v4. Future work is extracting the pose with respect to the sensor.

If you want to find out more: Read the paper here!

Thanks for reading! I’m Dickson, an 18-year-old Crypto enthusiast who’s excited to use it to impact billions of people 🌎

If you want to follow along on my journey, you can join my monthly newsletter, check out my website, and connect on LinkedIn or Twitter 😃

--

--

Dickson Wu

Hi I’m Dickson! I’m an 18-year-old innovator who’s excited to change the course of humanity for the better — using the power of ML!