Skip to main content

Anchor Generation and Palm Detection

Overview

The first key function of the Hand Detector model involves anchor generation and palm detection using the SSD (Single Shot Detector) approach. This function implements the initial detection phase that identifies potential hand regions in the camera frame through a sophisticated anchor-based system.

Purpose and Functionality

This function handles the core detection logic:

  • Anchor Generation: Creates anchor boxes with various scales and aspect ratios
  • Model Inference: Processes camera frames through the detection network
  • Raw Output Processing: Handles the model's raw detection outputs
  • Initial Filtering: Applies basic confidence thresholds

Implementation Details

SSD Anchor Options Configuration

# MEDIAPIPE EXACT CONFIGURATION
options = SSDAnchorOptions(
num_layers=4,
min_scale=0.1484375,
max_scale=0.75,
input_size_height=192,
input_size_width=192,
anchor_offset_x=0.5,
anchor_offset_y=0.5,
strides=[8, 16, 16, 16],
aspect_ratios=[1.0],
reduce_boxes_in_lowest_layer=False,
interpolated_scale_aspect_ratio=1.0,
fixed_anchor_size=True
)

Anchor Generation Algorithm

def calculate_scale(min_scale, max_scale, stride_index, num_strides):
if num_strides == 1:
return (min_scale + max_scale) / 2
else:
return min_scale + (max_scale - min_scale) * stride_index / (num_strides - 1)

def generate_anchors(options):
anchors = []
layer_id = 0
n_strides = len(options.strides)
while layer_id < n_strides:
anchor_height = []
anchor_width = []
aspect_ratios = []
scales = []
last_same_stride_layer = layer_id
while last_same_stride_layer < n_strides and \
options.strides[last_same_stride_layer] == options.strides[layer_id]:
scale = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer, n_strides)
if last_same_stride_layer == 0 and options.reduce_boxes_in_lowest_layer:
aspect_ratios += [1.0, 2.0, 0.5]
scales += [0.1, scale, scale]
else:
aspect_ratios += options.aspect_ratios
scales += [scale] * len(options.aspect_ratios)
if options.interpolated_scale_aspect_ratio > 0:
if last_same_stride_layer == n_strides -1:
scale_next = 1.0
else:
scale_next = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer+1, n_strides)
scales.append(sqrt(scale * scale_next))
aspect_ratios.append(options.interpolated_scale_aspect_ratio)
last_same_stride_layer += 1

for i, r in enumerate(aspect_ratios):
ratio_sqrts = sqrt(r)
anchor_height.append(scales[i] / ratio_sqrts)
anchor_width.append(scales[i] * ratio_sqrts)

stride = options.strides[layer_id]
feature_map_height = ceil(options.input_size_height / stride)
feature_map_width = ceil(options.input_size_width / stride)

for y in range(feature_map_height):
for x in range(feature_map_width):
for anchor_id in range(len(anchor_height)):
x_center = (x + options.anchor_offset_x) / feature_map_width
y_center = (y + options.anchor_offset_y) / feature_map_height
if options.fixed_anchor_size:
new_anchor = [x_center, y_center, 1.0, 1.0]
else:
new_anchor = [x_center, y_center, anchor_width[anchor_id], anchor_height[anchor_id]]
anchors.append(new_anchor)
layer_id = last_same_stride_layer
return anchors

Anchor Generation System

SSD Anchor Configuration

The anchor generation follows MediaPipe's configuration with specific parameters:

  • Layer Configurations: Multiple detection layers for multi-scale detection
  • Scale Ranges: Covers various hand sizes from close-up to distant
  • Aspect Ratios: Accommodates different hand orientations and poses
  • Anchor Density: 2016 total anchor boxes across all layers

Anchor Generation Process

def generate_anchors():
"""
Generate SSD anchors for hand detection

Returns:
List of anchor boxes with coordinates and properties
"""
anchors = []

# Define anchor configuration
anchor_options = {
'num_layers': 4,
'min_scale': 0.1484375,
'max_scale': 0.75,
'input_size_height': 192,
'input_size_width': 192,
'anchor_offset_x': 0.5,
'anchor_offset_y': 0.5,
'strides': [8, 16, 16, 16],
'aspect_ratios': [1.0],
'reduce_boxes_in_lowest_layer': False,
'interpolated_scale_aspect_ratio': 1.0,
'fixed_anchor_size': True
}

# Generate anchors for each layer
for layer_id in range(anchor_options['num_layers']):
layer_anchors = generate_layer_anchors(layer_id, anchor_options)
anchors.extend(layer_anchors)

return anchors

Bounding Box Decoding

Raw Output Processing

The model outputs raw detection data that needs to be decoded:

def decode_bboxes(raw_boxes, raw_scores, anchors):
"""
Decode bounding boxes from model outputs using anchors

Args:
raw_boxes: Raw bounding box predictions [1, 2016, 18]
raw_scores: Raw confidence scores [1, 2016, 1]
anchors: Pre-generated anchor boxes

Returns:
List of HandRegion objects with decoded information
"""
detections = []

# Apply sigmoid activation to scores
scores = sigmoid(raw_scores)

# Process each anchor box
for i, anchor in enumerate(anchors):
confidence = scores[0, i, 0]

# Filter by confidence threshold
if confidence > DETECTION_THRESHOLD:
# Decode bounding box coordinates
bbox = decode_single_bbox(raw_boxes[0, i], anchor)

# Extract palm keypoints
keypoints = extract_palm_keypoints(raw_boxes[0, i, 4:])

# Create HandRegion object
hand_region = HandRegion(
bbox=bbox,
confidence=confidence,
keypoints=keypoints
)

detections.append(hand_region)

return detections

Palm Keypoint Extraction

The detection includes palm keypoints for rotation calculation:

  • Wrist Keypoint: Base reference point for hand orientation
  • Middle Finger MCP: Second reference point for rotation calculation
  • Coordinate System: Normalized coordinates relative to bounding box

Key Processing Steps

1. Frame Preprocessing

  • Resize: Input frame resized to 192x192 pixels
  • Normalization: Pixel values normalized to model requirements
  • Batch Formation: Single frame formatted as batch input

2. Model Inference

  • Forward Pass: Frame processed through SSD network
  • Output Extraction: Raw detections and scores obtained
  • Memory Management: Efficient handling of model outputs

3. Score Processing

  • Sigmoid Activation: Applied to raw confidence scores
  • Threshold Filtering: Remove low-confidence detections
  • Score Ranking: Sort detections by confidence

4. Bounding Box Decoding

  • Anchor Mapping: Raw outputs mapped to anchor coordinates
  • Coordinate Transformation: Convert to image coordinate system
  • Size Calculation: Compute bounding box dimensions

Output Format

HandRegion Object Structure

class HandRegion:
def __init__(self, bbox, confidence, keypoints):
self.bbox = bbox # [x, y, w, h] in normalized coordinates
self.confidence = confidence # Detection confidence score
self.keypoints = keypoints # Palm keypoints for rotation
self.rotation = None # Calculated in next stage
self.rect_points = None # Rotated rectangle points

Detection Metadata

  • Bounding Box: [x, y, width, height] in normalized coordinates
  • Confidence Score: Float value between 0.0 and 1.0
  • Palm Keypoints: Wrist and middle finger MCP coordinates
  • Processing Time: Inference latency for performance monitoring

Integration Points

Input Interface

  • Video Capture: Receives frames from camera or video file
  • Frame Buffer: Manages input frame queue for processing
  • Preprocessing: Handles frame preparation and formatting

Output Interface

  • Detection List: Provides list of detected hand regions
  • Quality Metrics: Includes confidence and processing statistics
  • Next Stage: Feeds into Non-Maximum Suppression function

Configuration Parameters

Detection Settings

  • detection_threshold: Minimum confidence for valid detection (default: 0.5)
  • input_size: Model input resolution (192x192)
  • max_detections: Maximum number of detections to process (default: 100)

Anchor Parameters

  • num_layers: Number of detection layers (default: 4)
  • min_scale: Minimum anchor scale (default: 0.1484375)
  • max_scale: Maximum anchor scale (default: 0.75)
  • aspect_ratios: Anchor aspect ratios (default: [1.0])

Next Steps

After anchor generation and palm detection, the pipeline proceeds to:

  • Function 2: Non-Maximum Suppression and rotation calculation
  • Quality Assessment: Validate detection reliability
  • Temporal Consistency: Track detections across frames