Anchor Generation and Palm Detection

Overview

The first key function of the Hand Detector model involves anchor generation and palm detection using the SSD (Single Shot Detector) approach. This function implements the initial detection phase that identifies potential hand regions in the camera frame through a sophisticated anchor-based system.

Purpose and Functionality

This function handles the core detection logic:

Anchor Generation: Creates anchor boxes with various scales and aspect ratios
Model Inference: Processes camera frames through the detection network
Raw Output Processing: Handles the model's raw detection outputs
Initial Filtering: Applies basic confidence thresholds

Implementation Details

SSD Anchor Options Configuration

# MEDIAPIPE EXACT CONFIGURATION
options = SSDAnchorOptions(
    num_layers=4,
    min_scale=0.1484375,
    max_scale=0.75,
    input_size_height=192,
    input_size_width=192,
    anchor_offset_x=0.5,
    anchor_offset_y=0.5,
    strides=[8, 16, 16, 16],
    aspect_ratios=[1.0],
    reduce_boxes_in_lowest_layer=False,
    interpolated_scale_aspect_ratio=1.0,
    fixed_anchor_size=True
)

Anchor Generation Algorithm

def calculate_scale(min_scale, max_scale, stride_index, num_strides):
    if num_strides == 1:
        return (min_scale + max_scale) / 2
    else:
        return min_scale + (max_scale - min_scale) * stride_index / (num_strides - 1)

def generate_anchors(options):
    anchors = []
    layer_id = 0
    n_strides = len(options.strides)
    while layer_id < n_strides:
        anchor_height = []
        anchor_width = []
        aspect_ratios = []
        scales = []
        last_same_stride_layer = layer_id
        while last_same_stride_layer < n_strides and \
                options.strides[last_same_stride_layer] == options.strides[layer_id]:
            scale = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer, n_strides)
            if last_same_stride_layer == 0 and options.reduce_boxes_in_lowest_layer:
                aspect_ratios += [1.0, 2.0, 0.5]
                scales += [0.1, scale, scale]
            else:
                aspect_ratios += options.aspect_ratios
                scales += [scale] * len(options.aspect_ratios)
                if options.interpolated_scale_aspect_ratio > 0:
                    if last_same_stride_layer == n_strides -1:
                        scale_next = 1.0
                    else:
                        scale_next = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer+1, n_strides)
                    scales.append(sqrt(scale * scale_next))
                    aspect_ratios.append(options.interpolated_scale_aspect_ratio)
            last_same_stride_layer += 1

        for i, r in enumerate(aspect_ratios):
            ratio_sqrts = sqrt(r)
            anchor_height.append(scales[i] / ratio_sqrts)
            anchor_width.append(scales[i] * ratio_sqrts)

        stride = options.strides[layer_id]
        feature_map_height = ceil(options.input_size_height / stride)
        feature_map_width = ceil(options.input_size_width / stride)

        for y in range(feature_map_height):
            for x in range(feature_map_width):
                for anchor_id in range(len(anchor_height)):
                    x_center = (x + options.anchor_offset_x) / feature_map_width
                    y_center = (y + options.anchor_offset_y) / feature_map_height
                    if options.fixed_anchor_size:
                        new_anchor = [x_center, y_center, 1.0, 1.0]
                    else:
                        new_anchor = [x_center, y_center, anchor_width[anchor_id], anchor_height[anchor_id]]
                    anchors.append(new_anchor)
        layer_id = last_same_stride_layer
    return anchors

Anchor Generation System

SSD Anchor Configuration

The anchor generation follows MediaPipe's configuration with specific parameters:

Layer Configurations: Multiple detection layers for multi-scale detection
Scale Ranges: Covers various hand sizes from close-up to distant
Aspect Ratios: Accommodates different hand orientations and poses
Anchor Density: 2016 total anchor boxes across all layers

Anchor Generation Process

def generate_anchors():
    """
    Generate SSD anchors for hand detection
    
    Returns:
        List of anchor boxes with coordinates and properties
    """
    anchors = []
    
    # Define anchor configuration
    anchor_options = {
        'num_layers': 4,
        'min_scale': 0.1484375,
        'max_scale': 0.75,
        'input_size_height': 192,
        'input_size_width': 192,
        'anchor_offset_x': 0.5,
        'anchor_offset_y': 0.5,
        'strides': [8, 16, 16, 16],
        'aspect_ratios': [1.0],
        'reduce_boxes_in_lowest_layer': False,
        'interpolated_scale_aspect_ratio': 1.0,
        'fixed_anchor_size': True
    }
    
    # Generate anchors for each layer
    for layer_id in range(anchor_options['num_layers']):
        layer_anchors = generate_layer_anchors(layer_id, anchor_options)
        anchors.extend(layer_anchors)
    
    return anchors

Bounding Box Decoding

Raw Output Processing

The model outputs raw detection data that needs to be decoded:

def decode_bboxes(raw_boxes, raw_scores, anchors):
    """
    Decode bounding boxes from model outputs using anchors
    
    Args:
        raw_boxes: Raw bounding box predictions [1, 2016, 18]
        raw_scores: Raw confidence scores [1, 2016, 1]
        anchors: Pre-generated anchor boxes
        
    Returns:
        List of HandRegion objects with decoded information
    """
    detections = []
    
    # Apply sigmoid activation to scores
    scores = sigmoid(raw_scores)
    
    # Process each anchor box
    for i, anchor in enumerate(anchors):
        confidence = scores[0, i, 0]
        
        # Filter by confidence threshold
        if confidence > DETECTION_THRESHOLD:
            # Decode bounding box coordinates
            bbox = decode_single_bbox(raw_boxes[0, i], anchor)
            
            # Extract palm keypoints
            keypoints = extract_palm_keypoints(raw_boxes[0, i, 4:])
            
            # Create HandRegion object
            hand_region = HandRegion(
                bbox=bbox,
                confidence=confidence,
                keypoints=keypoints
            )
            
            detections.append(hand_region)
    
    return detections

Palm Keypoint Extraction

The detection includes palm keypoints for rotation calculation:

Wrist Keypoint: Base reference point for hand orientation
Middle Finger MCP: Second reference point for rotation calculation
Coordinate System: Normalized coordinates relative to bounding box

Key Processing Steps

1. Frame Preprocessing

Resize: Input frame resized to 192x192 pixels
Normalization: Pixel values normalized to model requirements
Batch Formation: Single frame formatted as batch input

2. Model Inference

Forward Pass: Frame processed through SSD network
Output Extraction: Raw detections and scores obtained
Memory Management: Efficient handling of model outputs

3. Score Processing

Sigmoid Activation: Applied to raw confidence scores
Threshold Filtering: Remove low-confidence detections
Score Ranking: Sort detections by confidence

4. Bounding Box Decoding

Anchor Mapping: Raw outputs mapped to anchor coordinates
Coordinate Transformation: Convert to image coordinate system
Size Calculation: Compute bounding box dimensions

Output Format

HandRegion Object Structure

class HandRegion:
    def __init__(self, bbox, confidence, keypoints):
        self.bbox = bbox              # [x, y, w, h] in normalized coordinates
        self.confidence = confidence  # Detection confidence score
        self.keypoints = keypoints    # Palm keypoints for rotation
        self.rotation = None          # Calculated in next stage
        self.rect_points = None       # Rotated rectangle points

Detection Metadata

Bounding Box: [x, y, width, height] in normalized coordinates
Confidence Score: Float value between 0.0 and 1.0
Palm Keypoints: Wrist and middle finger MCP coordinates
Processing Time: Inference latency for performance monitoring

Integration Points

Input Interface

Video Capture: Receives frames from camera or video file
Frame Buffer: Manages input frame queue for processing
Preprocessing: Handles frame preparation and formatting

Output Interface

Detection List: Provides list of detected hand regions
Quality Metrics: Includes confidence and processing statistics
Next Stage: Feeds into Non-Maximum Suppression function

Configuration Parameters

Detection Settings

detection_threshold: Minimum confidence for valid detection (default: 0.5)
input_size: Model input resolution (192x192)
max_detections: Maximum number of detections to process (default: 100)

Anchor Parameters

num_layers: Number of detection layers (default: 4)
min_scale: Minimum anchor scale (default: 0.1484375)
max_scale: Maximum anchor scale (default: 0.75)
aspect_ratios: Anchor aspect ratios (default: [1.0])

Next Steps

After anchor generation and palm detection, the pipeline proceeds to:

Function 2: Non-Maximum Suppression and rotation calculation
Quality Assessment: Validate detection reliability
Temporal Consistency: Track detections across frames

Overview​

Purpose and Functionality​

Implementation Details​

SSD Anchor Options Configuration​

Anchor Generation Algorithm​

Anchor Generation System​

SSD Anchor Configuration​

Anchor Generation Process​

Bounding Box Decoding​

Raw Output Processing​

Palm Keypoint Extraction​

Key Processing Steps​

1. Frame Preprocessing​

2. Model Inference​

3. Score Processing​

4. Bounding Box Decoding​

Output Format​

HandRegion Object Structure​

Detection Metadata​

Integration Points​

Input Interface​

Output Interface​

Configuration Parameters​

Detection Settings​

Anchor Parameters​

Next Steps​