Anchor Generation and Palm Detection
Overview
The first key function of the Hand Detector model involves anchor generation and palm detection using the SSD (Single Shot Detector) approach. This function implements the initial detection phase that identifies potential hand regions in the camera frame through a sophisticated anchor-based system.
Purpose and Functionality
This function handles the core detection logic:
- Anchor Generation: Creates anchor boxes with various scales and aspect ratios
- Model Inference: Processes camera frames through the detection network
- Raw Output Processing: Handles the model's raw detection outputs
- Initial Filtering: Applies basic confidence thresholds
Implementation Details
SSD Anchor Options Configuration
# MEDIAPIPE EXACT CONFIGURATION
options = SSDAnchorOptions(
num_layers=4,
min_scale=0.1484375,
max_scale=0.75,
input_size_height=192,
input_size_width=192,
anchor_offset_x=0.5,
anchor_offset_y=0.5,
strides=[8, 16, 16, 16],
aspect_ratios=[1.0],
reduce_boxes_in_lowest_layer=False,
interpolated_scale_aspect_ratio=1.0,
fixed_anchor_size=True
)
Anchor Generation Algorithm
def calculate_scale(min_scale, max_scale, stride_index, num_strides):
if num_strides == 1:
return (min_scale + max_scale) / 2
else:
return min_scale + (max_scale - min_scale) * stride_index / (num_strides - 1)
def generate_anchors(options):
anchors = []
layer_id = 0
n_strides = len(options.strides)
while layer_id < n_strides:
anchor_height = []
anchor_width = []
aspect_ratios = []
scales = []
last_same_stride_layer = layer_id
while last_same_stride_layer < n_strides and \
options.strides[last_same_stride_layer] == options.strides[layer_id]:
scale = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer, n_strides)
if last_same_stride_layer == 0 and options.reduce_boxes_in_lowest_layer:
aspect_ratios += [1.0, 2.0, 0.5]
scales += [0.1, scale, scale]
else:
aspect_ratios += options.aspect_ratios
scales += [scale] * len(options.aspect_ratios)
if options.interpolated_scale_aspect_ratio > 0:
if last_same_stride_layer == n_strides -1:
scale_next = 1.0
else:
scale_next = calculate_scale(options.min_scale, options.max_scale, last_same_stride_layer+1, n_strides)
scales.append(sqrt(scale * scale_next))
aspect_ratios.append(options.interpolated_scale_aspect_ratio)
last_same_stride_layer += 1
for i, r in enumerate(aspect_ratios):
ratio_sqrts = sqrt(r)
anchor_height.append(scales[i] / ratio_sqrts)
anchor_width.append(scales[i] * ratio_sqrts)
stride = options.strides[layer_id]
feature_map_height = ceil(options.input_size_height / stride)
feature_map_width = ceil(options.input_size_width / stride)
for y in range(feature_map_height):
for x in range(feature_map_width):
for anchor_id in range(len(anchor_height)):
x_center = (x + options.anchor_offset_x) / feature_map_width
y_center = (y + options.anchor_offset_y) / feature_map_height
if options.fixed_anchor_size:
new_anchor = [x_center, y_center, 1.0, 1.0]
else:
new_anchor = [x_center, y_center, anchor_width[anchor_id], anchor_height[anchor_id]]
anchors.append(new_anchor)
layer_id = last_same_stride_layer
return anchors
Anchor Generation System
SSD Anchor Configuration
The anchor generation follows MediaPipe's configuration with specific parameters:
- Layer Configurations: Multiple detection layers for multi-scale detection
- Scale Ranges: Covers various hand sizes from close-up to distant
- Aspect Ratios: Accommodates different hand orientations and poses
- Anchor Density: 2016 total anchor boxes across all layers
Anchor Generation Process
def generate_anchors():
"""
Generate SSD anchors for hand detection
Returns:
List of anchor boxes with coordinates and properties
"""
anchors = []
# Define anchor configuration
anchor_options = {
'num_layers': 4,
'min_scale': 0.1484375,
'max_scale': 0.75,
'input_size_height': 192,
'input_size_width': 192,
'anchor_offset_x': 0.5,
'anchor_offset_y': 0.5,
'strides': [8, 16, 16, 16],
'aspect_ratios': [1.0],
'reduce_boxes_in_lowest_layer': False,
'interpolated_scale_aspect_ratio': 1.0,
'fixed_anchor_size': True
}
# Generate anchors for each layer
for layer_id in range(anchor_options['num_layers']):
layer_anchors = generate_layer_anchors(layer_id, anchor_options)
anchors.extend(layer_anchors)
return anchors
Bounding Box Decoding
Raw Output Processing
The model outputs raw detection data that needs to be decoded:
def decode_bboxes(raw_boxes, raw_scores, anchors):
"""
Decode bounding boxes from model outputs using anchors
Args:
raw_boxes: Raw bounding box predictions [1, 2016, 18]
raw_scores: Raw confidence scores [1, 2016, 1]
anchors: Pre-generated anchor boxes
Returns:
List of HandRegion objects with decoded information
"""
detections = []
# Apply sigmoid activation to scores
scores = sigmoid(raw_scores)
# Process each anchor box
for i, anchor in enumerate(anchors):
confidence = scores[0, i, 0]
# Filter by confidence threshold
if confidence > DETECTION_THRESHOLD:
# Decode bounding box coordinates
bbox = decode_single_bbox(raw_boxes[0, i], anchor)
# Extract palm keypoints
keypoints = extract_palm_keypoints(raw_boxes[0, i, 4:])
# Create HandRegion object
hand_region = HandRegion(
bbox=bbox,
confidence=confidence,
keypoints=keypoints
)
detections.append(hand_region)
return detections
Palm Keypoint Extraction
The detection includes palm keypoints for rotation calculation:
- Wrist Keypoint: Base reference point for hand orientation
- Middle Finger MCP: Second reference point for rotation calculation
- Coordinate System: Normalized coordinates relative to bounding box
Key Processing Steps
1. Frame Preprocessing
- Resize: Input frame resized to 192x192 pixels
- Normalization: Pixel values normalized to model requirements
- Batch Formation: Single frame formatted as batch input
2. Model Inference
- Forward Pass: Frame processed through SSD network
- Output Extraction: Raw detections and scores obtained
- Memory Management: Efficient handling of model outputs
3. Score Processing
- Sigmoid Activation: Applied to raw confidence scores
- Threshold Filtering: Remove low-confidence detections
- Score Ranking: Sort detections by confidence
4. Bounding Box Decoding
- Anchor Mapping: Raw outputs mapped to anchor coordinates
- Coordinate Transformation: Convert to image coordinate system
- Size Calculation: Compute bounding box dimensions
Output Format
HandRegion Object Structure
class HandRegion:
def __init__(self, bbox, confidence, keypoints):
self.bbox = bbox # [x, y, w, h] in normalized coordinates
self.confidence = confidence # Detection confidence score
self.keypoints = keypoints # Palm keypoints for rotation
self.rotation = None # Calculated in next stage
self.rect_points = None # Rotated rectangle points
Detection Metadata
- Bounding Box:
[x, y, width, height]in normalized coordinates - Confidence Score: Float value between 0.0 and 1.0
- Palm Keypoints: Wrist and middle finger MCP coordinates
- Processing Time: Inference latency for performance monitoring
Integration Points
Input Interface
- Video Capture: Receives frames from camera or video file
- Frame Buffer: Manages input frame queue for processing
- Preprocessing: Handles frame preparation and formatting
Output Interface
- Detection List: Provides list of detected hand regions
- Quality Metrics: Includes confidence and processing statistics
- Next Stage: Feeds into Non-Maximum Suppression function
Configuration Parameters
Detection Settings
detection_threshold: Minimum confidence for valid detection (default: 0.5)input_size: Model input resolution (192x192)max_detections: Maximum number of detections to process (default: 100)
Anchor Parameters
num_layers: Number of detection layers (default: 4)min_scale: Minimum anchor scale (default: 0.1484375)max_scale: Maximum anchor scale (default: 0.75)aspect_ratios: Anchor aspect ratios (default: [1.0])
Next Steps
After anchor generation and palm detection, the pipeline proceeds to:
- Function 2: Non-Maximum Suppression and rotation calculation
- Quality Assessment: Validate detection reliability
- Temporal Consistency: Track detections across frames