Landmark Detection and Smoothing
Overview
Function 2 of the Hand Landmark Detection model performs the core landmark prediction and applies sophisticated smoothing algorithms for temporal stability. This function processes the warped 224x224 hand images from Function 1 to predict precise 21-point hand landmarks with sub-pixel accuracy.
Purpose and Functionality
This function handles the critical landmark processing tasks:
- Model Inference: Predicts 21 hand landmarks within the cropped region
- Coordinate Normalization: Converts pixel coordinates to normalized space
- Temporal Smoothing: Applies Exponential Moving Average (EMA) for stability
- Quality Assessment: Validates landmark accuracy and confidence
Landmark Normalization
Coordinate Normalization Implementation
def normalize_landmarks_correctly(landmarks_px, image_width=224, image_height=224, normalize_z_factor=1.0):
"""Follow MediaPipe's exact normalization from C++ source.
landmarks_px is a list/array of [x_px, y_px, z_px] coordinates.
"""
normalized_landmarks = []
for landmark in landmarks_px:
x, y, z = landmark[:3]
norm_x = x / image_width
norm_y = y / image_height
norm_z = (z / image_width) / normalize_z_factor # Z uses width and an optional normalization factor
normalized_landmarks.append([norm_x, norm_y, norm_z])
return normalized_landmarks
EMA Smoothing Algorithm
IoU Calculation for Tracking
def calculate_iou(box1, box2):
"""
Calculates Intersection over Union (IoU) between two bounding boxes.
Box format: [x_min, y_min, width, height] (normalized coordinates)
"""
x1_min, y1_min, w1, h1 = box1
x1_max, y1_max = x1_min + w1, y1_min + h1
x2_min, y2_min, w2, h2 = box2
x2_max, y2_max = x2_min + w2, y2_min + h2
inter_x_min = max(x1_min, x2_min)
inter_y_min = max(y1_min, y2_min)
inter_x_max = min(x1_max, x2_max)
inter_y_max = min(y1_max, y2_max)
inter_w = max(0, inter_x_max - inter_x_min)
inter_h = max(0, inter_y_max - inter_y_min)
intersection_area = inter_w * inter_h
area1 = w1 * h1
area2 = w2 * h2
union_area = area1 + area2 - intersection_area
if union_area == 0:
return 0.0
return intersection_area / union_area
Main Processing Function
Complete Landmark Processing with EMA
def lm_postprocess(current_region, inference, previous_frame_regions,
alpha=0.6, iou_threshold=0.4, crop_width=224, crop_height=224):
"""Process landmark model output, applying EMA smoothing."""
# Hand presence score - Identity_1
current_region.lm_score = float(np.squeeze(inference['Identity_1']))
# Handedness (left/right) - Identity_2
current_region.handedness = float(np.squeeze(inference['Identity_2']))
# Use 'Identity' (output 0) for pixel coordinates from the 224x224 crop
pixel_landmarks_raw = np.squeeze(inference['Identity']) # Shape (63,)
# Reshape to (21, 3) for [x, y, z]
pixel_landmarks_reshaped = pixel_landmarks_raw.reshape(21, 3)
# Apply correct normalization
current_raw_normalized_landmarks = normalize_landmarks_correctly(
pixel_landmarks_reshaped,
image_width=crop_width,
image_height=crop_height
)
# --- EMA Smoothing ---
best_match_prev_region = None
max_iou = 0.0
if previous_frame_regions and hasattr(current_region, 'pd_box'):
for prev_region in previous_frame_regions:
if hasattr(prev_region, 'pd_box'):
# Ensure pd_box is in [x_min, y_min, width, height] format if not already
# decode_bboxes stores it as [x_center - w*0.5, cy - h*0.5, w, h]
# which is [x_min, y_min, w, h], so it should be fine.
iou = calculate_iou(current_region.pd_box, prev_region.pd_box)
if iou > max_iou:
max_iou = iou
best_match_prev_region = prev_region
final_landmarks_to_set = list(current_raw_normalized_landmarks) # Default to raw
if best_match_prev_region and max_iou >= iou_threshold and \
hasattr(best_match_prev_region, 'landmarks') and \
best_match_prev_region.landmarks is not None and \
len(best_match_prev_region.landmarks) == len(current_raw_normalized_landmarks):
prev_smoothed_landmarks = best_match_prev_region.landmarks
temp_smoothed_ema = []
for i in range(len(current_raw_normalized_landmarks)):
smooth_pt = [
alpha * current_raw_normalized_landmarks[i][k] + (1 - alpha) * prev_smoothed_landmarks[i][k]
for k in range(3) # x,y,z
]
temp_smoothed_ema.append(smooth_pt)
final_landmarks_to_set = temp_smoothed_ema
current_region.landmarks = final_landmarks_to_set
return current_region
Palm Detection Decision Logic
Tracking Quality Assessment
def should_run_palm_detection(tracked_regions, lm_score_threshold=0.7):
"""Determine if palm detection is needed."""
if not tracked_regions:
return True # No existing tracking, need to detect.
for region in tracked_regions:
if not hasattr(region, 'lm_score') or region.lm_score < lm_score_threshold:
return True # A region has low landmark score or no score, re-detect.
return False # All tracked regions have good scores.
if not validate_3d_landmark_consistency(transformed_coords): return None
return transformed_coords
## Performance Optimizations
### Batch Landmark Processing
```python
def batch_landmark_processing(warped_images, landmark_model):
"""
Process multiple hand images in batch for efficiency
Args:
warped_images: List of 224x224 hand images
landmark_model: Loaded landmark detection model
Returns:
List of processed landmark results
"""
if not warped_images:
return []
# Prepare batch input
batch_input = np.stack([
preprocess_image_for_inference(img) for img in warped_images
])
# Run batch inference
batch_outputs = landmark_model.infer(batch_input)
# Process each result
results = []
for i in range(len(warped_images)):
# Extract individual outputs
landmarks = batch_outputs['Identity'][i:i+1]
presence = batch_outputs['Identity_1'][i:i+1]
handedness = batch_outputs['Identity_2'][i:i+1]
world_coords = batch_outputs['Identity_3'][i:i+1]
# Post-process
processed = lm_postprocess(
landmarks, presence, handedness, None
)
results.append(processed)
return results
Configuration Parameters
Detection Settings
landmark_confidence_threshold: Minimum confidence for landmarks (default: 0.5)hand_presence_threshold: Hand presence detection threshold (default: 0.7)anatomical_validation: Enable pose validation (default: true)
Smoothing Parameters
ema_smoothing_factor: Base smoothing strength (default: 0.7)iou_matching_threshold: Threshold for frame matching (default: 0.3)motion_adaptation: Enable motion-adaptive smoothing (default: true)max_smoothing_frames: Maximum frames for smoothing history (default: 5)
Quality Control
enable_world_coordinates: Process 3D world coordinates (default: true)strict_anatomical_validation: Enforce strict anatomical constraints (default: false)interpolate_missing_landmarks: Fill low-confidence landmarks (default: true)
Output Format
ProcessedLandmarks Structure
class ProcessedLandmarks:
def __init__(self, landmarks, normalized_landmarks, presence_score,
handedness_score, handedness_classification):
self.landmarks = landmarks # Original space coordinates
self.normalized_landmarks = normalized_landmarks # [0,1] normalized
self.presence_score = presence_score # Hand presence confidence
self.handedness_score = handedness_score # Handedness confidence
self.handedness = handedness_classification # 'left' or 'right'
self.world_landmarks = None # 3D world coordinates
self.quality_score = None # Overall quality metric
self.smoothing_applied = False # EMA smoothing flag
self.processing_time = None # Processing latency
Integration Points
Input Interface
- Warped Images: Receives 224x224 hand images from Function 1
- Transform Matrices: Uses coordinate transformation data
- Previous Landmarks: Accesses previous frame data for smoothing
Output Interface
- Landmark Coordinates: Provides 21-point hand structure
- Confidence Scores: Reports detection and handedness confidence
- Quality Metrics: Supplies processing quality information
- Next Stage: Feeds into Model 3 (Gesture Embedder)
Next Steps
After landmark detection and smoothing, the pipeline proceeds to:
- Model 3 (Gesture Embedder): Generate compact feature representations from landmarks
- Coordinate Transformation: Convert landmarks to appropriate coordinate systems
- Temporal Tracking: Maintain landmark identity across video frames
- Quality Assessment: Monitor landmark stability and accuracy for downstream processing