Model 2 - Hand Landmark Detection

Overview

Once a hand is located by Model 1, the Hand Landmark Detection model is used to find its detailed structure. This is the most computationally intensive model in the pipeline, responsible for predicting the precise 3D coordinates of 21 key hand landmarks from a cropped image of a single hand.

Model Specifications

Model File: hand_landmarks_detector.xml
Architecture: Custom CNN with attention mechanisms optimized for landmark detection
Purpose: Predict precise 3D coordinates of 21 key hand landmarks from cropped hand regions
Computational Intensity: Most resource-intensive model in the pipeline

Processing function diagram Figure: Function diagram for Model 2 processing pipeline.

Input/Output Specifications

Input Requirements

Format: [1, 224, 224, 3] - A cropped and warped image of a single hand
Resolution: 224x224 pixels (standardized input size)
Source: Generated using the rect_points from Model 1 (Hand Detector)
Preprocessing: Affine transformation and image warping from original frame

Output Format

The model produces four distinct outputs providing comprehensive hand analysis:

Identity: [1, 63]
- The 21 landmarks as pixel coordinates [x, y, z] relative to the 224x224 input crop
- 63 values total (21 landmarks × 3 coordinates each)
Identity_1: [1, 1]
- Confidence score indicating the presence of a hand in the crop
- Single float value representing hand presence probability
Identity_2: [1, 1]
- Confidence score for handedness classification (left vs. right hand)
- Single float value for left/right hand determination
Identity_3: [1, 63]
- World landmarks providing 3D coordinates in meters
- Real-world spatial coordinates for depth-aware applications

21-Point Hand Landmark Model

Landmark Structure

The model detects landmarks following the MediaPipe hand model specification:

Wrist (1 point)

Point 0: Wrist center - base reference point

Thumb (4 points)

Point 1: CMC joint (base) - thumb base connection
Point 2: MCP joint - thumb metacarpal joint
Point 3: IP joint - thumb interphalangeal joint
Point 4: Thumb tip - thumb fingertip

Index Finger (4 points)

Point 5: MCP joint (base) - index finger base
Point 6: PIP joint - proximal interphalangeal joint
Point 7: DIP joint - distal interphalangeal joint
Point 8: Index tip - index fingertip

Middle Finger (4 points)

Point 9: MCP joint (base) - middle finger base
Point 10: PIP joint - proximal interphalangeal joint
Point 11: DIP joint - distal interphalangeal joint
Point 12: Middle tip - middle fingertip

Ring Finger (4 points)

Point 13: MCP joint (base) - ring finger base
Point 14: PIP joint - proximal interphalangeal joint
Point 15: DIP joint - distal interphalangeal joint
Point 16: Ring tip - ring fingertip

Pinky (4 points)

Point 17: MCP joint (base) - pinky base
Point 18: PIP joint - proximal interphalangeal joint
Point 19: DIP joint - distal interphalangeal joint
Point 20: Pinky tip - pinky fingertip

Processing Pipeline

The Hand Landmark Detection model operates through a sophisticated pipeline that transforms detected hand regions into detailed landmark coordinates:

Region Rectification and Cropping (Function 1): Image warping and hand region extraction
Landmark Detection and Smoothing (Function 2): 21-point detection with temporal stabilization

Technical Implementation

Key Functions & Logic

1. Image Warping

Function: warp_rect_img in hand_landmark.py
Process: Performs affine transformation on original full-resolution camera frame
Result: "Cuts out" the hand and straightens it, creating 224x224 input

2. Model Inference

Function: _process_landmarks_and_gestures in gesture_engine.py
Process: Warped hand image fed into hand_landmarks_detector model
Output: Raw landmark coordinates and confidence scores

3. Post-processing

Function: lm_postprocess in hand_landmark.py
Process: Normalizes pixel landmarks to [0, 1] range
Features: Extracts handedness and presence scores
Smoothing: Applies Exponential Moving Average (EMA) for temporal stability

Coordinate Systems

Pixel Coordinates (Identity Output)

X-axis: Left to right within 224x224 crop (0 to 224)
Y-axis: Top to bottom within 224x224 crop (0 to 224)
Z-axis: Depth relative to wrist (negative values closer)
Normalization: Converted to [0, 1] range during post-processing

World Coordinates (Identity_3 Output)

Units: Real-world coordinates in meters
Reference: 3D spatial positioning relative to camera
Applications: Depth-aware gesture recognition and AR/VR integration
Precision: Millimeter-level accuracy for close-range detection

Model Architecture Details

Network Design

Input Layer: 224x224x3 RGB image tensor
Feature Extraction: Multi-scale convolutional layers
Attention Mechanisms: Focus on hand regions and joint locations
Regression Heads: Separate heads for landmarks, presence, and handedness
Output Processing: Multi-task learning with shared feature extraction

Key Features

Sub-pixel Accuracy: Precise landmark localization with sub-pixel precision
Rotation Invariance: Handles various hand orientations through input warping
Scale Normalization: Consistent output regardless of input hand size
Multi-task Learning: Simultaneous landmark detection and hand classification

Temporal Stabilization

Exponential Moving Average (EMA) Smoothing

Purpose: Reduces jitter and maintains smooth landmark trajectories
Implementation: Applied when matching regions found in previous frames
IoU Matching: Uses Intersection over Union to match hands across frames
Adaptive Smoothing: Smoothing strength adapts based on motion speed

Frame-to-Frame Consistency

Tracking: Maintains landmark identity across video frames
Interpolation: Handles temporary occlusions and detection gaps
Quality Control: Validates landmark consistency and anatomical constraints

Integration with Pipeline

Input Interface

Hand Regions: Receives rect_points from Model 1 (Hand Detector)
Original Frame: Accesses full-resolution camera frame for warping
Configuration: Processing parameters and quality thresholds

Output Interface

Landmark Data: Provides 21-point hand structure to Model 3
Quality Metrics: Hand presence and confidence scores
Handedness: Left/right hand classification
World Coordinates: 3D spatial data for advanced applications

Configuration Parameters

Detection Settings

landmark_confidence: Minimum confidence threshold for landmarks (default: 0.5)
presence_threshold: Hand presence detection threshold (default: 0.7)
handedness_threshold: Left/right classification threshold (default: 0.5)

Smoothing Parameters

temporal_smoothing: EMA smoothing factor (default: 0.3)
iou_threshold: Frame matching threshold (default: 0.3)
max_smoothing_frames: Maximum frames for smoothing history (default: 5)

Processing Options

world_coordinates: Enable world coordinate output (default: true)
anatomical_validation: Enable pose validation (default: true)
batch_processing: Process multiple hands simultaneously (default: false)

Next Steps

With detailed hand landmarks established, the pipeline proceeds to:

Model 3 (Gesture Embedder): Generate compact feature representations
Gesture Classification: Identify specific hand poses and movements
Temporal Analysis: Track dynamic gesture patterns over time
Real-time Applications: Enable immediate gesture-based interactions

Overview​

Model Specifications​

Input/Output Specifications​

Input Requirements​

Output Format​

21-Point Hand Landmark Model​

Landmark Structure​

Processing Pipeline​

Technical Implementation​

Key Functions & Logic​

1. Image Warping​

2. Model Inference​

3. Post-processing​

Coordinate Systems​

Pixel Coordinates (Identity Output)​

World Coordinates (Identity_3 Output)​

Model Architecture Details​

Network Design​

Key Features​

Temporal Stabilization​

Exponential Moving Average (EMA) Smoothing​

Frame-to-Frame Consistency​

Integration with Pipeline​

Input Interface​

Output Interface​

Configuration Parameters​

Detection Settings​

Smoothing Parameters​

Processing Options​

Next Steps​