AutoVision: Building a Multi-Modal UI Automation Framework
How I built a UI automation framework with 11 spy modes, 80+ scripting functions, and advanced techniques like memory scanning, ETW tracing, and computer vision for detecting any UI element.
Traditional UI automation tools hit walls. Legacy apps ignore accessibility APIs. Games render custom UIs. Some applications actively block automation. This led me to build AutoVision, a UI automation framework that combines 11 different detection techniques to find any element on screen.
The Problem: One Size Doesn't Fit All
Each UI automation approach has blind spots:
| Technique | Limitation |
|---|---|
| UI Automation (UIA) | Modern apps only, no games |
| MSAA | Legacy, incomplete on new apps |
| Win32 | Window-level only, no internal controls |
| Image matching | Breaks on resolution/theme changes |
What if you could combine them all?
The Solution: Multi-Modal Detection
AutoVision implements 11 spy modes, each optimized for different scenarios:
Core Spy Modes
| Mode | Technology | Best For |
|---|---|---|
| UIA | UI Automation API | Modern WPF/UWP apps |
| MSAA | Active Accessibility | Legacy Win32 apps |
| Win32 | Window messages | Native controls |
| JAB | Java Access Bridge | Java applications |
Advanced Spy Modes
| Mode | Technology | Best For |
|---|---|---|
| WM_HOOK | SetWindowsHookEx | Real-time message interception |
| MEM_SCAN | ReadProcessMemory | Bypassing automation blockers |
| HID_DEVICE | Raw Input API | Hardware-level input capture |
| RENDER_HOOK | DirectX/GDI hooks | Games, custom renderers |
| KERNEL_TRACE | ETW tracing | Kernel-level visibility |
| VISION | OpenCV + OCR | Visual element detection |
| FUSION | All modes combined | Intelligent auto-selection |
Architecture
Native Core (C++)
Performance-critical element detection runs in native C++:
class AutomationElement {
public:
ActionResult Initialize();
std::shared_ptr<AutomationElement> FindElementByPoint(int x, int y);
std::vector<std::shared_ptr<AutomationElement>> GetChildren();
ElementProperties GetProperties() const;
ActionResult PerformAction(ActionType action);
private:
IUIAutomationElement* m_uiaElement;
IAccessible* m_msaaElement;
HWND m_hwnd;
};
Managed Interop (C#)
Business logic and scripting in C#:
public class NativeAutomationElement : IDisposable {
public NativeElementProperties GetProperties();
public List<NativeAutomationElement> GetChildren();
public void Click();
public void SetText(string value);
public string GetText();
}
Scripting Engine
80+ built-in functions for automation scripts:
// Math: Sum, Average, Power, Sqrt, Sin, Cos...
// String: Trim, Replace, Substring, RegexMatch...
// DateTime: Today, AddDays, DaysBetween, FormatDate...
// Collections: Count, First, Last, Reverse...
// Example script
Set totalPrice = Sum(data.prices)
Set formattedDate = FormatDate(Today(), "yyyy-MM-dd")
Set isValid = RegexMatch(email, "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+$")
Key Features
1. Intelligent Fallback
When the primary mode fails, AutoVision automatically tries alternatives:
float CalculateConfidence(ModeResult result) {
float score = 0.0f;
// Has valid properties
if (!result.Element.Name.IsEmpty()) score += 0.3f;
if (result.Element.BoundingRectangle.IsValid()) score += 0.2f;
// Response time (faster = better)
if (result.ResponseTime < 50) score += 0.2f;
// Mode-specific bonuses
if (result.Mode == SpyMode.UIA) score += 0.2f;
return Math.Min(score, 1.0f);
}
2. Real-Time Element Highlighting
Color-coded highlights show which spy mode found each element:
| Color | Mode |
|---|---|
| Red | UIA |
| Green | MSAA |
| Blue | Win32 |
| Orange | WM_HOOK |
| Gold | Vision |
| Rainbow | Fusion |
3. Session Recording & Replay
Record automation sessions with intelligent element re-finding:
- Multiple locator strategies per element (AutomationId, XPath, visual)
- Handles UI changes between recordings
- Exports to C# or Python test code
4. Memory-Safe Native Code
RAII patterns prevent memory leaks in long-running sessions:
// Smart pointers for automatic cleanup
std::shared_ptr<AutomationElement> GetParent() {
return std::make_shared<AutomationElement>(m_parentHandle);
}
// RAII wrappers for COM interfaces
class ComPtr {
~ComPtr() { if (m_ptr) m_ptr->Release(); }
};
Performance Metrics
| Metric | Value |
|---|---|
| Element detection | < 100ms |
| Script parsing | 24/24 tests passing |
| Fuzzing inputs tested | 1000+ |
| Crash rate | 0% |
| Standard library functions | 80+ |
Real-World Applications
Test Automation
- Cross-platform UI testing
- Legacy application validation
- Accessibility compliance checking
RPA Integration
- Element detection for robotic process automation
- Visual verification of automation steps
- Handling non-standard UI controls
Quality Assurance
- Visual regression testing
- Performance profiling
- Automated accessibility audits
What I Learned
1. Native Code Still Matters
Performance-critical paths benefit enormously from C++. Element detection went from 200ms to 20ms by moving to native code.
2. Fallback Chains Beat Single Solutions
No single API covers all scenarios. The fusion approach provides 99%+ coverage.
3. Scripting Enables Adoption
A powerful scripting engine lets users customize without rebuilding. The 80+ functions cover most business logic needs.
4. Memory Safety Requires Discipline
COM interop and native handles need careful lifecycle management. RAII patterns made this manageable.
Future Directions
- Machine learning for fusion mode confidence scoring
- Cross-platform support via libui or Qt
- Cloud integration for distributed test execution
- Visual debugging with element tree visualization
UI automation shouldn't require knowing which API works for each application. AutoVision makes finding elements automatic.