Untitled 7
This pipeline will:
✅ Take screenshots continuously
✅ Keep a history of actions + a system goal
✅ Decide the next step based on current screen + history + goal
✅ Show a translucent overlay bubble on screen with step-by-step guidance
This will feel like a real “on-screen tutor” without any hackathon hacks.
🏗️ Single Straightforward Pipeline for SkillForge
🌟 Big Picture Workflow
📸 Screenshot → 🎯 Context Extraction → 🧠 Next Action Reasoning → 🖱 UI
Overlay
Each stage uses specific GPU-optimized tools.
🚀 Full Pipeline Step-by-Step
✅ 1. Screenshot Capture (Current Screen State)
What? Capture user’s current screen (full screen or focused app window)
Tool mss (fast screen capture for Python)
Alternative PyAutoGUI.screenshot() if fallback needed
Frequency Every ~1 second (configurable, avoid overloading GPU)
Output current_screen.png
✅ 2. Context Extraction (What’s on Screen?)
What? Detect UI elements (buttons, menus, text) and OCR labels
Object Detection YOLOv8m (medium model runs well on RTX 3050 @ ~30 FPS)
OCR PaddleOCR (GPU-accelerated, better than EasyOCR)
Output JSON of detected elements:
Example
{
"elements": [
{"type": "button", "label": "File", "position": [x1, y1, x2, y2]},
{"type": "textfield", "label": "Untitled Project"}
]
}
✅ 3. State & History Tracker (What’s Been Done So
Far?)
What? Maintain a history of user actions (clicks, inputs, etc.)
Tool Lightweight JSON or SQLite DB (stores each detected click/action)
Data Example
{ "goal": "Design Instagram Post", "history": [
"Opened Figma", "Selected Text Tool", "Typed headline text"
]}
This acts as short-term memory for reasoning.
✅ 4. Reasoning Engine (What Should Happen Next?)
What? Decide next action using goal + history + current screen
Model LLaMA 2-13B (or 7B) (RTX 3050 can run 7B in FP16 or 4-bit quantized)
Framework transformers + accelerate (or llama.cpp for ultra-light GGUF)
Prompt System: You are a professional design tutor helping user create an
Template Instagram post.
Goal: [Goal here]
History: [Past steps here]
Current UI Elements: [JSON of detected UI]
Question: What should user do next?
Example Output "Click the 'Align Center' button in the toolbar to center text"
✅ 5. UI Overlay (Show Instruction on Screen)
What? Display a translucent bubble overlay on top of detected UI element
Tool PyQt5 (native overlay for any OS window)
Alternative Electron.js (if browser-only workflow)
Features - Draw translucent bubbles (opacity 70%)
- Arrow pointing to UI element
- Tooltip text: “Click here to align text”
- Hotkey toggle: Ctrl+Shift+S (show/hide)
📦 Technology Stack Summary
Layer Technology/Model Why
📸 Screenshot mss Fast, low latency screen capture
🖼 Object Detection YOLOv8m (Ultralytics) GPU-optimized UI element detection
📄 OCR PaddleOCR (GPU enabled) Multi-language, fast, more accurate
Layer Technology/Model Why
🧠 Reasoning LLaMA 2-7B (4-bit quantized) Local reasoning with history+goal
🖱 Overlay PyQt5 Clean translucent bubbles & arrows
📚 State Tracker SQLite Lightweight persistent memory
🏃 End-to-End Pipeline Flow
1️⃣ Capture screenshot ( mss ) →
2️⃣ Detect UI elements ( YOLOv8m + PaddleOCR ) →
3️⃣ Append user actions to history (SQLite) →
4️⃣ Query LLaMA2:
Given goal + history + current UI, what’s next?
5️⃣ Draw overlay bubble (PyQt5): “Click here to align text”
⚡ Why This is Perfect for RTX 3050
YOLOv8m + PaddleOCR run GPU-accelerated
LLaMA2-7B quantized fits in 6GB VRAM (4-bit)
Overlay system lightweight and doesn’t block UI
Entire pipeline can hit ~1 step per second refresh
🎯 Bonus: System Instruction Example
You are helping user learn Figma by creating a “Summer Sale Poster”.
Guide step-by-step, explain why each step is important, and use plain English.
🏆 Final Deliverable Look (for Judges)
🎥 Screen shows Figma → User selects Text Tool → AI overlay appears:
💬 “Type your headline here. Use bold fonts for visibility.”
Next step → Align text → Overlay:
💬 “Click Align Center to balance your design.”