0% found this document useful (0 votes)
42 views5 pages

Untitled 7

The document outlines a pipeline for an on-screen tutoring system that captures screenshots, extracts context, tracks user actions, and provides step-by-step guidance through a UI overlay. It utilizes GPU-optimized tools such as YOLOv8m for object detection and LLaMA 2 for reasoning, ensuring efficient performance on an RTX 3050. The system aims to enhance user learning by guiding them through tasks in applications like Figma with real-time instructions.

Uploaded by

natyasbackups
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views5 pages

Untitled 7

The document outlines a pipeline for an on-screen tutoring system that captures screenshots, extracts context, tracks user actions, and provides step-by-step guidance through a UI overlay. It utilizes GPU-optimized tools such as YOLOv8m for object detection and LLaMA 2 for reasoning, ensuring efficient performance on an RTX 3050. The system aims to enhance user learning by guiding them through tasks in applications like Figma with real-time instructions.

Uploaded by

natyasbackups
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Untitled 7

This pipeline will:


✅ Take screenshots continuously
✅ Keep a history of actions + a system goal
✅ Decide the next step based on current screen + history + goal
✅ Show a translucent overlay bubble on screen with step-by-step guidance
This will feel like a real “on-screen tutor” without any hackathon hacks.

🏗️ Single Straightforward Pipeline for SkillForge

🌟 Big Picture Workflow


📸 Screenshot → 🎯 Context Extraction → 🧠 Next Action Reasoning → 🖱 UI
Overlay

Each stage uses specific GPU-optimized tools.

🚀 Full Pipeline Step-by-Step

✅ 1. Screenshot Capture (Current Screen State)

What? Capture user’s current screen (full screen or focused app window)
Tool mss (fast screen capture for Python)
Alternative PyAutoGUI.screenshot() if fallback needed
Frequency Every ~1 second (configurable, avoid overloading GPU)
Output current_screen.png
✅ 2. Context Extraction (What’s on Screen?)

What? Detect UI elements (buttons, menus, text) and OCR labels


Object Detection YOLOv8m (medium model runs well on RTX 3050 @ ~30 FPS)
OCR PaddleOCR (GPU-accelerated, better than EasyOCR)
Output JSON of detected elements:
Example
{
"elements": [
{"type": "button", "label": "File", "position": [x1, y1, x2, y2]},
{"type": "textfield", "label": "Untitled Project"}
]
}

✅ 3. State & History Tracker (What’s Been Done So


Far?)

What? Maintain a history of user actions (clicks, inputs, etc.)


Tool Lightweight JSON or SQLite DB (stores each detected click/action)
Data Example
{ "goal": "Design Instagram Post", "history": [
"Opened Figma", "Selected Text Tool", "Typed headline text"
]}

This acts as short-term memory for reasoning.

✅ 4. Reasoning Engine (What Should Happen Next?)


What? Decide next action using goal + history + current screen
Model LLaMA 2-13B (or 7B) (RTX 3050 can run 7B in FP16 or 4-bit quantized)
Framework transformers + accelerate (or llama.cpp for ultra-light GGUF)
Prompt System: You are a professional design tutor helping user create an
Template Instagram post.
Goal: [Goal here]
History: [Past steps here]
Current UI Elements: [JSON of detected UI]
Question: What should user do next?
Example Output "Click the 'Align Center' button in the toolbar to center text"

✅ 5. UI Overlay (Show Instruction on Screen)

What? Display a translucent bubble overlay on top of detected UI element


Tool PyQt5 (native overlay for any OS window)
Alternative Electron.js (if browser-only workflow)
Features - Draw translucent bubbles (opacity 70%)
- Arrow pointing to UI element
- Tooltip text: “Click here to align text”
- Hotkey toggle: Ctrl+Shift+S (show/hide)

📦 Technology Stack Summary


Layer Technology/Model Why
📸 Screenshot mss Fast, low latency screen capture
🖼 Object Detection YOLOv8m (Ultralytics) GPU-optimized UI element detection
📄 OCR PaddleOCR (GPU enabled) Multi-language, fast, more accurate
Layer Technology/Model Why
🧠 Reasoning LLaMA 2-7B (4-bit quantized) Local reasoning with history+goal
🖱 Overlay PyQt5 Clean translucent bubbles & arrows
📚 State Tracker SQLite Lightweight persistent memory

🏃 End-to-End Pipeline Flow


1️⃣ Capture screenshot ( mss ) →
2️⃣ Detect UI elements ( YOLOv8m + PaddleOCR ) →
3️⃣ Append user actions to history (SQLite) →
4️⃣ Query LLaMA2:

Given goal + history + current UI, what’s next?

5️⃣ Draw overlay bubble (PyQt5): “Click here to align text”

⚡ Why This is Perfect for RTX 3050


YOLOv8m + PaddleOCR run GPU-accelerated
LLaMA2-7B quantized fits in 6GB VRAM (4-bit)
Overlay system lightweight and doesn’t block UI
Entire pipeline can hit ~1 step per second refresh

🎯 Bonus: System Instruction Example

You are helping user learn Figma by creating a “Summer Sale Poster”.
Guide step-by-step, explain why each step is important, and use plain English.

🏆 Final Deliverable Look (for Judges)


🎥 Screen shows Figma → User selects Text Tool → AI overlay appears:
💬 “Type your headline here. Use bold fonts for visibility.”
Next step → Align text → Overlay:
💬 “Click Align Center to balance your design.”

You might also like