what is ai screen guidance?

ai screen guidance is the idea that an AI should be able to see what a user is looking at — their actual browser tab, their actual UI — and respond with spoken instructions in real time. not a help article. not a chatbot. a voice in their ear saying "click the gear icon in the top right, then billing."

the name borrows from aviation. a copilot doesn't fly the plane. they watch the instruments, track conditions, and call out what the pilot might miss. ai screen guidance works the same way — watching, narrating, keeping the user on course without taking control away from them.

the problem

users get stuck in software. they hit a wall, can't figure out where something is, and have two options: keep fumbling around, or call support. most call support.

the support agent picks up, listens to a vague description of the problem, and says "can you share your screen?" then both parties spend ten minutes installing a screen-share tool, waiting for a link to load, and finally arriving at the same view the user has had open the entire time.

"where is the export button?" is a 30-second question that takes 40 minutes to answer because the infrastructure around answering it is broken.

this is the support ticket. not a bug, not a complaint — a user who couldn't find a button. these tickets make up the majority of support volume at most saas companies, and they're expensive, repetitive, and entirely preventable.

how it works

ai screen guidance runs in three steps — and the total time from "i'm stuck" to "got it" is under three seconds.

first, the user shares their screen via webrtc. no downloads, no scheduling, no support agent on the line. they click a button embedded in your product, choose which tab to share, and that's it. the whole thing lives in the browser.

second, a vision LLM receives the screen frames. not a screenshot — a live feed. the model sees what's on the screen right now, in context, the same way a human looking over their shoulder would. it understands which app is open, what state the user is in, and what they appear to be trying to do.

third, the guidance is spoken via text-to-speech. not typed. not displayed as a tooltip. spoken — so the user doesn't have to shift their attention from the screen to read instructions. they hear "click the gear icon in the top right" and can follow along immediately.

the entire integration

<script src="https://cdn.aside.ai/widget.js" data-workspace="your-id"></script>

one script tag. no sdk, no npm install, no webhooks.

who it's for

three teams reach for ai screen guidance first, and they reach for it for different reasons.

support teams use it to deflect tickets. when a user is guided through their question in real time, they never file a ticket. deflection rates of 30–40% are common in the first month. the cost math is obvious.

product teams use it to improve onboarding. instead of tooltip carousels that break on every deploy, the AI watches where new users go and speaks them through activation flows in plain english.

sales teams use it for self-serve demos. prospects can explore the product on their own, with the AI explaining what they're looking at. no sales rep required for a first look.

why now

the category didn't exist two years ago because the underlying technology wasn't ready. vision LLMs — models that can look at an image and understand what's on screen — crossed a quality threshold in 2025. they went from "interesting demo" to "reliable enough for production."

latency came down at the same time. earlier models took 4–8 seconds to analyze a screen frame. that's too slow — users need guidance before they give up and click away. current models respond in under a second. combined with webrtc and streaming TTS, the total time from screen analysis to spoken guidance is now under three seconds.

the cost also dropped. running a vision model over a continuous screen feed would have cost a dollar per session in 2023. today it's a few cents. that makes real-time screen guidance economically viable at scale — you don't need to reserve it for VIP accounts.

real-time screen understanding is now fast enough, smart enough, and cheap enough for production use cases. the window to build on top of it before it's commoditized is open right now.

the problem

how it works

who it's for

why now

try aside on your own screen