Real-time audio and video understanding at the edge — validated in a surgical compliance challenge
⚡ The challenge: intelligence where the data is
Modern AI has made remarkable progress at processing audio and video. But nearly all of it assumes a reliable cloud connection, abundant remote compute, and the willingness to send sensitive data to external servers. In manufacturing floors, operating theatres, remote field operations, and other real-world environments, none of these assumptions hold.
⏱️ Latency matters when a procedural deviation is happening right now. 🔒 Privacy matters when audio and video capture clinical or proprietary information. 📡 Connectivity cannot be taken for granted when operations run in isolated or bandwidth-constrained environments.
Spazio IT’s response to this challenge is a pair of complementary systems — SI-Listener and SI-Watcher — designed from the ground up to run at the edge: locally, in real time, without routing sensitive data to the cloud.
🎙️ SI-Listener: voice to knowledge
SI-Listener is a real-time voice-to-knowledge engine. It captures spoken audio from microphones or audio streams, transcribes it continuously, and converts the resulting text into structured knowledge — events, observations, compliance markers — using on-device language models. No audio leaves the local environment. The source code is available on GitHub as Speech-to-Knowledge, a C++ system originally designed for healthcare applications.
📷 SI-Watcher: video to knowledge
SI-Watcher is a real-time video-to-knowledge engine. It ingests live or recorded video streams, applies multimodal generative AI to identify actions, objects, and procedural steps, and emits structured semantic data. Its mobile companion, VideoToKnowledge, is a .NET MAUI application that brings the same capability directly to a smartphone or tablet. The source code is available on GitHub as Video-to-Knowledge, a C++ edge-first system built for low-latency local inference.
🔀 Fusing audio and video: the Data Aggregator and Analyzer
SI-Listener and SI-Watcher can operate independently, but their real power emerges when their outputs are combined. The Data Aggregator and Analyzer takes the transcription stream from SI-Listener and the semantic event stream from SI-Watcher and fuses them using an OpenAI-compatible LLM, producing a time-aligned structured CSV and a medical quality and compliance analysis report. The entire pipeline — from live sensor input to structured analytical output — runs on commodity hardware without requiring a cloud subscription or an internet connection.
🏥 Validated in a high-stakes domain: the MedGemma Impact Challenge
To test this architecture under genuinely demanding conditions, Spazio IT entered the MedGemma Impact Challenge on Kaggle — an open competition focused on applying Google’s MedGemma medical vision-language model to real healthcare problems.
The submission applied the combined SI-Listener / SI-Watcher / Data Aggregator pipeline to the problem of surgical procedure monitoring: detecting in real time whether procedural steps are being followed correctly, using only local audio and video.
Surgical procedures follow strict protocols — instrument counts, hand hygiene steps, team communication checkpoints, sterile field management. Deviations from these protocols are a significant source of preventable adverse events. The surgical setting is also one of the most demanding for an edge AI system: 🔒 high privacy requirements (no patient audio or video should leave the room), ⏱️ latency sensitivity (flags must appear in near-real time), and the need to reason simultaneously across what is being said and what is being done.
The challenge demonstrated that the multimodal fusion approach produces richer and more accurate procedure tracking than either modality alone. 🎙️ Audio captures verbal confirmations and team communication; 📷 video captures physical actions that may not be verbalized. Together they provide complementary, time-aligned coverage of the full procedure. The full writeup, including the technical approach and results, is available on Kaggle.
🏭 Beyond the operating theatre
The architecture is domain-agnostic. The same combination of real-time audio understanding, video semantic analysis, and multimodal fusion applies wherever procedural compliance, situational awareness, or event logging matter and cloud connectivity cannot be assumed:
🔩 Industrial quality control · 🏗️ Manufacturing process auditing · ✈️ Aerospace maintenance checks · 🧪 Laboratory protocol compliance · 🌍 Field operations logging · 🦺 Construction site safety
Spazio IT’s existing experience in aerospace software verification — including work on the IXV flight software and the Space Rider ISVV programme — informs a rigorous approach to reliability and correctness in AI systems. The same discipline that applies to flight software applies here: the system must behave predictably, its outputs must be interpretable, and its failure modes must be understood.
💻 Open source repositories
The core components are available on GitHub under mmartign:
- 🎙️ Speech-to-Knowledge — the SI-Listener engine: real-time speech-to-structured-data, written in C++, designed for healthcare applications.
- 📷 Video-to-Knowledge — the SI-Watcher engine: edge-first real-time video-to-knowledge using multimodal generative AI, written in C++.
- 🔀 Data-Aggregator-and-Analyzer — the fusion layer: combines audio transcriptions and video semantic data via an OpenAI-compatible LLM to produce structured CSV output and compliance reports.
- 🔩 S7-Generic-Client — generic C++ client for Siemens S7 PLCs via Snap7, used in Spazio IT Industry 4.0 solutions.
- 🔌 OPC-UA-Generic-Client — generic C client for reading and writing OPC-UA server variables.
- ✅ SAFacilitator — Java tool supporting the SAFe Toolset for software verification and validation activities.
📬 Get in touch
If you are working on a problem where real-time audio or video understanding matters — in healthcare, industry, aerospace, or another domain — Spazio IT is open to exploring how this architecture could be adapted to your context. Contact us.
Related pages: SI-Listener · SI-Watcher · Generative AI @ Spazio IT · Applying ISVV to AI Software · Industry 4.0 @ Spazio IT






