converting MedGemma 1.5 4B to LiteRT-LM and running it on edge browser
In the rapidly evolving landscape of healthcare AI, the transition from massive, cloud-dependent models to specialized, on-device intelligence is not just a trend—it’s a clinical necessity. Medical data is inherently sensitive, and the requirements for privacy (HIPAA compliance), zero-latency reasoning, and offline accessibility in remote or high-security environments are paramount.
Here we are diving into how to bring state-of-the-art medical multimodal intelligence directly to the edge. By converting MedGemma 1.5 4B to the specialized LiteRT (.litertlm) format, we unlock the ability to perform complex clinical analysis—including MRI interpretation and EHR questioning—entirely within a local web browser using WebGPU.
Traditional medical AI often relies on sending high-resolution scans and patient records to powerful GPU clusters in the cloud. While effective, this approach introduces significant bottlenecks:
On-device intelligence solves these by performing inference where the data is born. With the release of Google’s Gemma 3 architecture and its medical sibling MedGemma 1.5, the “edge” is now powerful enough to handle 4-billion parameter multimodal models.
MedGemma 1.5 4B represents a significant architectural shift over its predecessors. While MedGemma 1.0 was a pioneer in clinical text understanding, the 1.5 iteration—built on the Gemma 3 foundation—is a true multimodal powerhouse.
| Capability | MedGemma 1 (4B) | MedGemma 1.5 (4B) |
|---|---|---|
| Base Architecture | Gemma 2 | Gemma 3 |
| Imaging Support | 2D Focus (X-rays) | 3D (CT/MRI) & WSIs |
| Temporal Reasoning | Single-scan | Longitudinal Tracking |
| EHR QA Accuracy | ~68% | ~90% |
To run MedGemma efficiently on the edge, we leverage the .litertlm format. This isn’t just another file extension; it is LiteRT’s (formerly TensorFlow Lite) specialized bundle for Generative AI.
.tflite graphs, a .litertlm bundle is designed for the iterative nature of LLMs. It contains separate, optimized graphs for Prefill (processing the prompt) and Decode (generating tokens one-by-one), while natively managing the KV-cache..litertlm bundles these disparate components into a single, self-describing artifact, ensuring the vision-language projection layers are always synchronized.The transformation from raw PyTorch weights to a production-ready LiteRT bundle is an intricate process handled via the litert_torch export pipeline. It starts with the original MedGemma 1.5 4B IT weights from Google and re-architects them for high-performance edge execution.
Instead of a simple “file format save,” the conversion logic performs several critical architectural bridges and optimizations:
transformers library, the vision modules (tower and projector) are nested deeply within the model structure. Our workflow includes a structural patch that maps these nested components to the top level of the model class. This ensures the export engine can accurately “see” and trace the multimodal connection points during the graph-generation phase.image_text_to_text export task. This process traces the mathematical flow of data through both the vision encoder and the language head. It effectively captures how an MRI image is transformed into tokens and how those tokens are processed by the LLM to generate a clinical description.dynamic_wi8_afp32 recipe (8-bit weights with 32-bit activations), balancing reasoning depth with memory footprint.weight_only_wi8_afp32 recipe, ensuring that the high-dimensional features required for medical imaging are preserved while still reducing the storage overhead.Deploying a 4-billion parameter multimodal model like MedGemma in a browser tab is a feat of modern web engineering. It transforms the browser from a simple document viewer into a secure, hardware-accelerated sandbox for private clinical intelligence. To verify the effectiveness of the converted model, we implement a simple GUI prototype written in JavaScript/TypeScript that runs on a local web browser, in which the model deployment is optimized through a multi-layered architectural approach.
The deployment begins with an environmental handshake. Before attempting to load any model, the application verifies the presence of the WebGPU API (navigator.gpu). This is the application’s “gatekeeper”—without native GPU access, the computational overhead of a 4B parameter model would be too high for a standard browser thread.
Once hardware is confirmed, the application initializes the LiteRT GenAI runtime. Instead of shipping massive binary loaders with the app, we utilize the FilesetResolver to pull specialized WebAssembly (WASM) runtimes (like genai_wasm_internal.js) from a high-performance CDN. This ensures the edge engine is always running the latest version compatible with the MedGemma 1.5 bundle format.
For the GUI prototype to function correctly, the .litertlm model files must be hosted locally within the application’s structure. Specifically, the model bundle should be located at the public/models directory. This allows the LiteRT GenAI runtime to fetch the multi-gigabyte binary artifacts directly from the same origin, bypassing complex cross-origin resource sharing (CORS) issues while maintaining high-speed local data transfer.
To maximize clinical accessibility, we implemented a robust Primary-to-Backup loading loop:
In the future, it would be more handful to have a robust MedGemma with less than 2B.
Handling 3GB model files at the edge introduces “stale cache” risks. If a browser attempts to load a partially-downloaded or outdated model file, initialization will fail. Our deployment employs a Cache-Busting Strategy, appending a dynamic timestamp query parameter (?v=${Date.now()}) to the model asset path. This forces the browser to verify the file’s integrity and ensures that the clinical engine is always synchronized with the correct .litertlm artifact.
The inference process is not just a text loop; it is a coordinated orchestration of vision and language:
HTMLImageElement objects.generateResponse method uses a callback that implements a Smart Accumulator. This logic detects whether the backend is sending cumulative strings or incremental tokens, ensuring that the response flows smoothly without flickering, repeating, or vanishing.To enable the advanced memory features (like SharedArrayBuffer) required for GPU-accelerated inference, the deployment requires a “Secure Context.” This is enforced via mandatory HTTP security headers:
same-origin require-corp This architecture allows expert-level medical AI to run with 100% privacy, utilizing the power already sitting on the clinician’s desk.
The integration of MedGemma 1.5 4B with LiteRT represents a significant milestone in medical AI accessibility, delivering expert-level multimodal intelligence directly to the browser. By enabling private, always-available assistants on-device, we are bridging the gap between cloud-scale performance and edge efficiency, ensuring clinicians can access critical decision support regardless of their connectivity.
To transition this from a research prototype to a global production tool for resource-constrained environments, our future roadmap focuses on three key optimizations:
Int4 QAT Quantization: Implementing Quantization-Aware Training to drastically reduce memory requirements for entry-level mobile and laptop hardware.
Speculative Decoding: Integrating draft models to achieve a 2–3× increase in inference speed, essential for high-volume clinical workflows.
Localized Fine-Tuning: Adapting the model to regional medical terminology and local languages to ensure real-world clinical utility.
A primary example of the need for these advancements is found in Indonesia’s 3T regions (underdeveloped, frontier, and outermost). In these settings, healthcare workers require high-accuracy tools that operate fully offline, speak Bahasa Indonesia, and run on the affordable, low-spec hardware already available in community health centers (Puskesmas).
Resources:
.litertlm): huggingface.co/ai4med-id/medgemma-1.5-4b-it-litertlm