On-Device Medical Intelligence - Converging MedGemma 1.5 4B and LiteRT

converting MedGemma 1.5 4B to LiteRT-LM and running it on edge browser

In the rapidly evolving landscape of healthcare AI, the transition from massive, cloud-dependent models to specialized, on-device intelligence is not just a trend—it’s a clinical necessity. Medical data is inherently sensitive, and the requirements for privacy (HIPAA compliance), zero-latency reasoning, and offline accessibility in remote or high-security environments are paramount.

Here we are diving into how to bring state-of-the-art medical multimodal intelligence directly to the edge. By converting MedGemma 1.5 4B to the specialized LiteRT (.litertlm) format, we unlock the ability to perform complex clinical analysis—including MRI interpretation and EHR questioning—entirely within a local web browser using WebGPU.

1. The Rise of On-Device Medical Intelligence

Traditional medical AI often relies on sending high-resolution scans and patient records to powerful GPU clusters in the cloud. While effective, this approach introduces significant bottlenecks:

Privacy Risks: Every byte of data leaving the hospital network is a potential point of failure.
Latency: In critical care, waiting for a round-trip to a data center can be the difference between a prompt diagnosis and a delayed one.
Connectivity: Many clinical environments (ORs, remote clinics, or mobile health units) suffer from inconsistent internet access.

On-device intelligence solves these by performing inference where the data is born. With the release of Google’s Gemma 3 architecture and its medical sibling MedGemma 1.5, the “edge” is now powerful enough to handle 4-billion parameter multimodal models.

2. MedGemma 1.5 4B: A Multimodal Leap

MedGemma 1.5 4B represents a significant architectural shift over its predecessors. While MedGemma 1.0 was a pioneer in clinical text understanding, the 1.5 iteration—built on the Gemma 3 foundation—is a true multimodal powerhouse.

Key Advancements:

From 2D to 3D: Previous models focused on 2D images like X-rays or dermoscopy. MedGemma 1.5 natively interprets 3D medical volumes from CT and MRI scans.
Longitudinal Reasoning: One of the model’s strongest features is its ability to track disease progression by comparing historical scans against current ones.
Clinical Accuracy: EHR Question Answering accuracy has jumped from ~68% in version 1 to a staggering ~90% in version 1.5.
Anatomy Localization: Precise identification of anatomical structures and abnormalities saw an improvement from ~3% IoU to ~38% IoU.

Capability	MedGemma 1 (4B)	MedGemma 1.5 (4B)
Base Architecture	Gemma 2	Gemma 3
Imaging Support	2D Focus (X-rays)	3D (CT/MRI) & WSIs
Temporal Reasoning	Single-scan	Longitudinal Tracking
EHR QA Accuracy	~68%	~90%

3. Deep Dive into the .litertlm Format

To run MedGemma efficiently on the edge, we leverage the .litertlm format. This isn’t just another file extension; it is LiteRT’s (formerly TensorFlow Lite) specialized bundle for Generative AI.

Why .litertlm for MedGemma?

Stateful Optimization: Unlike standard .tflite graphs, a .litertlm bundle is designed for the iterative nature of LLMs. It contains separate, optimized graphs for Prefill (processing the prompt) and Decode (generating tokens one-by-one), while natively managing the KV-cache.
Multimodal Synergy: MedGemma requires a vision encoder and a language head to work in tandem. .litertlm bundles these disparate components into a single, self-describing artifact, ensuring the vision-language projection layers are always synchronized.
Hardware Native: The format is built to leverage the LiteRT GenAI API, which provides highly optimized kernels for mobile GPUs and NPUs, significantly outperforming generic graph execution.

4. The Conversion Workflow

The transformation from raw PyTorch weights to a production-ready LiteRT bundle is an intricate process handled via the litert_torch export pipeline. It starts with the original MedGemma 1.5 4B IT weights from Google and re-architects them for high-performance edge execution.

The Logic Behind the Conversion

Instead of a simple “file format save,” the conversion logic performs several critical architectural bridges and optimizations:

Structural Alignment (Architecture Bridging): MedGemma 1.5 is built on the Gemma 3 architecture. In some versions of the Hugging Face transformers library, the vision modules (tower and projector) are nested deeply within the model structure. Our workflow includes a structural patch that maps these nested components to the top level of the model class. This ensures the export engine can accurately “see” and trace the multimodal connection points during the graph-generation phase.
Multimodal Graph Tracing: The conversion initiates a image_text_to_text export task. This process traces the mathematical flow of data through both the vision encoder and the language head. It effectively captures how an MRI image is transformed into tokens and how those tokens are processed by the LLM to generate a clinical description.
Prefill Bucketing: To optimize the “time-to-first-token” on edge devices, the workflow generates specialized graphs for different prefill lengths (e.g., 128, 256, 512 tokens). This allows the runtime to use the most efficient computation path based on the size of the user’s initial prompt or image metadata.
Specialized Dual-Quantization: To compress the 4B parameter model to a browser-friendly ~3GB, we apply distinct quantization strategies to different components:
- LLM Core: Uses a dynamic_wi8_afp32 recipe (8-bit weights with 32-bit activations), balancing reasoning depth with memory footprint.
- Vision Encoder: Uses a weight_only_wi8_afp32 recipe, ensuring that the high-dimensional features required for medical imaging are preserved while still reducing the storage overhead.
KV-Cache Architecture: The workflow configures a fixed-length Key-Value (KV) cache (typically 4096 tokens). This is embedded into the LiteRT graph, enabling the model to “remember” the context of a long medical conversation without re-processing the entire history for every new word generated.
Unified Bundling: The final step packages the optimized graphs, the token embedder, the tokenizer configuration, and essential model metadata into a single, self-describing .litertlm container. This eliminates the need for external configuration files and ensures the model is “plug-and-play” for the edge runtime.

5. Model Deployment on Edge Web Browsers

Deploying a 4-billion parameter multimodal model like MedGemma in a browser tab is a feat of modern web engineering. It transforms the browser from a simple document viewer into a secure, hardware-accelerated sandbox for private clinical intelligence. To verify the effectiveness of the converted model, we implement a simple GUI prototype written in JavaScript/TypeScript that runs on a local web browser, in which the model deployment is optimized through a multi-layered architectural approach.

Hardware-First Verification

The deployment begins with an environmental handshake. Before attempting to load any model, the application verifies the presence of the WebGPU API (navigator.gpu). This is the application’s “gatekeeper”—without native GPU access, the computational overhead of a 4B parameter model would be too high for a standard browser thread.

Dynamic Runtime Resolution (WASM)

Once hardware is confirmed, the application initializes the LiteRT GenAI runtime. Instead of shipping massive binary loaders with the app, we utilize the FilesetResolver to pull specialized WebAssembly (WASM) runtimes (like genai_wasm_internal.js) from a high-performance CDN. This ensures the edge engine is always running the latest version compatible with the MedGemma 1.5 bundle format.

Local Asset Setup

For the GUI prototype to function correctly, the .litertlm model files must be hosted locally within the application’s structure. Specifically, the model bundle should be located at the public/models directory. This allows the LiteRT GenAI runtime to fetch the multi-gigabyte binary artifacts directly from the same origin, bypassing complex cross-origin resource sharing (CORS) issues while maintaining high-speed local data transfer.

Two-Stage Fallback Strategy

To maximize clinical accessibility, we implemented a robust Primary-to-Backup loading loop:

Primary Attempt (MedGemma 1.5 4B): The system first tries to allocate resources for the high-fidelity medical model. This model provides the deep clinical reasoning required for complex MRI/CT analysis.
Automatic Fallback (Gemma 3 2B): If the primary load fails—common on devices with less than 16GB of RAM or limited VRAM—the logic catches the error and immediately attempts to initialize the lighter Gemma 3 engine. This ensures the clinician is never left without an active intelligence layer.

In the future, it would be more handful to have a robust MedGemma with less than 2B.

Integrity & Cache Management

Handling 3GB model files at the edge introduces “stale cache” risks. If a browser attempts to load a partially-downloaded or outdated model file, initialization will fail. Our deployment employs a Cache-Busting Strategy, appending a dynamic timestamp query parameter (?v=${Date.now()}) to the model asset path. This forces the browser to verify the file’s integrity and ensures that the clinical engine is always synchronized with the correct .litertlm artifact.

Multimodal Inference Pipeline

The inference process is not just a text loop; it is a coordinated orchestration of vision and language:

Image Ingestion: Medical scans are loaded into browser memory as HTMLImageElement objects.
Part-Based Payload: We construct a multimodal payload consisting of the image source followed by a structured text prompt.
Smart Streaming: To provide a fluid “typing” experience, the generateResponse method uses a callback that implements a Smart Accumulator. This logic detects whether the backend is sending cumulative strings or incremental tokens, ensuring that the response flows smoothly without flickering, repeating, or vanishing.

Secure Execution Environment

To enable the advanced memory features (like SharedArrayBuffer) required for GPU-accelerated inference, the deployment requires a “Secure Context.” This is enforced via mandatory HTTP security headers:

COOP (Cross-Origin Opener Policy): same-origin
COEP (Cross-Origin Embedder Policy): require-corp

This architecture allows expert-level medical AI to run with 100% privacy, utilizing the power already sitting on the clinician’s desk.

6. Conclusion and Future Work

The integration of MedGemma 1.5 4B with LiteRT represents a significant milestone in medical AI accessibility, delivering expert-level multimodal intelligence directly to the browser. By enabling private, always-available assistants on-device, we are bridging the gap between cloud-scale performance and edge efficiency, ensuring clinicians can access critical decision support regardless of their connectivity.

To transition this from a research prototype to a global production tool for resource-constrained environments, our future roadmap focuses on three key optimizations:

Int4 QAT Quantization: Implementing Quantization-Aware Training to drastically reduce memory requirements for entry-level mobile and laptop hardware.
Speculative Decoding: Integrating draft models to achieve a 2–3× increase in inference speed, essential for high-volume clinical workflows.
Localized Fine-Tuning: Adapting the model to regional medical terminology and local languages to ensure real-world clinical utility.

A primary example of the need for these advancements is found in Indonesia’s 3T regions (underdeveloped, frontier, and outermost). In these settings, healthcare workers require high-accuracy tools that operate fully offline, speak Bahasa Indonesia, and run on the affordable, low-spec hardware already available in community health centers (Puskesmas).

Resources:

Model (.litertlm): huggingface.co/ai4med-id/medgemma-1.5-4b-it-litertlm
Implementation & GUI Prototype: github.com/AI4MedResearch/edge-ai