Gemma 4 Is Running On Phones Now and I Don't Think People Realise How Mental That Is
3GB of Intelligence in Your Pocket 📱
So someone on Instagram posted a video of an LLM running on their phone. No internet. No API calls. No cloud. Just a model sitting on the device, answering questions, understanding images, processing audio. Offline. On a phone.
That's Google's new Gemma 4. Dropped April 2nd. Four model sizes, all open weight, Apache 2.0 licensed. The one that matters for this conversation is the E2B (Effective 2 Billion parameters), which is small enough to run on a phone and fast enough to actually be useful. Google built it from the same research behind Gemini 3, then squeezed it down for edge devices.
The comments on the post split exactly how you'd expect. Half the replies are "that's insane 🔥" and the other half are "battery go bye 😂". Both camps are right.
Why This Is Actually A Big Deal
I've been mucking about with local models for a while. Ollama on the Mac. The occasional experiment with llama.cpp on a beefy Linux box. But running a properly capable multimodal model on a phone, offline, with decent latency? That's new.
The E2B model handles text, images, video, and audio. All four modalities. On device. The E4B (slightly larger) does the same but slower. Google says the E2B runs 3x faster than the E4B, which tracks because fewer parameters means less compute per inference.
These are the foundation for the next generation of Gemini Nano, so anything you build against Gemma 4 today will work on Gemini Nano 4 devices later this year. That's the play. Google wants developers building on this stack now so the apps are ready when the phones ship.
The Privacy Angle (and the Irony)
Here's where it gets interesting. A local LLM means nothing leaves your phone. Your prompts, your documents, your photos, your voice. All processed on device. No server roundtrip. No data collection endpoint. No "we use your inputs to improve our models" clause.
Which is a properly compelling privacy story. The irony, of course, is that it's a Google model. The company whose entire business model is built on knowing everything about you has shipped a model that explicitly keeps everything on your device. I reckon the cynical take is that they want the ecosystem lock-in even if it means giving up the data pipeline. The charitable take is that on-device AI is the future and they'd rather be the foundation than the also-ran.
Either way, if privacy matters to you (and it should), running Gemma 4 locally via something like PocketPal or Ollama mobile is the move.
What You Can Actually Do With It
The practical use cases for a 3GB on-device model are more interesting than the benchmarks.
Offline coding assistance on a tablet. Document analysis without uploading sensitive files to a cloud service. Real-time translation in places with no signal. Voice transcription that never leaves the device. Image understanding for accessibility when you've got no data connection.
For developers specifically, imagine baking this into a mobile app. An on-device AI assistant that works on a plane, in the underground, in a basement with no wifi. No API costs. No latency spikes. No "please check your internet connection." Just inference, running locally, at whatever speed the chip can manage.
The Hardware Question
This matters more than people reckon. If you're on an iPhone, A16 or A17 chip makes a real difference for which model size you can run comfortably. The E2B is the safe bet for most devices. The E4B will work but you'll feel it in the heat on your palm and the battery percentage dropping.
On Android, the AICore Developer Preview gives you native integration. Google's clearly pushing this as part of the Android platform, not just a standalone model download. Which means optimised inference, better memory management, and presumably less battery murder than running it through a generic runtime.
| 📚 Geek Corner |
|---|
| Mixture of Experts on your phone: The larger Gemma 4 models (26B MoE and 31B Dense) are for servers and workstations. But the architecture choice matters even for the small ones. The E2B uses Google's "effective parameter" approach, which means the actual parameter count in the weights file is larger than 2B but the active parameters per inference step are equivalent to a 2B model. Think of it as a bigger brain that only activates the relevant bits for each query. This is why a "2B" model can punch above its weight class on benchmarks. It's also why the file size (~3GB) is larger than you'd expect from a naive "2 billion parameters at 2 bytes each = 4GB" calculation, the quantisation and architecture mean the relationship between parameter count, file size, and capability is non-linear. |
The Bigger Picture
We're watching the "AI needs a data centre" narrative fall apart in real time. Six months ago, running a multimodal model on a phone was a party trick for enthusiasts willing to tolerate ten-second response times. Now Google is shipping models explicitly designed for it, with Android platform integration, and they're telling developers to build apps against it.
The phones in our pockets are becoming inference engines. And the models are getting small enough to actually be useful there. That's not a benchmark improvement. That's an architecture shift.
Bottom line: Gemma 4 on device is the quiet announcement that matters more than most people reckon. A 3GB model with multimodal capabilities running offline on your phone, processing your data without it ever leaving the device. The battery life concerns are real. The privacy story is compelling (even from Google, yes). And the developer implications are massive if you're building anything that needs to work without an internet connection.