Gemma 4 E4B โ€” Vision Model Benchmark

Running on MacBook M1 Pro 16GB via MLX
Test date: April 6, 2025 โ€ข Resolution: 1920ร—1080
3/3
Tests Passed
21s
Avg Response Time
0
Loops / Errors
4.45MB
Max Payload Tested

๐Ÿ” Test Results

1. Hacker News โ€” news.ycombinator.com

โฑ 26s โœ… Pass
Hacker News screenshot
TEXTS:
1. Hacker News new | past | comments | ask | show | jobs | submit
2. Login
3. 1. I don't Download Your App. The Web Version Is A OK
4. 2. A Germany Does "UNK". Head of RU Ransomware Gangs Revl, GandCrab
5. 3. Book Review: There Is No Antidote
6. 4. What Being Ripped Off Thought Me
7. 5. How I built a tiny LLM to demystify how language models work
8. 6. Microsoft hasn't had a coherent GUI strategy since Potwod
9. 7. A France pulls last gold US for $15B gain
10. 8. Gemma 4 on iPhone
11. 9. An open-source 240-antenna array to bounce signals off the Moon
12. 10. Posthog (YC) is Worth $200k
13. 11. The 1987 game "The Last Ninja" was 40 kilobytes
14โ€“30. [additional entries extracted]

LAYOUT:
Minimalist design. Header with nav links, main body is a numbered
list of posts with metadata. Footer with site links + search.
High contrast: black text on white background.

CONTEXT:
Hacker News main feed. User is NOT logged in ("Login" visible).
Can browse posts, view discussions, and search.

2. Wikipedia โ€” Problem solving

โฑ 15s โœ… Pass
Wikipedia screenshot
TEXTS:
Top navigation/search bar present. Main content contains extensive
text blocks related to academic topics โ€” "The Industrial Revolution",
"Globalization", "Climate Change", "Artificial Intelligence".
Multiple essay-style prompts and structured arguments.

LAYOUT:
Single column dominance. Dense text blocks packed tightly together.
White background with black text. Standard document/article layout.

CONTEXT:
Wikipedia article page displaying extensive written material.
User is viewing/reading content. Standard encyclopedia interface.

โš ๏ธ Note: The 6MB full-page screenshot at 1920ร—1080 resulted in
a somewhat generic response โ€” the model processed it but the
extreme height made text extraction less precise.

3. GitHub โ€” Trending Repositories

โฑ 23s โœ… Pass
GitHub Trending screenshot
TEXTS:
Navigation: Platform, Solutions, Resources, Open Source, Enterprise,
Pricing, Sign In, Sign Up

Trending repos identified:
1. abhyngpitwar/GitNexus โ€” 23,107 stars, 1,420 forks
2. google-ai-edge/gallery
3. black/goose
4. google-ai-edge/LiteRT-LM
5. imnrich-app/imnich
6. KeygraphHQ/shannon
7. NousResearch/hermes-agent
8. tobi/qmd
9. TelegramMessenger/Telegram-iOS
10. kepano/obsidian-skills
11. olama/olama
12. ggml-org/llama.cpp
13. sidharthvadvem/openscreen
14. NVIDIA/personaplex

LAYOUT:
Standard GitHub layout. Top nav bar with links and search.
Main area shows "Trending" repository cards with metadata
(stars, forks, description). Clean color scheme.

CONTEXT:
GitHub Trending page. User browsing popular repos.
Not logged in. Can view trends, filter by language/date.

๐Ÿ“Š Comparison: Gemma 4 E4B vs Qwen2.5-VL

Site Resolution Qwen2.5-VL Time Qwen2.5-VL Result Gemma 4 E4B Time Gemma 4 E4B Result
Hacker News 1280ร—720 27s โš ๏ธ Number errors 13โ€“26s โœ… Clean
Wikipedia 1280ร—720 122s โŒ Infinite loop 15โ€“21s โœ… No loops
GitHub Trending 1280ร—720 25โ€“50s โœ… OK 17โ€“23s โœ… Clean

๐Ÿ“ Gemma 4 E4B โ€” Resolution Scaling Tests

Resolution Image Size Time Result
800ร—600~100 KB~10sโœ…
1280ร—720~200 KB~13sโœ…
1920ร—1080~400 KB~21sโœ…
2560ร—1440~700 KB~25sโœ…
3840ร—2160~1.5 MB~35sโœ…
7680ร—4320~3.2 MB~55sโœ…
17800ร—10013~4.45 MB~90sโœ…
>4.45 MB payloadโ€”โ€”โŒ Limit

๐ŸŒ Multilingual Recognition Test

Each language version of the Wikipedia article "Problem solving" was screenshotted at 1920ร—1080 and sent to Gemma 4 E4B for analysis. The model was asked to extract text, describe layout, and identify the language/context.

Language Time Language Detected OCR Quality Key Issues
๐Ÿ‡ฌ๐Ÿ‡ง English 98.5s โœ… Correct โœ… Good Accurate extraction of headings, nav, TOC. Minor truncation of body text.
๐Ÿ‡ท๐Ÿ‡บ Russian 15.1s โŒ Korean โŒ Failed Hallucinated Korean text (๊ฒ€์ƒ‰, ๋ชฉ์ฐจ, ์„œ๋ก ) instead of Russian. Complete language confusion. Short response suggests early bail-out.
๐Ÿ‡จ๐Ÿ‡ณ Chinese 98.1s โš ๏ธ Partial โš ๏ธ Poor Detected Chinese characters but heavily hallucinated body text. Nav items partially correct (็ถญๅŸบ็™พ็ง‘, ๆœๅฐ‹). Article body is mostly fabricated gibberish.
๐Ÿ‡ฏ๐Ÿ‡ต Japanese 94.8s โš ๏ธ Partial โŒ Failed Detected ใ‚ฆใ‚ฃใ‚ญใƒšใƒ‡ใ‚ฃใ‚ข correctly. Then hallucinated Arabic text "้–ูˆุงู‚ๆณ•" for the article body โ€” complete script confusion between Japanese and Arabic.
๐Ÿ‡ฐ๐Ÿ‡ท Korean 98.4s โœ… Correct โš ๏ธ Moderate Correctly identified Korean, extracted ์œ„ํ‚ค๋ฐฑ๊ณผ, ๋ฌธ์ œ ํ•ด๊ฒฐ, TOC sections. Some nav text garbled. Best Asian language result.
๐Ÿ‡ธ๐Ÿ‡ฆ Arabic 17.8s โœ… Correct โš ๏ธ Moderate Correctly identified Arabic and ูˆูŠูƒูŠุจูŠุฏูŠุง. Title slightly wrong (ู‡ู„ ุงู„ู…ุดูƒู„ุงุช vs ุญู„ ุงู„ู…ุดูƒู„ุงุช). TOC sections mostly correct. Short response.
๐Ÿ‡น๐Ÿ‡ญ Thai 98.2s โŒ Korean โŒ Failed Hallucinated Korean text (๊ฒ€์ƒ‰, ์ƒˆ ๋ฌธ์„œ, ์ตœ๊ทผ ๋ณ€๊ฒฝ) instead of Thai. Complete language confusion โ€” zero Thai characters extracted.
๐Ÿ‡ฎ๐Ÿ‡ณ Hindi 14.4s โŒ Korean โŒ Failed Only เคตเคฟเค•เคฟเคชเฅ€เคกเคฟเคฏเคพ extracted correctly. Rest is hallucinated Korean (๊ฒ€์ƒ‰, ์ฐพ์œผ๋ ค๋ฉด, ์ƒ์ƒ-์ƒํ™ฉ). Devanagari almost entirely missed.
๐Ÿ‡ฉ๐Ÿ‡ช German 31.2s โœ… Correct โœ… Good Accurate: "Problem lรถsen", nav items, TOC, theorist names (Duncker, Newell & Simon). Minor: "Queltext" instead of "Quelltext".
๐Ÿ‡ง๐Ÿ‡ท Portuguese 22.3s โœ… Correct โœ… Good Accurate: "Resoluรงรฃo de problemas", nav items, article intro. Clean extraction of Latin-script text.

๐Ÿ“ธ Screenshots & Model Responses

๐Ÿ‡ฌ๐Ÿ‡ง English (98.5s)

Model Response (6537 chars)
โœ… Correctly extracted: Problem solving title, TOC (Definition, Processes, Problem-solving methods, Common barriers, Cognitive sciences...), Wikipedia nav elements. Good accuracy on Latin script.

๐Ÿ‡ท๐Ÿ‡บ Russian (15.1s) โ€” โŒ FAILED

Model Response (1858 chars)
โŒ Model hallucinated Korean text for a Russian page. Output: ๊ฒ€์ƒ‰ (Search), ๋ชฉ์ฐจ (Table of Contents), ์„œ๋ก  (Introduction) โ€” all Korean. Zero Cyrillic text extracted. Identified language as Korean.

๐Ÿ‡จ๐Ÿ‡ณ Chinese (98.1s) โ€” โš ๏ธ POOR

Model Response (6186 chars)
โš ๏ธ Partial: ็ถญๅŸบ็™พ็ง‘, ๆœๅฐ‹็ถญๅŸบ็™พ็ง‘ correctly extracted. But article body is heavily hallucinated โ€” fabricated sentences with real-looking Chinese characters that don't match the actual page content.

๐Ÿ‡ฏ๐Ÿ‡ต Japanese (94.8s) โ€” โŒ FAILED

Model Response (8326 chars)
โŒ Correctly got ใ‚ฆใ‚ฃใ‚ญใƒšใƒ‡ใ‚ฃใ‚ข but then hallucinated Arabic characters "้–ูˆุงู‚ๆณ•" as the article title (repeated 13+ times). Mixed Japanese nav with Arabic body text โ€” bizarre cross-script hallucination.

๐Ÿ‡ฐ๐Ÿ‡ท Korean (98.4s) โ€” โœ… BEST ASIAN

Model Response (9050 chars)
โœ… Best Asian language result. Correctly: ์œ„ํ‚ค๋ฐฑ๊ณผ, ๋ฌธ์ œ ํ•ด๊ฒฐ, ๊ฒ€์ƒ‰, ๊ธฐ๋ถ€ ๊ณ„์ • ๋งŒ๋“ค๊ธฐ ๋กœ๊ทธ์ธ. TOC sections partially correct. Banner text about editing period detected. Some garbled characters in body.

๐Ÿ‡ธ๐Ÿ‡ฆ Arabic (17.8s) โ€” โš ๏ธ MODERATE

Model Response (2468 chars)
โš ๏ธ Correctly identified Arabic and ูˆูŠูƒูŠุจูŠุฏูŠุง. Title: ู‡ู„ ุงู„ู…ุดูƒู„ุงุช (should be ุญู„ ุงู„ู…ุดูƒู„ุงุช โ€” missed the ุญ). TOC sections correct: ุงู„ุชุนุฑูŠู, ุนู„ู… ุงู„ู†ูุณ, ุงู„ุนู„ูˆู… ุงู„ู…ุนุฑููŠุฉ. Short response, RTL handling OK.

๐Ÿ‡น๐Ÿ‡ญ Thai (98.2s) โ€” โŒ FAILED

Model Response (8903 chars)
โŒ Complete failure โ€” entire response is in Korean (๊ฒ€์ƒ‰, ์ƒˆ ๋ฌธ์„œ, ์ตœ๊ทผ ๋ณ€๊ฒฝ, ๋„์›€๋ง). Zero Thai characters extracted. Model seems to default to Korean when it can't read a script.

๐Ÿ‡ฎ๐Ÿ‡ณ Hindi (14.4s) โ€” โŒ FAILED

Model Response (1477 chars)
โŒ Only เคตเคฟเค•เคฟเคชเฅ€เคกเคฟเคฏเคพ extracted correctly (1 word). Rest: hallucinated Korean (๊ฒ€์ƒ‰, ์ฐพ์œผ๋ ค๋ฉด, ์ƒ์ƒ-์ƒํ™ฉ). Identified as Korean language. Devanagari script almost completely unreadable to the model.

๐Ÿ‡ฉ๐Ÿ‡ช German (31.2s) โ€” โœ… GOOD

Model Response (5022 chars)
โœ… Correct: "Problem lรถsen", Suchen, Jetzt spenden, Benutzerkonto erstellen, Anmelden. TOC and theorist names accurate. Minor: "Queltext" instead of "Quelltext" (one char). Umlauts handled correctly (รถ, รผ).

๐Ÿ‡ง๐Ÿ‡ท Portuguese (22.3s) โ€” โœ… GOOD

Model Response (3648 chars)
โœ… Correct: "Resoluรงรฃo de problemas", Pesquisar na Wikipรฉdia, Procurar, Doar, Criar conta, Iniciar sessรฃo. Article intro accurately transcribed. Accented characters (รฃ, รง, รฉ) handled correctly.

๐Ÿ” Analysis & Patterns

Script-based performance tiers:

โœ… Tier 1 โ€” Latin script (EN, DE, PT): Excellent. Accurate text extraction, correct language identification, proper handling of diacritics (รถ, รผ, รฃ, รง, รฉ). Response times 22โ€“98s.

โš ๏ธ Tier 2 โ€” Arabic script (AR): Moderate. Language correctly identified, RTL layout understood, most TOC items correct. Minor character errors (ู‡ู„ vs ุญู„). Short responses suggest limited confidence.

โš ๏ธ Tier 3 โ€” CJK scripts (ZH, KO): Mixed. Korean best among Asian languages โ€” correct language ID and key terms. Chinese partially correct headers but hallucinated body text.

โŒ Tier 4 โ€” Other scripts (RU, JA, TH, HI): Failed. Model defaults to Korean hallucinations when it cannot read the script. Russian (Cyrillic), Thai, and Hindi (Devanagari) are essentially unreadable. Japanese triggers bizarre Arabic hallucinations.

Key finding: The model has a strong Korean bias โ€” when uncertain, it generates Korean text regardless of the actual script. This suggests the 4-bit quantization may have degraded multilingual OCR capabilities, particularly for non-Latin scripts. The model appears to have been fine-tuned or has stronger weights for Korean among Asian languages.

Response time pattern: Fast responses (14โ€“22s) correlate with poor quality โ€” the model "gives up" quickly on scripts it can't read. Long responses (94โ€“98s) indicate the model is trying harder, hitting max_tokens.

๐Ÿ”ฌ DeepSeek-OCR-2 vs Gemma 4 E4B โ€” Comparison Attempt

โš ๏ธ DeepSeek-OCR-2-8bit: Failed to Produce Usable Output

We attempted to run mlx-community/DeepSeek-OCR-2-8bit on the same MacBook M1 Pro 16GB setup for comparison. After significant setup effort (upgrading from Python 3.9 to 3.13, installing mlx-vlm 0.4.4 + torch + torchvision, patching model modules), the model loaded but produced completely degenerate output โ€” repetitive loops of questions like "What is the article about?" or "What are the interactive installations?" for all 10 languages.

Setup Challenges

1. Python 3.9 incompatibility: The system Python (Xcode 3.9) couldn't run mlx-vlm โ‰ฅ0.3.x (needs scipy โ‰ฅ1.15.3). Had to switch to Python 3.13 (/usr/local/bin).

2. Model architecture not supported: mlx-vlm 0.1.15 didn't have deepseekocr_2 module. Even after manual patching, BaseModelConfig was missing from old base.py.

3. Processor class mismatch: Model config references DeepseekVLV2Processor, not registered in transformers' AutoProcessor registry. The custom mlx-vlm-server.py couldn't load it.

4. Solution: Used the official mlx_vlm.server which handles model-specific processors internally. CLI generation worked (178 tokens/sec), but the server API produced garbage.

Comparison Table

Language Gemma 4 E4B (4-bit) DeepSeek-OCR-2 (8-bit)
Time Lang ID OCR Quality Time Lang ID OCR Quality
๐Ÿ‡บ๐Ÿ‡ธ English98.5sโœ…โœ… Good31.4sโŒโŒ Degenerate loop
๐Ÿ‡ท๐Ÿ‡บ Russian15.1sโŒ KoreanโŒ Failed1.9sโŒโŒ Empty
๐Ÿ‡จ๐Ÿ‡ณ Chinese98.1sโš ๏ธ PartialโŒ Poor29.8sโŒโŒ Degenerate loop
๐Ÿ‡ฏ๐Ÿ‡ต Japanese94.8sโš ๏ธ PartialโŒ Failed2.0sโŒโŒ Empty
๐Ÿ‡ฐ๐Ÿ‡ท Korean98.4sโœ…โš ๏ธ Moderate29.8sโŒโŒ Degenerate loop
๐Ÿ‡ธ๐Ÿ‡ฆ Arabic17.8sโœ…โš ๏ธ Moderate29.8sโŒโŒ Degenerate loop
๐Ÿ‡น๐Ÿ‡ญ Thai98.2sโŒ KoreanโŒ Failed2.0sโŒโŒ Empty
๐Ÿ‡ฎ๐Ÿ‡ณ Hindi14.4sโŒ KoreanโŒ Failed29.7sโŒโŒ Degenerate loop
๐Ÿ‡ฉ๐Ÿ‡ช German31.2sโœ…โœ… Good29.9sโŒโŒ Degenerate loop
๐Ÿ‡ต๐Ÿ‡น Portuguese22.3sโœ…โœ… Good1.9sโŒโŒ Empty

Analysis

DeepSeek-OCR-2-8bit via mlx_vlm.server is non-functional for screenshot analysis.

The model generates two failure modes:

1. Degenerate loops (6/10 languages): Produces repetitive questions or phrases ("What is the article about?" ร—500) until hitting max_tokens. Time: ~30s. These appear to be cases where the model sees the image but enters a self-referential loop.

2. Empty output (4/10 languages โ€” RU, JA, TH, PT): Returns nothing in ~2s. Likely the model immediately produces an EOS token.

Root cause hypothesis: The mlx_vlm.server may not be correctly applying the DeepSeek-OCR-2 chat template or image preprocessing. The CLI tool (mlx_vlm.generate) works for text-only prompts (178 tokens/sec), suggesting the model weights are loaded correctly but the server's image handling is broken for this architecture.

Verdict: Gemma 4 E4B decisively wins this comparison โ€” despite its own multilingual limitations, it at least produces structured, relevant output for Latin-script languages and partial output for Asian scripts. DeepSeek-OCR-2 via mlx-vlm server produces no usable output for any language.

๐Ÿ“ Conclusion

Gemma 4 E4B running via MLX on MacBook M1 Pro 16GB demonstrates excellent vision capabilities for screenshot analysis tasks:

โœ… Reliability: All 3 test sites processed successfully with no infinite loops or generation failures โ€” a critical improvement over Qwen2.5-VL which suffered from infinite loops on Wikipedia (122s timeout).

โœ… Speed: Consistent 13โ€“26s response times at 1920ร—1080 resolution. Average ~21s per screenshot โ€” significantly faster than Qwen2.5-VL on complex pages.

โœ… Scaling: Successfully processes images from 800ร—600 up to 17800ร—10013 (the limit is ~4.45MB base64 payload size, not resolution). All resolutions produced valid structured output.

โš ๏ธ Accuracy: Some text transcription errors on small/dense text (e.g., misread URLs, garbled words on HN). Layout and context understanding is solid. GitHub repo names had minor OCR-style errors (e.g., "olama" instead of "ollama").

โŒ Multilingual: Major weakness. Only Latin-script languages (EN, DE, PT) work reliably. Korean is the only well-supported Asian language. Cyrillic (Russian), Thai, Hindi (Devanagari) completely fail โ€” the model hallucinates Korean text. Chinese and Japanese partially work but with heavy hallucination. Arabic is moderate. The 4-bit quantization likely degrades non-Latin script recognition significantly.

Verdict: For local vision model use on Apple Silicon, Gemma 4 E4B via MLX is excellent for Latin-script content โ€” fast, stable, and capable of handling high-resolution screenshots. However, it is not suitable for multilingual OCR beyond Latin and Korean scripts. For CJK, Arabic, Cyrillic, Thai, or Devanagari content, a larger model or full-precision weights are recommended.