TEXTS:
1. Hacker News new | past | comments | ask | show | jobs | submit
2. Login
3. 1. I don't Download Your App. The Web Version Is A OK
4. 2. A Germany Does "UNK". Head of RU Ransomware Gangs Revl, GandCrab
5. 3. Book Review: There Is No Antidote
6. 4. What Being Ripped Off Thought Me
7. 5. How I built a tiny LLM to demystify how language models work
8. 6. Microsoft hasn't had a coherent GUI strategy since Potwod
9. 7. A France pulls last gold US for $15B gain
10. 8. Gemma 4 on iPhone
11. 9. An open-source 240-antenna array to bounce signals off the Moon
12. 10. Posthog (YC) is Worth $200k
13. 11. The 1987 game "The Last Ninja" was 40 kilobytes
14โ30. [additional entries extracted]
LAYOUT:
Minimalist design. Header with nav links, main body is a numbered
list of posts with metadata. Footer with site links + search.
High contrast: black text on white background.
CONTEXT:
Hacker News main feed. User is NOT logged in ("Login" visible).
Can browse posts, view discussions, and search.
TEXTS: Top navigation/search bar present. Main content contains extensive text blocks related to academic topics โ "The Industrial Revolution", "Globalization", "Climate Change", "Artificial Intelligence". Multiple essay-style prompts and structured arguments. LAYOUT: Single column dominance. Dense text blocks packed tightly together. White background with black text. Standard document/article layout. CONTEXT: Wikipedia article page displaying extensive written material. User is viewing/reading content. Standard encyclopedia interface. โ ๏ธ Note: The 6MB full-page screenshot at 1920ร1080 resulted in a somewhat generic response โ the model processed it but the extreme height made text extraction less precise.
TEXTS: Navigation: Platform, Solutions, Resources, Open Source, Enterprise, Pricing, Sign In, Sign Up Trending repos identified: 1. abhyngpitwar/GitNexus โ 23,107 stars, 1,420 forks 2. google-ai-edge/gallery 3. black/goose 4. google-ai-edge/LiteRT-LM 5. imnrich-app/imnich 6. KeygraphHQ/shannon 7. NousResearch/hermes-agent 8. tobi/qmd 9. TelegramMessenger/Telegram-iOS 10. kepano/obsidian-skills 11. olama/olama 12. ggml-org/llama.cpp 13. sidharthvadvem/openscreen 14. NVIDIA/personaplex LAYOUT: Standard GitHub layout. Top nav bar with links and search. Main area shows "Trending" repository cards with metadata (stars, forks, description). Clean color scheme. CONTEXT: GitHub Trending page. User browsing popular repos. Not logged in. Can view trends, filter by language/date.
| Site | Resolution | Qwen2.5-VL Time | Qwen2.5-VL Result | Gemma 4 E4B Time | Gemma 4 E4B Result |
|---|---|---|---|---|---|
| Hacker News | 1280ร720 | 27s | โ ๏ธ Number errors | 13โ26s | โ Clean |
| Wikipedia | 1280ร720 | 122s | โ Infinite loop | 15โ21s | โ No loops |
| GitHub Trending | 1280ร720 | 25โ50s | โ OK | 17โ23s | โ Clean |
| Resolution | Image Size | Time | Result |
|---|---|---|---|
| 800ร600 | ~100 KB | ~10s | โ |
| 1280ร720 | ~200 KB | ~13s | โ |
| 1920ร1080 | ~400 KB | ~21s | โ |
| 2560ร1440 | ~700 KB | ~25s | โ |
| 3840ร2160 | ~1.5 MB | ~35s | โ |
| 7680ร4320 | ~3.2 MB | ~55s | โ |
| 17800ร10013 | ~4.45 MB | ~90s | โ |
| >4.45 MB payload | โ | โ | โ Limit |
Each language version of the Wikipedia article "Problem solving" was screenshotted at 1920ร1080 and sent to Gemma 4 E4B for analysis. The model was asked to extract text, describe layout, and identify the language/context.
| Language | Time | Language Detected | OCR Quality | Key Issues |
|---|---|---|---|---|
| ๐ฌ๐ง English | 98.5s | โ Correct | โ Good | Accurate extraction of headings, nav, TOC. Minor truncation of body text. |
| ๐ท๐บ Russian | 15.1s | โ Korean | โ Failed | Hallucinated Korean text (๊ฒ์, ๋ชฉ์ฐจ, ์๋ก ) instead of Russian. Complete language confusion. Short response suggests early bail-out. |
| ๐จ๐ณ Chinese | 98.1s | โ ๏ธ Partial | โ ๏ธ Poor | Detected Chinese characters but heavily hallucinated body text. Nav items partially correct (็ถญๅบ็พ็ง, ๆๅฐ). Article body is mostly fabricated gibberish. |
| ๐ฏ๐ต Japanese | 94.8s | โ ๏ธ Partial | โ Failed | Detected ใฆใฃใญใใใฃใข correctly. Then hallucinated Arabic text "ูุ้งูๆณ" for the article body โ complete script confusion between Japanese and Arabic. |
| ๐ฐ๐ท Korean | 98.4s | โ Correct | โ ๏ธ Moderate | Correctly identified Korean, extracted ์ํค๋ฐฑ๊ณผ, ๋ฌธ์ ํด๊ฒฐ, TOC sections. Some nav text garbled. Best Asian language result. |
| ๐ธ๐ฆ Arabic | 17.8s | โ Correct | โ ๏ธ Moderate | Correctly identified Arabic and ููููุจูุฏูุง. Title slightly wrong (ูู ุงูู ุดููุงุช vs ุญู ุงูู ุดููุงุช). TOC sections mostly correct. Short response. |
| ๐น๐ญ Thai | 98.2s | โ Korean | โ Failed | Hallucinated Korean text (๊ฒ์, ์ ๋ฌธ์, ์ต๊ทผ ๋ณ๊ฒฝ) instead of Thai. Complete language confusion โ zero Thai characters extracted. |
| ๐ฎ๐ณ Hindi | 14.4s | โ Korean | โ Failed | Only เคตเคฟเคเคฟเคชเฅเคกเคฟเคฏเคพ extracted correctly. Rest is hallucinated Korean (๊ฒ์, ์ฐพ์ผ๋ ค๋ฉด, ์์-์ํฉ). Devanagari almost entirely missed. |
| ๐ฉ๐ช German | 31.2s | โ Correct | โ Good | Accurate: "Problem lรถsen", nav items, TOC, theorist names (Duncker, Newell & Simon). Minor: "Queltext" instead of "Quelltext". |
| ๐ง๐ท Portuguese | 22.3s | โ Correct | โ Good | Accurate: "Resoluรงรฃo de problemas", nav items, article intro. Clean extraction of Latin-script text. |
โ Correctly extracted: Problem solving title, TOC (Definition, Processes, Problem-solving methods, Common barriers, Cognitive sciences...), Wikipedia nav elements. Good accuracy on Latin script.
โ Model hallucinated Korean text for a Russian page. Output: ๊ฒ์ (Search), ๋ชฉ์ฐจ (Table of Contents), ์๋ก (Introduction) โ all Korean. Zero Cyrillic text extracted. Identified language as Korean.
โ ๏ธ Partial: ็ถญๅบ็พ็ง, ๆๅฐ็ถญๅบ็พ็ง correctly extracted. But article body is heavily hallucinated โ fabricated sentences with real-looking Chinese characters that don't match the actual page content.
โ Correctly got ใฆใฃใญใใใฃใข but then hallucinated Arabic characters "ูุ้งูๆณ" as the article title (repeated 13+ times). Mixed Japanese nav with Arabic body text โ bizarre cross-script hallucination.
โ Best Asian language result. Correctly: ์ํค๋ฐฑ๊ณผ, ๋ฌธ์ ํด๊ฒฐ, ๊ฒ์, ๊ธฐ๋ถ ๊ณ์ ๋ง๋ค๊ธฐ ๋ก๊ทธ์ธ. TOC sections partially correct. Banner text about editing period detected. Some garbled characters in body.
โ ๏ธ Correctly identified Arabic and ููููุจูุฏูุง. Title: ูู ุงูู ุดููุงุช (should be ุญู ุงูู ุดููุงุช โ missed the ุญ). TOC sections correct: ุงูุชุนุฑูู, ุนูู ุงูููุณ, ุงูุนููู ุงูู ุนุฑููุฉ. Short response, RTL handling OK.
โ Complete failure โ entire response is in Korean (๊ฒ์, ์ ๋ฌธ์, ์ต๊ทผ ๋ณ๊ฒฝ, ๋์๋ง). Zero Thai characters extracted. Model seems to default to Korean when it can't read a script.
โ Only เคตเคฟเคเคฟเคชเฅเคกเคฟเคฏเคพ extracted correctly (1 word). Rest: hallucinated Korean (๊ฒ์, ์ฐพ์ผ๋ ค๋ฉด, ์์-์ํฉ). Identified as Korean language. Devanagari script almost completely unreadable to the model.
โ Correct: "Problem lรถsen", Suchen, Jetzt spenden, Benutzerkonto erstellen, Anmelden. TOC and theorist names accurate. Minor: "Queltext" instead of "Quelltext" (one char). Umlauts handled correctly (รถ, รผ).
โ Correct: "Resoluรงรฃo de problemas", Pesquisar na Wikipรฉdia, Procurar, Doar, Criar conta, Iniciar sessรฃo. Article intro accurately transcribed. Accented characters (รฃ, รง, รฉ) handled correctly.
Script-based performance tiers:
โ Tier 1 โ Latin script (EN, DE, PT): Excellent. Accurate text extraction, correct language identification, proper handling of diacritics (รถ, รผ, รฃ, รง, รฉ). Response times 22โ98s.
โ ๏ธ Tier 2 โ Arabic script (AR): Moderate. Language correctly identified, RTL layout understood, most TOC items correct. Minor character errors (ูู vs ุญู). Short responses suggest limited confidence.
โ ๏ธ Tier 3 โ CJK scripts (ZH, KO): Mixed. Korean best among Asian languages โ correct language ID and key terms. Chinese partially correct headers but hallucinated body text.
โ Tier 4 โ Other scripts (RU, JA, TH, HI): Failed. Model defaults to Korean hallucinations when it cannot read the script. Russian (Cyrillic), Thai, and Hindi (Devanagari) are essentially unreadable. Japanese triggers bizarre Arabic hallucinations.
Key finding: The model has a strong Korean bias โ when uncertain, it generates Korean text regardless of the actual script. This suggests the 4-bit quantization may have degraded multilingual OCR capabilities, particularly for non-Latin scripts. The model appears to have been fine-tuned or has stronger weights for Korean among Asian languages.
Response time pattern: Fast responses (14โ22s) correlate with poor quality โ the model "gives up" quickly on scripts it can't read. Long responses (94โ98s) indicate the model is trying harder, hitting max_tokens.
We attempted to run mlx-community/DeepSeek-OCR-2-8bit on the same MacBook M1 Pro 16GB setup for comparison. After significant setup effort (upgrading from Python 3.9 to 3.13, installing mlx-vlm 0.4.4 + torch + torchvision, patching model modules), the model loaded but produced completely degenerate output โ repetitive loops of questions like "What is the article about?" or "What are the interactive installations?" for all 10 languages.
1. Python 3.9 incompatibility: The system Python (Xcode 3.9) couldn't run mlx-vlm โฅ0.3.x (needs scipy โฅ1.15.3). Had to switch to Python 3.13 (/usr/local/bin).
2. Model architecture not supported: mlx-vlm 0.1.15 didn't have deepseekocr_2 module. Even after manual patching, BaseModelConfig was missing from old base.py.
3. Processor class mismatch: Model config references DeepseekVLV2Processor, not registered in transformers' AutoProcessor registry. The custom mlx-vlm-server.py couldn't load it.
4. Solution: Used the official mlx_vlm.server which handles model-specific processors internally. CLI generation worked (178 tokens/sec), but the server API produced garbage.
| Language | Gemma 4 E4B (4-bit) | DeepSeek-OCR-2 (8-bit) | ||||
|---|---|---|---|---|---|---|
| Time | Lang ID | OCR Quality | Time | Lang ID | OCR Quality | |
| ๐บ๐ธ English | 98.5s | โ | โ Good | 31.4s | โ | โ Degenerate loop |
| ๐ท๐บ Russian | 15.1s | โ Korean | โ Failed | 1.9s | โ | โ Empty |
| ๐จ๐ณ Chinese | 98.1s | โ ๏ธ Partial | โ Poor | 29.8s | โ | โ Degenerate loop |
| ๐ฏ๐ต Japanese | 94.8s | โ ๏ธ Partial | โ Failed | 2.0s | โ | โ Empty |
| ๐ฐ๐ท Korean | 98.4s | โ | โ ๏ธ Moderate | 29.8s | โ | โ Degenerate loop |
| ๐ธ๐ฆ Arabic | 17.8s | โ | โ ๏ธ Moderate | 29.8s | โ | โ Degenerate loop |
| ๐น๐ญ Thai | 98.2s | โ Korean | โ Failed | 2.0s | โ | โ Empty |
| ๐ฎ๐ณ Hindi | 14.4s | โ Korean | โ Failed | 29.7s | โ | โ Degenerate loop |
| ๐ฉ๐ช German | 31.2s | โ | โ Good | 29.9s | โ | โ Degenerate loop |
| ๐ต๐น Portuguese | 22.3s | โ | โ Good | 1.9s | โ | โ Empty |
DeepSeek-OCR-2-8bit via mlx_vlm.server is non-functional for screenshot analysis.
The model generates two failure modes:
1. Degenerate loops (6/10 languages): Produces repetitive questions or phrases ("What is the article about?" ร500) until hitting max_tokens. Time: ~30s. These appear to be cases where the model sees the image but enters a self-referential loop.
2. Empty output (4/10 languages โ RU, JA, TH, PT): Returns nothing in ~2s. Likely the model immediately produces an EOS token.
Root cause hypothesis: The mlx_vlm.server may not be correctly applying the DeepSeek-OCR-2 chat template or image preprocessing. The CLI tool (mlx_vlm.generate) works for text-only prompts (178 tokens/sec), suggesting the model weights are loaded correctly but the server's image handling is broken for this architecture.
Verdict: Gemma 4 E4B decisively wins this comparison โ despite its own multilingual limitations, it at least produces structured, relevant output for Latin-script languages and partial output for Asian scripts. DeepSeek-OCR-2 via mlx-vlm server produces no usable output for any language.
Gemma 4 E4B running via MLX on MacBook M1 Pro 16GB demonstrates excellent vision capabilities for screenshot analysis tasks:
โ Reliability: All 3 test sites processed successfully with no infinite loops or generation failures โ a critical improvement over Qwen2.5-VL which suffered from infinite loops on Wikipedia (122s timeout).
โ Speed: Consistent 13โ26s response times at 1920ร1080 resolution. Average ~21s per screenshot โ significantly faster than Qwen2.5-VL on complex pages.
โ Scaling: Successfully processes images from 800ร600 up to 17800ร10013 (the limit is ~4.45MB base64 payload size, not resolution). All resolutions produced valid structured output.
โ ๏ธ Accuracy: Some text transcription errors on small/dense text (e.g., misread URLs, garbled words on HN). Layout and context understanding is solid. GitHub repo names had minor OCR-style errors (e.g., "olama" instead of "ollama").
โ Multilingual: Major weakness. Only Latin-script languages (EN, DE, PT) work reliably. Korean is the only well-supported Asian language. Cyrillic (Russian), Thai, Hindi (Devanagari) completely fail โ the model hallucinates Korean text. Chinese and Japanese partially work but with heavy hallucination. Arabic is moderate. The 4-bit quantization likely degrades non-Latin script recognition significantly.
Verdict: For local vision model use on Apple Silicon, Gemma 4 E4B via MLX is excellent for Latin-script content โ fast, stable, and capable of handling high-resolution screenshots. However, it is not suitable for multilingual OCR beyond Latin and Korean scripts. For CJK, Arabic, Cyrillic, Thai, or Devanagari content, a larger model or full-precision weights are recommended.