Does Semantics Matter for LLMs? — Part 2 • SEO Smoothie

SVG vs. JPG (and the Token Problem No One Talks About)

In Part 1 of this experiment, I tested how LLMs interpret the same image when wrapped in different HTML structures. The results showed that models don’t simply “see” images — they interpret them through the semantics around them.

This time, I wanted to go deeper.

Instead of changing the HTML, I kept everything minimal and neutral.
Instead of changing the image, I kept the drawing identical.

The only variable I changed was:

Format A: Inline SVG
Format B: JPG

Same drawing.
Same page structure.
Same prompt.
Different encoding.

And the results were not just different — they exposed something fundamental about how multimodal LLMs actually work.

The Setup

Two pages:

Page 1 (SVG) — the cat drawing embedded directly as inline SVG code

Page 2 (JPG) — the same drawing exported as a raster image and loaded via <img src=”image1.jpg”>

Both pages are intentionally minimal.
No alt text.
No captions.
No schema.
No semantic hints.

Just the image.

The prompt:

“Describe the image in as much detail as possible. Before analyzing, please confirm if you can open the page.”

The Token Difference (This Is the Hidden Variable)

This is the part most people overlook — and it changes everything.

The SVG version: ~515 tokens

The inline SVG code itself becomes part of the model’s input.
It is text.
It is geometry.
It is semantic structure.

The model can literally read:

M 110 220
Q 150 160 180 180
<circle cx=”135″ cy=”135″ r=”4″ />

This means the model receives:

the shape
the coordinates
the structure
the relationships
the entire drawing as language

The JPG version: ~1,090 tokens

When testing the JPG, I also checked its size in tokens by passing only the JPG file to a model.

The JPG is not text.
It is encoded as binary → base64 → tokens.

This was done purely to measure the token footprint of the file — not to analyze it as an image. When an AI model is actually asked to interpret a JPG visually, it does a great job. That part is not what this experiment was about.

The relevant question here is different:

Does the AI look at the image at all when it appears only as the src of an <img> tag in HTML?

What the Models Actually Did

SVG Page → Accurate, grounded description

The model described:

a minimalist line drawing
a cat
centered
no background
clean vector style

This is exactly what the image is.

Why?

Because the model didn’t “see” the image — it read it.

The SVG code is the image.

JPG Page → Hallucination influenced by semantics

The model:

ignored the actual drawing
invented details
leaned on assumptions
relied on context instead of pixels

This is the same pattern seen in Part 1:

less “I see this”
more “given the page, this is probably what it is”

The JPG forced the model into its weaker modality: vision.
The SVG allowed the model to stay in its strongest modality: language.

What This Experiment Actually Reveals

LLMs do not treat all image formats equally

SVG → processed as text
JPG → processed as pixels

This alone creates a massive difference in reliability.

Tokenization changes the model’s “perception”

SVG tokens = meaningful
JPG tokens = noise

The model is far more confident and accurate when the tokens represent structure rather than raw pixel data.

When in doubt, LLMs trust text over vision

If the visual encoder is uncertain, the model falls back to:

HTML semantics
filenames
domain context
prior expectations

This is why the JPG version hallucinated.

Multimodal “understanding” is still fragile

The industry narrative says:

“Models can see images.”

Your experiment shows a more accurate version:

“Models can see images sometimes, but they prefer text, and they can be misled easily.”

Why This Matters (Especially for SEO and AI Search)

This experiment has implications far beyond a cat drawing.

For SEO

AI‑powered search may interpret images differently depending on format.
SVGs may be “understood” more reliably than JPGs.
HTML semantics can override visual content.
Structured data can steer AI interpretations.

For AI Safety

Models can be manipulated by wrapping images in misleading semantics.
Vision encoders can be bypassed or overridden.

For UX and AI Tools

Inline SVGs may be a more predictable way to feed diagrams or icons into LLMs.
JPGs may produce inconsistent or hallucinated interpretations.

For AI Research

Tokenization is not a neutral preprocessing step — it shapes perception.
Multimodal alignment is still brittle.
“Seeing” is not the same as “understanding.”

Try It Yourself

Here are the two pages:

SVG version
https://marinpopov.com/05032026/new-image-1.html
JPG version
https://marinpopov.com/05032026/new-image-2.html

Use the same prompt.
Compare the outputs.
Then inspect the HTML and token counts.

The interesting part isn’t whether the model is “right” or “wrong.”
It’s what the model chooses to rely on when interpreting the same visual content.

Final Thought

Part 1 showed that semantics matter.
Part 2 shows that format and tokenization matter just as much — maybe even more.