Does Semantics Matter for LLMs? — Part 2

SVG vs. JPG (and the Token Problem No One Talks About)

In Part 1 of this experiment, I tested how LLMs interpret the same image when wrapped in different HTML structures. The results showed that models don’t simply “see” images — they interpret them through the semantics around them.

This time, I wanted to go deeper.

Instead of changing the HTML, I kept everything minimal and neutral.
Instead of changing the image, I kept the drawing identical.

llm cat test image

The only variable I changed was:

  • Format A: Inline SVG
  • Format B: JPG

Same drawing.
Same page structure.
Same prompt.
Different encoding.

And the results were not just different — they exposed something fundamental about how multimodal LLMs actually work.

The Setup

Two pages:

  • Page 1 (SVG) — the cat drawing embedded directly as inline SVG code
Does Semantics Matter for LLMs - Part 2. Source code page with svg image.
  • Page 2 (JPG) — the same drawing exported as a raster image and loaded via <img src=”image1.jpg”>
Does Semantics Matter for LLMs - Part 2. Source code page with jpg image.

Both pages are intentionally minimal.
No alt text.
No captions.
No schema.
No semantic hints.

Just the image.

The prompt:

“Describe the image in as much detail as possible. Before analyzing, please confirm if you can open the page.”

The Token Difference (This Is the Hidden Variable)

This is the part most people overlook — and it changes everything.

The SVG version: ~515 tokens

Does Semantics Matter for LLMs - Part 2. SVG tokens.

The inline SVG code itself becomes part of the model’s input.
It is text.
It is geometry.
It is semantic structure.

The model can literally read:

  • M 110 220
  • Q 150 160 180 180
  • <circle cx=”135″ cy=”135″ r=”4″ />

This means the model receives:

  • the shape
  • the coordinates
  • the structure
  • the relationships
  • the entire drawing as language

The JPG version: ~1,090 tokens

Does Semantics Matter for LLMs - Part 2. JPG - tokens.

When testing the JPG, I also checked its size in tokens by passing only the JPG file to a model.

The JPG is not text.
It is encoded as binary → base64 → tokens.

This was done purely to measure the token footprint of the file — not to analyze it as an image. When an AI model is actually asked to interpret a JPG visually, it does a great job. That part is not what this experiment was about.

The relevant question here is different:

Does the AI look at the image at all when it appears only as the src of an <img> tag in HTML?

What the Models Actually Did

SVG Page → Accurate, grounded description

The model described:

  • a minimalist line drawing
  • a cat
  • centered
  • no background
  • clean vector style

This is exactly what the image is.

Why?

Because the model didn’t “see” the image — it read it.

The SVG code is the image.

Does Semantics Matter for LLMs - Part 2. The AI analysis of SVG image.

JPG Page → Hallucination influenced by semantics

The model:

  • ignored the actual drawing
  • invented details
  • leaned on assumptions
  • relied on context instead of pixels

This is the same pattern seen in Part 1:

  • less “I see this”
  • more “given the page, this is probably what it is”

The JPG forced the model into its weaker modality: vision.
The SVG allowed the model to stay in its strongest modality: language.

Does Semantics Matter for LLMs - Part 2. AI analysis of JPG image.

What This Experiment Actually Reveals

LLMs do not treat all image formats equally

  • SVG → processed as text
  • JPG → processed as pixels

This alone creates a massive difference in reliability.

Tokenization changes the model’s “perception”

  • SVG tokens = meaningful
  • JPG tokens = noise

The model is far more confident and accurate when the tokens represent structure rather than raw pixel data.

When in doubt, LLMs trust text over vision

If the visual encoder is uncertain, the model falls back to:

  • HTML semantics
  • filenames
  • domain context
  • prior expectations

This is why the JPG version hallucinated.

Multimodal “understanding” is still fragile

The industry narrative says:

“Models can see images.”

Your experiment shows a more accurate version:

“Models can see images sometimes, but they prefer text, and they can be misled easily.”

Why This Matters (Especially for SEO and AI Search)

This experiment has implications far beyond a cat drawing.

For SEO

  • AI‑powered search may interpret images differently depending on format.
  • SVGs may be “understood” more reliably than JPGs.
  • HTML semantics can override visual content.
  • Structured data can steer AI interpretations.

For AI Safety

  • Models can be manipulated by wrapping images in misleading semantics.
  • Vision encoders can be bypassed or overridden.

For UX and AI Tools

  • Inline SVGs may be a more predictable way to feed diagrams or icons into LLMs.
  • JPGs may produce inconsistent or hallucinated interpretations.

For AI Research

  • Tokenization is not a neutral preprocessing step — it shapes perception.
  • Multimodal alignment is still brittle.
  • “Seeing” is not the same as “understanding.”

Try It Yourself

Here are the two pages:

  • SVG version
    https://marinpopov.com/05032026/new-image-1.html
  • JPG version
    https://marinpopov.com/05032026/new-image-2.html

Use the same prompt.
Compare the outputs.
Then inspect the HTML and token counts.

The interesting part isn’t whether the model is “right” or “wrong.”
It’s what the model chooses to rely on when interpreting the same visual content.

Final Thought

Part 1 showed that semantics matter.
Part 2 shows that format and tokenization matter just as much — maybe even more.