What is a Noindex Meta Tag?
Noindex is a meta tag that instructs search engines not to include a specific page in their search results.
The noindex tag can be placed in the section of the page’s HTML or in an HTML header returned by the web server.
If a page has been indexed by Google and then a noindex tag was added, the page will be removed from Google search results after Google crawls the page again.
Example:
<meta name='robots' content='noindex, follow' />

Noindex Pages and LLM Usage
What “noindex” actually does
- A noindex tag tells search engines not to include a page in their public search results.
- It does not hide the page from the internet.
- It does not block access.
- It does not prevent an AI system from reading it if the system is given the URL or the text directly.
Effect on LLM training
- LLM training datasets are collected long before the model is released.
- Whether a noindex page is included depends entirely on how the dataset was built.
- Some crawlers respect noindex; others don’t.
- Even if included, the model does not store the page as a retrievable document — it only learns statistical patterns.
- Therefore:
A noindex page cannot be reliably “cited” by a pretrained LLM unless the system explicitly provides that page during inference.
Effect on retrieval‑augmented systems (RAG)
This is the only scenario where a noindex page becomes a guaranteed usable source.
If a system:
- downloads the page,
- indexes it in a private vector store,
- or feeds it directly into the LLM’s context window,
then:
- the LLM can read it,
- ground its answer in it,
- and cite it as a source even if the page is noindex.
In RAG systems, noindex is irrelevant because the content is not discovered by search engines — it is manually supplied.
What noindex does not do
- It does not prevent an LLM from using the page if the developer/user feeds it in.
- It does not guarantee privacy.
- It does not guarantee exclusion from future datasets.
- It does not make the page “invisible” to AI.
The real‑world truth
- Noindex only affects search engine indexing.
- LLMs do not rely on search engine indexing.
- LLMs can use noindex pages as sources ONLY when the content is explicitly provided to them.
