AI Agent Discovery: How LLMs Find, Index, and Recommend Software Tools End to End

> TL;DR
> - LLMs build product knowledge by crawling the web and extracting structured data — this happens before any user query
> - The pipeline: crawl → schema extraction → indexing → query decomposition → filtering → ranking → recommendation/purchase
> - Every stage has a failure point. This guide covers them all.
> - Make your product crawlable and indexable →

Updated: April 21, 2026

---

Stage 0: The Crawl (Before Any Query)

AI product discovery starts long before a user types a query. It starts when LLM crawlers visit your site.

Active crawlers as of 2026:

GPTBot — OpenAI (powers ChatGPT recommendations)

ClaudeBot — Anthropic (powers Claude's web knowledge)

PerplexityBot — Perplexity AI (powers Perplexity's product index)

Google-Extended — Google AI (powers AI Overviews)

Applebot — Apple Intelligence

Bytespider — TikTok / ByteDance AI

Meta-ExternalAgent — Meta AI

These crawlers visit your site and extract two types of data:

Structured data (JSON-LD schema) — Parsed directly. Clean, high-confidence, directly usable in product indexes. Fields like name, price, featureList, applicationCategory are extracted with certainty.

Unstructured HTML — Parsed with NLP. Lower confidence. The crawler has to infer what's a price, what's a feature, what's a description. Errors are common. Often not used for product indexes at all.

Products with structured data get clean, accurate records. Products without it get noisy, incomplete records — or no record at all.

What to check right now: Open your server access logs and search for GPTBot, ClaudeBot, PerplexityBot. They're almost certainly already visiting your site. What are they finding?

---

Stage 1: Schema Extraction and Indexing

When a crawler finds