Skip to main content
← Back to Articles

AI Agent Discovery: How LLMs Find, Index, and Recommend Software Tools End to End

The complete pipeline from LLM web crawl to product recommendation — how structured data, indexing, query decomposition, and agent selection work together to surface software tools.

By Web MCP GuideApril 21, 20267 min read


AI Agent Discovery: How LLMs Find, Index, and Recommend Software Tools End to End

> TL;DR
> - LLMs build product knowledge by crawling the web and extracting structured data — this happens before any user query
> - The pipeline: crawl → schema extraction → indexing → query decomposition → filtering → ranking → recommendation/purchase
> - Every stage has a failure point. This guide covers them all.
> - Make your product crawlable and indexable →

Updated: April 21, 2026

---

Stage 0: The Crawl (Before Any Query)

AI product discovery starts long before a user types a query. It starts when LLM crawlers visit your site.

Active crawlers as of 2026:

  • GPTBot — OpenAI (powers ChatGPT recommendations)

  • ClaudeBot — Anthropic (powers Claude's web knowledge)

  • PerplexityBot — Perplexity AI (powers Perplexity's product index)

  • Google-Extended — Google AI (powers AI Overviews)

  • Applebot — Apple Intelligence

  • Bytespider — TikTok / ByteDance AI

  • Meta-ExternalAgent — Meta AI
  • These crawlers visit your site and extract two types of data:

    Structured data (JSON-LD schema) — Parsed directly. Clean, high-confidence, directly usable in product indexes. Fields like name, price, featureList, applicationCategory are extracted with certainty.

    Unstructured HTML — Parsed with NLP. Lower confidence. The crawler has to infer what's a price, what's a feature, what's a description. Errors are common. Often not used for product indexes at all.

    Products with structured data get clean, accurate records. Products without it get noisy, incomplete records — or no record at all.

    What to check right now: Open your server access logs and search for GPTBot, ClaudeBot, PerplexityBot. They're almost certainly already visiting your site. What are they finding?

    ---

    Stage 1: Schema Extraction and Indexing

    When a crawler finds