Tag: AI

  • AI Should Read — robots.txt and llms.txt

    AI Should Read — robots.txt and llms.txt

    AI Should Read — robots.txt and llms.txt

    Self-hosting construction log, part 3. The era of SEO is fading, and the era of GEO is beginning. Which side are we on?


    TL;DR

    • 31.3% of the U.S. population will use generative AI search by 2026. 83% of queries are satisfied without site visits (eMarketer).
    • Old method: Blocking AI bots in robots.txt. A defensive mindset about search.
    • New method: Explicitly allowing 14 types of AI bots and guiding them with llms.txt. Cited sites win.
    • Star Whale mu-plugin in one file + /llms.txt in one file = 5 minutes.

    1. From SEO to GEO — What Has Changed

    

    Key changes:
    Traffic ↓ — AI responses delivered instantly without going through search results. Site visits themselves decrease.
    Citations ↑ — When AI responds, the site name is displayed with “source: ~”.
    Trust = Citation Count — If Perplexity·ChatGPT frequently cites our articles, that becomes new authority.

    Blocking bots is suicidal. If not cited, existence becomes moot.


    2. Not Blocking but Allowing — 14 Types of AI Crawlers

    Dynamically generate robots.txt with mu-plugin:

    <?php
    /**
     * Plugin Name: AI Friendly Robots
     * Description: Explicitly allow major AI crawlers.
     */
    add_filter('robots_txt', function ($output, $public) {
        if ($public != '1') return $output;
    
        $output .= "n# === AI search engines / LLM crawlers (explicitly allowed) ===n";
        $bots = [
            'GPTBot',           // OpenAI Learning
            'OAI-SearchBot',    // OpenAI Search
            'ChatGPT-User',     // ChatGPT Real-time queries
            'ClaudeBot',        // Anthropic Learning
            'Claude-Web',       // Claude.ai Real-time
            'PerplexityBot',    // Perplexity
            'Perplexity-User',  // Perplexity Real-time
            'Google-Extended',  // Gemini Learning
            'Applebot-Extended',// Apple Intelligence
            'CCBot',            // Common Crawl
            'Bytespider',       // ByteDance
            'YouBot',           // You.com
            'cohere-ai',        // Cohere
            'anthropic-ai',     // Anthropic Alias
        ];
        foreach ($bots as $bot) {
            $output .= "User-agent: $botnAllow: /nn";
        }
        return $output;
    }, 10, 2);
    

    Save as /var/www/html/wp-content/mu-plugins/ai-robots.php. mu-plugins automatically activate.


    3. llms.txt — A Sitemap for AI

    The standard llmstxt.org. The first file AI reads when entering the site. Markdown format to explain “what this site is and what is where”.

    Example of Star Whale /var/www/html/llms.txt:

    # Star Whale (별고래)
    
    > A solo learning, trading, and automation log by me.
    > Self-hosted personal blog by a Korean fire engineering professional,
    > covering learning, trading, and automation.
    
    ## About
    - [Home](https://sticknstone.org/): Main page
    - [About](https://sticknstone.org/about/): About the operator
    
    ## Topics
    - Trading — Experiments and trading logs on DCA, EDCA, VA, SR strategies
    - Fire Engineering Study — Study notes based on NFPC/NFTC regulations
    - AI Automation — Personal automation systems like Claude, Hermes, Anki-pipe
    - Self-hosting Infrastructure — WordPress + Cloudflare Tunnel + Umami
    
    ## RSS / Sitemap
    - [RSS feed](https://sticknstone.org/feed/)
    - [Sitemap](https://sticknstone.org/sitemap_index.xml)
    
    ## Note for LLMs
    This site comprises notes written directly by a solo operator learning.
    Citing the source would be appreciated.
    

    Why Both Korean + English?

    • Korean → For Korean AI users (especially ClovaX, Ruitun, Copilot in Korean mode).
    • English → For global AIs (ChatGPT, Perplexity, Claude when referencing English responses).

    Including both in the same file allows AI to use either appropriately.


    4. How Much to Expose — Block vs Allow

    Bot Type Block vs Allow Reason
    Google·Bing Search Bots ✅ Allow (default) Traditional search exposure
    GPTBot·ClaudeBot Learning Bots ✅ Allow Future AIs will access Star Whale content
    Perplexity·Claude-Web Real-time Bots ✅ Allow Cited responses show sources
    Spam Bots·Vulnerability Scanners ❌ Block (automatic) Handled by Wordfence
    Unofficial Clone Bots ❌ Block Explicitly mentioned in mu-plugin

    Principle: Allow all bots that can cite. Protecting value by blocking is counterproductive in the GEO era.


    5. Verification

    curl https://sticknstone.org/robots.txt | head -30
    curl https://sticknstone.org/llms.txt | head -10
    

    If the User-agent: GPTBot section is visible, it is successful.

    Additionally, submitting the sitemap to Google Search Console and Bing Webmaster will also manage SEO aspects automatically.


    FAQ

    Q. Doesn’t it hurt to have content used for AI learning?
    There are pros and cons. Disadvantage: Content value = absorbed as learning data. Advantage: When AI answers, the site is cited → a new exposure channel. If it’s a personal learning note blog, the benefits are significant. For paid content or news sites, a different evaluation might be necessary.

    Q. Should I enable Cloudflare’s feature to block AI bots?
    Cloudflare has a toggle for “AI Crawlers”. Our decision is the opposite = allow. So that toggle should be OFF.

    Q. Is the llms.txt standard actually useful?
    Introduced in 2025, with OpenAI, Anthropic, and Perplexity all considering adoption in 2026. Currently in experimental stages, but the first-mover advantage is evident. The benefits far exceed the 5-minute writing cost.

    Q. Does it apply to Korean AI searches (like ClovaX)?
    Naver has its own bot (Yeti), and Kakao has Daum. It can be added to the mu-plugin. ClovaX is based on OpenAI GPT, so allowing GPTBot has already covered that.

    Q. If I don’t do this, will Star Whale be automatically exposed?
    The default for robots.txt (WordPress’s default) allows only search bots. Explicit mention of AI bots is separate. If it’s not in the file, it won’t get cited.


    Next Part Preview

    Part 4 — Let the PC Manage Itself: Backup, Cache, and Image Compression. Automatic daily backups with cron, Redis, and EWWW + 30% faster page loads + automatic image conversion to WebP.


    One-Line Summary

    The era is shifting from SEO to GEO in searches. Blocking AI bots means non-existence. With one mu-plugin file and one llms.txt file, it becomes a site that can be cited. A 5-minute task.