Markdown for the LLM

danielkhoo.io

twitter

github

Markdown for the LLM

I got inspired by this tweet from @rauchg about making docs and webpages available to AI agents in a token-friendly way. The idea is simple: if an agent is reading the content, all the extra html, css, js is unnecessary, just serve markdown.

So I decided to see how hard it would be to get this blog to automatically serve raw markdown when requested. Requirements: If the page is requested with the Accept: text/markdown header or the .md suffix, it should return raw markdown.

It turned out to be surprisingly easy with Claude Code and the Claude Chrome Extension helping me along. This blog runs on NextJS, which has built-in support for rewrites and middleware. I don't recall the exact syntax but Claude does. Claude knows all.

Step 1: Generate the raw markdown files

The first step is to extract the markdown content from the MDX files at build time. This blog uses MDX with React components, so we need to strip out the imports, exports, and JSX wrappers to get clean markdown. Look at all this beautiful regex that no human could write without recent practice.

// scripts/generate-markdown.js
import fs from 'fs';
import path from 'path';

const PAGES_DIR = path.join(process.cwd(), 'pages');
const OUTPUT_DIR = path.join(process.cwd(), 'public', 'markdown');

function extractMarkdown(content) {
  let markdown = content;

  // Remove YAML frontmatter
  markdown = markdown.replace(/^---[\s\S]*?---\n*/m, '');

  // Remove import statements
  markdown = markdown.replace(/^import\s+.*?;\s*\n/gm, '');

  // Remove export const meta = {...}
  markdown = markdown.replace(/^export\s+const\s+meta\s*=\s*\{[\s\S]*?\};\s*\n/m, '');

  // Remove <Container ...> opening tag
  markdown = markdown.replace(/<Container[^>]*>\s*\n?/g, '');

  // Remove </Container> closing tag
  markdown = markdown.replace(/<\/Container>\s*\n?/g, '');

  // Convert <br /> to newlines
  markdown = markdown.replace(/<br\s*\/?>/g, '\n');

  // Clean up excessive newlines
  markdown = markdown.replace(/\n{3,}/g, '\n\n');

  return markdown.trim() + '\n';
}

The script is attached to the build and dev scripts

    "generate-md": "node scripts/generate-markdown.js",
    "dev": "npm run generate-md && next dev",
    "build": "npm run generate-md && next build",

Step 2: Redirect to the raw files

For the .md suffix approach, a simple rewrite rule in next.config.js does the trick:

// next.config.js
async rewrites() {
  return [
    {
      source: '/:slug.md',
      destination: '/markdown/:slug.md',
    },
  ]
}

For the Accept header approach, we need middleware to intercept requests and check the header:

// middleware.js
import { NextResponse } from 'next/server';

export function middleware(request) {
  const accept = request.headers.get('accept') || '';

  if (accept.includes('text/markdown')) {
    const pathname = request.nextUrl.pathname;
    return NextResponse.rewrite(new URL(`/markdown${pathname}.md`, request.url));
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!api|_next|markdown|favicon.ico|.*\\..*).*)'],
};

Now any page on this blog can be fetched as markdown. https://danielkhoo.io/markdown-for-the-llm.md or curl -H 'Accept: text/markdown' https://danielkhoo.io/markdown-for-the-llm

This all took maybe 10 minutes of me mostly staring at Claude Code grind away.