[SumatraPDF](https://www.sumatrapdfreader.org/free-pdf-reader) is a Windows GUI application for viewing PDF, ePub and comic books written in C++.

Lately I do a lot of my SumatraPDF coding with AI agents: Claude Code, Grok Build, OpenAI Codex.

They're good at writing code. They're less good at knowing if the code works, especially for GUI apps.

## The problem: agents don't drive UI well

Say I ask an agent to fix a bug in PDF text search, or in the new feature that translates selected text via an LLM.

How does the agent verify the fix?

Surprisingly, they can drive GUI app by injecting mouse clicks and keyboard input, taking screenshots. It's slow and flaky. On my machine injected mouse clicks would sometimes get dropped. Coordinates change when the layout changes. Screenshots need a vision model to interpret.

I wanted something an agent could drive deterministically: send a request, get a result back, assert on it. Like calling a function, except the function lives inside a running GUI app.

## The solution: a control channel over a named pipe

I added a command-line flag `-dbg-control <named-pipe>`. When SumatraPDF starts it creates a bi-direction [named pipe](https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipes) with that name and starts a thread listening for commands.

A test script written in TypeScript (bun):

- picks a unique pipe name
- launches `SumatraPDF-dll.exe -dbg-control <name>`
- connects to the pipe
- sends binary request/response commands ("search this PDF for this string", "translate this text")
- asserts on the results
- tells the app to quit

Why a named pipe rather than, say, a TCP socket? A named pipe is local to the machine, doesn't need a port, doesn't trip the Windows firewall and Windows gives it a tidy namespace (`\\.\pipe\...`). The Node/Bun client connects to it with the ordinary socket API, so on the script side it's as convenient as a socket anyway.

The key insight: instead of automating the GUI from the outside, I expose the *logic* from the inside. Text search, synctex, translation — these are functions in my code. The control channel lets an agent call them in the context of a fully initialized app, without going through the mouse and pixels.

## What the app does

The server is one background thread. It creates the pipe, waits for a client, and processes requests until the client disconnects:

```c++
static void SumatraControlThread(char* pipeName) {
    AutoFreeWStr pipeNameW(FullPipeName(pipeName));
    str::Free(pipeName);

    for (;;) {
        HANDLE pipe = CreateNamedPipeW(
            pipeNameW.Get(), PIPE_ACCESS_DUPLEX,
            PIPE_TYPE_BYTE | PIPE_READMODE_BYTE | PIPE_WAIT,
            1, 64 * 1024, 64 * 1024, 0, nullptr);
        if (pipe == INVALID_HANDLE_VALUE) {
            return;
        }
        BOOL connected = ConnectNamedPipe(pipe, nullptr)
            ? TRUE : (GetLastError() == ERROR_PIPE_CONNECTED);
        if (connected) {
            ProcessControlConnection(pipe);
        }
        DisconnectNamedPipe(pipe);
        CloseHandle(pipe);
    }
}
```

There's one subtlety that matters a lot in a GUI app.

The control thread is *not* the UI thread. If I touch UI state or run document logic on a random thread, I get races and crashes. So the control thread doesn't execute the command itself. It parses the request, then posts the work to the UI thread and waits for it to finish:

```c++
static void ProcessControlConnection(HANDLE h) {
    for (;;) {
        ControlRequest* req = ReadControlRequest(h);
        if (!req) {
            return;
        }
        uitask::Post(MkFunc0<ControlRequest>(ExecuteControlRequest, req), "SumatraControl");
        WaitForSingleObject(req->done, INFINITE);
        bool ok = WriteControlResponse(h, req);
        DeleteControlRequest(req);
        if (!ok) {
            return;
        }
    }
}
```

`uitask::Post` schedules a function to run on the UI thread. The request carries an event handle; `ExecuteControlRequest` signals it with `SetEvent` when it's done. The pipe thread blocks on `WaitForSingleObject` until then, then writes the response. So the wire protocol is strictly synchronous request → response, but the actual work runs where it's safe to run.

`ExecuteControlRequest` is just a switch over the command id:

```c++
case ControlCmd::TestSearch: {
    const char* pdf = StringArg(req, 0);
    const char* needle = StringArg(req, 1);
    const char* password = StringArg(req, 2);
    if (!pdf || !needle) {
        AppendError(req, "TestSearch expects string pdf, string needle, optional string password");
        break;
    }
    AppendTestResult(req, 0, TestSearchResult(pdf, needle, password));
    break;
}
```

To add a new testable operation:
* add an enum value
* add a `case`
* call the function that does the real work. 
* 
* The function is the same code the GUI uses, so the test exercises the real code paths.

## The protocol

Wire protocol is not human readable and is optimized for simplicity and small code: a length-prefixed packet of typed arguments.

No JSON, no protobuf, no third party dependencies.

A **request** is:

```
u32 payloadSize
u16 command
u16 requestId
args...
end
```

A **response** is:

```
u32 payloadSize
u16 requestId
results...
end
```

The `requestId` lets the client match a response to its request and assert they line up. 

The length prefix means the reader knows exactly how many bytes to read from the pipe before parsing — no delimiter scanning, no ambiguity.

Each argument (and each result) is a tagged value:

```
u16 type followed by, depending on type:
  0 = end      (no payload — marks end of the arg list)
  1 = i32      4 bytes
  2 = bytes    [u32 length][bytes]
  3 = string   [u32 length][utf8 bytes][0]   (zero-terminated, for C convenience)
  4 = list     [u16 count][element, element, ...]
```

That's the whole protocol. It's enough to express "translate(backend=1, src='English', dst='Polish', text='...')" and get back "(exitCode, translatedText)".

The string type carries an explicit length *and* a zero terminator. The length is what the parser actually uses; the trailing zero is a small courtesy so the C++ side can treat the bytes as a C string without copying.

Both sides are a few dozen lines. The C++ reader is a hand-rolled `PacketReader` with bounds checks; the TypeScript side is its mirror image:

```ts
function encodeArg(out: number[], arg: ControlArg): void {
  if (typeof arg === "number") {
    appendU16(out, ArgType.Int32);
    appendU32(out, arg | 0);
    return;
  }
  if (typeof arg === "string") {
    const bytes = new TextEncoder().encode(arg);
    appendU16(out, ArgType.String);
    appendU32(out, bytes.length);
    appendBytes(out, bytes);
    out.push(0);
    return;
  }
  // ... list and bytes
}
```

I write tests in TypeScript and run them with [Bun](https://bun.sh), which has fast startup and talks to Windows named pipes through the standard `node:net` socket API. The client (`cmd/control.ts`) wraps the protocol in a small `ControlClient` and a `withControlledSumatra` helper that handles the launch/connect/quit lifecycle:

```ts
export async function withControlledSumatra<T>(
  exe: string,
  fn: (client: ControlClient) => Promise<T>,
  extraArgs: string[] = [],
): Promise<T> {
  const pipeName = uniquePipeName();
  const proc = Bun.spawn([exe, "-dbg-control", pipeName, ...extraArgs], {
    stdout: "ignore",
    stderr: "ignore",
  });
  let client: ControlClient | undefined;
  try {
    client = await ControlClient.connect(pipeName);
    return await fn(client);
  } finally {
    if (client) {
      try {
        await client.quit();
      } catch {
        proc.kill();
      }
      client.close();
    } else {
      proc.kill();
    }
    await proc.exited;
  }
}
```

`connect` retries for a few seconds because there's a small race: the script launches the exe, but the app needs a moment to start the pipe server. Rather than guess at a sleep, the client polls until the pipe accepts a connection.

An actual test reads like a normal function call. Here's the gist of the selection-translation test, which sends English text and checks it comes back as plausible Polish:

```ts
const res = await runControlCommand(EXE, ControlCommand.TestSelectionTranslate, [
  backend.id,   // i32: which LLM CLI to use
  "English",    // string: source language
  "Polish",     // string: target language
  PHRASE,       // string: text to translate
]);
const exitCode = Number(res[0]);
const translation = String(res[1] ?? "").trim();
```

No screenshots, no mouse, no pixel coordinates. The agent runs `bun tests/ad-hoc-selection-translate.ts`, reads the assertions, and iterates on its own.

## Use it in your own project

There's nothing SumatraPDF-specific about this idea. Any GUI app can grow a control channel like this, and it pairs really well with AI agents. If you want to point an agent at your own app, here are instructions you can drop into your `AGENTS.md` / `CLAUDE.md` and adapt:

```markdown
## Driving the GUI app from tests

Prefer driving the app through its control channel over GUI automation
(injected clicks / screenshots) or adding one-off test-only flags.

- The app accepts `-dbg-control <named-pipe>`. When present it starts a
  control server on that pipe and listens for binary request/response
  commands. Combine with `-for-testing` so it starts a fresh instance and
  doesn't touch real settings.
- A test should: pick a unique pipe name, launch the app with
  `-for-testing -dbg-control <name>`, connect, send commands, assert on the
  results, then quit the app through the control client.
- Protocol: requests are `[u32 payloadSize][u16 command][u16 requestId][args]`;
  responses are `[u32 payloadSize][u16 requestId][results]`. Each arg/result
  is `[u16 type]` where 0=end, 1=i32 (+4 bytes), 2=bytes (+u32 len +data),
  3=utf8 string (+u32 len +bytes +zero terminator), 4=list (+u16 count
  +elements).
- To add a testable operation: add a command id, handle it in the server's
  command switch by calling the real function, and add a matching method on
  the client. Run the operation ON THE UI THREAD if it touches UI state.
- Don't add new test-only command-line flags when a control command will do.
```

The two rules that matter most, learned the hard way:

1. **Expose logic, not pixels.** The command should call the same function the GUI calls, in the real app, with the real document loaded. Then the test verifies the thing users actually hit — not a re-implementation of it.

2. **Run the work on the UI thread.** The pipe lives on a background thread, but UI and document state usually isn't thread-safe. Post the work over and wait for it. Keep the wire protocol synchronous so the test stays a simple sequence of calls.

The payoff is that the agent's feedback loop closes. It writes code, builds, runs a test that drives the real GUI app, reads a pass/fail, and fixes the next thing — all without me clicking a single button.
