Harden web search and docs defaults

This commit is contained in:
2026-06-24 23:57:44 -07:00
parent 8fcd94d2c5
commit 8237f1331c
19 changed files with 691 additions and 35 deletions

View File

@@ -14,6 +14,13 @@ shell code.
| `CONTEXT_KIT_DATA_DIR` | `$HOME/.local/share/context-kit` | Persistent docs indexes and model cache |
| `CONTEXT_KIT_COMPOSE_PROJECT` | `context-kit` | Docker Compose project and network prefix |
| `CONTEXT_KIT_SEARXNG_PORT` | `8099` | Localhost SearXNG port |
| `CONTEXT_KIT_WEB_SEARCH_MAX_BYTES` | `52428800` | Max bytes `context-web-search` accepts and downloads per fetch |
| `CONTEXT_KIT_WEB_SEARCH_PROVIDER` | `searxng` | Default `search_web` provider; fallback order depends on this provider |
| `CONTEXT_KIT_WEB_SEARCH_HTTP_TIMEOUT` | `15000` | HTTP timeout in milliseconds for search providers |
| `CONTEXT_KIT_WEB_SEARCH_MAX_RESULTS` | `10` | Default search result count when clients omit `limit` |
| `CONTEXT_KIT_WEB_SEARCH_CHROME_PATH` | `/usr/bin/chromium` | Chromium path inside the web-search image for Bing fallback |
| `CONTEXT_KIT_WEB_SEARCH_BROWSER_USER_AGENT` | bundled Chrome/Linux UA | User agent for the Chromium-backed Bing fallback |
| `CONTEXT_KIT_WEB_SEARCH_MCP_COMPAT_MODE` | unset | Set to `legacy` for MCP clients with weak tool-schema parsers |
| `CONTEXT_KIT_DOCS_PORT` | `8776` | Localhost port for the long-lived docs-mcp HTTP service |
| `CONTEXT_KIT_DOCS_HTTP_URL` | `http://127.0.0.1:${CONTEXT_KIT_DOCS_PORT}/mcp` | URL emitted into HTTP MCP install snippets |
| `CONTEXT_KIT_DOCS_ALLOW_ORIGIN` | unset | Optional exact browser CORS origin(s) for docs-mcp, separated by spaces |
@@ -22,6 +29,8 @@ shell code.
| `CONTEXT_KIT_DOCS_MAX_GET_BYTES` | `75000` | Max bytes returned by docs retrieval |
| `CONTEXT_KIT_DOCS_EMBED_MODEL` | `BAAI/bge-small-en-v1.5` | SentenceTransformers embedding model |
| `CONTEXT_KIT_DOCS_PREINDEX` | `0` | Set to `1` to re-embed every source on container start |
| `CONTEXT_KIT_DOCS_LOCAL_SOURCES_DIR` | `${CONTEXT_KIT_DATA_DIR}/local-sources` | Machine-local llms.txt tree mounted read-only into docs-mcp |
| `CONTEXT_KIT_DOCS_LOCAL_SOURCES_PORT` | `8769` | Loopback port inside docs-mcp for serving local source files |
## TTL Guidance
@@ -66,3 +75,8 @@ CONTEXT_KIT_DOCS_SOURCES="config/sources.default.txt config/sources.js.txt"
```
Each source file is plain text. Blank lines and `#` comments are ignored.
Entries may be absolute source-profile paths for private machine-local config.
For local llms.txt files, place content under
`CONTEXT_KIT_DOCS_LOCAL_SOURCES_DIR` and reference it as
`http://127.0.0.1:8769/path/inside/local-sources.txt`; that loopback URL is
inside the docs-mcp container, not exposed on the host.

View File

@@ -33,6 +33,37 @@ Build default images:
bin/context-kit build
```
## Fetch URL Says Max Download Bytes Is Too Big
If `fetch_url` fails before making a network request with an MCP validation error
like `Number must be less than or equal to 26214400`, rebuild the web-search MCP
image:
```sh
bin/context-kit build
```
Context Kit patches the upstream `mcp-web-search` schema so the accepted
`max_download_bytes` value matches `CONTEXT_KIT_WEB_SEARCH_MAX_BYTES`, which
defaults to `52428800`.
## Search Fallback and Chromium
`search_web` defaults to SearXNG. If SearXNG fails or returns no results, the
upstream fallback order is DuckDuckGo, then Bing. Bing uses Chromium through
Puppeteer, so `bin/context-kit doctor` checks that the configured Chromium path
exists inside the web-search image.
Context Kit carries a source-controlled Bing provider override in
`docker/web-search/overrides/bing.js` because the upstream 1.3.0 provider can
race result rendering and return no items even when Chromium sees Bing result
cards. The override waits for result cards and decodes current Bing redirect
URLs before handing results back to the upstream fallback registry.
`fetch_url` is different: in upstream `mcp-web-search` 1.3.0, `engine=browser` is
accepted but reserved for future support. It does not currently invoke Chromium;
URL fetching uses the HTTP extractor path.
## Docs Indexing Is Slow
The first run downloads an embedding model and embeds every configured docs