What It Does
- fetches and scrapes content from web pages
- converts HTML to text for indexing
- for sitemap mode, discovers all URLs listed in a
sitemap.xml(including sitemap indexes) - runs ingestion on configurable schedules
Environment Variables
Set these in.env.rag:
URLs mode:
WEB1_URLS: comma-separated list of URLs to scrapeWEB1_SCHEDULES: ingestion interval in seconds (default is3600)
WEB2_SITEMAP_URL: URL of thesitemap.xmlto parseWEB2_INCLUDE_PREFIX: (optional) only ingest URLs with this path prefix — mutually exclusive withWEB2_EXCLUDE_PREFIXWEB2_EXCLUDE_PREFIX: (optional) skip URLs with this path prefix — mutually exclusive withWEB2_INCLUDE_PREFIXWEB2_SCHEDULES: ingestion interval in seconds (default is3600)
config.yaml Example
.env.rag Example
Configuration Reference
| Field | Required | Default | Description |
|---|---|---|---|
urls | yes (URLs mode) | — | Comma-separated list of URLs to scrape |
sitemap_url | yes (sitemap mode) | — | URL of the sitemap.xml to parse |
include_prefix | no | — | Only ingest URLs with this path prefix (mutually exclusive with exclude_prefix) |
exclude_prefix | no | — | Skip URLs with this path prefix (mutually exclusive with include_prefix) |
html_to_text | no | true | Convert HTML to plain text before indexing |
schedules | no | 3600 | Ingestion interval in seconds |
request_delay | no | 0 | Seconds to wait between outbound requests. Increase to avoid rate-limiting (e.g. 0.1) |
Multiple Web Sources
Add moresources entries (web3, web4, etc) with separate env vars per source.