# Trusted Signatures RAG Kit

This directory contains machine-readable RAG assets generated directly from the source repository.

## Generation

- Generator: `node scripts/generate-rag-kit.js`
- Validation: `node scripts/validate-rag-kit.js`
- Built-site validation: `node scripts/validate-rag-public.js`
- Just recipes: `just rag-generate`, `just rag-validate`, `just rag`
- Canonical content source: `content/`
- Additional published source inputs used when no markdown source exists for the same resource:
  - `data/rag-source/glossary.json`
  - `data/rag-source/pricing_tiers.json`

Generated outputs are staged under `tmp/generated-rag/` and mounted into Hugo at build time so they are served without being committed.

The generator is repeatable and idempotent. It reads source markdown, front matter, code fences, FAQ shortcodes, and selected published static RAG inputs, then writes normalized outputs with provenance.

## Files

- `faq.jsonl`
- `facts.jsonl`
- `glossary.json`
- `pricing_tiers.json`
- `products.jsonl`
- `integrations.jsonl`
- `implementation_paths.jsonl`
- `docs_pages.jsonl`
- `trust_legal.jsonl`
- `trust_controls.jsonl`
- `subprocessors.jsonl`
- `support.jsonl`
- `announcements.jsonl`
- `company.json`
- `guardrails.json`
- `snippets/index.jsonl`
- `snippets/README.md`
- `snippets/generated/*`
- `README.md`

## Provenance model

Every generated record includes:

- `id`
- `source_url`
- `source_path`
- `source_section`
- `extracted_from`
- `last_modified_utc`
- `evidence_text`
- `confidence`

## Schemas

- `faq.jsonl`: `id`, `type`, `question`, `answer`, `category`, `product_scope`, `source_url`, `source_path`, `source_section`, `extracted_from`, `last_modified_utc`, `evidence_text`, `confidence`
- `facts.jsonl`: `id`, `type`, `subject`, `predicate`, `object`, `qualifiers`, `product_scope`, `tags`, `source_url`, `source_path`, `source_section`, `extracted_from`, `last_modified_utc`, `evidence_text`, `confidence`
- `glossary.json`: `terms[]` with `term`, `definition`, `aliases`, `source_url`, `source_path`, `source_section`, `evidence_text`
- `pricing_tiers.json`: `currency`, `plans`, `add_ons`, `buying_paths`, `usage_model_notes`, `source_pages`, `warnings`
- `products.jsonl`: `id`, `type`, `product_name`, `category`, `summary`, `intended_use`, `non_goals`, `capabilities`, `limitations`, `integrations`, `standards`, `audiences`, `pricing_refs`, `related_pages`, provenance fields
- `integrations.jsonl`: `id`, `type`, `integration_name`, `kind`, `summary`, `auth_model`, `input_requirements`, `output_behavior`, `commands_or_endpoints`, `environment_variables`, `error_handling`, `prerequisites`, `constraints`, `related_products`, provenance fields
- `implementation_paths.jsonl`: `id`, `record_type`, `title`, `audience`, `approach`, `approach_category`, tradeoff fields, `common_mistakes`, `recommended_next_step`, related links, provenance fields
- `docs_pages.jsonl`: `id`, `record_type`, `title`, `doc_path`, `canonical_url`, `implementation_path`, `cloud_platform`, `best_for`, `time_to_first_success`, `prerequisites`, `integration_model`, `document_upload_model`, `next_step`, `related_docs`, provenance fields
- `trust_legal.jsonl`: `id`, `type`, `domain`, `topic`, `statement`, `applies_to`, `caveats`, provenance fields
- `trust_controls.jsonl`: `id`, `control_area`, `control_name`, `statement`, `product_scope`, `deployment_scope`, `region_scope`, `caveat`, provenance fields
- `subprocessors.jsonl`: `id`, `vendor_name`, `purpose`, `data_categories`, `processing_location`, `product_scope`, provenance fields
- `support.jsonl`: `id`, `type`, `channel`, `purpose`, `availability`, `audience`, `contact`, `link_text`, provenance fields
- `announcements.jsonl`: `id`, `type`, `published_date`, `title`, `summary`, `products_affected`, `docs_affected`, provenance fields
- `company.json`: `name`, `summary`, `contacts`, `leadership`, `addresses`, `links`, `source_pages`, `warnings`
- `guardrails.json`: `guardrails[]` with `id`, `scope`, `statement`, `reason_type`, provenance fields
- `snippets/index.jsonl`: `id`, `type`, `title`, `language`, `file`, `context`, provenance fields

## Extraction rules

- JSONL files contain one normalized record per line.
- `facts.jsonl` uses a conservative `predicate: "states"` model to avoid unsupported semantic inference.
- Only explicit FAQs or question/answer shortcodes are emitted to `faq.jsonl`.
- Only direct, declarative statements are emitted to `facts.jsonl`, `trust_legal.jsonl`, and `guardrails.json`; marketing superlatives and imperative advice are omitted.
- `implementation_paths.jsonl` is sourced from the published implementation-decision guide and preserves nulls when the source does not publish an explicit value for a field.
- `docs_pages.jsonl` separates implementation-path retrieval from guide-page retrieval so `integrations.jsonl` can remain integration-centric.
- `trust_controls.jsonl` and `subprocessors.jsonl` normalize published diligence content from Trust, DPA, Privacy, and Subprocessors pages without adding unpublished claims.
- `pricing_tiers.json` only includes prices explicitly published in the repo sources.
- `glossary.json` is sourced from the published glossary file because that is the site’s explicit term-definition source.
- `snippets/` contains extracted docs examples plus a JSONL index with provenance. Response-only or error-only code blocks are omitted.

## Legacy compatibility

The site still exposes legacy FAQ/facts JSONL routes through Hugo templates. Those templates now read generated data from `data/rag/` so the legacy and canonical outputs stay aligned.

## Known gaps

- Currency is not explicitly labeled in the pricing sources, so `pricing_tiers.json` leaves `currency` as `null`.
- Support availability/hours are not explicitly published.
- Some pages imply broader capabilities or guarantees, but the generator omits anything not explicitly stated.
- Power Automate is documented as a first-class workflow path, but the site splits that coverage across the Azure docs and a product page rather than a single docs overview page.
