Newer
Older
cortex-hub / docs / auth_tls_journey.md

System Securitization Journey: Day 1 to Day 2

This document outlines the full user journey and technical implementation plan for securing the AI Hub and its Swarm Control system. It focuses on making the initial experience (Day 1) as seamless as possible, while providing a clear path to production-grade security (Day 2) without losing access or data.

🚨 Quick Start UX / Feasibility Blockers (Pre-Release Phase)

Before proceeding with the Day 1 to Day 2 transition, we must unblock the following critical UX flaws in the "Day 0" setup.

1. The .env Override Trap (Critical Failure)

  • The Problem: docker-compose.yml hardcodes environment variables like SUPER_ADMINS into the environment: block, which overrides the .env file generated by setup.sh.
  • The Fix: Update docker-compose.yml to use variable interpolation with fallbacks (e.g., - SUPER_ADMINS=${SUPER_ADMINS:-admin@local.host}) so the .env file takes precedence.

2. The "Empty Shell" Problem (No Agent Nodes)

  • The Problem: AI Hub requires at least one node to be useful. If a user runs docker-compose up, the Hub starts, but there are zero agent nodes connected. The user is required to manually provision a node to do anything useful.
  • The Fix: Include a pre-configured sandbox-node directly in the docker-compose.yml that automatically performs the token handshake with the Hub on startup.

3. The "Brain Dead" State (Missing LLM Keys)

  • The Problem: If a user bypasses OIDC, logs in, and goes to the Chat interface, the AI will crash because there are no LLM provider keys configured.
  • The Fix:
    1. Update setup.sh to optionally accept a pre-filled configuration file (./setup.sh my_config.yaml). The script copies this file to ./config.yaml and mounts it into the backend via docker-compose so the provider keys are instantly available.
    2. Update the frontend UI to display a friendly onboarding splash screen ("Welcome to Cortex! Please configure your first AI Provider...") routing them to the settings page if no active providers are found (if they didn't pass in a config file).

🌅 Day 0 / Day 1: Seamless Initialization (Interactive Setup script)

1. One-Line Setup Script (setup.sh)

User Journey:

  • The administrator runs a one-line install command published in the repository (e.g., bash <(curl -s https://.../setup.sh) or ./setup.sh).
  • The script acts as an interactive wizard asking for the bare-minimum required values:
    • Admin Email (maps to SUPER_ADMINS).
  • The script automatically generates a highly secure SECRET_KEY and a random initial Admin Password.
  • It saves all this to a .env file and starts the application via docker-compose up -d.
  • The script prints a clean summary to the terminal:
      ✅ AI Hub is up and running!
      🌐 Access the Hub UI: http://localhost:8002 (or your server IP)
      👤 Admin Email: admin@example.com
      🔑 Initial Password: [Generated Password]

2. Web UI Access (Local Auth Fallback)

User Journey:

  • The administrator visits the URL provided by the setup script.
  • They log in using the generated local credentials. No OIDC configuration is required.
  • Once logged in, they can access the platform and can optionally change their password in their Profile settings.

Implementation TODOs:

  • [ ] DB Model: Update User model (app/db/models/user.py) to include a nullable password_hash column.
  • [ ] Config: Set OIDC settings as optional and disabled by default in Settings (app/config.py), adding an oidc_enabled: bool flag.
  • [ ] Backend Gen: In save_user or the initialization hook, if forming the initial super admin, generate a random password, hash it, save it, and log the plain text to stdout.
  • [ ] API Routes: Create local login endpoints (POST /api/v1/users/login/local and PUT /api/v1/users/password).
  • [ ] Frontend: Redesign the Login page to show a local Username/Password form by default.

3. Swarm Control (gRPC Insecure Mode & Missing Hostname)

User Journey:

  • The user wants to test remote agent nodes (e.g., their laptop).
  • Because TLS isn't configured out of the box, the Hub's gRPC Orchestrator starts on an insecure binding (add_insecure_port).
  • Furthermore, because the Hub doesn't know its external DNS name/IP, the generated provisioning script defaults to pointing nodes at localhost:50051.
  • The Policy: We allow this behavior to enable rapid local testing. However, the Web UI clearly warns the user with two banners on the Swarm dashboard:
    1. ⚠️ "Swarm operating in Insecure Mode. Traffic is not encrypted."
    2. ⚠️ "Hub External Endpoint is not configured. Node provisioning scripts will default to 'localhost' and will require manual IP adjustment if deploying to remote machines."
  • The user can connect nodes locally, or manually substitute their LAN IP, and test the system immediately.

🛡️ Day 2: Production Securitization & Hostname Configuration

TODOs:

  • [ ] Config: Add GRPC_TLS_ENABLED: bool to settings, defaulting to False.
  • [ ] Backend: Ensure app/core/grpc/services/grpc_server.py cleanly handles the insecure port bound to GRPC_ENDPOINT when GRPC_TLS_ENABLED is false.
  • [ ] Nodes API / UI: Expose the TLS status via an API so the frontend dashboard can render a warning banner about insecure swarm communication.

🛡️ Day 2: Production Securitization

1. Transitioning to Single Sign-On (OIDC)

User Journey:

  • Once the user is comfortable and ready to deploy to a team, the administrator navigates to the Authentication Settings page in the UI.
  • They input their OIDC provider details (Client ID, Secret, Auth URL, etc.) and toggle Enable SSO.
  • From then on, the login screen displays a prominent "Log in with SSO" button.
  • Critical Phase: When the admin logs in via SSO, the system extracts their email from the JWT payload. Since it matches the local admin account, the system safely links the oidc_id to the existing account. The admin does not randomly lose access or get a duplicated profile.

Implementation TODOs:

  • [ ] OIDC Settings API: Create PUT /api/v1/admin/config/oidc to update OIDC details and write them persistently to config.yaml.
  • [ ] Auth Linking: Update app/core/services/auth.py (handle_callback) to search for an existing user by email before creating a new account. If found, link the OIDC sub to the existing local user.
  • [ ] Frontend Login: The login screen queries /api/v1/auth/config to detect if OIDC is enabled and conditionally renders the OIDC redirect button alongside the local fallback.

2. Encrypting Swarm Control & Setting Hostnames (gRPC TLS)

User Journey:

  • The admin decides to connect agent nodes located in different cloud providers or over the public internet.
  • They navigate to the Swarm/Node Settings page.
  • They define the EXTERNAL_GRPC_ENDPOINT (e.g., ai.example.com:50051).
  • They upload or provide paths to a valid SSL/TLS Certificate (cert.pem) and Private Key (key.pem) and toggle Enable TLS for Swarm Control.
  • The Transition (Disconnect): The UI displays a loud warning prompt: "Changing the Endpoint or enabling TLS will drop all currently connected nodes. You will need to manually download the new configuration for each node and restart them."
  • The Hub restarts its gRPC server bound to a secure port (add_secure_port).
  • Existing nodes disconnect because the insecure channel is closed or network routes change.
  • The user navigates to their nodes in the UI, clicks "Download Config Bundle", and pushes the updated secure configuration to their remote machines, completing the secured Day 2 journey.

Implementation TODOs:

  • [ ] Config: Add fields for GRPC_EXTERNAL_ENDPOINT, GRPC_CERT_PATH, and GRPC_KEY_PATH.
  • [ ] Backend gRPC Server: Update serve_grpc in grpc_server.py. If GRPC_TLS_ENABLED is true, read the certs and use server.add_secure_port().
  • [ ] API / Settings UI: Create an admin endpoint (PUT /api/v1/admin/config/swarm) to define external endpoints and TLS certificates.
  • [ ] Node Provisioning: Update the script generator (_generate_node_config_yaml) to use GRPC_EXTERNAL_ENDPOINT (fallback to localhost) and instruct the python client to use grpc.ssl_channel_credentials() when TLS is active.
  • [ ] UI Banners: Add the "Insecure Mode" and "Missing Hostname" warning banners to the Swarm Dashboard frontend.