Newer
Older
cortex-hub / docs / reviews / feature_review_stt_providers.md

Code Review Report: Feature 17 — Speech-to-Text Infrastructure

This report performs a deep-dive audit of the Hub's audio transcription layer, specifically the GoogleSTTProvider in stt/gemini.py, through the lens of 12-Factor App Methodology, Credential Safety, and Memory Efficiency.


🏗️ 12-Factor App Compliance Audit

Factor Status Observation
XI. Logs 🔴 Major Issue Plain-text API Key Leak: The GoogleSTTProvider (Line 89) logs the self.api_url in every transcription request at the DEBUG level. Because the Google AI Studio api_key is a query parameter in this URL (Line 36), the full production API key is leaked in plain text to the Hub's log aggregators.
VI. Processes 🟡 Warning Payload Duplication: Transcription requests generate a base64-encoded copy of the entire audio blob in-memory (audio_b64, Line 69). For long-form audio processing (e.g., meeting recordings), this can double the Hub's per-request memory pressure, potentially triggering OOM kills on memory-constrained containers.
II. Dependencies 🟡 Warning Client Inconsistency: The STT provider uses aiohttp (Line 2), while most other backend services (including the TTS provider) use httpx. This introduces redundant dependencies and disparate connection pooling behavior across the Hub.

🔍 File-by-File Diagnostic

1. app/core/providers/stt/gemini.py

The inline transcription bridge for Google's Gemini multimodal models.

[!CAUTION] Lack of Response Throttling/Retry Unlike the TTS provider, the STT provider does not implement a tenacity retry decorator (Line 94). If a transcription fails due to a transient network timeout or a 429 rate limit from Google, the user's voice message is lost immediately without an automatic retry. Fix: Implement a standard retry policy for 429/5xx errors, consistent with the GeminiTTSProvider.

Identified Problems:

  • Brittle MIME Detection: The _detect_mime sniffer (Line 41) only checks the first 3-4 bytes. While effective for common formats, it lacks the robustness of a dedicated media library and might misidentify edge-case codec containers.
  • Static System Prompt: The instruction "Return only the spoken words, nothing else" (Line 82) is hardcoded. If a user wants to include punctuation or speaker labels (Diarization), this prompt will prevent the model from doing so.

🛠️ Summary Recommendations

  1. Redact Logger URL: Replace the self.api_url in the debug log with a masked string or a simple "Sending to Google" label to prevent credential leaks.
  2. Standardize Client: Migrate from aiohttp to httpx to unify the Hub's HTTP connection management and reduce the dependency footprint.
  3. Add Resilience: Wrap transcribe_audio in a tenacity retry loop to handle transient API failures without dropping user requests.

This concludes Feature 17. I have persisted this report to /app/docs/reviews/feature_review_stt_providers.md. How should we address the API-key logging hazard?