Real-time collective intelligence through semantic clustering and consensus detection in Discord conversations
BaselHack 2025 – Collective Intelligence through Semantic Clustering
AI Consensus Platform transforms Discord conversations into structured collective intelligence. Built in 24 hours for BaselHack 2025, the system automatically discovers consensus by embedding messages in semantic space, clustering them using machine learning, and identifying agreement patterns through geometric analysis. No voting mechanisms or manual categorization required—just natural conversation that the AI organizes in real-time.
The core innovation treats each message as a point in 1536-dimensional semantic space. Every incoming message is embedded using OpenAI's text-embedding-3-small model. A content-based cache (SHA-256 hashing) eliminates redundant API calls—critical when discussions reference similar concepts repeatedly.
On startup, the bot performs incremental historical scraping, fetching only new messages since the last run. Each cached message undergoes relevance checking against the active question using cosine similarity:
where and are the embedding vectors for the message and question. Messages exceeding the 0.40 threshold are automatically added to the discussion pool, providing rich historical context.
K-Means Initialization: When a question receives messages, K-Means (k=4) establishes initial cluster structure. Embeddings are L2-normalized for consistent distance metrics:
The algorithm maximizes intra-cluster similarity while increasing inter-cluster separation.
Centroid Computation: Each cluster's "semantic center of gravity" is the arithmetic mean of all member embeddings:
where is the set of messages in cluster . This centroid mathematically represents the cluster's consensus position.
Label Generation: To create two-word labels, we identify the 5 messages nearest to the centroid using cosine similarity, then pass them to GPT-4o-mini with constrained prompts. This ensures labels emerge from the most representative messages, not outliers. Duplicate detection compares new labels against existing ones using both string matching and embedding similarity (0.90 threshold), with automatic retry logic employing progressively harder prompts.
Dynamic Assignment: New messages join the nearest cluster if similarity exceeds 0.30; otherwise, they enter an unassigned buffer. Periodic re-clustering (every 1 second when message count changes) maintains quality without thrashing.
Consensus emerges when clusters meet two criteria: high intra-cluster similarity and significant participation (≥60% of discussants). Intra-cluster similarity is computed as the average pairwise cosine similarity:
Clusters with are identified as consensus regions and highlighted in real-time on the dashboard.
The backend maintains a single active question in-memory, stored in a QuestionState object. This design reflects the hackathon scope and optimizes for clustering—Agglomerative Clustering must iterate over all embeddings repeatedly, making in-memory access essential.
State changes trigger atomic JSON writes to data/ directory files. This provides crash recovery without database transaction complexity or ORM overhead. For discussions with 100+ messages, clustering completes in <500ms—database round-trips would add 2-3 seconds.
In-memory state over database — Single active question + compute-intensive clustering eliminated the need for a database. JSON files provide sufficient crash recovery for hackathon scope while keeping clustering sub-second.
L2-normalized embeddings only for distance calculations — Normalization ensures consistent K-Means distances, but centroid computation uses original embeddings to preserve semantic magnitude.
Dual relevance thresholds (0.40 historical, 0.30 live) — Historical messages require higher standards to avoid pollution; live messages get generous thresholds to avoid false negatives.
Content-based embedding cache — SHA-256 hashing of message content (not message IDs) achieved 85% cache hit rates by reusing embeddings when users rephrase similar ideas.
Centroid-based label selection — Initial approaches used random samples or most-liked messages. Centroid-nearest selection consistently produced more representative labels.
Embedding cache optimization — Initial message-ID-based caching failed to reuse embeddings for rephrased ideas. Switching to content hashing (SHA-256) dramatically improved hit rates but required careful whitespace handling.
Label uniqueness — Early clustering generated duplicates like "good idea" across clusters. Implementing embedding similarity checking (≥0.90 triggers retry) with progressively harder prompts solved this.
Clustering stability — Re-clustering on every new message caused clusters to "jump." Adding message-count-change gating and 1-second debounce stabilized visualization.
Historical relevance tuning — Initial 0.30 threshold pulled in tangential content. Separate thresholds for historical (0.40) vs. live (0.30) messages achieved better precision.
Backend: Python 3.12, FastAPI, OpenAI API (text-embedding-3-small, gpt-4o-mini), scikit-learn (K-Means, Agglomerative Clustering), NumPy, discord.py, WeasyPrint, Uvicorn
Frontend: Next.js 16, React 19, TypeScript, Tailwind CSS v4, Recharts, WebSocket
Infrastructure: Docker (Alpine), Sevalla hosting, Nixpacks, JSON persistence
Team: Oliver Baumgartner (backend & AI), Samel Baumgartner, Sven Messmer, Kimi Löffel
Built for BaselHack 2025 – Endress+Hauser Collective Intelligence Challenge