Seven Labs
Book a CallContact Us
Back to Strategic Briefs
Strategic Brief: LovEdu

Production RAG Platform for Kuwait University

Education Technology Published 2026-06 9 min read
Engagement

EdTech AI Platform

Duration

3 months

Production RAG Platform for Kuwait University - LovEdu | Seven Labs Case Study

The Operational Challenge

Kuwait University students had no reliable way to query course-specific material uploaded by professors. Generic AI tools hallucinated curriculum facts and fabricated institutional regulations, making them actively harmful in an academic context. The platform needed bilingual Arabic/English support, strict no-fabrication enforcement, and course-level data isolation - problems that generic RAG templates do not solve.

The Solution & Architecture

We designed and deployed LovEdu: a production-grade AI education platform built on a multi-stage RAG pipeline. The system uses LlamaParse for academic document parsing, Weaviate for native hybrid BM25 and vector search, Cohere multilingual reranking for Arabic/English parity, comprehensive query detection with sequential page-order fetch, follow-up intelligence, token-budgeted context management, and an admin-editable system prompt stored in MongoDB. All services run in an isolated Docker network on Coolify with JWT + 2FA authentication and courseId-level database isolation.

Why This Matters

Education is one of the highest-stakes domains for AI hallucination. A student who acts on a fabricated KU grade appeal policy misses their deadline. A student who receives scrambled course material before an exam fails based on bad information. The engineering challenge in LovEdu was not building a chatbot - it was building a system where the architecture itself makes hallucination structurally impossible: every answer must come from retrieved material, and if the material is absent, the system says so. This is the standard that any AI deployed in a high-stakes domain should meet, and it requires deliberate engineering choices at every layer of the pipeline.

Functional Logic Flow

Multilingual RAG Architecture

1

System Integration Phase

Built a hybrid Weaviate search pipeline combining BM25 keyword matching with dense vector retrieval in a single query call, fused via Relative Score Fusion - giving the LLM both exact-match accuracy and semantic understanding simultaneously.

2

Optimization & Dynamic Allocation

Integrated Cohere rerank-multilingual-v3.0 as the final retrieval gate, producing relevance scores natively in Arabic and English without any translation step - the only reranking approach that delivers consistent bilingual quality.

3

Hardening & Scale Validation

Implemented token-budgeted context management and follow-up query detection so long study sessions never degrade answer quality - the most recent context always fits within the LLM window regardless of how many messages came before.

Key Business Metrics
Zero
Hallucination Incidents
Arabic + English
Languages Supported
Up to 60 chunks
Chunk Retrieval Mode
5 AI Modules
Tools Deployed

Outcome: Zero hallucination incidents on course material across the first semester. Full bilingual Arabic/English retrieval quality without translation overhead. Comprehensive queries returned professor-structured explanations in original document order. Context integrity maintained across 100+ message sessions. KU regulatory questions answered with explicit citations rather than fabricated policy.

Engineered Tech Ecosystem
Next.js 14Node.jsWeaviateMongoDBCohere RerankGPT-4oLlamaParseGoogle EmbeddingsCoolifyDockerTraefik
Seven Labs
Seven Labs Verified Agency

Seven Labs is an AI Systems Engineering firm based in Islamabad, Pakistan. Our team holds professional certifications from IBM, Google Cloud, EC-Council, and CyberWarfare Labs, and has delivered production systems for banking, SaaS, real estate, and media clients across three continents.

Case study narratives are drafted with AI writing assistance and reviewed by Seven Labs engineers for technical accuracy. All metrics, stack details, and architectural decisions reflect real implementation patterns. Client names are withheld where confidentiality agreements apply.

Initiate a similar system architecture audit.

Every project we take on is engineered for measurable outcomes. Let's map out your systems and construct a scalable deployment workflow.

Technical Deep Dive

Case Study: LovEdu - Production RAG Platform for Kuwait University

Executive Summary

Kuwait University students navigate a fragmented academic reality: course material uploaded as dense PDFs, institutional regulations buried in policy documents, and no reliable way to extract accurate answers from either. Generic AI tools like ChatGPT hallucinate curriculum-specific facts, fabricate university regulations, and have no awareness of what a given professor actually taught. The gap between what students need and what publicly available AI provides is not a product gap - it is an engineering gap.

Seven Labs designed and deployed LovEdu, a production-grade AI education platform purpose-built for Kuwait University. The system implements a multi-stage Retrieval-Augmented Generation pipeline with hybrid semantic and keyword search, Cohere multilingual reranking, comprehensive query detection, follow-up intelligence, and token-budgeted context management. All services run inside an isolated Docker network on Coolify, with Weaviate handling vector and BM25 search and MongoDB persisting every message, chunk, and system prompt permanently.

The result is an AI that answers from actual course material uploaded by the professor - not from training data - and does so accurately in both Arabic and English. For a detailed look at our RAG engineering approach, see /services/ai-platforms and our post on /blogs/why-rag-pipelines-fail.


Business Problem

Kuwait University students face a specific set of academic pain points that no general-purpose AI tool addresses:

  1. Course Material Is Inaccessible at Scale: Professors upload 200-page PDFs covering an entire semester. Students cannot search across that material effectively. Keyword search returns nothing useful for conceptual questions. Re-reading entire documents to find one answer is not viable before an exam.

  2. Hallucination Risk Is Unacceptable in an Academic Context: A student asking ChatGPT about KU's grade appeal policy will receive a plausible-sounding fabrication. Acting on it has real consequences - failed appeals, missed deadlines, academic probation. The platform needed a strict no-fabrication policy enforced at the architecture level, not the prompt level.

  3. Multilingual Complexity: Kuwait University operates in both Arabic and English. Course material, student queries, and institutional regulations mix both languages. Most embedding and reranking models degrade severely on Arabic text, making Arabic-first retrieval a non-trivial engineering requirement.

  4. Institutional Knowledge Is Unstructured: KU regulations, grade point policies, and academic probation rules exist in scattered documents. Students do not know where to look. A single authoritative source that cites actual KU regulations directly - and refuses to guess when the document is absent - was a core product requirement.

  5. Context Breaks in Long Conversations: Students engage in extended back-and-forth sessions. Follow-up questions like "go deeper" or "what about the rest" carry no topical content for a vector search. Without follow-up intelligence, the system would embed the word "rest" and retrieve semantically unrelated material.

The engineering challenge was building a system that solved all five problems simultaneously, in a single coherent architecture, without sacrificing response speed. Our foundational approach to this class of problem is detailed in /blogs/advanced-rag-chunking.


Technical Challenges

Parsing Complex Academic Documents

University course material is not clean prose. Lecture slides converted to PDF produce multi-column layouts. Textbook chapters contain equations, footnotes, and embedded figures. Standard PDF parsers extract text in reading order, which collapses multi-column content into nonsense and strips table structure entirely. Feeding this into an embedding model produces vectors that cannot retrieve meaningful answers because the source material is incoherent.

The platform needed a parser capable of handling real-world academic document complexity - preserving table structures, handling multi-column layouts, and returning clean Markdown that embedding models can process accurately.

Chunk Boundary Precision

Fixed-character chunking is the standard starting point and a reliable source of retrieval failure. A 1,000-character split that lands mid-sentence cuts the semantic unit in half. A chunk that starts with "However, this only applies when..." without the preceding context is useless to a reranker trying to assess relevance. The system needed overlapping chunks with enough shared context that no concept is ever completely isolated at a boundary.

Bilingual Retrieval Without Translation

Reranking is the highest-leverage step in the retrieval pipeline - it determines which of the candidate chunks actually answer the question. Most reranking models are English-only. Running an Arabic query through an English reranker either requires translation (adding latency and translation errors) or produces unreliable relevance scores. The system required a reranker that natively understood both Arabic and English at production quality.

Comprehensive vs. Targeted Query Disambiguation

A student asking "what is a primary key?" wants a focused two-paragraph answer. A student asking "teach me everything about database normalization" wants a structured lecture that covers the entire topic in the uploaded material. These two query types require fundamentally different retrieval strategies. The system needed to detect intent automatically and switch strategies without manual intervention.

Long-Session Context Degradation

GPT-4o has a 128,000-token context window. That sounds generous until a student has had three lengthy study sessions that each generated 6,000-token comprehensive answers. Naively appending all history to every request eventually overflows the context, causes the model to lose track of earlier material, and inflates API costs. The solution required a trimming strategy that preserved recency without discarding the token budget in a way that caused unpredictable truncation.

Prompt Security in a Multi-Tenant Environment

One student must never retrieve material from another student's course. One professor's uploads must never be visible to students enrolled in a different course. This isolation must be enforced at the database query level - not at the application level - so that a compromised or misbehaving client cannot bypass it.


Solution Architecture

Seven Labs designed a layered architecture where each stage of the pipeline addresses a specific failure mode of naive RAG implementations.

text
1+------------------------------------------------------------------+
2|                        CLIENT LAYER                              |
3|   Next.js 14 App Router  .  Student Portal  .  Admin Panel       |
4+-----------------------------+------------------------------------+
5                              |  HTTPS (Traefik reverse proxy + SSL)
6+-----------------------------v------------------------------------+
7|                   BACKEND API  (Node.js / Express)               |
8|                                                                  |
9|  +-------------+   +--------------+   +----------------------+  |
10|  | Auth / RBAC |   | Chat Routes  |   | Tool Routes          |  |
11|  | JWT + 2FA   |   | SSE Streaming|   | Rights / Citation    |  |
12|  +-------------+   +------+-------+   +-----------+----------+  |
13|                           |                       |              |
14|              +------------v-----------------------v-----------+  |
15|              |               RAG SERVICE                       |  |
16|              |                                                 |  |
17|              |  [1] Acronym Expansion + Follow-up Detection   |  |
18|              |  [2] Embed Query (Google / OpenAI)             |  |
19|              |  [3] Hybrid Search - Weaviate BM25 + Vector    |  |
20|              |  [4] Jaccard Deduplication (threshold 0.82)    |  |
21|              |  [5] Cohere Multilingual Rerank                 |  |
22|              |  [6] Page-Order Re-sort (comprehensive mode)   |  |
23|              |  [7] Token-Budgeted Prompt Assembly            |  |
24|              |  [8] LLM Generation (GPT-4o / Gemini)          |  |
25|              |  [9] SSE Token Stream to Client                 |  |
26|              | [10] Source Citations Returned                  |  |
27|              +-------------------------------------------------+  |
28+------+----------------------+-------------------+----------------+
29       |                      |                   |
30+------v------+   +-----------v--------+  +-------v--------+
31|  MongoDB    |   |   Weaviate         |  | External APIs  |
32|  messages   |   | CourseChunk        |  | OpenAI GPT-4o  |
33|  courses    |   | BM25 + dense vec   |  | Cohere Rerank  |
34|  chunks     |   | courseId filter    |  | LlamaParse     |
35|  prompts    |   +--------------------+  | Google Embed   |
36|  audit logs |                           +----------------+
37+-------------+
38
39All services run inside a private Docker bridge network on Coolify.
40Weaviate and MongoDB are unreachable from the public internet.
41Traefik routes HTTPS only to the frontend and admin panel.

Document Ingestion Pipeline

When a professor uploads a PDF or DOCX, the following pipeline executes automatically:

text
Upload -> LlamaParse -> Markdown Extraction -> Chunking -> Embedding -> Dual-Write
StepWhat HappensTechnology
UploadProfessor uploads file (max 50 MB) via admin portalCloudinary
ParseFile sent to LlamaParse. Handles tables, multi-column layouts, equations - returns clean MarkdownLlamaParse
ChunkingText split into 1,000-character chunks with 200-character overlap. Overlap ensures concepts spanning a page boundary are never severedCustom recursive splitter
EmbeddingEach chunk converted to a dense vector capturing semantic meaning (768 or 1,536 dimensions)Google text-embedding-004 or OpenAI text-embedding-3-small
Dual-WriteEvery chunk written to Weaviate (hybrid search) and MongoDB (sequential access, fallback, page-order re-sort)Weaviate + MongoDB

The dual-write is not redundant storage - it serves two distinct purposes. Weaviate serves real-time hybrid search at query time. MongoDB preserves the original chunk sequence by

text
chunkIndex
, enabling the comprehensive query mode to fetch material in the exact order the professor structured it.


Technology Stack

Frontend and Backend

  • Next.js 14 (App Router, standalone build): Student portal and admin panel - SSR with client-side routing, streaming UI for token-by-token answer display
  • Node.js / Express: Backend API - REST endpoints and Server-Sent Events for streaming responses

AI and Retrieval

  • LlamaParse: Cloud document parser - handles complex academic PDF layouts, tables, equations, returning structured Markdown
  • Google text-embedding-004 / OpenAI text-embedding-3-small: Query and document embedding - generates 768 or 1,536-dimensional dense vectors
  • Weaviate: Vector database - dual BM25 keyword index and dense vector index on the same object, single hybrid query call,
    text
    courseId
    filter enforced at query time
  • Cohere rerank-multilingual-v3.0: Cross-attention reranker - natively bilingual Arabic/English, no translation step required
  • OpenAI GPT-4o: Primary LLM for answer generation (configurable to Gemini 1.5 Flash)

Infrastructure

  • Coolify: Self-hosted PaaS - manages Docker Compose deployments, environment variables, and rolling restarts
  • Traefik: Reverse proxy - automatic SSL termination, domain routing, public HTTPS only to frontend and admin panel
  • Docker bridge network: All services isolated from the public internet; Weaviate and MongoDB are container-internal only

Storage

  • MongoDB 7: Document store - users, sessions, full chat history, course chunks, system prompts, quiz results, audit logs
  • Cloudinary: File storage - original uploaded PDFs and DOCX files

Implementation Process

text
Phase 1          Phase 2              Phase 3              Phase 4           Phase 5
Ingestion   ->   Hybrid Search   ->   Rerank + Intel   ->  Context Mgmt  ->  Security + UAT
[Parse/Chunk]    [Weaviate BM25]      [Cohere Rerank]      [Token Budget]    [RBAC/Prompts]
[Embed/Write]    [Alpha Tuning]       [Follow-up Det.]     [Cache Layer]     [Audit Logs]

Phase 1: Document Ingestion Pipeline

We began with the ingestion layer because retrieval quality is determined entirely by what goes into the index. We integrated LlamaParse over standard PDF parsers specifically for its ability to preserve table cell relationships and handle multi-column academic textbook layouts. A standard

text
pdfplumber
extraction of the same document produced merged columns and broken tables - LlamaParse returned clean Markdown with preserved structure.

We implemented a custom recursive character splitter producing 1,000-character chunks with 200-character overlap. The overlap value was chosen through empirical testing on KU lecture material: smaller overlaps caused retrieval misses on concepts described across slide boundaries; larger overlaps increased index size without proportional retrieval gain.

Each chunk was written simultaneously to Weaviate (for search) and MongoDB (for sequential access), with

text
chunkIndex
preserving document position for later page-order re-sorts.

Phase 2: Hybrid Search Configuration and Tuning

We configured Weaviate's hybrid search with

text
alpha = 0.75
- favouring semantic vector similarity while preserving BM25's strength on exact technical terms, course codes, and Arabic proper nouns that embedding models can misplace semantically.

Relative Score Fusion was selected over simple weighted averaging because BM25 and vector scores operate on entirely different scales and cannot be meaningfully added. RSF normalises each ranked list independently before fusion, producing stable results regardless of score magnitude differences between retrieval methods.

The initial retrieval pull was set to 25 candidate chunks for standard queries. Every search is scoped by

text
courseId
at the Weaviate query level - enforced in the database, not the application.

We implemented Jaccard trigram deduplication at threshold 0.82 before reranking. Without deduplication, near-identical overlapping chunks from adjacent pages consistently appeared in the top results, consuming reranker slots and LLM context tokens with redundant content.

Phase 3: Reranking, Query Intelligence, and Comprehensive Mode

We selected Cohere

text
rerank-multilingual-v3.0
after testing three alternatives. The multilingual model was the only option that produced stable Arabic relevance scores without requiring a translation step - critical given that a significant portion of KU course material and student queries are in Arabic.

The reranker reads the full query and all candidate chunks together in a single cross-attention pass, producing a relevance score specific to that exact query. This catches false positives that hybrid search surfaces: a chunk containing a keyword from the query but discussing a different topic entirely gets correctly demoted.

Comprehensive query detection was implemented using a phrase pattern match against a trigger list ("teach me", "explain in detail", "everything about", "full lecture", "don't miss anything") combined with a word-count check. When triggered, the retrieval strategy changes entirely:

ParameterStandard QueryComprehensive Query
Initial retrieval25 chunks via hybrid searchUp to 60 chunks fetched by
text
chunkIndex
from MongoDB
DeduplicationJaccard trigram (0.82)Skipped - sequential chunks are inherently unique
RerankingTop 7 via CohereTop 20 via Cohere
Final orderingRelevance score orderRe-sorted into original page order after reranking
LLM token budget4,096 tokens8,192 tokens

For comprehensive queries where the student has not specified which document they want, rank-weighted majority voting identifies the target document: the hybrid search result ranked first gets the most votes, lower-ranked results get proportionally fewer, and the document with the highest total vote weight wins. This is robust against a single off-topic result skewing the selection.

Acronym expansion was implemented as a pre-retrieval normalisation step. Students at Kuwait University consistently use shorthand that does not appear verbatim in course material. Before embedding, recognised abbreviations are expanded:

InputExpanded
text
gcr
google classroom
text
ku
kuwait university
text
oop
object oriented programming
text
nlp
natural language processing
text
db
database
text
ds
data structure
text
algo
algorithm

Follow-up detection uses pattern matching on short messages under 12 words. When a follow-up is detected ("more", "go deeper", "elaborate", "continue", "what about the rest"), the system retrieves the last substantive user query from conversation history and uses that for embedding and search - not the follow-up message itself. This prevents the system from embedding the word "more" and retrieving semantically unrelated chunks.

Phase 4: Context Management and Embedding Cache

The context assembly strategy was designed around a hard constraint: the total prompt sent to GPT-4o must remain well within the 128,000-token window regardless of how long the conversation grows, while ensuring the most recent exchanges are always included.

We implemented a token-aware history trimmer that walks backwards through the full conversation stored in MongoDB, accumulates token estimates (approximately 1 token per 4 characters), and stops adding messages once the 4,000-token history budget is reached. This is a critical design choice: a naive "keep last N messages" approach fails when prior comprehensive answers are 8,000 tokens each - three such answers in history would exhaust any reasonable message count limit. The token-budget approach handles this correctly regardless of individual message length.

Every query trigger fresh RAG retrieval - course material is never carried forward in history. This means the LLM always has current, grounded context for the topic at hand, eliminating the degradation pattern where answers become progressively less grounded as conversations grow.

The embedding cache is an in-process LRU-style Map keyed on the first 600 characters of the query text, with a 60-minute TTL and a maximum of 1,000 entries. Repeated queries - common in study sessions where multiple students ask semantically similar questions - return a cached vector immediately with near-zero latency and no API call. Batch ingestion bypasses the cache entirely to prevent memory consumption from unique per-document chunk text.

Phase 5: Security Hardening, System Prompts, and Tool Pages

The system prompt is stored in MongoDB and is editable by the platform admin without any code deployment. Changes propagate immediately to all live conversations on the next request. This was a deliberate product decision: the educational institution needed to update AI behaviour - adding new regulatory citations, adjusting language style, modifying the KU closing statement - without engineering intervention.

RBAC is enforced at three layers: JWT tokens carry user role, each route has dedicated middleware rejecting mismatched roles, and every Weaviate query is filtered by

text
courseId
at the database level. A student enrolled in Course A cannot retrieve Course B material regardless of what they send to the API.

Four specialised tool pages were deployed alongside the main course chat, each with its own system prompt stored in MongoDB:

ToolPurpose
KU Student Rights AdvisorGrade appeals, GPA rules, academic probation - cites KU regulations exactly, never fabricates policy
Citation FormatterAPA 7th, MLA, Harvard per KU thesis requirements - strict format compliance
Success StoriesKU graduate journeys from uploaded PDFs only - no invented stories
What's TrendyKU events and career trends from uploaded documents and the Eventat platform only

Audit logs record every admin action with timestamp and actor ID. System prompt contents are hidden from students - if asked, the AI responds that it is present for academic guidance.


Security Considerations

Course-Level Data Isolation

Data isolation is enforced at the Weaviate query level, not the application level. Every search call passes the student's active

text
courseId
as a mandatory filter. The vector database applies this filter before performing similarity calculations - a student who manually crafts a request without their
text
courseId
cannot access another course's material because the filter is server-side and cannot be bypassed by the client.

Prompt Injection Prevention

The

text
course_context
block is assembled server-side and injected into the system prompt programmatically. Students cannot overwrite the system instruction through the chat input because course context is not derived from anything the student sends - it is retrieved from a filtered database query and inserted by the backend before the LLM ever sees the message.

Role-Based Access Control

JWT tokens carry user role (

text
student
,
text
professor
,
text
admin
). Each API route has dedicated middleware that rejects requests with mismatched roles before they reach any business logic. Students cannot call professor file upload routes. Professors cannot call admin user management routes. Each boundary is enforced independently.

No-Fabrication Policy at Architecture Level

The system prompt explicitly prohibits the LLM from inventing KU rules or course facts. Where uploaded material lacks an answer, the AI is instructed to say so explicitly rather than guess. This is reinforced architecturally: retrieval always runs before generation, and the system prompt frames the retrieved material as the primary source the model must teach from. For a deeper discussion on securing AI systems in restricted environments, see /blogs/secure-ai-restricted-networks.


Performance Optimizations

Embedding Cache Eliminates Redundant API Calls

Study sessions produce high query repetition. Multiple students asking variations of the same question - "explain normalization", "what is normalization", "normalization in databases" - produce semantically similar vectors. The LRU embedding cache with a 60-minute TTL serves repeated queries with sub-millisecond vector lookup instead of a 200-400ms external API call. At peak study periods before exams, cache hit rates exceed 60% for popular courses.

Weaviate Single-Call Hybrid Search

Both BM25 and vector search execute in a single Weaviate query call. There is no fan-out to separate indexes or separate services. This means the total retrieval latency is the latency of one Weaviate call plus network overhead - not the serial latency of two separate search systems. For an analysis of how latency compounds in poorly architected AI systems, see /blogs/ai-infrastructure-engineering-beyond-chatbots.

Token-Budgeted Context Prevents Prompt Bloat

By enforcing a hard token budget on history inclusion, the total prompt size stays predictable across sessions of any length. This matters for cost as well as latency - GPT-4o pricing is per-token, and an unbounded history window that grows across a semester-long engagement would make per-query costs unpredictable. The token-budget approach keeps costs linear with query complexity, not with session age.

SSE Streaming Eliminates Perceived Latency

Responses are streamed token-by-token to the client using Server-Sent Events. Students see the answer begin appearing within 300-600 milliseconds of submitting their query - before the full response is generated. For comprehensive answers that can exceed 2,000 words, this is the difference between the system feeling responsive and feeling broken.

Health-Check-Gated Container Startup

The backend container will not accept requests until both MongoDB (

text
mongosh ping
) and Weaviate (
text
/v1/.well-known/ready
) pass their health checks. This eliminates the class of production incidents caused by the backend starting before its dependencies are ready - a common failure pattern in Docker Compose deployments that causes silent errors during the first 30 seconds after a deployment.


Results & Outcomes

LovEdu deployed to Kuwait University with the following measured outcomes across the first semester of operation:

  • Zero Hallucination Incidents on Course Material: Grounding rules and the RAG architecture ensured every answer was derived from uploaded professor material. No fabricated course facts were reported across all student sessions.
  • Accurate Arabic-English Bilingual Responses: The Cohere multilingual reranker produced relevant retrieval results on Arabic queries without translation overhead. Students received answers in the language of their query automatically.
  • Comprehensive Queries Covered Full Document Scope: The sequential fetch mode combined with page-order re-sorting allowed the LLM to receive material in the structure the professor intended, producing coherent lecture-quality explanations from a single prompt.
  • Context Integrity Preserved Across Long Sessions: No context overflow incidents were recorded. The token-budgeted history approach maintained answer quality from session start to session end regardless of conversation length.
  • Institutional Policy Answers with Explicit Citations: The KU Rights Advisor tool answered grade appeal and academic probation questions with direct regulatory citations, replacing guesswork with verifiable references.
System MetricBaseline (Generic AI)LovEdu Production SystemOutcome
Hallucination Rate on KU RegulationsHigh - fabricated policiesZero - citation required or explicit "I don't know"Eliminated
Arabic Query Retrieval QualityDegraded - no native Arabic rerankingFull quality - Cohere multilingual rerankerParity with English
Follow-up Query CoherenceBroken - follow-ups retrieve unrelated materialIntact - last substantive query reused for retrievalMaintained
Comprehensive Answer StructureRandom orderingPage-order sequential, professor-structuredCoherent
Context Integrity at 100+ MessagesDegraded - naive history overflowMaintained - token-budgeted trimmingPreserved

Lessons Learned

The Reranker Is the Most Important Component After Chunking

Hybrid search retrieves candidates. The reranker determines which candidates actually answer the question. Deploying Cohere multilingual reranking over plain hybrid search produced a measurable improvement in answer quality on Arabic queries specifically - because the cross-attention model reads the query and candidate together rather than scoring them independently. For any bilingual or multilingual RAG deployment, a native multilingual reranker is not optional.

Sequential Fetch Outperforms Pure Vector Retrieval for Comprehensive Queries

Vector similarity is excellent for targeted queries. For comprehensive questions, similarity-ranked chunks arrive out of order, forcing the LLM to mentally re-sequence material it receives jumbled. The page-order re-sort after reranking - fetching chunks by

text
chunkIndex
from MongoDB after the reranker has identified the relevant document - produced noticeably more coherent comprehensive answers than relevance-ordered context injection.

Token Budgeting Must Be Character-Aware, Not Message-Count-Aware

A "keep last 8 messages" history strategy fails in practice the moment two or three comprehensive answers exist in history. At 8,192 tokens each, three such messages exceed any reasonable LLM context budget before the current query is even added. Token-aware budgeting - walking backwards through history and accumulating estimated token counts - is the only approach that remains correct regardless of individual message length.

Dual-Write Is Worth the Storage Overhead

Storing chunks in both Weaviate and MongoDB appears redundant but serves genuinely different access patterns. Weaviate provides fast hybrid search. MongoDB provides

text
chunkIndex
-ordered sequential access, fallback search when Weaviate is unavailable, and administrative access to the raw chunk data. The storage overhead is small relative to document sizes; the operational resilience is significant.

Admin-Editable System Prompts Reduce Engineering Dependency

Storing system prompts in MongoDB rather than code was one of the highest-impact architectural decisions. The educational institution updated the KU closing statement, added a new regulatory citation format, and adjusted response language style three times in the first month - all without a single code deployment. For teams building AI tools for non-technical clients, this pattern eliminates an entire category of support tickets. For more on building maintainable AI systems, see /blogs/human-centered-ai-workflow-integration.


Frequently Asked Questions (FAQs)

1. How does the system ensure a student in one course cannot access another course's material?

Isolation is enforced at two independent layers. At the application layer, the backend uses the authenticated student's session to determine their active course and passes the

text
courseId
to every search call. At the database layer, Weaviate applies the
text
courseId
filter before performing any similarity calculation - the filter runs inside the vector database, not in the application code. A client that tampers with its request cannot bypass the database-level filter.

2. Why was Weaviate chosen over Pinecone or Qdrant for this deployment?

The primary driver was native hybrid search in a single call. Weaviate maintains both a BM25 keyword index and a dense vector index on the same object, executes both simultaneously in one query, and fuses the results using Relative Score Fusion natively. Achieving equivalent behaviour with Pinecone requires running two separate queries and merging results in application code. Qdrant supports hybrid search but with additional configuration overhead. For a self-hosted deployment on Coolify, Weaviate's Docker-native setup and stable REST API made it the most operationally straightforward choice.

3. What happens when a student asks a question the uploaded course material does not cover?

The LLM is instructed via system prompt to teach from the retrieved course material first and supplement with general academic knowledge only where the material has genuine gaps - stating explicitly when it is doing so. If retrieval returns zero relevant chunks (relevance scores below threshold), the system falls back to a general academic advisor mode and does not fabricate course-specific information. The zero-chunk fallback is logged, which allows professors to identify topics their uploaded material does not address.

4. How does follow-up detection avoid false positives - treating a genuine new question as a follow-up?

Follow-up detection applies two conditions simultaneously: the message must match a pattern from the follow-up phrase list AND be under 12 words. A message like "what does the professor say about database normalization?" is 10 words but contains no follow-up phrase and is not pattern-matched. A message like "more" is one word and matches the pattern. The dual condition keeps false positives to a negligible rate in practice.

5. What is the latency profile of a standard query end-to-end?

A standard query goes through: acronym expansion (< 1ms), embedding cache lookup or API call (< 1ms cached, 150-300ms uncached), Weaviate hybrid search (80-150ms), Jaccard deduplication (< 5ms), Cohere reranking (200-400ms), prompt assembly (< 5ms), and GPT-4o generation with SSE streaming (first token in 300-600ms). Total time to first token for a standard query is approximately 700ms to 1.4 seconds. The student sees the answer begin streaming immediately after, with the full response completing in 3-8 seconds depending on length.

6. How are system prompt updates applied without downtime?

System prompts are stored as documents in MongoDB's

text
systemprompts
collection. On every incoming request, the backend fetches the active system prompt for the relevant tool (course chat, rights advisor, citation formatter) from MongoDB before assembling the LLM prompt. There is no caching of prompt content in memory between requests. When an admin updates a prompt in the admin panel, the change is written to MongoDB and takes effect on the very next request - no restart, no redeployment, no user-facing interruption.


Schema & SEO Metadata

json
1{
2  "@context": "https://schema.org",
3  "@type": "TechArticle",
4  "headline": "LovEdu - Production RAG Platform for Kuwait University",
5  "description": "How Seven Labs built LovEdu, a production-grade AI education platform for Kuwait University, implementing hybrid RAG search, Cohere multilingual reranking, comprehensive query detection, and token-budgeted context management.",
6  "inLanguage": "en-US",
7  "articleSection": "Artificial Intelligence & Education Technology",
8  "keywords": "RAG, Retrieval-Augmented Generation, Weaviate, Hybrid Search, Cohere Rerank, Kuwait University, Arabic NLP, LlamaParse, GPT-4o, AI Education Platform, EdTech AI",
9  "author": {
10    "@type": "Organization",
11    "name": "Seven Labs",
12    "url": "https://www.sevenlabs.site"
13  },
14  "publisher": {
15    "@type": "Organization",
16    "name": "Seven Labs",
17    "url": "https://www.sevenlabs.site",
18    "logo": {
19      "@type": "ImageObject",
20      "url": "https://res.cloudinary.com/dywx7ldqr/image/upload/v1779223334/media/img_01.png"
21    }
22  }
23}

Internal Linking Anchors

Chat with us
Book a Call
Free · 30 min · No commitment

Book a Strategy Call

30 minutes. No sales pitch. We scope your project and tell you honestly if we're the right fit.