Overview
EFMSA ingests emails and uploads (PDF/DOCX/TXT/MD), paraphrases them to remove noise, and indexes the content for instant search. A retrieval‑augmented chatbot answers staff questions with citations to the underlying docs. The system is PHP‑first (no S3/Lambda); we rely on cron jobs and PHP libraries for text extraction and processing.
- Sources: inbound email (IMAP/pipe), file uploads, wiki pages.
- Processing: PHP extractors → paraphrase → embeddings → MySQL FULLTEXT + optional vector store.
- Access: intranet search UI + internal Q&A chatbot with citations.
Tiny diagram (PHP workflow)
The simple version
Why indexing matters: without it, search would be slow and incomplete. With it, you can type a few words and immediately see the right page, email, or file.
- New info arrives: You upload a file or our system receives an email.
- Text is extracted: We pull text from the document or email. If it’s a PDF/DOCX/TXT we can read, we take the body text; titles and tags come along for the ride.
- We tidy and store: We clean odd characters, normalize whitespace, and record the content in our database.
- Search gets smarter: The catalog (index) is updated so keywords lead you straight to relevant pages and attachments.
- Repeats on a schedule: We re‑index every 12 hours so the catalog stays fresh. You can also run it manually anytime.
- Privacy first: Only signed‑in members can search. We store what you upload or receive internally—no public sharing.
Technical details & reasoning
Pipeline stages
- Discovery: find new emails and files not yet processed (
status = 'unprocessed'). - Extraction: parse structure (title, sender, date, tags) and extract text (PDF/DOCX/TXT/MD).
- Normalization: strip HTML, collapse whitespace, remove control chars, unify encodings.
- Enrichment: apply tags (if provided), infer type (Email Index vs Manual Upload), store metadata.
- Commit: write into
wiki_docs(or equivalent), mark entry processed, update timestamps. - Searchability: results appear instantly in server‑side search; optional FULLTEXT indices speed queries.
Performance choices
- Server‑side filtering → fast first paint; AJAX refines results without heavy client‑side parsing.
- Index by
hiddenandcreated_atto keep list queries quick. - Prefer FULLTEXT search (if available) for large datasets; fallback to
LIKEfor compatibility.
Safety choices
- Auth gate: only logged‑in members can access the index.
- CSRF tokens on manual actions (e.g., “index now”).
- Headers:
X-Frame-Options,X-Content-Type-Options: nosniff, and a saneReferrer-Policy.
PHP‑first architecture
- Email intake: IMAP polling (cron) or mail server pipe to a PHP endpoint; parse raw MIME (headers, body, attachments).
- File uploads: web form posts to a PHP controller; store temp file; queue for extraction.
- Extraction: PHP libraries convert DOCX/MD/PDF/TXT into UTF‑8 text; attachments handled individually.
- Paraphrase: send cleaned text to OpenAI for summary + normalized copy; redact PII.
- Indexing: upsert rows into
wiki_docswith FULLTEXT over(title, body); optional embeddings table for semantic search. - Search & chatbot: server‑side search endpoint returns ranked docs; chatbot assembles context and cites sources.
Extraction — DOCX / MD / PDF (PHP)
Libraries: phpoffice/phpword for DOCX, league/commonmark for MD, smalot/pdfparser for PDF. Fallback to strip_tags for HTML.
composer require phpoffice/phpword league/commonmark smalot/pdfparser openai-php/client
<?php
require __DIR__ . '/vendor/autoload.php';
use PhpOffice\PhpWord\IOFactory as WordIO;
use League\CommonMark\CommonMarkConverter;
use Smalot\PdfParser\Parser as PdfParser;
function extract_text(string $path): array {
$ext = strtolower(pathinfo($path, PATHINFO_EXTENSION));
if ($ext === 'docx') {
$phpWord = WordIO::load($path);
$text = '';
foreach ($phpWord->getSections() as $section) {
foreach ($section->getElements() as $el) {
if (method_exists($el, 'getElements')) {
foreach ($el->getElements() as $inner) {
if (method_exists($inner, 'getText')) $text .= $inner->getText() . "\n";
}
} elseif (method_exists($el, 'getText')) {
$text .= $el->getText() . "\n";
}
}
}
return ['mime' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'text' => trim($text)];
}
if ($ext === 'md') {
$md = file_get_contents($path);
$converter = new CommonMarkConverter();
$html = $converter->convert($md)->getContent();
return ['mime' => 'text/markdown', 'text' => trim(strip_tags($html))];
}
if ($ext === 'pdf') {
$parser = new PdfParser();
$pdf = $parser->parseFile($path);
return ['mime' => 'application/pdf', 'text' => trim($pdf->getText())];
}
// fallback: plain text or HTML
$raw = file_get_contents($path);
$txt = $ext === 'txt' ? $raw : strip_tags($raw);
return ['mime' => 'text/plain', 'text' => trim($txt)];
}
Paraphrase & embeddings (OpenAI, PHP)
Policy: keep facts; remove salutations/signatures; redact PII; neutral tone; produce 3–6 bullet summary + 1 paragraph narrative. Store both original and paraphrase.
<?php
use OpenAI as OpenAIClient;
$client = OpenAIClient::client(getenv('OPENAI_API_KEY'));
function paraphrase(string $text): string {
global $client;
$prompt = "Rewrite the content into (1) 3-6 bullets, (2) one neutral paragraph. Remove salutations/signatures. Redact names/emails/phones.";
$r = $client->chat()->create([
'model' => 'gpt-4o-mini',
'messages' => [
['role' => 'system', 'content' => 'You paraphrase EFMSA internal documents.'],
['role' => 'user', 'content' => $prompt . "\n\nCONTENT:\n" . $text]
],
'temperature' => 0.2
]);
return $r->choices[0]->message->content;
}
function embed(string $text): array {
global $client;
$e = $client->embeddings()->create([
'model' => 'text-embedding-3-large',
'input' => $text
]);
return $e->data[0]->embedding; // array of floats
}
Index & search (MySQL FULLTEXT)
We keep a wiki_docs table with FULLTEXT over (title, body). Optional: store embedding JSON for future semantic search.
CREATE TABLE IF NOT EXISTS wiki_docs (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255) NOT NULL,
body LONGTEXT NOT NULL,
tags VARCHAR(255) DEFAULT NULL,
source ENUM('email','upload','wiki') NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
hidden TINYINT(1) NOT NULL DEFAULT 0,
embedding JSON NULL
) ENGINE=InnoDB;
CREATE FULLTEXT INDEX ft_title_body ON wiki_docs (title, body);
CREATE INDEX idx_created ON wiki_docs (created_at);
CREATE INDEX idx_hidden ON wiki_docs (hidden);
<?php
$pdo = new PDO('mysql:host=localhost;dbname=efmsa;charset=utf8mb4','user','pass', [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]);
function upsert_doc(PDO $pdo, string $title, string $body, string $tags, string $source, ?array $embedding = null) {
$stmt = $pdo->prepare("INSERT INTO wiki_docs (title, body, tags, source, embedding) VALUES (?,?,?,?,?)");
$stmt->execute([$title, $body, $tags, $source, $embedding ? json_encode($embedding) : null]);
return $pdo->lastInsertId();
}
function search_docs(PDO $pdo, string $q): array {
// Prefer FULLTEXT if terms look suitable; fallback to LIKE
$stmt = $pdo->prepare("SELECT id, title, LEFT(body, 400) AS snippet, tags, created_at FROM wiki_docs
WHERE hidden = 0 AND MATCH(title, body) AGAINST(:q IN NATURAL LANGUAGE MODE)
ORDER BY MATCH(title, body) AGAINST(:q IN NATURAL LANGUAGE MODE) DESC LIMIT 20");
$stmt->execute([':q' => $q]);
$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
if (!$rows) {
$stmt2 = $pdo->prepare("SELECT id, title, LEFT(body, 400) AS snippet, tags, created_at FROM wiki_docs
WHERE hidden = 0 AND (title LIKE :l OR body LIKE :l) ORDER BY created_at DESC LIMIT 20");
$stmt2->execute([':l' => '%'.$q.'%']);
return $stmt2->fetchAll(PDO::FETCH_ASSOC);
}
return $rows;
}
Chatbot — Retrieval‑Augmented Generation
For now we build answers from top FULLTEXT results. (Optional upgrade: vector DB for better semantic recall.) The bot cites titles and links to the source pages.
<?php
require __DIR__.'/vendor/autoload.php';
use OpenAI as OpenAIClient;
$pdo = new PDO('mysql:host=localhost;dbname=efmsa;charset=utf8mb4','user','pass', [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]);
$oi = OpenAIClient::client(getenv('OPENAI_API_KEY'));
$q = $_POST['q'] ?? '';
$rows = search_docs($pdo, $q);
$ctx = "";
foreach ($rows as $r) {
$ctx .= "Title: {$r['title']}\nSnippet: {$r['snippet']}\nLink: /wiki.php?id={$r['id']}\n--\n";
if (strlen($ctx) > 6000) break;
}
$system = 'You answer questions for EFMSA using ONLY the provided context. Cite sources as [Title](/wiki.php?id=ID). If not supported, say you are unsure.';
$res = $oi->chat()->create([
'model' => 'gpt-4o-mini',
'messages' => [
['role' => 'system', 'content' => $system],
['role' => 'user', 'content' => "Question: $q\n\nContext:\n$ctx"]
],
'temperature' => 0.2
]);
echo nl2br(htmlentities($res->choices[0]->message->content));
Scheduling — cron jobs
- Every 12 hours: discover new emails/files, (re)index changed items.
- Hourly: quick sweep for urgent uploads.
- Manual: “Index now” button queues an immediate run with CSRF protection.
# m h dom mon dow command
0 */12 * * * /usr/bin/php /var/www/efmsa/scripts/reindex.php >> /var/log/efmsa-cron.log 2>&1
15 * * * * /usr/bin/php /var/www/efmsa/scripts/sweep_uploads.php >> /var/log/efmsa-cron.log 2>&1
Security
- Auth gate: only logged‑in members can search.
- CSRF tokens on manual actions.
- Headers:
X-Frame-Options,X-Content-Type-Options: nosniff,Referrer-Policy. - Privacy: originals stored internally; paraphrases redact PII.
Performance
- Server‑side filtering for fast initial load; AJAX for refinement.
- Indexes on
created_atandhiddento speed list queries. - Prefer FULLTEXT for scale; fallback to
LIKEon small instances.