EFMSA Intranet — Email Ingest & AI Search (PHP)

Overview

EFMSA ingests emails and uploads (PDF/DOCX/TXT/MD), paraphrases them to remove noise, and indexes the content for instant search. A retrieval‑augmented chatbot answers staff questions with citations to the underlying docs. The system is PHP‑first (no S3/Lambda); we rely on cron jobs and PHP libraries for text extraction and processing.

Sources: inbound email (IMAP/pipe), file uploads, wiki pages.
Processing: PHP extractors → paraphrase → embeddings → MySQL FULLTEXT + optional vector store.
Access: intranet search UI + internal Q&A chatbot with citations.

Tiny diagram (PHP workflow)

The simple version

Why indexing matters: without it, search would be slow and incomplete. With it, you can type a few words and immediately see the right page, email, or file.

New info arrives: You upload a file or our system receives an email.
Text is extracted: We pull text from the document or email. If it’s a PDF/DOCX/TXT we can read, we take the body text; titles and tags come along for the ride.
We tidy and store: We clean odd characters, normalize whitespace, and record the content in our database.
Search gets smarter: The catalog (index) is updated so keywords lead you straight to relevant pages and attachments.
Repeats on a schedule: We re‑index every 12 hours so the catalog stays fresh. You can also run it manually anytime.
Privacy first: Only signed‑in members can search. We store what you upload or receive internally—no public sharing.

Technical details & reasoning

Pipeline stages

Discovery: find new emails and files not yet processed (status = 'unprocessed').
Extraction: parse structure (title, sender, date, tags) and extract text (PDF/DOCX/TXT/MD).
Normalization: strip HTML, collapse whitespace, remove control chars, unify encodings.
Enrichment: apply tags (if provided), infer type (Email Index vs Manual Upload), store metadata.
Commit: write into wiki_docs (or equivalent), mark entry processed, update timestamps.
Searchability: results appear instantly in server‑side search; optional FULLTEXT indices speed queries.

Performance choices

Server‑side filtering → fast first paint; AJAX refines results without heavy client‑side parsing.
Index by hidden and created_at to keep list queries quick.
Prefer FULLTEXT search (if available) for large datasets; fallback to LIKE for compatibility.

Safety choices

Auth gate: only logged‑in members can access the index.
CSRF tokens on manual actions (e.g., “index now”).
Headers: X-Frame-Options, X-Content-Type-Options: nosniff, and a sane Referrer-Policy.

PHP‑first architecture

Email intake: IMAP polling (cron) or mail server pipe to a PHP endpoint; parse raw MIME (headers, body, attachments).
File uploads: web form posts to a PHP controller; store temp file; queue for extraction.
Extraction: PHP libraries convert DOCX/MD/PDF/TXT into UTF‑8 text; attachments handled individually.
Paraphrase: send cleaned text to OpenAI for summary + normalized copy; redact PII.
Indexing: upsert rows into wiki_docs with FULLTEXT over (title, body); optional embeddings table for semantic search.
Search & chatbot: server‑side search endpoint returns ranked docs; chatbot assembles context and cites sources.

Extraction — DOCX / MD / PDF (PHP)

Libraries: phpoffice/phpword for DOCX, league/commonmark for MD, smalot/pdfparser for PDF. Fallback to strip_tags for HTML.

Composer

composer require phpoffice/phpword league/commonmark smalot/pdfparser openai-php/client

PHP — extract_text.php

<?php
require __DIR__ . '/vendor/autoload.php';

use PhpOffice\PhpWord\IOFactory as WordIO;
use League\CommonMark\CommonMarkConverter;
use Smalot\PdfParser\Parser as PdfParser;

function extract_text(string $path): array {
    $ext = strtolower(pathinfo($path, PATHINFO_EXTENSION));
    if ($ext === 'docx') {
        $phpWord = WordIO::load($path);
        $text = '';
        foreach ($phpWord->getSections() as $section) {
            foreach ($section->getElements() as $el) {
                if (method_exists($el, 'getElements')) {
                    foreach ($el->getElements() as $inner) {
                        if (method_exists($inner, 'getText')) $text .= $inner->getText() . "\n";
                    }
                } elseif (method_exists($el, 'getText')) {
                    $text .= $el->getText() . "\n";
                }
            }
        }
        return ['mime' => 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'text' => trim($text)];
    }
    if ($ext === 'md') {
        $md = file_get_contents($path);
        $converter = new CommonMarkConverter();
        $html = $converter->convert($md)->getContent();
        return ['mime' => 'text/markdown', 'text' => trim(strip_tags($html))];
    }
    if ($ext === 'pdf') {
        $parser = new PdfParser();
        $pdf = $parser->parseFile($path);
        return ['mime' => 'application/pdf', 'text' => trim($pdf->getText())];
    }
    // fallback: plain text or HTML
    $raw = file_get_contents($path);
    $txt = $ext === 'txt' ? $raw : strip_tags($raw);
    return ['mime' => 'text/plain', 'text' => trim($txt)];
}

Paraphrase & embeddings (OpenAI, PHP)

Policy: keep facts; remove salutations/signatures; redact PII; neutral tone; produce 3–6 bullet summary + 1 paragraph narrative. Store both original and paraphrase.

PHP — paraphrase_and_embed.php

<?php
use OpenAI as OpenAIClient;

$client = OpenAIClient::client(getenv('OPENAI_API_KEY'));

function paraphrase(string $text): string {
    global $client;
    $prompt = "Rewrite the content into (1) 3-6 bullets, (2) one neutral paragraph. Remove salutations/signatures. Redact names/emails/phones.";
    $r = $client->chat()->create([
        'model' => 'gpt-4o-mini',
        'messages' => [
            ['role' => 'system', 'content' => 'You paraphrase EFMSA internal documents.'],
            ['role' => 'user', 'content' => $prompt . "\n\nCONTENT:\n" . $text]
        ],
        'temperature' => 0.2
    ]);
    return $r->choices[0]->message->content;
}

function embed(string $text): array {
    global $client;
    $e = $client->embeddings()->create([
        'model' => 'text-embedding-3-large',
        'input' => $text
    ]);
    return $e->data[0]->embedding; // array of floats
}

Index & search (MySQL FULLTEXT)

We keep a wiki_docs table with FULLTEXT over (title, body). Optional: store embedding JSON for future semantic search.

SQL — schema

CREATE TABLE IF NOT EXISTS wiki_docs (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  title VARCHAR(255) NOT NULL,
  body LONGTEXT NOT NULL,
  tags VARCHAR(255) DEFAULT NULL,
  source ENUM('email','upload','wiki') NOT NULL,
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  hidden TINYINT(1) NOT NULL DEFAULT 0,
  embedding JSON NULL
) ENGINE=InnoDB;

CREATE FULLTEXT INDEX ft_title_body ON wiki_docs (title, body);
CREATE INDEX idx_created ON wiki_docs (created_at);
CREATE INDEX idx_hidden ON wiki_docs (hidden);

PHP — insert & search

<?php
$pdo = new PDO('mysql:host=localhost;dbname=efmsa;charset=utf8mb4','user','pass', [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]);

function upsert_doc(PDO $pdo, string $title, string $body, string $tags, string $source, ?array $embedding = null) {
    $stmt = $pdo->prepare("INSERT INTO wiki_docs (title, body, tags, source, embedding) VALUES (?,?,?,?,?)");
    $stmt->execute([$title, $body, $tags, $source, $embedding ? json_encode($embedding) : null]);
    return $pdo->lastInsertId();
}

function search_docs(PDO $pdo, string $q): array {
    // Prefer FULLTEXT if terms look suitable; fallback to LIKE
    $stmt = $pdo->prepare("SELECT id, title, LEFT(body, 400) AS snippet, tags, created_at FROM wiki_docs
                           WHERE hidden = 0 AND MATCH(title, body) AGAINST(:q IN NATURAL LANGUAGE MODE)
                           ORDER BY MATCH(title, body) AGAINST(:q IN NATURAL LANGUAGE MODE) DESC LIMIT 20");
    $stmt->execute([':q' => $q]);
    $rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
    if (!$rows) {
        $stmt2 = $pdo->prepare("SELECT id, title, LEFT(body, 400) AS snippet, tags, created_at FROM wiki_docs
                                WHERE hidden = 0 AND (title LIKE :l OR body LIKE :l) ORDER BY created_at DESC LIMIT 20");
        $stmt2->execute([':l' => '%'.$q.'%']);
        return $stmt2->fetchAll(PDO::FETCH_ASSOC);
    }
    return $rows;
}

Chatbot — Retrieval‑Augmented Generation

For now we build answers from top FULLTEXT results. (Optional upgrade: vector DB for better semantic recall.) The bot cites titles and links to the source pages.

PHP — ask.php (simple RAG)

<?php
require __DIR__.'/vendor/autoload.php';
use OpenAI as OpenAIClient;

$pdo = new PDO('mysql:host=localhost;dbname=efmsa;charset=utf8mb4','user','pass', [PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION]);
$oi  = OpenAIClient::client(getenv('OPENAI_API_KEY'));

$q = $_POST['q'] ?? '';
$rows = search_docs($pdo, $q);

$ctx = "";
foreach ($rows as $r) {
  $ctx .= "Title: {$r['title']}\nSnippet: {$r['snippet']}\nLink: /wiki.php?id={$r['id']}\n--\n";
  if (strlen($ctx) > 6000) break;
}

$system = 'You answer questions for EFMSA using ONLY the provided context. Cite sources as [Title](/wiki.php?id=ID). If not supported, say you are unsure.';
$res = $oi->chat()->create([
  'model' => 'gpt-4o-mini',
  'messages' => [
    ['role' => 'system', 'content' => $system],
    ['role' => 'user', 'content' => "Question: $q\n\nContext:\n$ctx"]
  ],
  'temperature' => 0.2
]);

echo nl2br(htmlentities($res->choices[0]->message->content));

Scheduling — cron jobs

Every 12 hours: discover new emails/files, (re)index changed items.
Hourly: quick sweep for urgent uploads.
Manual: “Index now” button queues an immediate run with CSRF protection.

Linux — crontab

# m h dom mon dow  command
0 */12 * * * /usr/bin/php /var/www/efmsa/scripts/reindex.php >> /var/log/efmsa-cron.log 2>&1
15 *  * * * /usr/bin/php /var/www/efmsa/scripts/sweep_uploads.php >> /var/log/efmsa-cron.log 2>&1

Security

Auth gate: only logged‑in members can search.
CSRF tokens on manual actions.
Headers: X-Frame-Options, X-Content-Type-Options: nosniff, Referrer-Policy.
Privacy: originals stored internally; paraphrases redact PII.

Performance

Server‑side filtering for fast initial load; AJAX for refinement.
Indexes on created_at and hidden to speed list queries.
Prefer FULLTEXT for scale; fallback to LIKE on small instances.