Improving MediaWiki Search Using Machine Learning

Why machine‑learning belongs in MediaWiki search

Picture this: you type “history of the Space‑X launchpad” and the wiki throws back a half‑baked list that barely scratches the surface. Happens more often than you’d think, especially on big wikis where the built‑in Lucene‑ish engine struggles with synonyms, typos, and context.

What the classic search does (and where it trips)

  • Full‑text token matching – great for exact words, terrible for “rocket” vs. “launch vehicle”.
  • Stemming only – “launches” becomes “launch”, but “orbit” stays “orbit”.
  • No ranking by relevance – a page that mentions “NASA” once beats a page that devotes a whole section to it.

Enter machine learning (ML)

At its core, an ML‑enhanced search layer learns to guess what you really meant. It can:

  1. Spot semantic neighbours (e.g., “rocket” ≈ “launch vehicle”).
  2. Prioritise pages with higher topic density – meaning more occurrences of the concept, not just a stray mention.
  3. Adjust on‑the‑fly when users click “Did you mean…?” or scroll past irrelevant hits.

How to stitch an ML model into MediaWiki

Below is a sketchy outline – don’t copy‑paste without testing, but it gives you a road map.


// LocalSettings.php: route search queries through a Python micro‑service
$wgHooks['SearchResultProvide'][] = function ( $searchResultSet ) {
    $query = $searchResultSet->getSearchTerm();
    $mlResponse = file_get_contents("http://localhost:8000/rank?query=" . urlencode($query));
    $rankedTitles = json_decode($mlResponse, true);
    // Replace original results with ML‑ranked titles
    $searchResultSet->clearResults();
    foreach ($rankedTitles as $title) {
        $searchResultSet->addResult( Title::newFromText($title) );
    }
    return true;
};

And the tiny Python service that does the heavy lifting (using Transformers BERT embeddings):


import json, uvicorn
from fastapi import FastAPI, Query
from sentence_transformers import SentenceTransformer, util
import mwclient  # to fetch page texts

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

def fetch_wiki_texts():
    # Very naive: grab ~1k pages, cache somewhere
    ...

@app.get("/rank")
def rank(query: str = Query(...)):
    query_vec = model.encode(query, convert_to_tensor=True)
    docs = fetch_wiki_texts()
    scores = [util.cos_sim(query_vec, model.encode(txt, convert_to_tensor=True)).item() for txt in docs]
    ranked = [title for _, title in sorted(zip(scores, docs), reverse=True)][:20]
    return json.dumps(ranked)

Practical tips & pitfalls

  • Cache embeddings. Computing BERT vectors on‑the‑fly kills latency.
  • Don’t forget to re‑index when you add new pages – otherwise the model can’t see fresh content.
  • Watch out for “pop‑culture drift”. A sudden meme can flood the index; a quick filter prevents nonsense results.
  • Logging user clicks (did‑you‑mean hits) helps the model adapt – treat it like a tiny reinforcement loop.

Where to look next

If you’re already comfortable with the basics, consider:

  1. Fine‑tuning a multilingual model so language‑specific wikis get better recall.
  2. Adding a “boost” field in page metadata for “official” docs – it’s a simple trick but the impact feels like night‑and‑day.
  3. Experimenting with neural rerankers that re‑score the top‑50 Lucene hits instead of rebuilding the whole index.

Bottom line? You don’t need a PhD in AI to give MediaWiki a smarter search brain. A modest Python service, a few tweaks in LocalSettings.php, and some patience to tune the model can turn a clunky list into a genuinely helpful navigator. Give it a whirl – you might be surprised by how quickly your community starts finding the right pages without the usual “I’m looking for…?” friction.

Subscribe to MediaWiki Tips and Tricks

Don’t miss out on the latest articles. Sign up now to get access to the library of members-only articles.
jamie@example.com
Subscribe