Unlocking Advanced Search Capabilities with CirrusSearch in MediaWiki
What CirrusSearch Brings to MediaWiki
When you spin up a MediaWiki installation, the default search feels a bit like using a paper‑clip to pry open a steel door. It works, but you quickly realize it’s not built for the kind of precision you’d expect from a modern site. Enter CirrusSearch, the Elasticsearch‑powered engine that swaps the old wobbly door‑stop for a sleek, motorised gate.
In plain English, CirrusSearch indexes every page, revision, and talk‑page in a separate Elasticsearch cluster, then serves queries back to MediaWiki via the CirrusSearch extension. The result? Faster results, fuzzy matching, phrase boosting, and, because it talks to Elasticsearch, the whole world of faceted search that you thought only big e‑commerce sites could afford.
Why Bother? A Quick Reality Check
- Full‑text relevance ranking replaces the naive TF‑IDF that MediaWiki used to employ.
- Supports prefix queries (“wik*” finds Wiki, Wikipedia, wikidata, etc.) without a custom parser.
- Allows you to filter by namespace, page protection level, or even by the presence of a specific
$wgNamespaceIdsvalue. - Provides rich query DSL – you can script complex boolean logic in a single line.
Sounds fancy? It is. But the real power shows up when you start tweaking the default behavior. Below you’ll find the nuts‑and‑bolts of setting up CirrusSearch, plus a handful of tricks that turn a bland search box into a Swiss‑army knife for editors.
Getting CirrusSearch Up and Running
Step 1 – Pull the Extension
Grab the code straight from the official repository. It’s as simple as a git clone into your extensions folder.
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch.git extensions/CirrusSearch
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Elastica.git extensions/Elastica
Tip: keep the Elastica version in sync; otherwise you’ll end up with a version mismatch error that looks like a cryptic riddle.
Step 2 – Tweak LocalSettings.php
Add the following lines near the bottom of the file. If you’re already using other extensions, you’ll want to make sure they’re loaded before CirrusSearch.
wfLoadExtension( 'Elastica' );
wfLoadExtension( 'CirrusSearch' );
$wgCirrusSearchServers = [
[ 'host' => 'localhost', 'port' => 9200 ]
];
$wgSearchType = 'CirrusSearch';Don’t forget to restart your web server after you save. A quick systemctl restart php-fpm (or Apache, whichever you prefer) usually does the trick.
Step 3 – Build the Index
Run the maintenance script. It can take a few minutes on a small wiki, hours on a big one.
php maintenance/updateSearchIndex.php --skipLinks --indexOnSave
While the script is chugging away, you’ll see a stream of output like “Indexing page 12345”. It’s normal to see some “warning: missing revision” messages; those are harmless, just the system being a bit chatty.
Diving Into Advanced Queries
Now that the engine is humming, let’s talk about how you can coax more out of it. The trickiest part is remembering that you’re no longer limited to the simple “title:foo” syntax. Below are three common scenarios.
1. Namespace‑Specific Search
Say you want to search only in the “Help” namespace. The ns filter lets you do that without fiddling with UI checkboxes.
$searchResults = SearchEngine::create()
->setNamespaces( [ NS_HELP ] )
->searchText( 'installation' );Notice the use of the constant NS_HELP. It’s better than hard‑coding the numeric ID, which can change if you add custom namespaces later.
2. Boosting Certain Fields
Sometimes the title matters more than the body, especially for disambiguation pages. You can instruct Elasticsearch to weight titles higher using the boost parameter.
$search = new \CirrusSearch\Search\FullTextSearchQueryBuilder( $searcher );
$search->setQuery( 'php' )
->addBoosting( 'title', 3.0 )
->addBoosting( 'text', 1.0 );
$results = $search->search();This tiny snippet tells the engine: “If ‘php’ appears in the title, treat it as if it were three times more important.” Works like a charm on a wiki that mixes code snippets with prose.
3. Fuzzy Matching for Typos
Writers misspell words all the time. CirrusSearch includes fuzzy matching out of the box; you just need to set the fuzziness option.
$searcher = new \CirrusSearch\Searcher( $fullTextQuery );
$searcher->setFuzziness( 'AUTO' );
$results = $searcher->search();‘AUTO’ lets Elasticsearch decide how many edits are allowed based on the term length. Short words get fewer edits; longer words get more leeway.
Performance Tweaks You Might Not Expect
Even though Elasticsearch is the heavyweight champion, a mis‑configured cluster can still feel sluggish. Below are a couple of “gotchas” that have bitten me before.
- Shard Count – Don’t default to 5 shards per index on a tiny wiki. Each shard consumes memory and CPU. Scale down to 1 or 2 and watch latency drop.
- Refresh Interval – The default 1s refresh can flood the cluster with tiny writes. For a read‑heavy wiki, bump it to
5sor even30sinelasticsearch.yml. - Replica Number – A single replica is often enough unless you have a massive user base. More replicas = more disk usage without a proportional gain.
Here’s a snippet you can drop into elasticsearch.yml to apply these ideas.
index:
number_of_shards: 2
number_of_replicas: 1
refresh_interval: 5sAfter editing, restart Elasticsearch and re‑run updateSearchIndex.php so the settings take effect.
Real‑World Use Cases
It helps to see how organizations actually employ CirrusSearch. Below are three anonymised examples that illustrate its versatility.
Documentation Portal
A university’s internal wiki hosts tens of thousands of pages describing labs, software tools, and policies. By enabling ns:0 filtering, staff can quickly pull up only policy documents, while students benefit from fuzzy matching on course codes.
Open‑Source Project
A large open‑source project migrated to CirrusSearch to allow contributors to locate functions and API references across multiple language versions. They turned on highlight in the query, so matching fragments are bolded directly in the search snippet – a small UX win that saved hours of reading.
Community Knowledge Base
On a gaming community wiki, moderators use the protected:true filter to audit pages that have been locked for vandalism. The filter works because CirrusSearch indexes page protection metadata as a field called is_protected.
Getting Your Hands Dirty – Sample Query Playground
If you’re itching to experiment, spin up a _search request via curl. The following query searches for “network latency” in the “Technical” namespace, boosts titles, and enables fuzzy matching.
curl -XGET 'http://localhost:9200/wiki/_search' -H 'Content-Type: application/json' -d '
{
"query": {
"bool": {
"must": [
{ "match": { "title": { "query": "network latency", "boost": 3 } } },
{ "match": { "text": { "query": "network latency", "fuzziness": "AUTO" } } }
],
"filter": [
{ "term": { "namespace": 100 } }
]
}
},
"highlight": {
"fields": { "text": {} }
}
}'Paste that into your terminal. You’ll see a JSON blob with hits – each hit includes the page ID, title, and highlighted snippets. It’s a bit raw, but once you wrap it in MediaWiki’s Special:Search UI, the same logic powers the search results you see on the front‑end.
Common Pitfalls and How to Dodge Them
- Out‑of‑sync indexes – If you bypass
$wgCirrusSearchIndexOnSave(e.g., during bulk imports), run a full reindex afterwards. Otherwise you’ll end up with phantom pages that never surface. - Missing Elastica – Forgetting to enable the Elastica extension throws a “class not found” error. Double‑check
wfLoadExtension( 'Elastica' );is in place. - Improper JSON escaping – When you build DSL queries in PHP, use
json_encode()rather than manual string concatenation. It prevents hidden syntax errors that only appear at runtime.
In my own experience, I once turned on $wgCirrusSearchEnableExtraFeatures without having the cirrussearch-wikibase module installed. The result? A cascade of “undefined field” warnings that cluttered the log. The fix? Comment out that line or install the missing module.
Wrapping Up – The Takeaway
CirrusSearch isn’t just a “nice‑to‑have” plug‑in; it’s a fundamental shift in how MediaWiki handles search. By delegating indexing to Elasticsearch (or its OpenSearch cousin), you unlock fuzzy matching, field boosting, and namespace‑aware filtering – all without rewriting the core codebase.
If you’re already running a modest wiki, the migration cost is relatively low: a few git clones, a tiny LocalSettings.php tweak, and a one‑time reindex. For larger installations, the performance gains and extra query flexibility can be the difference between users finding the right page in seconds or wandering forever.
Bottom line: once CirrusSearch is up, you’ll start noticing the search bar behaving less like a stubborn mule and more like a well‑trained guide dog, leading you straight to the information you need.