Advanced Search with CirrusSearch Extension in MediaWiki
What is CirrusSearch?
If you’ve ever tried to hunt for a specific phrase on a wiki with a gazillion pages, you know the built‑in search can feel like looking for a needle in a haystack. CirrusSearch changes that story. It hooks MediaWiki up to Elasticsearch (soon OpenSearch) and turns the search engine into something that actually understands relevance, proximity, and fuzzy matching.
Why bother with “Advanced Search”?
There’s a AdvancedSearch extension that sits on top of Special:Search. It adds a form where you can pick namespaces, set a date range, or toggle case‑sensitivity. Alone it’s handy, but when you pair it with CirrusSearch you unlock a whole suite of hidden parameters that ordinary users never see. Think of it as the difference between a basic screwdriver and a multi‑bit power driver – both turn screws, but the latter does it faster, cleaner, and with fewer mistakes.
Getting CirrusSearch up and running
First things first: you need a working MediaWiki installation. Then:
// In Composer
composer require mediawiki/cirrussearch
// In LocalSettings.php
wfLoadExtension( 'CirrusSearch' );
$wgSearchType = 'CirrusSearch';
// Minimal Elasticsearch config
$wgCirrusSearchServers = [ [ 'host' => 'localhost', 'port' => 9200 ] ];
That’s it. In practice you’ll want to tweak a few more settings – class‑name prefixes, index names, perhaps a connection timeout – but the snippet above gets the engine talking.
Installing the AdvancedSearch front‑end
Grab the extension, slap it into extensions/AdvancedSearch, and add a single line to LocalSettings.php:
wfLoadExtension( 'AdvancedSearch' );
Now Special:Search shows a collapsible “Advanced options” panel. That panel merely passes query arguments to the back‑end; CirrusSearch reads them, interprets them, and does the heavy lifting.
Key parameters you’ll see
- ns: one or more namespace IDs, comma‑separated.
- profile: a search profile such as
default,strict, orautocomplete. - prefix: restrict matches to terms that start with the given string.
- regex: a regular expression filter (dangerous, use with care).
Beyond the UI – raw CirrusSearch query syntax
Whenever you submit a search, CirrusSearch translates the URL into an Elasticsearch query DSL. If you’re comfortable with JSON, you can craft your own queries and feed them through the cirrussearch-query API endpoint. For example, to find pages that contain the exact phrase “climate change” but not the word “denial”, you could POST the following:
{
"query": {
"bool": {
"must": [
{ "match_phrase": { "text": "climate change" } }
],
"must_not": [
{ "match": { "text": "denial" } }
]
}
},
"highlight": {
"fields": { "text": {} }
}
}
It’s a mouthful, but the power is undeniable. You can filter by page creation date, boost certain domains, or even limit results to a specific language.
Practical examples you can copy‑paste
1. Find all pages in the “Help” namespace that were edited after 2022‑01‑01
$params = [
'search' => '',
'ns' => 12, // Help namespace
'cirrusSearchBoostTemplates' => false,
'profile' => 'strict',
'date' => '20220101..' // open‑ended range
];
$api = new \MediaWiki\Api\ApiMain( new \FauxRequest( $params ) );
2. Use fuzzy matching for a misspelled name
Append ~2 after the term to allow two edits (insert, delete, substitute). The UI doesn’t expose this directly, but you can add it to the search field yourself:
Jon~2Will pull up “John”, “Jonas”, “Joon” – whatever is within two Levenshtein steps.
3. Exclude all talk pages from results
Set the ns parameter to everything except talk (namespace 1). In URL form it looks like:
https://wiki.example.com/w/index.php?search=foo&ns=0%2C2%2C3%2C4%2C5%2C6%2C7%2C8%2C9%2C10%2C11%2C12%2C13%2C14%2C15%2C100%2C101Performance considerations
CirrusSearch is fast, but it’s only as good as the underlying Elasticsearch cluster. A few points to remember:
- Shard sizing: Too many shards for a small index cause unnecessary overhead. Start with a single shard per node and monitor.
- Refresh interval: The default 1‑second refresh can be aggressive for a heavily edited wiki. Raising it to
5scan reduce index write load. - Memory: Elasticsearch likes RAM. Allocate at least half the machine’s memory to the JVM heap, but don’t exceed 30 GB (the “compressed oops” limit).
And a friendly reminder: after massive imports or a bulk edit spree, run php maintenance/rebuildCirrusSearchIndex.php to catch up.
Common pitfalls and how to avoid them
1. “My searches are returning nothing!” – Often this means the index is out of sync. Check the cirrussearch-index-status page or run the rebuild script.
2. “Fuzzy search is too permissive.” – The ~ operator defaults to a fuzziness of 2, but you can tighten it by appending ~1 or using the fuzzy_max_expansions parameter in the JSON DSL.
3. “The advanced form shows namespaces I don’t want.” – Tweak $wgAdvancedSearchNamespaces in LocalSettings.php to restrict the list.
Future directions – OpenSearch migration
MediaWiki’s developers have announced a shift from Elasticsearch to OpenSearch. From a user’s perspective, nothing dramatic changes – the API stays the same, the UI stays the same. Under the hood you’ll get a more community‑driven backend, regular security patches, and better compatibility with AWS‑hosted services. Keep an eye on the extension page for migration guides.
Wrapping up
Advanced search in MediaWiki isn’t a luxury; it’s a necessity when you’ve got a knowledge base that rivals an encyclopedia. By pairing the AdvancedSearch UI with the raw power of CirrusSearch, you give editors and readers alike a tool that feels both familiar and surprisingly precise. Install the extensions, tweak a few settings, and you’ll notice the difference before the first search even finishes loading.
Remember: the real magic lies not in the fancy UI but in the underlying query DSL. If you’re comfortable with JSON, go ahead and experiment – the surface is slick, but the engine is a beast you can tame. And when it finally runs smoothly, you’ll wonder how you ever lived without it.