Implementing a Custom Search Index in MediaWiki with Elasticsearch

Why a Custom Search Index?

MediaWiki ships with a simple MySQL‑based search engine that works for tiny wikis, but it quickly becomes a bottleneck as the amount of content grows. Extension:CirrusSearch replaces the native engine with Elasticsearch, giving you full‑text relevance, fast autocomplete, and the ability to store arbitrary fields alongside each page document. By tapping into the CirrusSearchBuildDocumentParse hook you can augment the default document with your own metadata – the essence of a *custom search index*.

Prerequisites

  • MediaWiki ≥ 1.39 (the current LTS version).
  • Elasticsearch 7.10.2 or a compatible OpenSearch release. The official Docker image (docker.elastic.co/elasticsearch/elasticsearch:7.10.2) is the simplest way to get a test cluster.
  • PHP extensions: curl and json (required by Extension:Elastica).
  • Access to the job queue (Redis or the default DB queue) – a reliable queue is essential for near‑real‑time indexing.

Step 1 – Install Elasticsearch

# Pull the official image
docker run -d --name es \
    -p 9200:9200 -p 9300:9300 \
    -e "discovery.type=single-node" \
    docker.elastic.co/elasticsearch/elasticsearch:7.10.2

# Verify it is up
curl -s http://localhost:9200 | jq .cluster_name

Make sure the container is reachable from the MediaWiki host; if they run on separate machines, configure the network and, optionally, TLS according to the CirrusSearch docs.

Step 2 – Add the required extensions

Place the extensions in extensions/ and install Composer dependencies:

cd $MW_ROOT/extensions
# Elastica
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/Elastica
cd Elastica && composer install --no-dev

# CirrusSearch
cd ..
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/CirrusSearch
cd CirrusSearch && composer install --no-dev

Enable them in LocalSettings.php:

wfLoadExtension( 'Elastica' );
wfLoadExtension( 'CirrusSearch' );
$wgDisableSearchUpdate = true; // temporarily stop auto‑updates while we bootstrap

Step 3 – Basic CirrusSearch configuration

Tell MediaWiki where the Elasticsearch nodes live and switch the search backend:

$wgCirrusSearchServers = [ 'http://127.0.0.1:9200' ];
$wgSearchType = 'CirrusSearch';

Run the index‑creation script:

php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --startOver

Then bootstrap the index (first pass parses page content, second pass creates link data):

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse

Finally enable live updates:

$wgDisableSearchUpdate = false; // back LocalSettings.php

Step 4 – Designing a custom document

CirrusSearch stores each page as an Elasticsearch _doc with a set of predefined fields (title, text, headings, categories, etc.). To add your own data you need two things:

  1. A hook that injects the extra fields while the document is being built.
  2. A mapping that tells Elasticsearch how to index those fields (type, analyzer, whether they are searchable, etc.).

Both are covered in the CirrusSearch hooks documentation.

4.1 Adding fields via CirrusSearchBuildDocumentParse

Create a small extension (e.g. MyCustomSearch) that registers the hook:

class MyCustomSearchHooks {
    public static function onCirrusSearchBuildDocumentParse( \Elastica\Documentdoc, Title $title, Content $content, OutputPage $output, \Elastica\Connection $conn ) {
        // Example: store the page's protection level
        $protect = $title->getRestrictions( 'edit' );
        $doc->set( 'protect_level', $protect ? implode( '|', $protect ) : 'none' );

        // Example: add a numeric field for the number of images on the page
        $imageCount = preg_match_all( '/\[\[File:/i', $content->getText(), $m );
        $doc->set( 'image_count', $imageCount );
        return true;
    }
}
$wgHooks['CirrusSearchBuildDocumentParse'][] = 'MyCustomSearchHooks::onCirrusSearchBuildDocumentParse';

Place the file under extensions/MyCustomSearch/MyCustomSearch.php and load the extension in LocalSettings.php with wfLoadExtension( 'MyCustomSearch' );.

4.2 Defining the mapping

Elasticsearch needs to know the type of each new field. CirrusSearch reads extra mapping entries from the $wgCirrusSearchExtraFieldsInSearchResults and $wgCirrusSearchExtraIndexSettings variables. A minimal mapping for the two fields above looks like this:

$wgCirrusSearchExtraFieldsInSearchResults = [
    'protect_level' => [ 'type' => 'keyword' ],
    'image_count'   => [ 'type' => 'integer' ]
];

$wgCirrusSearchExtraIndexSettings = [
    'mappings' => [
        'properties' => [
            'protect_level' => [ 'type' => 'keyword' ],
            'image_count'   => [ 'type' => 'integer' ]
        ]
    ]
];

After adding the mapping, re‑run the index‑creation script with the --startOver flag so that Elasticsearch creates the new fields.

Step 5 – Re‑indexing the wiki

Whenever you change the document schema you must rebuild the index. The quickest way on a production wiki is to create a fresh index, copy the old alias, then switch:

# Create a new index name (e.g. "wiki_v2")
php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --startOver --indexIdentifier wiki_v2

# Populate it
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip --indexIdentifier wiki_v2
php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse --indexIdentifier wiki_v2

# Atomically point the alias "wiki" to the new index
php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --aliasSwitch wiki_v2

Running the two ForceSearchIndex.php passes can be parallelised with the job queue; on large wikis you may want to split the work as described in the bootstrapping guide.

Step 6 – Querying the custom fields

Once the index is live you can use the normal MediaWiki API or Special:Search with the new fields. The syntax mirrors other Cirrus filters:

# Find pages that contain more than 10 images
https://wiki.example.org/w/api.php?action=query&list=search&srsearch=image_count:>10

# Find pages that are protected for "sysop" edits only
https://wiki.example.org/w/api.php?action=query&list=search&srsearch=protect_level:sysop

These queries can also be combined with ordinary full‑text terms, namespace filters, etc.

Step 7 – Advanced customisation

  • Analyzers & tokenizers: If you need language‑specific processing (e.g. n‑grams for autocomplete) define a custom analyzer in $wgCirrusSearchExtraIndexSettings['settings']['analysis']. See the CirrusSearch configuration page for the exact JSON structure.
  • Per‑wiki overrides: In multi‑wiki deployments you can set $wgCirrusSearchExtraIndexSettings per wiki in LocalSettings.php or via the wmf-config repository (files ext‑CirrusSearch.php, CirrusSearch‑production.php, etc.).
  • Re‑index only changed pages: After the initial bulk load you can rely on the job queue to keep the index up‑to‑date. If you add a new custom field that can be derived from page text alone, you only need to re‑run the ForceSearchIndex.php --skipLinks pass.
  • Testing locally: The dockerised test environment ships with a ready‑made Elasticsearch node, making it easy to iterate on mapping changes without touching production.

Step 8 – Troubleshooting common pitfalls

ProblemTypical causeFix
Search returns 0 results for a newly added fieldMapping not applied; index was not recreated with --startOverDelete the index, run UpdateSearchIndexConfig.php --startOver, then re‑index.
Elasticsearch reports "illegal_argument_exception" on indexingField type mismatch (e.g. sending a string to an integer field)Check the hook code that sets the field; cast to the correct type.
Queries become very slow after adding a large text fieldField was indexed as text and analysed with the default analyzer.Mark the field as keyword or doc_values only, or disable indexing with "index": false.
Job queue keeps failing with "unserialize() error"Large job payload (common when re‑indexing many pages at once)Configure the job queue to use Redis ($wgJobTypeConf['default'] = ['class' => 'JobQueueRedis'];) and enable $wgJobRunRate adjustments.

Conclusion

By combining CirrusSearch, Elastica, and the CirrusSearchBuildDocumentParse hook you can turn MediaWiki into a fully‑featured search platform that knows about any piece of metadata you care about – from protection levels to custom taxonomy fields. The workflow is straightforward: install Elasticsearch, enable the extensions, declare extra fields and mappings, hook into the document builder, re‑index, and start querying. With the job queue keeping the index fresh, the custom search index stays in sync with the wiki without manual intervention.

Feel free to adapt the snippets above to your own use‑case, whether you are tracking document types, embedding security‑event logs, or exposing semantic‑media‑wiki properties through. The MediaWiki community maintains extensive documentation on hooks and Elasticsearch configuration, so you can always dive deeper when you need more sophisticated analyzers or cross‑wiki search.

Subscribe to MediaWiki Tips and Tricks

Don’t miss out on the latest articles. Sign up now to get access to the library of members-only articles.
jamie@example.com
Subscribe