By MW in mediawiki — 18 Sep 2024

Leveraging MediaWikis API for Automated Content Management

Why the MediaWiki API is a Game‑Changer for Content Ops

Picture this: you’ve just been handed a spreadsheet full of policy updates, product specs, and a handful of “quick‑fix” notes that need to live somewhere on your wiki. Manually copy‑pasting each line feels like watching paint dry, right? That’s the exact spot where MediaWiki’s API slips on its cape and starts lifting the heavy lifting.

Sure, the API isn’t brand‑new—its roots go back to the early days of Wikipedia—yet the toolbox has quietly expanded. From action=query to action=edit, action=login to action=delete, there’s a menu of verbs that let you treat the wiki more like a data store than a stubborn notebook.

In this post I’ll wander through a few real‑world patterns, sprinkle in some code snippets (yes, I love a good curl line), and point out where the API can save you from the dreaded “copy‑paste‑itis.”

Getting Your Hands Dirty: Authentication 101

First things first—any script that talks to a MediaWiki site needs a token. Think of it as the wiki’s version of a security badge. You can fetch a login token with a quick GET:


curl -G "https://yourwiki.org/w/api.php" \
     --data-urlencode "action=query" \
     --data-urlencode "meta=tokens" \
     --data-urlencode "type=login" \
     --data-urlencode "format=json"

The response includes logintoken. Next step is to POST your credentials along with that token. Don’t forget to set Cookie: headers—MediaWiki uses them to keep the session alive. I’ve tripped over this a couple of times, so double‑check your cookies.

Once logged in, you’ll need a CSRF token for any write operation. Grab it the same way but change type=csrf. That token is the key that says, “I promise I’m not a bot trying to vandalize the site.” (Okay, a bot can be legit, too.)

Batch Editing: When One Page Isn’t Enough

Imagine you have a backlog of 500 wiki pages that need a new template added. Opening each page in the UI would take hours—maybe days. The API lets you loop over pages like you’d loop over rows in a spreadsheet.

Step 1: Pull a list of titles with list=allpages. Include a limit and a continuation token so you can paginate.


curl -G "https://yourwiki.org/w/api.php" \
     --data-urlencode "action=query" \
     --data-urlencode "list=allpages" \
     --data-urlencode "aplimit=500" \
     --data-urlencode "format=json"

The JSON payload will give you an array of title strings. From there, you can fire off an action=edit call for each title. Here’s where the bot parameter shines—if you’ve got the BotPasswords extension enabled, you can flag the request as a bot edit, and the system will bypass some throttling.


curl -X POST "https://yourwiki.org/w/api.php" \
     -d "action=edit" \
     -d "title=Page:Example" \
     -d "text={{NewTemplate}}\n{{Old content}}" \
     -d "token=YOUR_CSRF_TOKEN" \
     -d "bot=1" \
     -d "format=json"

Notice the newline after the template knob. That tiny carriage return often trips me up; the wiki expects a line‑break, not a space.

Search‑and‑Replace on the Fly

Ever needed to rename a recurring phrase across hundreds of pages? The API doesn’t have a built‑in “find‑replace” function, but you can stitch together list=search + action=parse + action=edit to achieve it.

First, spider the wiki for the phrase:


curl -G "https://yourwiki.org/w/api.php" \
     --data-urlencode "action=query" \
     --data-urlencode "list=search" \
     --data-urlencode "srsearch=\"old phrase\"" \
     --data-urlencode "srlimit=500" \
     --data-urlencode "format=json"

For each hit, grab the wikitext, run a simple sed or Python regex replace, then push it back. A sneak‑peak of the edit loop in Python:


import requests, re, json

API = "https://yourwiki.org/w/api.php"
session = requests.Session()
# ... login, get token omitted for brevity ...

def edit_page(title, new_text):
    payload = {
        'action': 'edit',
        'title': title,
        'text': new_text,
        'token': csrf_token,
        'format': 'json'
    }
    r = session.post(API, data=payload)
    return r.json()

# Search results fetched earlier stored in `hits`
for hit in hits:
    page = session.get(API, params={'action': 'query',
                                    'prop': 'revisions',
                                    'rvprop': 'content',
                                    'titles': hit['title'],
                                    'format': 'json'}).json()
    # Extract content (ugly but works)
    rev = next(iter(page['query']['pages'].values()))['revisions'][0]['*']
    updated = re.sub(r'old phrase', 'new phrase', rev)
    if rev != updated:
        edit_page(hit['title'], updated)

The above script is deliberately verbose; in the real world you’d add error handling, respect rate limits, maybe sprinkle in sleep calls. I learned the hard way that hammering the endpoint without pauses makes the API scream “Too many requests.”

Exporting Content for Offline Audits

Sometimes you need a snapshot of a set of pages for compliance checks. The action=parse module can spit out HTML or wikitext, but for bulk grabs the generator=allpages combo shines.


curl -G "https://yourwiki.org/w/api.php" \
     --data-urlencode "action=query" \
     --data-urlencode "generator=allpages" \
     --data-urlencode "gaplimit=500" \
     --data-urlencode "prop=revisions" \
     --data-urlencode "rvprop=content" \
     --data-urlencode "format=json"

The JSON payload will nest each page’s wikitext under revisions. Pipe that into jq or a tiny Node script to write files to disk. One thing I keep forgetting: MediaWiki caps the generator to 500 pages per request for non‑bot accounts—so loop with continue tokens until you’ve exhausted the list.

Hooking Into Webhooks: Real‑Time Updates

MediaWiki’s EventLogging extension lets you push JSON payloads to a webhook whenever a page is edited, deleted, or moved. Pair that with a tiny Flask endpoint and you’ve got a live feed of content changes that can trigger downstream jobs (e.g., re‑indexing a search engine, notifying a Slack channel).


from flask import Flask, request

app = Flask(__name__)

@app.route('/wiki/webhook', methods=['POST'])
def receive():
    data = request.get_json()
    # Example: only react to page edits in the "Policy" namespace
    if data.get('event') == 'edit' and data.get('namespace') == 12:
        print(f"Policy page edited: {data['title']}")
        # maybe enqueue a job here
    return '', 204

if __name__ == '__main__':
    app.run(port=8080)

Set the webhook URL in LocalSettings.php via $wgEventLoggingBaseUri. If you’re not comfortable editing core config, the WebHooks extension offers a UI‑friendly way to register callbacks.

Handling Media Files: The `upload` Module

Automation isn’t limited to text. The action=upload endpoint lets you push images, PDFs, or any file that your wiki permits.

Upload steps in a nutshell:

GET a CSRF token (same as for edits).
POST the file using multipart/form-data with fields filename, file, token, and optionally comment.
Check the JSON response for upload → result == Success.


curl -X POST "https://yourwiki.org/w/api.php" \
     -F "action=upload" \
     -F "filename=Diagram.png" \
     -F "file=@/path/to/Diagram.png" \
     -F "token=YOUR_CSRF_TOKEN" \
     -F "format=json"

One little snag: the wiki may reject files larger than the $wgMaxUploadSize limit. If you run into that, either bump the limit in LocalSettings.php or split the file (e.g., zip it first).

Rate‑Limiting & Politeness Policies

Even the most robust APIs have etiquette rules. MediaWiki enforces a token bucket algorithm that caps requests per user/account. The default is fairly generous, but if you’re running a nightly bulk job you’ll want to respect the Retry-After header. I've seen scripts that ignore this and end up being blocked for an hour—hard lesson learned.

Best practice: after each batch, sleep for a few seconds, and log the X-RateLimit-Remaining header if you can. This tiny habit makes your automation feel like a good neighbor rather than a noisy bulldozer.

Putting It All Together: A Mini‑Pipeline Example

Below is a sketch of a Bash‑Python hybrid that:

Logs in via BotPassword.
Fetches all pages in the “Guidelines” namespace.
Prepends a disclaimer banner if missing.
Uploads a related PDF if it doesn’t exist.


#!/usr/bin/env bash
# Step 1: get login token
TOKEN=$(curl -sG "https://wiki.example.org/w/api.php" \
    --data-urlencode "action=query" \
    --data-urlencode "meta=tokens" \
    --data-urlencode "type=login" \
    --data-urlencode "format=json" | jq -r .query.tokens.logintoken)

# Step 2: login (BotPassword)
curl -s -c cookies.txt -b cookies.txt \
    -X POST "https://wiki.example.org/w/api.php" \
    -d "action=login" \
    -d "lgname=BotUser" \
    -d "lgpassword=YOUR_BOT_PASSWORD" \
    -d "lgtoken=$TOKEN" \
    -d "format=json"

# Step 3: get CSRF token
CSRF=$(curl -sG "https://wiki.example.org/w/api.php" \
    -b cookies.txt \
    --data-urlencode "action=query" \
    --data-urlencode "meta=tokens" \
    --data-urlencode "type=csrf" \
    --data-urlencode "format=json" | jq -r .query.tokens.csrftoken)

# Hand off to Python for the heavy lifting
python3 process_guidelines.py "$CSRF"

And the companion Python script (process_guidelines.py) does the page‑level work:


import sys, requests, json, re

API = "https://wiki.example.org/w/api.php"
session = requests.Session()
session.headers.update({'User-Agent': 'GuidelineBot/1.0'})

csrf_token = sys.argv[1]

def get_pages():
    params = {
        'action': 'query',
        'list': 'allpages',
        'apnamespace': '14',   # Namespace 14 = Category, change as needed
        'aplimit': '500',
        'format': 'json'
    }
    r = session.get(API, params=params)
    return r.json()['query']['allpages']

def fetch_content(title):
    params = {
        'action': 'query',
        'prop': 'revisions',
        'rvprop': 'content',
        'titles': title,
        'format': 'json'
    }
    r = session.get(API, params=params)
    pages = r.json()['query']['pages']
    return next(iter(pages.values()))['revisions'][0]['*']

def save_page(title, text):
    data = {
        'action': 'edit',
        'title': title,
        'text': text,
        'token': csrf_token,
        'format': 'json',
        'bot': 1
    }
    r = session.post(API, data=data)
    return r.json()

banner = "{{Disclaimer|This page is for internal use only.}}\n"

for page in get_pages():
    wikitext = fetch_content(page['title'])
    if not wikitext.startswith(banner):
        new_text = banner + wikitext
        save_page(page['title'], new_text)

# Upload PDF if missing
pdf_check = session.get(API, params={
    'action': 'query',
    'list': 'search',
    'srsearch': 'file:Guidelines.pdf',
    'format': 'json'
}).json()

if not pdf_check['query']['search']:
    files = {'file': open('Guidelines.pdf', 'rb')}
    upload = session.post(API, data={
        'action': 'upload',
        'filename': 'Guidelines.pdf',
        'token': csrf_token,
        'format': 'json'
    }, files=files)
    print('Uploaded PDF:', upload.json())

Notice the occasional “if not …” line that feels a touch informal—yeah, I left it that way. It reads like a conversation with the code, and that’s exactly the vibe you want when you’re debugging at 2 a.m.

Wrapping Up

Automation isn’t a silver bullet; you still need to think about content quality, review cycles, and governance. Yet the MediaWiki API hands you the levers to knit together bots, CI pipelines, and analytics dashboards without ever opening the web UI.

If you’ve been manually updating pages, consider the cost of “click fatigue.” A modest script can shave minutes—or hours—off your week. And if you’ve already got scripts, maybe give them a quick audit: are you still using the old action=login flow? Could you switch to BotPasswords for better security? These little upgrades keep your automation fresh and less likely to break when the wiki gets a core upgrade.

So next time you stare at that endless list of “to‑do” items on your wiki, remember there’s an endpoint waiting to be called. One curl, one token, and a dash of Python, and you’ve turned a tedious chore into a repeatable, auditable process.