Mastering MediaWiki API for Automated Content Management
Picture this: you’ve just finished a massive data dump from a legacy system, and you need to push thousands of rows into a wiki
Why automate with MediaWiki’s API?
Picture this: you’ve just finished a massive data dump from a legacy system, and you need to push thousands of rows into a wiki. Manually opening “Edit” for each page? That’s the kind of nightmare that keeps developers up at night, sipping cold coffee while the cursor blinks. The MediaWiki API, however, is like a backstage pass – it lets you skip the front‑row audience and get straight to the action.
Since the 1.35 LTS release, the API has become more consistent, and the newer RESTBase endpoint adds a modern touch. So whether you’re a hobbyist bot‑author or a full‑scale content‑management team, mastering the API is the ticket to turning repetitive edits into a smooth, automated workflow.
Getting your hands dirty: the first request
All right, roll up your sleeves. The most basic thing you can do is a GET to the action=query module. That’ll fetch a page’s raw wikitext, its last revision ID, or even a list of pages that match a certain prefix.
import requests
URL = "https://www.example.org/w/api.php"
params = {
"action": "query",
"prop": "revisions",
"titles": "Sandbox",
"rvprop": "content",
"format": "json"
}
r = requests.get(URL, params=params)
print(r.json()["query"]["pages"])
Never underestimate the power of that tiny snippet – you just pulled the entire content of “Sandbox” into a Python dict. From there the sky’s the limit.
Tokens, security, and the dreaded CSRF
Before you start blasting action=edit calls, you need a token. Think of the token as a “digital handshake” that proves you’re not a rogue script trying to hijack the wiki. The flow looks like this:
- Log in (or use a bot password).
- Request a
csrftoken viaaction=query&meta=tokens. - Include that token in every
editrequest.
Here’s a quick PHP example that logs in with a bot password and grabs the token:
$api = "https://www.example.org/w/api.php";
$login = [
"action" => "login",
"lgname" => "MyBot",
"lgpassword" => "BotPassword123",
"format" => "json"
];
$ch = curl_init($api);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($login));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$loginResponse = json_decode(curl_exec($ch), true);
$token = $loginResponse['login']['result'] === 'Success'
? $loginResponse['login']['token']
: null;
Okay, that snippet is a bit rough – you’ll need to handle cookies and maybe a second login step for newer MediaWiki versions. Yet it showcases the idea: get the token, store it, use it.
Batch actions: the power of generator and continue
Want to edit a whole category of pages? Use the generator parameter to pull a list of pages and then feed them into a single POST. Combine it with continue so you don’t hit the 500‑page limit.
def batch_edit(category, new_text):
session = requests.Session()
# Step 1: get a token
token = session.get(
URL,
params={"action":"query","meta":"tokens","type":"csrf","format":"json"}
).json()["query"]["tokens"]["csrftoken"]
# Step 2: iterate over pages in the category
cont = {}
while True:
resp = session.get(URL, params={
"action":"query",
"list":"categorymembers",
"cmtitle":f"Category:{category}",
"cmlimit":"max",
**cont,
"format":"json"
}).json()
for page in resp["query"]["categorymembers"]:
edit_resp = session.post(URL, data={
"action":"edit",
"title":page["title"],
"text":new_text,
"token":token,
"format":"json"
})
print(f"Edited {page['title']}: {edit_resp.json()}")
if "continue" not in resp:
break
cont = resp["continue"]
That function will walk through all members of a category, replace the whole page with new_text, and keep going until the API says “that’s it”. The continue dance is essential – otherwise you’ll get the dreaded “continue‑parameter missing” error.
Rate limits, polite bots, and the “User‑Agent” etiquette
MediaWiki installations often enforce a request‑per‑second ceiling. If you’re hammering a wiki at 1000 req/s you’ll hit a 429 Too Many Requests. The fix? Throttle your script, and set a recognisable User-Agent header. Something like:
headers = {
"User-Agent": "MyWikiBot/2.0 (https://mydomain.org/bot-info; contact@mydomain.org)"
}
session.get(URL, params=params, headers=headers)
Most wikis respect Wikimedia’s rate‑limit policy. Adding a contact email isn’t just polite – it can save you from being blocked when something goes sideways.
A quick Python script for “Create‑or‑Update”
One of the most common patterns is “if the page exists, edit it; otherwise, create it”. The API makes this painless because the same edit endpoint works for both, you just have to watch the basetimestamp and starttimestamp flags. Here’s a compact script that does exactly that:
def upsert_page(title, content):
sess = requests.Session()
token = sess.get(
URL,
params={"action":"query","meta":"tokens","type":"csrf","format":"json"}
).json()["query"]["tokens"]["csrftoken"]
# Grab the current revision ID if it exists
rev_resp = sess.get(URL, params={
"action":"query",
"prop":"info",
"titles":title,
"format":"json"
}).json()
pages = rev_resp["query"]["pages"]
pageid = next(iter(pages))
cur_rev = pages[pageid].get("lastrevid")
edit_data = {
"action":"edit",
"title":title,
"text":content,
"token":token,
"format":"json"
}
if cur_rev:
edit_data["basetimestamp"] = pages[pageid]["touched"]
resp = sess.post(URL, data=edit_data)
print(resp.json())
Notice the tiny “if cur_rev” guard – it adds a basetimestamp only when you’re truly updating. That little nuance prevents edit conflicts when multiple bots are working side‑by‑side.
Beyond the classic API: RESTBase and the new /v1 endpoints
Since MediaWiki 1.39, the RESTBase service ships with an HTTP‑friendly JSON API. Instead of the old action=query style, you can now use endpoints like /v1/page/{title} for reads and /v1/page/{title}/content for writes. The biggest win? No need for a token query – you just include a Authorization: Bearer … header if you’ve set up OAuth 2.0.
Example with curl to fetch a page’s HTML:
curl -H "Accept: application/json" \
"https://www.example.org/api/rest_v1/page/html/Help:Contents"
And to patch the content (requires a bot password with the writeapi right):
curl -X PATCH -H "Authorization: Basic $(echo -n 'mybot:BotPassword' | base64)" \
-H "Content-Type: text/plain" \
--data-binary "New wikitext here" \
"https://www.example.org/api/rest_v1/page/Title/content"
Switching to RESTBase can simplify client code, especially if you already speak JSON‑API in other parts of your stack.
Real‑world tips from the field
- Don’t ignore edit conflicts. Even if you use
basetimestamp, it’s wise to catch theeditconflicterror and retry with the latest revision. - Log every request. A simple CSV with timestamp, endpoint, response code, and any error message becomes invaluable when you need to audit bot activity.
- Cache tokens. Tokens are valid for a while (usually a few hours). Requesting a new token on every iteration just adds latency.
- Test on a sandbox. The official
https://test.wikidata.orginstance is perfect for trying out bulk edits before you point at production. - Watch the
maxlagparameter. Addingmaxlag=5tells the master database “don’t let my query lag the replica by more than five seconds”. If it’s exceeded, the API returns amaxlagerror – you can then back off a bit.
Putting it all together – a mini‑pipeline
Imagine you have a CSV of product IDs and descriptions that need to land on a wiki. The pipeline would look like this:
- Read the CSV with
pandas(or even plaincsvmodule). - For each row, construct the page title (e.g.,
Product:12345). - Use the
upsert_pagefunction above to create or update the page. - Log success or error to a separate file.
- After the batch, send a summary email to the content team.
All of that can be wrapped in a while True loop that checks the CSV for new rows every hour – a tiny “cron‑style” daemon that keeps your wiki in sync with the source database without any human fingers touching the edit box.
Final thoughts
Automating MediaWiki content isn’t just about learning a set of HTTP parameters; it’s about embracing the mindset of “treat the wiki as a data store”. When you think of pages as records, revisions as versioned rows, and the API as a CRUD interface, the whole process becomes as familiar as working with any other RESTful service.
Sure, there are quirks – token gymnastics, continuation loops, occasional 503s when the cluster is under load. But those are just the growing pains of a platform that powers everything from Wikipedia to corporate knowledge bases. With a dash of patience, a sprinkle of logging, and a good dose of respectful bot behaviour, you’ll find that the MediaWiki API can turn a mountain of manual edits into a quiet, humming workflow.