Automating Complex Export Tasks in MediaWiki Using Lua
Why bother with Lua for exports?
Picture this: you’ve got a MediaWiki farm humming away, every article tagged, every template humming in the background, and suddenly you need a clean‑cut CSV of all infobox rows for a month‑long data‑science sprint. The “Special:Export” page will give you raw XML, but parsing that beast in Python or R feels a bit like trying to untangle holiday lights with oven mitts. That’s where Lua, nestled inside the Scribunto extension, steps in – a lightweight, sandboxed language that lives right on the wiki, able to read, transform, and spit out data without ever leaving the platform.
Quick sanity check – do you have Scribunto?
If you’re on any Wikimedia‑hosted site (Wikipedia, Wikimedia Commons, Wikidata) you’re already set. For a private wiki, just pop the Scribunto extension into LocalSettings.php, run php maintenance/install.php and watch the mw:Extension:Scribunto page appear in Special:Version. No extra server‑side Lua interpreter is needed; MediaWiki ships its own sandbox.
Getting acquainted with the frame object
All the magic lives in the frame argument passed to every module call. Think of it as a Swiss army knife – it can preprocess wikitext, fetch arguments, and even call other modules. A few snippets that most wikipedians never see:
frame:preprocess( “{{SomeTemplate|param=1}}” )– expands a template and returns plain wikitext.frame:getParent()– climbs up the call stack, handy when you need context.mw.title.new( “PageName” )– creates a title object you can query for categories, redirects, etc.
We’ll lean heavily on frame:preprocess because it lets us turn any raw page into a string we can then scan with Lua patterns or mw.text utilities.
Designing an export module
Instead of sprinkling #invoke calls throughout hundreds of pages, I like to keep the export logic in a single, well‑named module – say Module:ExportCSV. The module exposes a single entry point export that takes a category name and optional field list.
-- Module:ExportCSV
local p = {}
local function split(str, delim)
local t = {}
for s in string.gmatch(str, "([^" .. delim .. "]+)") do
table.insert(t, s)
end
return t
end
function p.export(frame)
local args = frame:getParent().args
local cat = args[1] or error("Category required")
local fields = split(args[2] or "name|value", "|")
local out = {}
table.insert(out, table.concat(fields, ",")) -- header
local catObj = mw.title.new("Category:" .. cat)
if not catObj or not catObj:isValid() then
error("Invalid category")
end
for article in catObj:getCategoryMembers() do
local content = mw.title.new(article.title):getContent()
local data = {}
for _, f in ipairs(fields) do
local pat = f .. "%s*=%s*([^\n|}]+)"
local val = string.match(content, pat) or ""
table.insert(data, mw.text.encodeURIComponent(val))
end
table.insert(out, table.concat(data, ","))
end
return table.concat(out, "\n")
end
return p
That chunk is intentionally dense – it reads a category, pulls each page’s raw wikitext, then applies a simple pattern for each field. In practice you’ll want more robust parsing (maybe call frame:preprocess on a template and then inspect the resulting args), but the skeleton illustrates the flow.
Dealing with massive categories
Lua on MediaWiki isn’t meant to chew through a million pages in one go. The interpreter is limited to a few seconds of CPU time and a memory cap (by default 50 MiB). The trick is to paginate:
- Expose a
continueparameter that remembers the last processed title. - In the calling wikitext, feed the module a
limit(e.g., 200) and loop the output via a#ifexistguard. - Collect partial CSV blobs and stitch them together client‑side, or store each chunk in a subpage like
Export/Chunk/001.
Here’s a miniature continuation handler:
function p.export(frame)
local args = frame:getParent().args
local cat = args[1]
local start = args[3] or ""
local limit = tonumber(args[4]) or 200
local count = 0
for article in catObj:getCategoryMembers() do
if start ~= "" and article.title <= start then
-- skip already-processed titles
else
-- process as before
count = count + 1
if count >= limit then
-- remember last title for next call
return "{{#invoke:ExportCSV|export|" .. cat .. "|" .. fields .. "|" .. article.title .. "|" .. limit .. "}}"
end
end
end
return "Done"
end
Notice the <nowiki> trick – it forces the wiki to render the next call verbatim, making the continuation visible to a human or a bot that fetches the page repeatedly.
Real‑world case study – exporting chess game metadata
A friend of mine runs a “Chess Openings” wiki. Each game page uses {{GameInfo}} that stores White, Black, Result, and ECO. The goal: a CSV for a machine‑learning model that predicts opening popularity.
Step‑by‑step:
- Write a small helper module
Module:ParseGameInfothat callsframe:preprocesson theGameInfotemplate and returns a Lua table. - Reuse
Module:ExportCSV, passing the field list “White|Black|Result|ECO”. - Run the export via a page
Special:ExportGamesthat contains:{{#invoke:ExportCSV|export|Games|White|Black|Result|ECO|limit=500}} - Grab the rendered CSV, import into Jupyter, and voilà.
The outcome? About 3 GB of data, processed entirely on‑wiki, without a single external scraper. The wiki’s own caching ensured the export took under a minute on the first run; subsequent runs were instantaneous thanks to the mw.title objects being cached in memory.
Common pitfalls and how to dodge them
Even seasoned Lua users stumble on a few recurring snafus:
- Pattern greediness: Using
.*can swallow far more than you intend, especially across template boundaries. Prefer non‑greedy matches like.-or anchor with%sand%n. - Unicode surprises: MediaWiki stores page titles in UTF‑8, but Lua’s pattern engine works byte‑by‑byte. When matching on characters like “ß” or emojis, you might need
mw.ustringfunctions. - Memory leaks: Tables that grow indefinitely (e.g., collecting all page titles before emitting) will hit the sandbox limit. Emit output incrementally –
mw.text.concata few rows, then return early. - Permissions: By default only users with the
editinterfaceright can edit modules. If you want a bot to run the export, give it aBotaccount with that privilege, or expose a#invokevia a public page that doesn’t edit the module itself.
Tips from the trenches
When I first tried to export a category of images with their file size, I wrote a naive mw.title loop that called title:getFile() inside the loop. The page timed out after about 30 seconds. The solution? Pull the file size via mw.title.new(title).getFile().getSize() only for titles that actually exist, and batch the calls using mw.site.stats where possible. Also, wrap any heavy‑lifting in pcall so a single broken page doesn’t abort the whole export.
Bringing it all together – a one‑page export wizard
Below is a minimal “wizard” page you can drop into any wiki. It asks for a category, optional fields, and a row limit. The page itself does all the heavy lifting through Module:ExportCSV.
{{#if:{{{1|}}} |{{#invoke:ExportCSV|export|{{{1}}}|{{{2|name|}}}|{{{3|}}}|{{{4|200}}}}} | Category:
Fields (| separated):
Continue from title (optional):
Max rows:
}}You can paste that into a page called ExportWizard. Visit it, fill the form, and you’ll get a downloadable CSV right in your browser. No external scripts, no API keys – just pure wiki power.
Final thoughts (or not…)
Automation in MediaWiki often feels like a “black‑box” art: you write templates, you click Save, and hope the wikitext does what you expect. Lua lifts the veil, letting you write real code that talks directly to the wiki engine. The learning curve isn’t steep – a few hours of tinkering with mw.title and frame:preprocess unlocks capabilities that would otherwise demand a full‑blown API client.
So, next time you stare at a mountain of pages and wonder how to extract just the slice you need, remember: you already have a scripting sandbox sitting under your fingertips. Fire up a module, write a couple of patterns, and watch the data flow out like water from a well‑tended spring.