Migrating a ratings MediaWiki extension data with Python lp_solve script

For long time Softcatalà wiki used a MediaWiki AJAX rating extension, so we could have a rough idea of visitors feedback on the different applications we list in our software repository. Recently I migrated that wiki to a newer version, so before I had reviewed which extensions to use: basically I checked whether there was any update to the current MediaWiki version for every installed extension.

I noticed that the former extension was already rather obsolete and unmaintained, and other extensions such as VoteNY looked far better. At first sight, I thought that should not be too difficult simply adapating the data structure from one case to the other and then migrating all the data. But the truth is that it turned to be more complex than I'd have expected from a "simple" rating add-on.

The fact is that the former extension was not storing every vote that was submitted, but rather accumulating the sum and number of votes for every associated page. On the contrary, VoteNY, does store every vote, which is what I think anyone would normally expect at first (we could argue about privacy, but that's another story).

At that point, I could simply have decided to drop all the previous data and start again from scratch, but I started to wonder whether there could be any other chance to keep that information.

Taking into account that the possible scoring options are only (5, 4, 3, 2, 1) and since we know for certain both the number of votes and their sum, what I had could be translated into 2 equations:

5a + 4b + 3c + 2d + e = SUM_VOTES
a + b + c + d + e = VOTES

That is too many degrees of freedom, and I thought that Linear Programming could provide a satisfactory enough answer.

For this I used LP_solve Python wrapper and you can see a gist of code I used below.

Of course, the outcome are not real votes, and it might be argued that these inserted entries are not realistic (for example, we may lack certain common vote values).

Another improvement could have been including the stored IPs in the new structure, but I simply skipped it for sake of simplicity. So now users have a new chance to rate their favourite applications.

On the other hand, we should be aware that if we ever wanted to use recorded data to know, let's say, how many users rate with a '5 vote' one specific application, we should restrict it to table entries that are recorded from now on.

I hope this post can help people like me to have a second thought on what to do with old, and sometimes little-structured data, before deciding to definitely drop them.