Unfortunately, it is the HTML reports that have most systemic issues. We already mentioned the structural problems, this page will deal with something else. Also, since these reports supposedly mostly complement the basic boxscore JSON feed, we may have overlooked some issues, accepting the values from the boxscore.
The errors that we were able to detect so far are:
Some early HTML reports, pre-2007, even when not breaking the markup language, have somewhat inconsistent headers.
Some of the HTML reports, although finishing in three (four) periods, have their status stuck at 'End of Period 3 (4)' instead of 'Final'.
Some GS reports have their table layout botched.
Some GS reports have incomplete data despite being in state 'Final' with 'Data Pending' keyword featured in the appropriate cells.
There are switches in the naming conventions of the players. Sometimes it's LASTNAME, FIRSTNAME. Sometimes it's FIRSTNAME LASTNAME.
In some ES reports the stat cells are empty instead of featuring the number 0.
Parsing the GS report is an adventure even when everything is in order. You can take a look at the source code of the parser.
There is a general inconsistency about the misconduct/bench penalties and the assigned 'servedby' player. Sometimes the penalty minutes get added to his boxscore, sometimes they do not. Sometimes he's marked as the offender, but the penalty is a bench penalty.
Sometimes in the PBP event the hash sign '#' is not followed by the player's number, usually in the bench penalties.
It would be extremely useful if the HTML reports would feature some reference to player IDs next to the names of the player via tooltip, onMouseOver event or something like that.
Pre-2007 PBP reports are formatted line sheets, not HTML tables. Sometimes the offsets for columns run away. It should be possible to convert all the pre-2007 reports into the new HTML format.
PSTR, PEND and GEND events are inconsistent, often mixed. We recommend dropping extraction of them altogether if they are out of order, and repopulate them by knowing the start and the end points of each period by default.
A few PBP events have their periods out of order (e.g. 12).
On ice information for shootouts is very inconsistent. Sometimes it's just the shooter and the goalie. Sometimes it's two goalies only. Sometimes it's all the three people.
Please note that the errors that didn't affect the data we ourselves look for may have flown under our radar. Such is for example, an invalid on-ice listing, where two goalies may be listed for the same team.
All this work would not be possible if not for the magnificent HTML::TreeBuilder module.