Catastrophic Failure - OG Myth-Weavers

Notices


Catastrophic Failure

 
Catastrophic Failure

Throwing this up here to answer the most asked question: Multiple recovery approaches were attempted and failed; there will not be any further recovery. The only option, much to our regret, is to rebuild character sheets and implement safeguards to prevent any data loss events in the future. -Colin

On my part.

I've got a sheets backup from July 25 2016 restored. Any sheet data from after that point (until anything saved after this) is lost.

I'm going to vomit, then hide for a bit and then get a postmortem up.

Current Status
Initial Assessment as Posted to FacebookI don't want to be alarmist but I'm really afraid this may be our first major data loss event.

I made a mistake while trying to rebuild/repair the data tables and that mistake may have cost the data. Additionally, our backup procedure hasn't been reviewed in about 6 months, since I thought I'd confirmed it working.

Regardless of the outcome, major failures on my part. I can't begin to apologize.

I'm working now to recover the data.


What we know now is that all sheet data after 25 July 2016 (The most recent backup since the automated backup failed) has been lost. The sheets are still present, however any edits made after that date are gone. For sheets created after the date, the sheet will still be present but blank.

A full post mortem will be coming tomorrow.

Planning is currently underway to prevent a catastrophic loss such as this from happening again, and details will be posted for the community to read and respond to in the next couple days.

-Colin


Stopgap Measure for SheetsRight now for those of you who rebuilt your sheets, you can right click on the sheet, click print, and select PDF. That will save the sheet as a PDF, which right now is better than nothing. A better backup system will happen.


EDIT: 11:28PM

The full postmortem:

So, at every point in this story, there is gross negligence and incompetence, be forewarned.

The story leading up to this starts back in late 2015, with a stupid design decision I made regarding the data structure for storing the deltas with each sheet edit. The reason these were kept were to allow moving back and forth through the versions of a sheet, branching of sheets that shared a common base, as well as a way to rebuild a sheet from the deltas if anything ever happened to the main data table. I didn't properly account for the number of revisions each sheet was likely to have and this caused the data storage requirements to explode, culminating in downtime Dec 20 2015 when I increased our disk size dramatically to account for the data, and change the storage engine to InnoDB (from MyISAM) - for row-level locking and better crash recovery and transaction support.

Eventually, I went back to undo my mistake. I rewrote the data storage mechanism for sheets and converted all of the old data. Unfortunately the damage was done, the sheets data had grown to monstrous size and was very nearly filling the disk. After I finished carrying everything across, I backed up the database, and pruned data that I felt could reliably be deleted - sheets that had been deleted for over 6 months where the user had not been active in at least one month. The deletion went fine, and I freed up enough space to allow us to run for some time before more action would be needed.

We had been using an AWS api to manage our automated backups for many years without needing to touch it, and they started their process of deprecating it at the end of 2015, with a deadline of July in 2016. There's no excuse for not getting it done sooner during the deprecation period. To be honest, I was unfamiliar with the code used to manage it, and opted to take the backup manually every night before going to bed until I could familiarize myself with the new api. This serviced until July 24 when I rewrote the automated backups. The system was simple - a script controlled by a cron job. I checked on it for a few days to make sure that it was both writing snapshots and logging correctly. Once I felt confident it was working, I went back to being complacent about it and didn't check it again.

It's still not clear to me what failed regarding the cron task. The logs for each day show the backup starting and ending, with no exceptions or errors, all the way up to today. I truthfully keep hoping that the logs aren't lying and that there is a backup that I've just been too worry-blind to see.

So, we come to today.

The disk was nearly full again, and there was data that could be pruned, as I'd done safely before, to prevent failure. I don't know why I was in such a hurry. Thinking back, it makes no sense at all to do this when and how I did. I decided to take an early lunch from work and do the prune. I should have taken a backup then. I always had in the past. Even to the point of storing junk data for years just in case some derived data hadn't been generated correctly. Even if I *did* have a nightly, as I assumed I did, 8+hrs of data loss, plus the hours of downtime would have been too terrible to deal with in the middle of a Monday. It was a stupid time to even get into this.

The prune, of course, did not go without a hitch as it did before. As I brought the database back up I started getting errors that the InnoDB engine was not recognized. I should have taken a backup at this point, before delving deeper. I didn't. I tried updating the database, repairing/optimizing the tables, anything I could think of to rebuild the tables. Eventually, I got it to recognize InnoDB as a valid engine, but it wasn't recognizing that the tables existed. I researched ways to rebuild the InnoDB state file, and the most valid way to do so was to delete the central data file and rebuild it from the individual tables. I checked that the config was set for one file per table. There was nothing explicit about InnoDB so I checked the MariaDB defaults, and they were for one file per table. At this point I should have checked where those files were, and taken a backup of either the whole file I was about to delete or of the entire filesystem. In my panic to fix the situation more quickly, I neglected to do either and deleted the file. I went to rebuild the database information and it wasn't working. I then realized that there was not a file per table, and I had made a catastrophic assumption.

I looked into our backups and was horrified that there wasn't one that was more recent. I was foolishly complacent, and expected a system that had always worked before to keep working, with no monitoring or additional attention. I scrambled for the logs, and to check the cron file, everything appeared as though it had been running fine - but the snapshots were nowhere to be found.

I spent the next few hours attempting various forms of data recovery/forensics to attempt to retrieve the file, but to no avail. I called amazon support hoping they maybe had a recovery method based in their own data redundancy, also to no avail.

Eventually I settled on restoring the old backup. It took some time to get the data moved from the old snapshot and eventually get it loaded back in, and the tables rebuilt.

...

In the end, just north of 200,000 character sheets were outright lost, with no idea how many older sheets that got rolled back.

This event was entirely avoidable and inexcusable and, were anybody in a position to take the reigns, I would've handed them off. But, for better or worse, those of you who stick around or come in the future are stuck with me.

Since a lot of people have said that someone would be fired for this level of neglience (honestly as someone in the field, I've seen worse pass with nary an eyeblink and less have someone's head roll), I rather feel like this post is me interviewing for my own job.

What's your biggest weakness?

Probably pride, but for what's relevant, I'm not really a systems admin. It's not where my skills or interests lie. I know enough to keep a system updated, manage the disks, set up automated tasks, implement monitoring, and create/maintain environments, even use containers and package disk images and RPMs. But the nuts and bolts reliability and stability that I see from my bearded, still using 20-year-old code/packages/tools because they're rock-solid, even if they don't have the latest features compatriots... I honestly don't share that experience, discipline, or wealth of knowledge. Today that is clear more than ever.

One thing I do well, or at least well enough to make it my day job, is software. I think the tools we've developed at Myth-Weavers over the past decade demonstrate that. I don't think I've done too poorly as a moderator, or as a manager for the rest of the staff. I'd like to take more credit than I probably rightfully can, but I think we've also built up quite the community, here, through many trials and tribulations.

So, those are my qualifications. This site is a tapestry of my successes and failures. Over the past ten years I've worked to make Myth-Weavers a solid service, with a wealth of features and a first-rate community. Today, in the span of an hour, I exhibited negligence on a scale that could potentially have destroyed the record of those tools and community. This is probably the biggest failure I've had. I, personally, wouldn't tolerate from a service provider, and I understand anyone who is absolutely furious with me.

I can't undo what I did. I assure you, no matter how angry you are, no one wishes this hadn't happened more than me. The only thing I can do is work to make things better for those who have the patience to allow me to attempt to rebuild the trust I dashed today.

I've taken an initial step by making a tiny widget, that queries the date and time of the latest snapshot, and displays it in the footer of the site. This is not in any way to put the onus of keeping an eye on it on the members, but is to provide transparency into the status of our backup system. It's accompanied on the back-end by a monitor that alerts me if the latest backup is too old.

My next step is going to be to provide the link to download character sheet data, with a method for importing not far behind. As of Feb 21 2017 both json export and import have been implemented.

Following that, I plan to update the sheets system to, on every save, store the latest version of that save within the browser's localstorage. Not only will this provide a backup that can be quickly saved back to the server, but will also provide the path toward offline editing of sheets. While this isn't a perfect solution due to multiple devices and browser data clearing, mixed with the other methods there will be a variety of ways to export and import your sheet data.

The next part has more to do with discipline than code. I'll keep a schedule to check backups regularly, and to run drills of disaster recovery to ensure that the backups are actually something that we can recover from.

I'm also heavily considering a script to kick off a backup when I log into the database server, both for convenience and to remind me of the potential gravity of even the most mundane task.

Other than that, I'm open to other ideas of how to ensure data reliability and fault tolerance here at Myth-Weavers.

I offer my sincerest apologies to the community, to our wider user base, and to my staff for having to deal with the fallout of this today. Ensuring this, or any other event like it, never happens again is my pledge to you.

You sure? Cause I see sheets made only last month are still up.

EDIT: Wait n/m. I see it now.

Quote:
Originally Posted by Blue Tempest View Post
You sure? Cause I see sheets made only last month are still up.

EDIT: Wait n/m. I see it now.
For other people: They're still showing up in lists, and the sheetid is still reserved. That means that if you are fortunate enough to have kept a particular sheet open on your computer, you can change one cell on that sheet and save it to get your sheet back instantly.

It also means that [sb] tags will be blank, but won't cause problems, and your characters are still connected to their games.

NOOOOOOOOOOOO-oh well. Crap happens.

Quote:
For other people: They're still showing up in lists, and the sheetid is still reserved. That means that if you are fortunate enough to have kept a particular sheet open on your computer, you can change one cell on that sheet and save it to get your sheet back instantly.
This is true.

I'm so sorry that happened, Rodrigo... It's always awful when you lose your data. Stay strong and let us all rebuild from what we can and have.

Offers Hugs

Really sorry you have to deal with this, and as I said on FB, this kind of things happens. I recently lost data that can't be recovered, 2.2 gigs worth of art, and campaign files for about 7 different campaigns that just can't be redone. It sucks, but just need to take a breath and move on. Awful, but this kind of thing happens. It could have been much worse. Thanks for all of the work you and your staff do here!




 

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2024, vBulletin Solutions, Inc.
User Alert System provided by Advanced User Tagging (Lite) - vBulletin Mods & Addons Copyright © 2024 DragonByte Technologies Ltd.
Last Database Backup 2024-03-28 08:21:26pm local time
Myth-Weavers Status