Recovering a Website from Thin Air
Or, Undeleting Mysql MyISAM
Note: this post is from 2012 but I never got around to cleaning up and posting it.
Around 7:45 AM in the morning of one Thursday in 2012, I got the following piece of good news in my email:
URGENT: Website.com deleted their entire blog
Website.com (name changed) is a medium sized niche news blog that averages somewhere in the range of 100k uniques a month and has around 25,000 posts. I had helped build some specific functionality for the site on top of a CMS I had built as a consultant in 2010. It’s hosted on its own Linode instance. What had happened was that a user had been given a few too many permissions, had meandered into an admin screen she shouldn’t have while trying to delete a post, had bravely clicked onwards, ignoring multiple warnings of doom and gloom, and had gone ahead and deleted the entire blog along with all 25,000 of its posts.
“Aw, Crap,” I thought, knowing the consultancy that had built it (my former client) had no one on staff that could pull the lost information out of backups and merge it in.
Logging on with SSH, I went to the config directory to find out where the machine was keeping its backups. I got a little scared when I didn’t see the requisite backup.yml file, but figured I’d just have to restore the Linode to the state of the last full-machine backup, losing the past day’s worth of forum activity and users comments. My ex-client’s client would be a little pissed, but as they weren’t without blame in this either it could all probably be smoothed over. Logging on to the Linode control panel I notice something that sent a palpable chill down my spine: a little link asking me if I wanted to enable backups on the instance.
There were no active backups.
The last site backup was from the middle of 2011 before it had been moved to it’s current server.
Nearly a year of the blog, 2,000 posts and thousands of man-hours of people’s lives, for all intents and purposes, were gone.
I relayed the news, explaining that there were no recent backups.
I was asked what could be done. “Well…” My first impulse was to state the obvious: “Nothing.”
The website and the data was gone. This wasn’t exactly true though. Search engines and the archive.org crawler cache the full content of pages all the time. We didn’t need any images, just the blog data. Provided we took down the website immediately, any cached copies would stick around long enough for us to grab them down. The only problem with this is that we aren’t talking about a 10 page brochure site. We were talking about thousands of posts in various categories and sections strewn throughout the site.
But, suddenly all didn't seem lost. My mood improved as I realized that there were additional fingerprints from the site’s existence strewn about the server: logs, data files, caches, etc. Anything that might relate back to the blog could be assembled, one piece at a time, back into a cohesive set of data for the blog.
Getting permission to shut down Apache to prevent any search engine’s caches from being updated, I scoured the server for anything that might be reconstituted back into blog data. At minimum, the rotating Rails logs for the previous 14 days were around and any posts could be recreated from the post-data in the log files, which was better than nothing.
As I was working my way through the server I figured I might as well take a look at the mysql data files and see what there was to see. I didn’t know anything about how MySQL works interally, so I didn’t have any idea what I would do with them in any case. What I saw stopped me in my tracks:
-rw-r——- 1 root root 13392 2012-03-22 12:14 blog_blogs.frm
-rw-r——- 1 root root 8714 2012-03-22 12:14 blog_categories.frm
-rw-r——- 1 root root 13312 2012-03-22 12:14 blog_post_revisions.frm
-rw-r——- 1 root root 232787772 2012-03-22 12:14 blog_post_revisions.MYD
-rw-r——- 1 root root 1608704 2012-03-22 12:14 blog_post_revisions.MYI
-rw-r——- 1 root root 8656 2012-03-22 12:14 blog_posts_categories.frm
-rw-r——- 1 root root 9008 2012-03-22 12:14 blog_posts.frm
-rw-r——- 1 root root 3385064 2012-03-22 12:14 blog_posts.MYD
-rw-r——- 1 root root 648192 2012-03-22 12:14 blog_posts.MYI
-rw-r——- 1 root root 8652 2012-03-22 12:14 blog_targets.frm
blog_post_revisions was the table where the damage had been done, and it had been cleaned out completely, with every record in it deleted off the face of the earth. If that was the case, why was it nearly two hundred and fifty megabytes big? I did a quick:
$ more blog_post_revisions.MYD
Between the occasional angry beeps of ASCII BELL characters and various non-ASCII characters mucking with my terminal, there was a lot of stuff that looked exactly like the contents of the deleted blog posts. 232 megabytes of stuff, to be exact. blog_post_revisions is the table where Webiva stores a copy of the content every-time the user presses ‘Save’ on a blog post. The most important data for the blog was still there, hanging out in the deleted-row-netherland.
I stopped mysql to prevent any additional data-lose and copied all the data down to my local machine. Opening the file up in a hex-editor confirmed that the majority of the data was there
Running a myisamchk on the file gave me good news:
$ myisamchk -e blog_post_revisions.MYI
Checking MyISAM file: blog_post_revisions.MYI
Data records: 0 Deleted blocks: 70457
- check file-size
- check record delete-chain
- check key delete-chain
- check index reference
- check data record references index: 1
- check data record references index: 2
- check records and index references
70457 deleted blocks sounded promising, if the checker knew about the blocks then there had to be a substantial amount of data here. Now the questions became: how does Mysql actually store it’s data in binary format (for a MyISAM table in this specific case) and can MyISAM rows be undeleted?
A quick consultation of the Google was inconclusive: some people claimed something was possible but that there was no automated way to undelete data. This turned out to be true: because of the way it’s stored, there was no automated way to undelete MyISAM data when the row format is Dynamic (as blog posts, which were rife with MEDIUMTEXT fields, most certainly were.) If the table had been innodb a company called Percona has released an open-source data recovery tool: https://launchpad.net/percona-data-recovery-tool-for-innodb If the table had been a fixed MyISAM table life would also have been easier, unfortunately the table was MyISAM and Dynamic and as such I wasn’t able to find any tool that might help in recovery.
I did, however, come across this jewel:
While this document may be outdated and has at least one inaccuracy that stymied me for a bit (specifically DATETIME’s aren’t stored as 2 32-bit integers, but rather 1 64-bit integer in my version of MySQL) it proved to be the keystone of the whole operation. It describes the data format of MySQL’s MyISAM files from beginning to end and explained exactly why a universal undelete was not possible.
A MyISAM table consists of three separate files on the file system: a .MYD (MySQL Data), a .MYI (MySQL Index), and a .frm (Format) file. The format file describes the columns in the table. For this case I only cared about the .MYD file as the structure of the table was still intact. The section in this page on the Layout of the Record Storage Frame explained exactly why pull the data out of the deleted rows was not going to be straightforward:
Most of the blog posts would be well under 64k, which means that they would fit in a “small” record, which means that their header length would be only 3 bytes. Deleted blocks, on the other hand, overwrite the beginning of each record block with a 20 byte header. This means that at least 17 bytes of data are missing from the start of each deleted record. This wouldn’t be a huge deal except for that fact that in those 17 bytes is the bitmap that defines which fields are NULL or empty and which have content. If you have a table that has a lot of fields that might or might not have data in them, you’re going to wish you knew which fields are represented in the data file and which weren’t.
Since the blog_post_revisions table is mostly a write-only table, there was a significant bonus that most of the records originally were type 1/01 which means that they were full small records. Tables that have a lot of deletions followed by writes would have been much more painful to handle as they would have been filled with records that would only be partial records continued in other parts of the file (see types 5/05, 7/07, 11/0B in the link above) that would have been unparsable because the details pointing to the next record would have been overwritten.
The table in question, blog_post_revisions looked like this:
1 id int(11)
2 blog_post_id int(11)
3 title varchar(255)
4 domain_file_id int(11)
5 status varchar(255)
6 keywords varchar(255)
7 preview mediumtext
8 preview_html mediumtext
9 body mediumtext
10 body_html mediumtext
11 author varchar(255)
12 created_at datetime
13 updated_at datetime
14 end_user_id int(11)
15 media_file_id int(11)
16 embedded_media text
17 preview_title varchar(255)
As you can see, the table has a number of fixed length fields?—?int(11) for example always takes up 4 bytes?—?along with variable length fields: varchar(255), mediumtext, and text. Variable length fields are prefixed with a length field whose size depends on the size of the field. varchar(255) fields have, as one would expect, 1 byte of length data. mediumtext has 3. text has 2. Except, as mentioned above, when the fields empty. Then they just aren’t there (information that would be contained in the bitmap, but unfortunately overwritten by the too-large deleted row header.) As you can imagine, this makes life more interesting when you’re trying to blunder your way through the file and extract useful data.
Looking through the data with my trusty hex editor I realized that this was going to be a challenge and was going to rely significantly on the actual data stored in the table: which fields were normally filled in and which weren’t. The first issue was to orient myself on an individual record. Achieving this turned out to be a fortunate coincidence. The only records that we cared about to restore were ones that had a status of ‘active’. This string, combined with it’s length prefix of 6 (resulting in the hex string of “06 61 63 74 69 76 65" was specific enough (having a bit less than a 2^56 or so chance of appearing randomly) to be an anchoring point for the recovery.
My plan of attack involved finding that specific hex string in the file, then going backwards 4 bytes to save the domain_file_id (the image used in the blog post) if present, then going backwards an unknown number of bytes to save the title of the post. Doing those two things were a bit tricky, as I didn’t know a priori if the post had an image and the first two characters of the title along with the byte that represented the length of the title itself were chopped of by the extended deleted header. The only solution was to look at the four bytes preceding the status and if any of them were non-alphanumeric ASCII characters then assume we had a file id (the title, after all, would be all valid ASCII alphanumeric characters). From there the path would continue backwards until the first non-alphanumeric character was reached. After which I chopped two characters off the front of the title for good measure to prevent any coincidental letters from getting into the title (as the title was already incomplete, it wasn’t a big deal if we lost a couple more characters).
Going forward from the status turned out to be a bit easier, as the start of the preview, preview_html, body and body_html of each post would always begin with a HTML-tag starting “<” (an artifact of the WYSIWYG editor Webiva uses and the glorious consistency of the bloggers writing the posts) This meant that unless any of the fields happened to be exactly 0x3C characters long the rest of the data was pretty manageable.
This technique was extended to another table that contained a search index containing the url of the post along with a slightly-less incomplete version of the title. Merging those two records then running the title through the Bing API, pulling down the cached version of the first page that matched the URL of the website and using the Ruby Nokogiri library pull out various pieces filled in some more of the data.
In the end there wasn’t a silver bullet. Recovering all the data required a combination of parsing the blog_post_revisions.MYD file, parsing the search index content_node_values.MYD file, parsing the production.log files for any missing url and finally hitting the Bing API for cached versions of pages when all that failed to pull the user data. I lost that Thursday, Friday and then the entire weekend that followed mixing and matching data where I could to piece together the content for the site. But in the end what I thought originally was going to be impossible turned out to be entirely doable with a little bit of metaphorical elbow grease.
This post will end the same way as the many other people who recount harrowing tales of partially-avoided data loss tend to end their posts: reminding you not to suffer the same fate as I and double check your backups right now. Even if it’s not a machine you’re employed to maintain, if you are going to be the one recovering the data, go ahead, log-on and double check your backups. You’ll thank me later.
For those of you found this post for whom it’s too late, know that with a little bit of luck you probably have more of the data for lost site than you think, you’ll just need to look at all your options for pulling the site out of the cloud and the MYSQL-binary-format badlands.
P.S. The Bing API actually Rocks. Microsoft has done what Google has specifically outlawed users from doing: hitting their API to generate custom search results for users on the server side. Who would have guessed 5 years ago that Google would now habitually play the role of restrictive, privacy violating bad guys, while Microsoft would come to the rescue as API saviors?