Hi-
This post isn’t for everybody. It’s about data backups, and an idea I got this morning. Not the most exciting topic ever…for most people. I have a special interest in it, but if you don’t? Might be better to skip this one.
Okay, here’s the setup:
- I’ve liked a company called SpiderOak for a while now. I’ve often recommended them to others.
- I liked them a lot better than something like Dropbox, even though their software is (necessarily) a bit slower and more cumbersome to use. Because they (say they) encrypt all user data with the user’s password, and (they say) they don’t ever know user passwords, and (they say) therefore they have no way to know what data is being stored with them. So they (say they) can’t provide your info to any third party either.
- One caveat I’ve always had with this is that you have to give your password to their web server to set up your account. Which means that each and every password in fact flows through their web server’s memory at least once. It’s a strange design choice IMO. Seems to me that using a separate passphrase for encryption would be a (much!) better idea.
- Another issue is that if you access your stuff via their web interface at any time after account setup…the password is again available to their server, for as long as your session lasts. They’re aware of this issue, and warn against it for those who care. But they don’t warn about the account-setup bit. Hmm.
- Basically you have the same issue if you access any of your stuff from a mobile device. And if you store your password automatically it’s no more secure than your mobile device. Which is always true with any device that can access your data…but still, it’s worth thinking about.
- You have to trust the company a lot to use the software at all. It’s not open-source, so it’s unclear how well they’ve implemented their ideas. There’s no way to verify that the version of the software you download is the same as the version other people are getting (I’ve suggested digital signatures on their download page & a BitTorrent “download” option to mitigate this, but they’ve preferred not to implement either…and since they’re both very easy to do, this concerns me somewhat). In fact as soon as you start the app the first time, it (reasonably!) asks for a username and password. Thing is, if I slipped someone a fake version it wouldn’t be hard to–for instance–fake that login window. Nothing good can happen from there. Plus, by default the app updates itself (and I know of no easy way to verify the company can’t update bits or all of it regardless of user preference, or even run arbitrary code on a per-user basis) and for all I know they have utilities to grab the password–always assuming they don’t already have it–upon request from a gov’t agency…which also means employees may have access, and the system as a whole may be hackable. Unless they’re the first to develop one that isn’t. If it were open-source and easy to inspect I’d feel better…but it isn’t.
All that said? I still thought they were a better bet than other backup storage providers. I liked them so much that I’ve been voluntarily paying them for double the storage space I’ve actually used for over a year. So all the caveats above were no big deal for my use case. And, y’know, I like playing with this “security” stuff but mostly I just needed backups that just worked. So I could…you know…do my thing and not worry about hard drive failures or accidentally deleting a month’s worth of effort.
Oops. They got me on that one.
I have a folder I backed up that has a few hundred subfolders. It’s about 18GB in total, with a few thousand files. I used to be able to, as part of setting up a new computer, “sync” that folder to the new machine. Everything would (slowly but surely) come to me, wherever I was. It was nice.
Unfortunately it no longer works for that folder. I’ve set up multiple new operating systems, both 32-bit and 64-bit, both various flavors of Linux and plain ol’ Windows…and that data will no longer sync. I don’t know why–could be data corruption (which is a very bad thing from a backup provider), or it could just be a software glitch with their current version.
All is not lost, for two reasons:
- I can choose to download that data instead of setting up a “sync,” and it sort of works. I get some of the files, anyway, but I may never wait around to find out whether I could get all of them. I have a fairly good internet connection at the moment, and it looks as if it will take roughly 10 days to get all 18GB. Downsides: (1) if I shut down the app, its “Download Manager” doesn’t remember what it was doing, so I’d have to leave a computer running without shutting it down for the entire 10 days. (2) I actually have much more data than that 18GB stored with SpiderOak. So…bad news, there. Um, and also (3) Seriously? Ten days?
- I’m kind of a geek. So I actually have everything backed up on an external drive anyway. Downside: what am I paying SpiderOak for? What about all the people to whom I’ve recommended their service? I need to publicly disrecommend them ASAP. Thus, this.
Unfortunately, it gets much worse than that. I sent their tech support an email about all this on 11/7. They responded the same day, telling me the issue had been “escalated” and I’d hear back within two days.
Since then I’ve sent a few more messages. I’ve heard nothing. For two weeks. Which, to me, means there’s no reason to do business with them anyway. Sheesh. What a pain. I’m disappointed, because I’d thought much better of them.
Okay, moving on: I have a fairly quick solution in mind for my own use, and here’s where some feedback might be useful. I used to run a service (called “Scarecrow”) that backed up small-business websites in the cloud. (It also checked website availability from various locations around the US–thus the image for this post–and sent user-defined alerts when files changed or a site didn’t respond, but that’s a side issue in this context.)
Scarecrow used Amazon S3 for storage, but all files were given UUID filenames in a single “bucket” and all files were also encrypted with various keys, so nobody at Amazon could read their contents or even determine which of my customers owned which file. I liked that feature a lot.
Here’s what I can do fairly quickly, for my own use:
- Set up a file server
- Set up a Scarecrow virtual machine (only takes about 128MB of RAM, because all uploads/downloads/encryption/decryption are streamed in small chunks) to monitor that file server. Scarecrow can use FTP or SFTP…but plain old SSH is its first choice. So…easy.
- Sit around smirking while it backs up everything to S3
- Separately, encrypt and back up the Scarecrow database so I can still do restores if something happens to my Scarecrow instance.
- Look into LAN-based folder-sync software so I don’t have to write a client app and my wife and I don’t have to remember to transfer important stuff to the file server.
Now…this will give me everything I need, and will cost far less than SpiderOak did…never mind that I was paying them double; I appreciated the convenience of their solution. Until, and all.
With Scarecrow, since it has a web interface already built, I can actually view all versions of all files if I want to. I can see a snapshot of all backed-up files at any given time, and if I want to download a version of a file or restore everything from, say, last Tuesday at 10:43:27PM I can do it whenever I want…and still keep all other versions of all files around in case I decide I need them. Plus, if something unforeseen happens to my S3 account or the Scarecrow db, I’ll still have a backup in the form of my file server.
What I’m wondering is: should I open-source Scarecrow, and maybe make the whole virtual machine into a downloadable appliance? Hell, it could be set up to run as a Windows service (for you Windows people) without much effort on my part. I’d want to do some work on it, ’cause some things (like my S3 account info) are stored in ways I found reasonable and I never chose to build any sort of application interface that had access to read or change that stuff. And I’d want the LAN folder-sync part to be simple to set up. Right now I know absolutely nothing about that end of things…but it looks like I’ll have to learn.
It’d be fun to do this. Of course it would also take time, which is a problem when I’m trying to get a lot of writing done. So, does anybody but me care about this stuff? I mean, some people obviously do, but they mostly don’t read my blog. And I’m not sure doing all this is worth the time it’d take away from my other projects. Especially if I then have to go out and find a way to let people know about it. What a pain.
It’s an idea. I get them sometimes. Often they pass without ill effect. We’ll see what happens this time.
The main takeaway from all this, for me: I’ve argued against people who said the cloud was a bad place for personal data. Um…SpiderOak used to be my primary example of a company that was doing it right. Could be I was wrong. I hate that.
Sigh. I still think that, with caution and awareness, that anti-cloud position is not totally correct. However, I simply don’t know of a backup solution out there (besides my proposal) that’s (1) reliable, and (2) verifiably prevents people who aren’t me from seeing my data. Even when I ran Scarecrow for money, in principle I had access to my customers’ files…I had to, so the app would be able to restore them “from the cloud” without customers’ having to run it all from a client application on their own computers. But if everybody ran their own version of Scarecrow, with their own Amazon accounts–funny how often Amazon crops up in discussion around here, isn’t it?–and with their own app-generated encryption keys…hmm!
So, noodle on that for a bit if you’re of a mind to do so. And have fun out there!
UPDATE:
They now say, via Twitter, that they’re really sorry. And have hired new help. And at least one employee is working as hard as he can. Well…tough. What are we, children on a playground? They didn’t respond to me at all until I went public–even though I told them I would. But they responded within an hour after I did. I don’t like that at all. What about the customers who don’t post complaints in public? I already established what happens to them…
And this clearly means they’ve got at least one very badly designed system within their company. Doesn’t it? I mean, not letting stuff like this sit for weeks with no comment/response is a no-brainer. Why didn’t an automated system pop up and say or do something about it? If their systems and culture allow/encourage this sort of thing where I can see it, what do they do where I can’t? Sheesh. Anyway, I’m done with ’em. I think you should be done with ’em, too. So…how ’bout Scarecrow? {8′>
Wow. Two freaking weeks?
Been following your blog since you ran Cabin Fever. I have a Micro-ISV, and your posts on software and backups were always interesting. I love the publishing stuff you write about now, because I keep telling myself I’ll write a book someday, but my favorite posts are when you take on some company or other that’s raised your ire. Death to Amazon! I like the stories too. Especially the ones with Marvin, so far.
Just wanted to say hello, finally, and thanks for the ongoing education and entertainment!
Hi John!
I’ve wondered whether any of the people from my original blog were still with me. Good to know there’s at least one!
What kind of software do you write? Can you send me a link via the contact page, since I see you didn’t want to include one with your comment? It’s none of my business but that made me curious. {8’>
Anyway, I hope you’ll stick around for more. Writing is fun, but mostly because of readers. Thanks for letting me know you’re out there.