"Survey" results: how people back up their data
I wanted to gather some more requirements for Obnam, and so talked to a few friends. Despite the title, it wasn't an actual survey — the conversations were informal, and I had no prepared questions. Unfortunately, none of those friends see a need for a backup solution. I still think we can draw a few insights from this, so here's the results.
Out of 7 people I talked to, only one is concerned with accidental deletion of data. For others, the only worry is hardware loss (e.g. bad disk sectors or theft). Thus, they all keep a "latest" copy of their data, and don't see a need in Obnam generations.
The one person who is concerned about the accidental deletion explained that even though they don't mitigate this risk, they're prepared for it: they have a filesystem recovery tool, and they're comfortable using it if they delete something important.
3 people rely on automatic synchronization to the cloud, for at least a part of their data.
2 people rely on spreading their data across multiple devices they own. One does it with Syncthing, the other by simply leaving data around.
1 person has a semi-automated system where photos are synchronized to the cloud, but some other stuff is manually copied to external HDDs from time to time. They explicitly stated that they prune data before copying it, because they're too lazy to configure exclusion rules.
1 person has a RAID array, and they're conscious that an array is not a backup.
2 people dragged me into a long discussion about building reliable systems (like backup solutions) out of unreliable parts (like disks, which are known to go bad), and the cost of doing so (buying a couple HDDs vs. renting space in the cloud). Their argument is that for the time, money and effort required, backups don't provide enough benefits.
I didn't ask all seven—my bad!—but of those whom I did ask, all said that they previously lost data to hardware failure or accidental deletion.
All those people are programmers. All mentioned that their code is stored on public or private Git servers, and I got the impression that this covers a sizable portion of backup-worthy data for them.
Insights and somewhat actionable points
It's fair to say that the surveyed group is not the target audience for Obnam. However, I think they are still indicative of what alternatives Obnam competes against, even if not directly.
The cloud is perceived as "good enough". Perhaps our doc can tackle this head-on, just listing some of the differences between the cloud and Obnam: upfront investment vs. renting, local disks being faster to recover from, lack of monitoring when using automatic synchronization as opposed to a backup that you run manually, possibility of corrupt data to get synched to the cloud and nullify the benefit of such backup copy. The goal isn't to paint Obnam as universally better, but to provide a list of points worth considering.
This also gives us a yardstick for "ease of use": if using Obnam is just as easy as renting some space on Google cloud and configuring their software, some people might try Obnam.
People assume they have to cherry-pick the data to backup. Even before I asked how much data they have, people told me how much data of each kind they have — photos, ebooks, work data. I might be reading too deep into this, but I think this indicates that people under-appreciate how fast modern storage is, and how efficient de-duplication and compression can get. Another explanation is that cloud storage (which 3 out of 7 use) is more limited and/or costly than buying HDDs. Perhaps our doc could explicitly state that backing up the entire home directory is the correct approach. (Or is this just my opinion, which others don't necessarily share?) We could also mention some average gains from compression and deduplication, and give numbers of how fast the backup is — not as promises, but as examples to adjust the expectations.
On the flip side, we could provide tools to help cherry-pick data. The person who said they're too lazy to configure exclusion rules probably thought of going through their data looking for often-changing subdirectories that they don't want to backup — can we automate this? Some visualization tools perhaps?
Backups are perceived as too difficult. In #86, I describe a hypothetical newbie who has no idea how to do backups. In reality, it seems, people have preconceived notions, and so our docs would have to work against those. I think our docs should have some case studies of plausible setups that people can relate to.
Lack of clarity on usefulness of backups. I might have been elaborately trolled, but it appears that at least some people don't have a model of what risks a backup can (and can't) mitigate. Perhaps our docs should briefly explain that, so people don't have to learn about disaster recovery planning in order to understand the pros and cons. I think this can be folded into case studies, too.