Need help?
<- Back

Comments (41)

  • neilv
    That imprecise chunking (with MD5 sums, and distributed across an FS tree, and using hash IDs, then separately storing the assembly information) seems like it would be a headache to restore, or to use with other mail programs.An alternative idea is to properly parse it and store in smaller mbox files, such as one file month, with the idea that any month in the past usually will not change. (And if it changes because they are storing frequently changing attributes in a faux header, like an atime, then maybe strip that header.) Then your incremental `restic` backups work fine, and you can also use it easily with a variety of mail programs (MUAs, impromptu IMAP servers for migration, quick text editor, etc.).
  • csb6
    Have you looked into using a full MIME/mbox parser library, e.g. GMime [0] or MimeKit [1]? Both support parsing mbox files directly, and they should be able to handle the intricacies of parsing any messages/attachments you throw at them. Then you could write out the MIME representation of each message (including any attachments) into its own file and then check for new messages. That way you can be sure each “chunk” represents a single message in its entirety. Not sure if this is any better since your solution seems to work pretty well.[0] https://github.com/jstedfast/gmime[1] https://github.com/jstedfast/MimeKit
  • marwis
    Isn't it easier to just backup via IMAP to maildir versioned with git?Does takeout include any metadata not accessible via IMAP? Does it even include labels?
  • jinnko
    I haven't used it for a while, but imapsync[0] still supports Gmail. With an approach like this you can regularly sync and get your messages in a standard format. Plus you don't have to wrangle those 50GB takeout dumps.0: https://imapsync.lamiral.info/FAQ.d/FAQ.Gmail.txt
  • hasperdi
    I just setup Gmail backup the other day. Using getmail + cron. The emails get stored as maildir (1 mail = 1 file). It's incremental backup friendly
  • pbhn
    Gmail takeouts come in an arbitrarily-ordered mbox file; I wanted something a bit more backup friendly so I created a small tool for that purpose and wrote about it.
  • jokoon
    I realized my inbox takes a lot of memory, even after a manual cleanup, it was still taking 5GB, despite regularly removing automated things and others.I tried using takeout to have a more accurate listing. I thought I could open it with thunderbird, I failed, I then tried to open it with some python lib, also failed.
  • jsrozner
    What about using the Gmail API and listening for recent changes? I suppose it wouldn't be in a mailbox format that could be easily exported to another provider, though?
  • jl6
    I have the same requirement and I solved it as follows:Apply a label to emails dated after the last backup, using an “after:YYYY-MM-DD” search. Takeout then offers the option to export only that label. I do an annual backup so the amount of manual effort here is acceptable.
  • yooogurt
    > if you want to back this file up regularly with something like restic, then you will quickly end up in a world of pain: since new mails are not even appended to the end of the file, each cycle of takeout-then-backup essentially produces a new giant file.As I'm sure the author is aware, Restic will do hash-based chunking so that similar files can be efficiently be backed up.How similar are two successive Takeout mboxes?If the order of messages within an mbox is stable, and new emails are inserted somewhere, the delta update might be tiny.Even if the order of the mbox's messages are ~random, Restic's delta updates will forego large attachments.It would be great to see empirical figures here: how large is the incremental backup after after a month's emails. How does that compare for each backup strategy?The pro of sticking with restic is simplicity, and also avoiding the risk of your tool managing to screw up the data.This risk isn't so bad if it's a mature tool that canonicalises mboxes (e.g. order them by time), but seems risky for something handrolled.
  • PunchyHamster
    You can just unpack it to file-per-email (format used by most sane email clients). There are dozen of programs to do it, one is included in Debian (and so most other distros), called mb2md
  • pabs3
    Why doesn't Google use zip/tar of a Maildir instead? Much better format than mbox. Converting the mbox to Maildir using standard tools would work too.
  • tehlike
    Wouldn't it be nice if Google just dumped the takeout into a sqlite file?
  • Intralexical
    `zpaq add archive.zpaq new.mbox -fragment 0 -method 3` is great for this. It splits the input into fragments averaging 1024 bytes in size [0], which catches up to ~90% of redundancy. The remaining ~10% is packed and compressed into 64MB (max) blocks that are added to the .zpaq.The resulting artifact is a single .zpaq file on disk. This file is only ever appended to, never overwritten, so it plays nice with Restic's own chunked deduplication. Plus it won't flood the filesystem with inodes and it suffers less small files overhead than TFA's solution.Granted I suspect TFA splitting on the e-mail headers may be chunking more efficiently. Though, unless I skimmed the linked GitHub too fast, it looks like TFA's solution also doesn't use any solid compression to exploit redundancy across chunks. And I trust zpaq as a general purpose tool more than a one-off just for a single use case. The code does look clean, though, nice work.[0] Average fragment size is 1024*2^N. If the most of the data is attachments that don't change, you can probably use a higher `-fragment N` to have less overhead keeping track of hashes. `-method 3` is a good middle ground for backups. `-m5` gets crazy high compression ratios, but also crazy slow speed. Old versions of ingested files are shadowed by default; use `-all` when you want to list/extract them.
  • Brajeshwar
    For emails, here is my current simple backup setup. Of course, I’m also looking to do this without having to open Thunderbird, or I might have an old laptop running it. So, work-in-progress.For the email accounts I want a backup, I set it to spew out POP3 without doing anything (don’t mark read or delete). I set up Thunderbird with that POP3. It has a backup copy of all the emails. I’ve had searchable emails since like 2004/2005, and I’ve occasionally replied to people and gotten back in touch with very old friends from the Internet.I saw an open-source tool sometime back (I think, here on Hacker News) that backs up your IMAP mails with a nicely done interface. That would be nice to have.Edit: Perhaps Bichon,[1] mentioned somewhere in the other comment threads[2] was the one.1. https://github.com/rustmailer/bichon2. https://news.ycombinator.com/item?id=46429250
  • SanjayMehta
    Serious question: have you ever needed an email from even 5 years ago?I only save financial statements and contact information. Everything else gets deleted as soon as possible.