Reddit’s API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn’t touch Reddit’s servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is “trust but verify” – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

  • Seefoo@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 days ago

    Does this decompress the files preemptively and leave them? Or is it only decompressing as a post/subreddit is accessed? Basically i am wondering what kind of storage footprint would be required to search through this

  • breakingcups@lemmy.world
    link
    fedilink
    English
    arrow-up
    83
    arrow-down
    2
    ·
    10 days ago

    Just so you’re aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

    Not to detract from your project, which looks cool!

      • Melvin_Ferd@lemmy.world
        link
        fedilink
        English
        arrow-up
        15
        arrow-down
        34
        ·
        9 days ago

        You’re awesome. AI is fun and there’s nothing wrong with using it especially how you did. Lemmy was hit hard with AI hate propaganda. China probably trying to stop it’s growth and development in other countries or some stupid shit like that. But you’re good. Fuck them

  • offspec@lemmy.world
    link
    fedilink
    English
    arrow-up
    41
    arrow-down
    2
    ·
    9 days ago

    It would be neat for someone to migrate this data set to a Lemmy instance

    • TeddE@lemmy.world
      link
      fedilink
      English
      arrow-up
      16
      ·
      9 days ago

      It would be inviting a lawsuit for sure. I like the essence of the idea, but it’s probably more trouble than it’s worth for all but the most fanatic.

      • floquant@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        7
        ·
        8 days ago

        Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can “own it” just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

        Of course it would be dumb for someone under US jurisdiction but we’ll see how much an international DMCA claim is worth considering the current relations anyway.

        • TeddE@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          8 days ago

          They don’t own it, the individual posters own the content of their own posts, however, from the reddit terms of service:

          When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.

          And with each of those rights granted, Reddit’s lawyers can defend those rights. So no, they don’t own it “just because they ran the servers” - they own specific rights to copy granted to them by each poster.

          (I don’t like this arrangement, but ignorance of the terms of service isn’t going to help someone who uploaded a full copy of the works they have extensive rights to) On this subject I think there needs to be an extensive overhaul to narrow what terms you can extend to the general public. The problem is I straight up don’t trust anyone currently in power to make such a change to have our interests in mind.

  • SteveCC@lemmy.world
    link
    fedilink
    English
    arrow-up
    27
    ·
    10 days ago

    Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

  • Tanis Nikana@lemmy.world
    link
    fedilink
    English
    arrow-up
    22
    arrow-down
    1
    ·
    10 days ago

    Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

    Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

  • 19-84@lemmy.dbzer0.comOP
    link
    fedilink
    English
    arrow-up
    21
    arrow-down
    1
    ·
    10 days ago

    PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

    • muusemuuse@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 days ago

      You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

        • muusemuuse@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 days ago

          What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

          • El Barto@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            8 days ago

            Oh I agree with you, friend. The problem is that they’ll say that they’re losing ad revenue. So they’ll try and sue, even if they’re in the wrong.

          • limelight79@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            9 days ago

            It might fall under the same concept that recipes do - you can’t copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

            In any case, they have a lot more money to pay lawyers than you or I do, I’ll bet, so even if you are right, that doesn’t mean you’ll have the money to actually win.

    • 19-84@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      9
      ·
      10 days ago

      redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

        • 19-84@lemmy.dbzer0.comOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          9 days ago

          redd-archiver will take up more disk space because the database exists along with the static html

  • inspxtr@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    8 days ago

    Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?