far.in.net


~Subscribing to arXiv

The academic feed for completionists

How do you find out about new papers? Do you curate an academic Twitter feed? Do you set up alerts on Google Scholar for specific keywords? Do you keep an eye on the #papers channel in each of the Slacks/Discords you frequent? Do you ask an LLM for literature recommendations?

I don’t systematically do any of those things. Instead, I read arXiv. More specifically, for around a year, I’ve started most days by reading all of the new titles posted to arXiv under a broad set of computer science and machine learning categories.

How crazy is that?

§ArXiv

In principle, subscribing to all arXiv preprints doesn’t seem crazy at all. ArXiv was literally founded around the idea of widely sharing new preprints by email. Today, one of arXiv’s core features remains daily email notifications about new papers posted in each category. (When you submit a paper, you get a preview of how your paper will appear in the listings, namely a distinctively retro fixed-width plain-text rendering of your title and abstract that betrays the age of this feature.)

In practice, subscribing to all arXiv preprints seems totally crazy. The mailing lists were designed in a time when arXiv posted a few preprints per day, and while this is still the case in some fields, machine learning has long since left those days behind. ArXiv posts hundreds of machine learning papers every day. Machine learning is an incredibly broad field, and while my interests are broad, they’re not that broad: around 97 out of every 100 of them are not at all relevant to my interests. This is an extremely noisy channel for paper recommendations.

The thing is, not everyone promotes their papers on twitter. And not everyone publishes their relevant ideas under the keywords I am expecting, or cites the papers I know. On the other hand, almost all academic papers that are relevant to my interest are posted as preprints to arXiv. Therefore, if I have the ambition to see every paper that is relevant to my research as soon as it is released, I can’t afford to skip arXiv.

§Logistics

I don’t actually monitor the titles via email subscription. There are no good email clients, so keeping track of my progress through the lists would be too difficult. Besides, this would be poor separation of concerns.

I needed a more streamlined and reliable tool to keep track of which titles I have seen and which I haven’t, and to help me efficiently work through these lists.

So as I started reading, I started building firehose. It’s a fittingly-retro command-line Python app with the following features.

Originally, I kept track of which papers I downloaded out of the ones I saw, giving me a personalised data set representing my research interests. More recently, I added more fine-grained logging and going forward I’ll be able to measure how efficiently I scan titles and file them into my reference manager.

I don’t have precise timing information for the first year of usage. My best guess is that I spent something like 5 seconds per title on average, including time to read interesting abstracts when I wasn’t sure from the title, and to download/file away useful papers into my reading list. This suggests that reading about 120k titles posted between mid-April 2025 and late February 2026 took me about 30 minutes per day.

I haven’t been doing this for a few months due to some major deadlines and travel commitments, but I’m resuming now, and I think I can commit to 30 minutes a day of title scanning going forward. That’s comparable to the time I imagine I would mindlessly sink into scrolling twitter feeds if I were a twitter user (and this way, I own my own engagement data).

§Review

After spending so many dozens of hours scanning so many thousand of titles on arXiv, what have I learned? I haven’t systematised the insights enough to put them all into words, but here is a rendition of the ‘vibe’ of reading almost a year of arXiv.

Firstly, I noticed that about 2.5 percent of the papers I saw seemed relevant enough to my research interests to file in my reading list. While there are vastly more irrelevant papers, this still does seem to me to be a large number of hits—I seem to have particularly broad interests compared to other researchers. It’s a bit hard to describe exactly what those interests are, but I am starting to get a clearer picture of it through this process, and I have a definitive data set to mine for insights in the future as well.

Even for the bulk of papers I am not interested in, I get a fairly visceral sense of the volume and depth of work in modern machine learning. I watch in real time as researchers rapidly flock after new topics (KANs, AI scientists, social media analyses of moltbook). This all seems important for my intellectual development in ways that (again) I’m not yet able to articulate. This benefit comes from the noise more than from the signal.

Moreover, there have been some exciting finds. Sometimes, days after having a conversation with a colleague about their research ideas, I’ll see a paper posted that is highly relevant to their topic, and immediately forward it to them. For papers outside of the socio-academic bubble I inhabit with my colleagues, they might not have known about this work otherwise. It’s rewarding to be the one to import such papers into my group. It’s also nice to see names of colleagues show up on the feed.

Finally, using this tool is a pretty interesting psychological experience. When using the tool, I feel really powerful and like nothing could surprise me. It’s also something easy to build a habit around, providing a nice rewarding feeling when I get to the end of the day’s new papers. I have found that even when I don’t have the time or energy to start deep work, I can often still manage to start up firehose and read the day’s papers—an instance of structured procrastination.

§Limitations

Subscribing to arXiv isn’t all you need. There are a few ways in which arXiv is flawed as a “completionist academic feed.”

Not all relevant academic work is posted to arXiv.

If I want to be truly completionist, I should expand firehose to include web journals, blogs, twitter profiles, historical conference proceedings and journals. I’m considering expanding the tool in this direction, into a unified feed of multiple sources of academic work, present and historical. In the mean time, I’m not too bothered by this. At a field level, senior researchers already have the historical perspective, many people are already on twitter. On the other hand, I don’t know anyone else in my field who attempts to do what I’m doing with arXiv. (I also try to scan mindfully, and I sometimes do leave my cloister and talk with colleagues).

I already mentioned the noise issue. I think overall right now it’s tolerable. However, it’s trending in the direction of more noise going forward, as the field expands naturally and, more recently, with notable increases in research automation. ArXiv is trying to filter out pure slop and hold back the tide, but it’s currently an uphill battle.

On this front, I think I can hold out a little longer, but not forever. Eventually, I might have to fight fire with fire. After amassing a sizeable personal data set representing my interests, it may be possible to train my own filters. With control over the model, I can prioritise recall over precision for the most part, but I might be willing to sacrifice a little recall to cut out the majority of the noise.

Finally, reading all the titles in the world isn’t going to produce great research. I have so far had great success in building a habit around the shallowest stage of the research pipeline. I still need to invest in the deeper parts: spending the long hours of poring over equations and plots, launching my own experiments, to actually make progress.

But I wouldn’t overstate this concern. I’m not just reading titles, after all. When I find a relevant title, I file it into my reference manager. Over time, that reference manager has grown to include thousands of papers. How could I ever read so many papers? That’s the wrong question—this isn’t a reading list, it’s an antilibrary. When I tend carefully to its growth, finding precisely the right place for each new title, I’m forging new mental connections and abstractions across my corner of the vast computer science and machine learning literatures. Once I finish filing away new titles each morning, I’m free to spend the rest of my day putting this perspective to use.

§Conclusion

Would I recommend subscribing to arXiv? Not lightly. It’s certainly not for the faint of heart. If you prefer precision over recall, you might want to use another tool. But while reading hundreds of titles every day can be somewhat exhausting, out of all the methods I know, it also comes the closest to being exhaustive. And so far, I’m enjoying the results. So, I’m going to keep at it a little longer.