Archival ML

I love living in the current age of machine learning research. We've begun to realize many of the magical creative capabilities people once imagined for the technologycertainly for language, voice, coding, and images, but we're still not sure what an insightful, practical theoryKuhn's view is that a useful theory helps us select a minimal set of properties from all the possible measurable attributes of a system, allowing the scientific community to drastically reduce the complexity of the problem. For successful theories, the resulting abstractions precipitate immense fecundity, enabling scientists to ask all kinds of new questions and make rapid progress on them. of machine learning will look like.See these footnotes for some recent papers^[1]^[2]^[3] that are asking some very interesting questions.

Naturally, I've become obsessed with understanding how we got here, so I've been reading old ML papers and piecing together a history of how the ideas we take for granted came to be. Along this journey, I've found several pivotal papers which most researchers in my generation have probably heard of, but never read personally. I think of these pieces as "Archival ML". They represent, not just an important idea, but herald an impactful change in worldview for the field. Over the next few months, I'm hoping to write a bit about my favorite archival papers and hope that the posts will get you interested in taking a look at them as well.

The first post will be about Boltzmann Machines, followed by a piece on Shannon's Theory of CommunicationInformation Theory. Eventually, I'm hoping to work through the lineage of discrete generative models from Markov to Discrete DiffusionTurns out the first algorithm that looks like discrete diffusion actually predates GPT and transformers altogether!.

Posts

The Elegance of Restricted Boltzmann Machines and the Effectiveness of an Intuitive but Improper Learning AlgorithmComing Soon!

How should we bring the computational or algorithmic properties of the observer into our understanding of the information that can be learned from a dataset? See, "From Entropy to Epiplexity". ↩︎
Should the unit of analysis in stabilizing neural network optimization be the "layer" or "module" instead of the whole network? See the work of Jeremy Bernstein and collaborators on Muon ↩︎
How should we compare learning objectives when we know we won't minimize the loss? Is the traditional analysis of the minimizers being desirable enough? (Haven't looked into this one in detail yet, but hmu if you have recs!) ↩︎