Skip to content

John Lam's Blog


Day 2 of fastai class.

The problem with p-values🔗

Ways of determining whether a relationship would happen by chance?

Independent variables Dependent variables

One way of doing this is by simulation.

Another way to do this is by looking at the p-value, i.e., the probability of an observed or more extreme result assuming that the null hypothesis is true. See wikipedia article on p-value

Unfortunately, p-values are not useful - bottom line is that it doesn't say anything about the importance of a result.

See Frank Harrell's work.

p-values are a part of machine learning.

The outcome that we want from our models is whether they predict a practical result or outcome. In Jeremy's critique of the temperature-R relationship paper, he's

Another way to look at this set of problems is through the lens of outcomes that you want from your model. He has a 4 step process. In the example, he's looking at the objective of maximizing the 5 year profits of a hypothetical insurance company. Next he looks at the levers, i.e., what can you control which in the case of insurance it's the price of a policy for an individual. Next he looks at the data that can be collected, i.e., the revenues and claims and the impact those have on the profitability. He then ties the first thee things: objective, levers, and data together in a model which learn how the levers influence the objective. This is all discussed in this article that he wrote in 2012

"Using data to produce actionable outcomes", i.e., don't just make a model to predict crashes, instead have the model optimize the profit.

Deploying the bear detector (black, grizzly, teddy) reminds me of Streamlit. It might be a really interesting exercise to build the bear model from the class and deploy it as a local streamlit app (and perhaps think about what it would take to deploy it to Azure as well).

A great tribute by Steven Sinofsky to the Apple Bicycle for the Mind. It would be great to create a modern poster for this someday.


Today, I'm starting to work on the 4th edition of the fastai course, and I'll be note-taking on this blog. At the start Jeremy shows this lovely photo of the Mark 1 perceptron at Cornell circa 1961. It does a great job at showing the complexity of the connections in a neural network:

Some housekeeping things that are pretty cool that intersects with my day job at Microsoft building tools for data scientists.

I'm using my ez to run the fastai notebooks locally on my Windows machine which has an RTX2080 GPU. All I need to do is run a single command:

$ ez env go [email protected]:jflam/fastai -n .

You'll notice that I'm using my own fork of the fastai repo which contains only a aingle configuration file: ez.json that ez uses to build and run the image in VS Code. This is the entire contents of the ez.json file:

    "requires_gpu": "true",
    "base_container_image": "fastdotai/fastai:latest"

After running that command, this is what ez created on my machine:

It's a fully functional running Docker container with GPU support enabled and VS Code is bound to it using the VS Code Remote Containers extension. There's a fair amount of manual steps that would have been needed to get this running and ez eliminates the need to do any of that - you get straight to the course in a running local environment.

I'm also using GitHub Codespaces to view the notebooks for the book version of the course. All you need to do is go to the fastbook GitHub repo and press the . key to open up Codespaces in the browser to view the contents of the notebook:

Of course, if you are able to, please support the authors by purchasing a copy of the book. {% end %}

Jerermy introduces his pedagogy for the class at the start, based on the work of David Perkins. I love this image:

Begin by "building state-of-the-art world-class models" as your Hello, World.

Jeremy and Sylvain wrote a paper on the fastai library which is the layer of software on top of PyTorch that is used in the course.

While following along in the first video of the course, I realized that Jeremy is running some of the cells in the notebooks from the fastbook repo. So I forked and copied the ez.json file into that repo and was quickly able to reproduce his results using ez:

ez env go -g [email protected]/jflam/fastbook -n .

A quick note - the -n . parameter tells ez to run the repo locally. ez also supports running on Azure VMs using the same command. See the docs in the ez for more details on the setup on Azure (it's only 2 additional commands!)

This is a screenshot from my computer this morning after running the first model in the course. You can see that I'm using the Outline feature in VS Code notebooks to see an outline of the different sections from the first chapter of the book:


  • classification model predicts one of a number of discrete possibilities, e.g., "dog" or "cat
  • regression models predict a numeric (continuous?) quantity

valid_pct=0.2 in the code means that it holds back 20% of the data to validate the accuracy of the model. This is also the default in case you forget to set it

a learner contains your data, your architecture and also a metric to optimize for:

learn = cnn_learner(dls, resnet34, metrics=error_rate)

An epoch is looking at every item in the dataset once

accuracy is another function which is define as 1.0 - error_rate

error_rate and accuracy are not good loss functions which is a bit counter-intuitive as those are the values that humans care about, but it turns out that they are poor loss functions which are used to tune the parameters in the model across epochs

fastai uses the validation set to determine the accuracy or error rate of a model

Remember that a model is parameters + architecture

Training, validation, test datasets - this is used for things like Kaggle

Actual performance is withheld from models so that you can avoid the overfitting problem as well

In a time series you can't really create a validation dataset by random sampling, instead you need to chop off the end, since that's really the goal - to predict the future, not make predictions at random points in the past. We need to have different sampling algorithms based on the nature of the data and the predictions desired.

Discusses a case about loss functions vs. metrics. One way to think about overfitting is a case where your model keeps getting better at making predictions in the training dataset, but starts getting worse against the validation dataset. This can be an indication of overfitting. Jeremy cautions that this is different from changes in the loss function however, and he will get into the mathematics behind this later when he discusses loss functions in more detail.

Definition: transfer learning is using a pretrained model for a task different from the one it was originally trained for. The fine_tune() method called that because it is doing transfer learning. In the examples earlier with resnet34, it was performing transfer learning against the model to get the superior performance. It lets you use less data and less compute to accomplish our goals.

Zeiler and Fergus published a paper in 2013 Visualizing and Understanding Convolutional Networks. It showed how different layers in the network recognize initially simple patterns and then become more specialized in later layers. I think this is done by activations of different filters against an image, so that you can see the parts of the image that a filter gets activated against. This paper gives a good mental model for thinking about how filters can be generalized and how transfer learning can take advantage of filters in earlier layers.

Sound detection can work by turning sounds into pictures and using CNNs to classify them:

Here's a really cool example of detecting fraudulent activity by looking at traces of mouse movements and clicks and turning them into pictures (done by a fastai student at Splunk - blog that announced this result

What happens when you fine tune an existing model - does it perform worse on detecting things that it used to do before the fine tuning dataset? In the literature this is called catastrophic forgetting or catastrophic interference. To mitigate this problem you need to continue to provide data for other categories that you want detected during the fine tuning (transfer learning) stage.

When looking for pretrained model, you can search for model zoo or pretrained models.

He has a number of different categories: vision, text, tabular, recommendation systems.

Recommendation systems == collaborative filtering

Recommendation != Prediction


Tim O'Reilly, one of our elder statesmen of the web, has written a great analysis article on Web3. It is well worth reading the post in its entirety, as he does a really good job at constructing arguments without being confrontational in his reasoning. he does this by asking questions without presuming what the answers are. This is probably the most balanced account of Web3 that I've read so far and well worth your time. #

I've been thinking a lot about the parallels between liquid chromatography and espresso making. I found this article that delves into both the chemistry and physics of espresso brewing. #

While looking around for a project for the holidays, I've started thinking about continuing to build my personal semantic search engine. The core idea is to make a tool that makes it easier to remember and recall things that are interesting to me. Part of that is searching my private data for things that are interesting. I've already made pretty good progress in August on this. The other part is building a browser extension that makes it easy to tag and take notes on things that I'm reading and add those things to the index that my search engine operates over. That feels like a good task for the holidays. #

I've also been interested in autonomous agents for helping to manage beer mode information. My gut tells me that these things likely won't help in the long run, but they are nevertheless interesting to me. There are two tools that I came across tonight:

  1. mailbrew which is a service for delivering news culled by agents that you configure into an email that shows up in your inbox. This is pretty interesting as a tool as it lets you aggregate different pieces of information into a personal newsletter. It's $5/month which is also pretty reasonable.
  2. huginn which is named after the crows Huginn and Muninn who sat on Odin's shoulders and told him the news of the world. This is kind of like a DIY mailbrew where all the information sits on a server that you get to run it on. It's a DAG of agents (all written in Ruby) that you can configure to do virtually anything. It also runs as a Docker container to save you the trouble of setup. This feels like a lot of work compared to mailbrew.


I found a way to split an MP3 into smaller files automatically using ffmpeg. This is also the first time that I've ever used ffmpeg before and it did a fantastic job on this task.

$ ffmpeg -i somefile.mp3 -f segment -segment_time 3 -c copy out%03d.mp3

source #

Perhaps the greatest productivity hack ever created is News Feed Eradicator. I use this for Twitter so that I still have the ability to read specific tweets, e.g., they were linked from somewhere else or I can look up a specific user. But the algorithmic feed is gone. It's lovely. #


There's long been an argument by crypto enthusiasts that we need crypto to fight against the dastardly fees charged by Western Union and the like in the 3rd world. In this post by Patrick McKenzie (aka patio11) More than you want to know about gift cards it seems like there's a strong argument to be made for using gift cards to work around the fees charged by Western Union?

In this regard it is not merely important that they look attractive in a birthday card but also that they’re available for cash everywhere, require no identification or ongoing banking relationship to purchase, do not charge a fee like e.g. Western Union, and can be conveyed over a text message or phone call. They're not worse cash, they're better Tide in the informal economy.


This morning on HN I found this course on Natural Language Processing for Semantic Search by a startup called pinecone. This is my current area of interest, which is why I created a simple wine semantic search engine a while ago to explore this area. Taking a look at a couple of chapters it definitely looks interesting and worth a longer look over the holidays. #

There's another post by someone who is trying to build a news site that is kind of like the original Yahoo aggregator, but with the twist of having sagas which let you follow a story as it progresses, e.g., salacious news like the Theranos trial which unfolds over a long period of time. It looks like it is curated by the poster though. I would love to combine the idea of sagas with some kind of AI filter that is trained on my interests to pull tweets and news articles into a personal feed for my own consumption. This way it is aligned with my interests vs. the interests of the aggregator. #

I listened to Professor Christensen on this podcast the year it came out (2004!) and it left an indelible impression on me. Sadly, it looks like IT Conversations no longer exists, and I found this archive of the page created by the awesome folks at I also copied it to this part 1 and part 2 so that I can find it again - just in case. I highly recommend listening to this; the stories that Christensen tells about his conversations with Andy Grove are wonderful and do a great job at driving home the concepts of his theory of disruption. RIP.



This is an interesting take on the metaverse that I haven't seen before:

The idea that (metaverse : digital) is like (singularity : AI) is certainly a possibility. As Shaan (with a healthy dose of unhelpful crypto speak) correctly says, we've been on a rapid trend towards more a life in a virtual world that is more detached from our physical world thanks to ever improving technology.

Where I disagree with Shaan's tweet is how long this has beeng going on. It's been going on for much longer than 20 years; from the creation of the printing press, we have been on a path to ever increasing amounts of media/digital/online in our life. We have been spending more of our time in front of some other piece of technology and less time in the "real world". When you are reading a book, watching TV, playing a video game, or wearing a VR headset, you aren't in the "real world" - you can be doing those activities from anywhere - your physical environment doesn't matter. You're immersed in these experiences.

At what point does the value of the virtual world become greater to us than the value of our physical world? To some extent, the pandemic has started pushing our work to be more online work and it's not a huge leap to imagine that we are moving more towards a world where the experience of being in an online meeting in the metaverse is better than the experience of being in a real-life meeting.

Perhaps Ben Thompson is right - the metaverse-as-a-place will start in businesses who will buy this expensive technology for their employees, much like how the PC revolution started. It has the characteristics of disruption; it is worse on some dimension (e.g., the fidelity of the experience) that the mainstream cares about but better on some dimension that the early adopters care about (e.g., you don't need to live close to an office to go to work) - and it's on a steeper slope of improvement.

I know I'll be watching this area closely and learning. It's tempting to want to dismiss this because of the dystopian takes on this technology. But that's not an excuse to ignore it or try to block it. Technology is chaotic neutral and can't be un-invented. It's up to all of us to create a better experience for ourselves using it.



That was a F1 championship for the ages; certainly the best one that I can remember since I started watching a decade ago during the Vettel era. A bit of luck certainly contributed to Max's win, but can we talk about just how well the Red Bull team managed Max during the race? The call to pit for new hard tires under the virtual safety car and the call to pit for new soft tires under the safety car were clearly the calls that made the difference in a race where it was clear that Lewis had the faster car.

Mercedes really need to look in the mirror here as what they needed to do was cover off any pit stops that Max did. They will of course appeal to the FIA, but I can't imagine them changing the title in the courts.

Something else to talk about: Sergio Perez. In the era of selfish F1 teammates his sacrifice and ultimate retirement this weekend clearly enabled Max to get not only pole position but possibly the race itself. His epic battle with Lewis made up a huge gap that let Max back into contention in the race.

Regardless of what happens in the end, this was an epic end to an amazing season. Both Lews and Max deserved to win this race. Lewis behaved like the GOAT that he is and was so gracious in congratulating Max, gutting as it must have been for him. It was also great to see, in the back of the paddock, this scene with Anthony Hamilton and Jos Verstappen congratulating Max on his win.

Also these scenes with the fathers consoling and congratulating their sons:

These are the human moments in sport, and I'm grateful to have been here to see this. #


A recent interview with Anders Hejlsberg!

Anders Hejlsberg

I found this on Hacker News

  • His brother is one of the interviewers
  • They started programming 40 years ago in Copenhagen
  • Started using high school computer in the 1970s
  • Started by implementing a programming language, Turbo Pascal
  • Cut his teeth by adding extensions to existing language to make it useful
  • C# was first language designed from scratch
  • "You can always add but you can never take away"
  • The perfect programming language is one with no users(!)
  • V1.0 is the only greenfield
  • Game of learning to say no vs. yes
  • Language features asks come from people with an instance of a problem - he tries to find the class of problems that it belongs to
  • HTML is an incomprehensible mess :)
  • TypeScript was the first foray of a project that, from the onset was open source
  • Wanted Roslyn to be open source, but C# and .NET was not open source
  • Roslyn was not built as open source, was open sourced later
  • Open development - 2014 moved to GitHub - everything is on GitHub is the next step, e.g., design notes - close to users
  • Anders participates in C# design committee, but Mads has been running it for the last decade(!)
  • Multiple inheritence - usefulness does not outweigh the downsides of additional complexity
  • If he could change one thing in C# he would have nullability and value/reference types as orthogonal issues, e.g., you cannot have non-nullable reference types
  • Tony Hoare called inventing null his billion dollar mistake
  • Functional programming languages do not enable circular reference data types (e.g., double-linked list or trees with back-pointers)
  • Opting into nullability in specific areas
  • Don Syme did most of the implementation of generics in .NET 2.0
  • He regrets dynamic in C# 4.0
  • He likes the work done in C# 5.0 for async/await
  • TypeScript - how solve the nullability problem. He is very happy with that result.
  • Turbo Pascal 1.0 symbol table was a linked list(!)
  • He read Algorithms + Data Structures = Programs by Niklas Wirth - learned about hash tables and then reimplemented the symbol table in Turbo Pascal 2.0 using them local copy in case this disappears from the web
  • He learned by trial and error without formal background


I like it when things that I read and don't agree with get me thinking about an idea. This post by David Perell spent some time rattling around in my subconscious:

I didn't agree with it because almost all software that I use today requires me to Google something to figure out an obscure feature - there are always obscure features especially in software that I don't use all the time. If I don't know how to do something already using the UI, I would immediately Google it and follow the directions on how to accomplish the task.

I was out riding my bike today, and listening to Neal Stephenson being interviewed by Lex Fridman. During the conversation they talked about Google and search. The idea that popped into my head then was that UI is great for idiomatic operations and search is great for everything else.

The example in my head was VS Code. The idiomatic operations are provided by the UI, e.g., vi keybindings for navigating a document, tabs for managing multiple documents, a file explorer for viewing the contents of your project, a debugging tab for viewing the state of variables in the debugger etc. IMHO the real innovation in VS Code is the command palette which lets you search for the command that you want - this avoids over-complicating the UI with endless toolbars. VS Code users quickly learn to use the fuzzy search in the command palette to find the command that they are looking for. They even have the ability to bind those commands to a custom keybinding to turn it into an idiomatic operation if they use that command often enough.

I think this strikes the right balance between a rich featureset and a simple, idiomatic UI. Unfortunately, it seems that in our attempt to simplify everything for the novice user, we have wound up with UIs that have way too many layers of UI (I'm looking at you, Microsoft Teams).

I wonder what a better experience for search would be on mobile though? A command palette is a lot more difficult to use on a phone. #


There is some good news coming out of South Africa today - it looks like the probability of severe outcomes from Omicron is lower than Delta:

However, this is not the way to handle the Delta and nascent Omicron wave in USA:


OK. We have some pretty good news coming out of South Africa on Omicron - it looks like the probability of severe outcomes from Omicron are lower than Delta!

Unfortunately, it looks like we're trending the wrong way on Delta in the US:


Our talk yesterday was about using Transformer models to win a Kaggle competition at a training cost of less than $50 on Azure. Fortunately, Alon understands what Transformer models are, and he did a wonderful job of summarizing what Transformer model is in about 5 minutes.

I really wanted to learn more about Transforrmer models; I've been treating them mostly as a black box and using pre-trained Transformer models to make cool things like my semantic wine review search engine. Coincidentally, this morning someone from work linked this tweet:

and he subsequently found a link to this fantastic explanation of Transformer models called Transformers From Scratch written by Brandon Roher (hi Brandon!) I'm still working through the piece but one thing that I had not understood before was how matrix multiplication and one-hot encoded vectors are used to do branch-free selection of rows from a table. Let that sink in for a minute: how would you do that WITHOUT COMPARISONS and BRANCHING? 🤯

Apparently this is one of the key insights from Transformers. There's a whole lot of branching and comparing in this Python list comprehension:

[x for x in ['fizz','buzz','fizz','fizz'] if x == 'fizz'] #

Continuing to take notes on Brandon's tutorial on the flight home. There is a concept of selective masking that is in the original Transformers paper. Here is an annotated Attention is All You Need paper.

In his explanation, there are a large number of unhelpful predictions where there is a 50:50 probability of some outcome in his highly simplified example. Masking is a concept to drive low probability events to zero to eliminate them from consideration, and is the central idea in Transformers.

He summarizes the first part of his explanation through three ideas:

  1. Turning everything into matrix multiplication is a good idea. As I observed above, being able to select rows out of a table (or matrix) by doing nothing more than matrix multiplication is incredibly efficient.

  2. Each step during training must be differentiable, i.e., each adjustment to a parameter must result in a calculation of the model error / loss function.

  3. Having smooth gradients is really important, He has a nice analogy between ML gradients and hills/mountains/valleys in the real world. He describes the art of data science in ensuring that the gradients are smooth and "well-conditioned", i.e., they shouldn't quickly drop to zero or rise to infinity.