Skip to content

John Lam's Blog

2021-12-22

I've been interested in using Machine Learning to extract text reliably from a web page for archival purposes (adding to my personal knowlegebase). So today, I'm collecting some links to prior art in this area for some inspiration:

A Machine Learning Approach to Webpage Content Extraction. This paper uses support vector machines to train a model that uses some specific features in the text block:

  • number of words in this block and thee quotient to its previous block
  • average sentence length in this block and the quotient to its previous block
  • text density in this block and the quotient to its previous block
  • link density in this block

Readability.js. This is an node library that contains the readability library used for the FireFox Reader view. In a web browser, you must directly pass the document object from the browser DOM to the library. If used in a node application, you'll need to use an external DOM library like jsdom. Either way the code is simple:

var article = new Readability(document).parse();

or in the case of a node app:

var { Readability } = require('@mozilla/readability');
var { JSDOM } = require('jsdom');
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", {
  url: "https://www.example.com/the-page-i-got-the-source-from"
});
let reader = new Readability(doc.window.document);
let article = reader.parse();

Ideally I should create an API that accepts a URI as a parameter and returns the parsed document to the caller. Invoking this from a Chrome extension should make it very straightforward to "clip" a web page into a personal knowledgebase.

Large language models like GPT-3 show real promise in this area as well. In this article, the authors use GPT-3 to answer questions based on text extracted from a document. Starting at 2:47 in the video below is a great demo of this working