2021-12-22

I've been interested in using Machine Learning to extract text reliably from a web page for archival purposes (adding to my personal knowlegebase). So today, I'm collecting some links to prior art in this area for some inspiration:

A Machine Learning Approach to Webpage Content Extraction. This paper uses support vector machines to train a model that uses some specific features in the text block:

number of words in this block and thee quotient to its previous block
average sentence length in this block and the quotient to its previous block
text density in this block and the quotient to its previous block
link density in this block

Readability.js. This is an node library that contains the readability library used for the FireFox Reader view. In a web browser, you must directly pass the document object from the browser DOM to the library. If used in a node application, you'll need to use an external DOM library like jsdom. Either way the code is simple:

var article = new Readability(document).parse();

or in the case of a node app:

var { Readability } = require('@mozilla/readability');
var { JSDOM } = require('jsdom');
var doc = new JSDOM("<body>Look at this cat: <img src='./cat.jpg'></body>", {
  url: "https://www.example.com/the-page-i-got-the-source-from"
});
let reader = new Readability(doc.window.document);
let article = reader.parse();

Ideally I should create an API that accepts a URI as a parameter and returns the parsed document to the caller. Invoking this from a Chrome extension should make it very straightforward to "clip" a web page into a personal knowledgebase.

Large language models like GPT-3 show real promise in this area as well. In this article, the authors use GPT-3 to answer questions based on text extracted from a document. Starting at 2:47 in the video below is a great demo of this working

22 December 2021