Alex Kessinger changelog.com/posts

Newspaper delivers Instapaper style article extraction

Newspaper lets anyone do article extraction, similar to Pocket or Instapaper.

Newspaper is a Python 2 library for extracting & curating articles from the web. It wants to change the way people handle article extraction with a new, more precise layer of abstraction.

Besides “read later” services, there’s a growing number of APIs that provide article extraction as a service like diffbot and embed.ly. Those services are great, but it’s nice that newspaper is open source and hackable.

For instance, when I first checked out newspaper it only had plain text article extraction. Sometimes, though, I want the original markup of the article with some sanitization. It helps to have the paragraphs, links, and headers accurately represent the article. So, I forked the project, made some changes, and the maintainer codelucas was reactive and worked with me to get my changes merged in.

If you want a place to start working on article extraction Newspaper looks like a good bet.


Discussion

Sign in or Join to comment or subscribe

Player art
  0:00 / 0:00