goose-extractor:文章主体内容提取库,不需要正则表达式,或者 CSS 选择器。

介绍

Goose 最初是用Java编写的文章提取器,最近(aug2011)被转换为scala项目。

这是一个完全重写的 python 库。 该软件的目的是采取任何新闻文章或文章类型的网页,不仅提取文章的主体是什么,而且所有元数据和最可能的图像。

Goose 将尝试提取以下信息:

文章的主要内容
文章的主要图片
任何Youtube / Vimeo电影嵌入文章
元描述
元标签
python 版本重写作者:

泽维尔·格兰吉尔

简单地说这个库将从爬取的网页文本中自动提取正文内容,而不用写正则表达式或者 CSS 选择器。

pip:https://pypi.python.org/pypi/goose-extractor/

更多机器学习资源:http://www.tensorflownews.com/

Intro

Goose was originally an article extractor written in Java that has most recently (aug2011) been converted to a scala project.

This is a complete rewrite in python. The aim of the software is to take any news article or article-type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags
The python version was rewritten by:

Xavier Grangier

Related posts

Leave a Comment