总结Python正文提取的工具包

本文短链接 http://memect.co/B1DWuNo

一年来我们试用了很多正文提取的工具,准备在这个系列里做一个总结。相关的资源一共有15条,列在这里 http://memect.co/python-text-extraction 会在今后十几条微博里介绍我们的经验教训。

首先推荐看Tomaž Kovačič’2011年很棒的总结。他的网站已经宕了,这里有一个pdf备份 http://python.memect.com/?p=3449

Kovačič的survey里,比较了两类产品,开源算法有Boilerpipe,Goose,Webstemmer等 商业API有Alchemy, Diffbot, Readability, Extractiv等 他做了benchmark测试,认为商业API在precision和recall上并不比当时的商业API强,Boilerpipe表现甚佳,详细列表如下

read more

5 Ways of Calling Java from Python

Some of my notes on calling Java from Python, only lightly edited from the raw notes. Short, mostly installation script and hello world code, but should serve the purpose.

Short answer: Jpype works pretty well, but Pyjnius is faster and
simpler than JPype

Summary

2013-05-21T22:38:11 (PDT) Pyjnius is faster and simpler than JPype

  • JCC, javabridge, Jpype and Jnius are all JNI wrappers.

2012-06-14T10:33:00 (PDT) Jpype works pretty well. I can call Stanford parser and OpenNLP from Python

2012-05-05T17:57:57 (PDT) close for now. At lease I can use Jpype. Reopen a Py4j task in the future if Jpype is not enough

read more

Peter Thiel CS183: Startup 笔记

2013-2-25 09:20 Peter Thiel: 做一家从 0 到 1 的创业公司需要的金钱成本和非金钱成本都很低,至少能学到很多东西,付出的努力也值了。而做一家从 1 到 n 的创业公司,虽然金钱成本不高,但非金钱成本会很高,比如你想做一家马达加斯加的团购网站,如果失败了,那可不太妙。 http://t.cn/zY0ULES

2013-2-25 10:17  我刚刚在#爱问共享资料#上传了资料Peter Thiel’s CS183: Startup—Stanford (斯坦福创业课程)完整讲义,欢迎大家下载分享! “peter_thiel_startup.pdf” http://t.cn/zY0n8iJ

read more

常见自然语言语法分析器总结

特性总表

 
FeaturesSatisfied byNote
Web-scale parsing: for both training and parsing time, should be able to handle TB or higher text volume efficientlyLink, MiniPar, Malt, DeSR, MST, pfp, MBSPLinear-time parsing is generally possible with dependency parsing; also parallelism support is important
Potentially support both statistical and knowledge-based parsingLink, NLTK, Malt, DepParse, MBSP
High accuracyStanford, Collins and Bikel, Berkeley, Charniak-Johnson, RASP, Malt, Link, DeSR, MST, pfp, Senna
Active developmentStanford, Berkeley, Link, NLTK, Malt, DeSR, pfp, MBSP, OpenNLP, Senna
Production-friendly licenseLink, NLTK, RASP, Malt, DepParse, OpenNLPSome others with GPL can be used in production as a web service without opening source other parts
Good documentationStanford, Link, NLTK, Malt, DeSR, MBSP, OpenNLP
Code Reusability: easy-to-use API or easy-to-understand codeStanford, Link, NLTK, MiniPar, DeSR, DepParse, pfp, MBSP, Senna

Continue reading

创业一年

从辞职创业开始快一年了。一直都有进步,进步一直很慢。随便说几句真实的想法。

第一个就是创业这件事,结婚生孩子之前做,和结婚生孩子之后做,那是完全不同的做法。

做任何一件事想成为专家都大概要一万个小时,创业也不例外,里面有有规律性的东西,绝对不是看书或者听别人说就能领会的,就是以前在大公司工作的经验都不能直接转化过来。时间投入不到,别指望有捷径。年轻人可以一周7天,一天工作16个小时,有家庭一周5天,一天能工作8个小时就很不错了,你工作的每一分钟都是从老婆孩子那里借来的。所以对中年大叔,年轻的竞争对手比你时间投入至少多一倍,怎么和他们竞争?这个就要想好怎么做那些即使对手多花一倍时间也不容易做好的事情,那些他们不理解或者不重视的问题。

read more