NLP

Search Technology and Search Business

New Idea Engineering is a California based Enterprise Search Consulting company. Their homepage host an unteresting collection of texts to the business as well as to the technology of search. This includes for example:

  • Search Industrie Overview: The search industry is an ecosystem with a number of different companies and related technologies that together provide complete solutions for intranet and customer-facing content search.
  • Mergers and Acquisition Map: Like many dynamic industries, enterprise search vendors and companies with related technologies have grown not only by sales but also by acquisition and merger.
  • Anatomy of a Search Engine

 

Character Recognition Recognition

Screenshot of Reverse OCR Tumblr

Via Twitter I stumbled upon the Reverse OCR bot yesterday. The bot itself states its mission like this:

I am a bot that grabs a random word and draws semi-random lines until the OCRad.js library recognizes it as the word.

And indeed, it is pretty interesting to follow those inexplicable lines which the OCRad algorithm identifies as words. You get a glance on why it is so difficult to write a good OCR system although most of us can parse text without any afford.

Darius Kazemi, the man behind Reverse OCR is an internet artist. On his website "Tiny Subversions" he lists a lot of other werid stuff he created, e.g. a bot that shops random stuff at amazon or a tool for creating presentations using the first image of a google search to a given list of topics.

Unitex Forum

Finally! Although the Unitex Mailinglist was of great help I really appreciate the launch of the forum as it will keep all the information at hand... Thanks, Cédrick and others!

A new Unitex-GramLab Forum has just been launched. Its aim is to replace the Unitex-users mailing list. This Forum will archive all the messages and will constitute over the time a large Knowledge base about Unitex and Gramlab.

Unitex-GramLab Forum

German is one of the ten weirdest languages in the world

Eine Gruppe von NLPlern hat die Sprachen der Welt auf die Häufigkeit bestimmte Phänomene und Strukturen hin untersucht und nun ein Ranking der Weirdest Languages aufgestellt: Sehr interessant und unterhaltsam!

The World Atlas of Language Structures evaluates 2,676 different languages in terms of a bunch of different language features. These features include word order, types of sounds, ways of doing negation, and a lot of other things—192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us take a worldwide view. That is, we evaluate each language in terms of how unusual it is for each feature.

(gefunden bei Nerdcore)

Google Research: Relation Corpus

One of the most difficult tasks in NLP is called relation extraction. It’s an example of information extraction, one of the goals of natural language understanding. A relation is a semantic connection between (at least) two entities.

Und weil es sich um so eine schwere Aufgabe handelt veröffentlicht Google ein Set an Daten, dass anderen Wissenschaftlern beim Trainieren von Information Retrival bzw. Relation-Extraction-Systemen helfen soll. Es handelt sich um 10.000 “place of birth”, und mehr als 40.000 “attended or graduated from an institution” Beziehungen, die aus der Wikipedia extrahiert und von jeweils mindestens fünf menschlichen Gutachtern als richtig beurteilt wurden. Die Daten liegen als "Prädikat Subjekt Objekt" Tripel vor, zahlreiche weitere Daten wie Links oder Judgement-Details sind auch dabei. Außerdem sollen weitere Relations folgen. Alle Details dazu im Google Research Blog:

50,000 Lessons on How to Read: a Relation Extraction Corpus