KrdWrd: Architecture for Unified Processing of Web Content


Algorithmic processing of Web content mostly works on textual contents, neglecting visual information. Annotation tools largely share this deficit as well. We specify requirements for an architecture to overcome both problems and propose an implementation, the KrdWrd system. It uses the Gecko rendering engine for both annotation and feature extraction, providing unified data access in every processing step. Stable data storage and collaboration control scripts for group annotations of massive corpora are provided via a Web interface coupled with a HTTP proxy. A modular interface allows for linguistic and visual data feature extractor plugins. The implementation is suitable for many tasks in theWeb as corpus domain and beyond.

Proceedings of the Fifth Web as Corpus Workshop (WAC5)