Testu kopuru handiak prozesatzeko big data teknikak
No Thumbnail Available
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Eskura dauzkagun datu kopuru erraldoiak prozesatzeko, zaharkituta gelditu dira XXI. mendearenhasieran erabiltzen ziren prozesaketa-teknikak eta algoritmoak. Gaur egun sistema banatuak erabiltzendira, prozesaketa makina batean baino gehiagotan eginez.Gauza berbera gertatzen da hizkuntzarenprozesamenduan ere. Corpusak edo testu-bilduma handiak prozesatzeko, makina bat baino gehiagokoinguruneak beharrezkoak bihurtu dira dagoeneko. Lan honetan, testu-dokumentu kopuru handiak ingu-rune banatuetan prozesatzeko teknikak aztertuko ditugu. Horretarako, makina birtualetan oinarritutakosistema bat eraiki dugu, Storm konputazio banatuko frameworka erabiliz.Esperimentu batzuk ereaurkeztu ditugu, eta hainbat ezarpenekin lortutako errendimenduaren hobekuntzak.
Processing techniques and algorithms used at the beginning of the 21th century to process massive datasets have become obsolete. Nowadays, distributed systems are used to performing the processing in severalcomputers simultaneously. In the Natural Language Processing field, clusters of several computers arealready necessary to process large quantities of text. In this work we analyze an architecture to performdistributed processing of text.The architecture relies on virtual machines and is based on the Stormdistributed processing framework. We describe some experiments and show the performance gain obtainedin diverse settings.
Processing techniques and algorithms used at the beginning of the 21th century to process massive datasets have become obsolete. Nowadays, distributed systems are used to performing the processing in severalcomputers simultaneously. In the Natural Language Processing field, clusters of several computers arealready necessary to process large quantities of text. In this work we analyze an architecture to performdistributed processing of text.The architecture relies on virtual machines and is based on the Stormdistributed processing framework. We describe some experiments and show the performance gain obtainedin diverse settings.
Description
Keywords
Big data, hizkuntzaren prozesamendua, sistema banatuak, Big data, natural language processing, distributed systems