Osasun-arloko entitate izendunen etiketatzea
No Thumbnail Available
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Lan honek helburu bikoitza du: alde batetik, transformerretan oinarritutako hizkuntza-ereduak erabiliz medi-kuntzaren alorreko entitate izendunen identifikazioa egiten du, eta bestetik, identifikatutako entitate klinikoakWikidata ezagutza-baseko gaixotasunekin eta sintomekin lotzen ditu. Entitateak ezagutzeko, biomedikuntzakoMedMentions corpusaren gainean, aldez aurretik entrenatutako BERT hizkuntza-eredu orokor batekin (BERTsmall) eta bi BERT espezializaturekin (BiomedNLP-PubMedBERT eta BioBERT) egin dira esperimentuak.Token segida batek medikuntzako entitate bat osatzen ote duen ebaluatu denean, 0,819ko F1 balioa lortu da, etaentitatea zein klase zehatzetakoa den ebaluatu denean, 0,62ko F1 balioa. Gainera, Levenhstein distantzia erabilizezagututako entitateak Wikidatarekin lotzeko lehenengo saiakeran %50 inguruko estaldura lortu da.
This work has a double objective: on the one hand, it identifies named entities using language models basedon transformers and, on the other hand, it links the identified clinical entities with the diseases and symptomsof the Wikidata knowledge base. To identify the entities, experiments have been performed on the MedMentionsbiomedical corpus with a generalpre-trained language mode˜n BERT (BERT small) and two specialised BERTs(BiomedNLP-PubMedBERT and BioBERT). When assessing whether a succession of tokens constitutes a medicalentity, an F1 value of 0.819 was obtained, while assessing the specific class to which the entity belongs, an F1value of 0.62 was obtained. In addition, a recall close to 50% has been achieved in the first attempt to associateWikidata to known entities using the Levenhstein distance.
This work has a double objective: on the one hand, it identifies named entities using language models basedon transformers and, on the other hand, it links the identified clinical entities with the diseases and symptomsof the Wikidata knowledge base. To identify the entities, experiments have been performed on the MedMentionsbiomedical corpus with a generalpre-trained language mode˜n BERT (BERT small) and two specialised BERTs(BiomedNLP-PubMedBERT and BioBERT). When assessing whether a succession of tokens constitutes a medicalentity, an F1 value of 0.819 was obtained, while assessing the specific class to which the entity belongs, an F1value of 0.62 was obtained. In addition, a recall close to 50% has been achieved in the first attempt to associateWikidata to known entities using the Levenhstein distance.
Description
Keywords
Entitate izendunen ezagutza, hizkuntza-ereduak, Wikidata, medikuntza, Named Entity Recognition, language models, Wikidata, medicine