The Use of Custom Embeddings Generated from Pubmed Corpora for Cancer Research
In natural language processing, one of the big questions that remain open is “what is the optimal approach to embed our natural language in a vector space?”, which essentially transforms words into series of numbers. Ideally, the numbers should represent semantic meaning.
In natural language processing, one of the big questions that remain open is “what is the optimal approach to embed our natural language in a vector space?”, which essentially transforms words into series of numbers. Ideally, the numbers should represent semantic meaning. In a multidimensional space, the different dimensions should correspond to different types of meaning (e.g. size of an entity, sex of an animal) that a computer algorithm can then subsequently use to make inferences.
Big text-data endowed institutions or corporations, claim only large-sized corpora produce performant embeddings. In this presentation, we will investigate what is the minimal size of a corpus useful for extracting cancer-related statements. To this end, we developed a literature knowledge mining tool “sina” (https://github.com/dicaso/sina), that allows extracting relevant statements to specific conditions and the research question at hand, by selecting a specific corpus of documents with which to establish a custom word embedding.