Development of information technology of term extraction from documents in natural language
Journal Title: Восточно-Европейский журнал передовых технологий - Year 2018, Vol 6, Issue 2
Abstract
<p class="KeywordsCxSpFirst">It is shown that domain dictionaries are widely used at various stages of design and operation of software products. The process of dictionary development, especially term extraction, is very labor-intensive, requiring high qualification of the expert. Studies are conducted to identify the most important characteristics of multi-word terms (MWT), such as: the probability of the presence of terms containing different numbers of words in the document; arrangement of nouns in MWT; possible number of nouns in MWT. The context of the use of terms is analyzed and possible limits of terms in the text are identified. The procedure is proposed for preliminary document grouping, thus avoiding the “loss” of terms included in short documents. The dependence of errors of term extraction on the size of the analyzed document is determined.</p><p class="KeywordsCxSpLast">The mathematical model of term representation, based on the definition of the set of word chains grouped around a head-word – a noun is proposed. Filtration of chains is performed depending on the frequency of their occurrence in the text based on a comparison of normalized representations of MWT.</p>Mechanisms for filling the domain dictionary with new records and adjusting existing ones in the process of analyzing the input document are developed. The solution to adjust the frequency of occurrence of terms based on the identification of inter-phrase relations is proposed. All processes and models are combined into a single information technology of construction of the domain dictionary. The problem of term interpretation is not considered in this paper, since it requires a separate solution. The software product allowing to automate substantially the process of term extraction from text documents is developed. The results of testing of the proposed solutions showed the absence of “lost terms” and, as a result, the reduction of the time of term extraction from texts of 10,000 words by 1.5 hours by freeing the expert from analyzing the original document. The research results can be used at various stages of design and operation of software products
Authors and Affiliations
Oleksii Kungurtsev, Svetlana Zinovatnaya, Iana Potochniak, Maxim Kutasevych
Analysis of correlation dimensionality of the state of a gas medium at early ignition of materials
<span lang="EN-US">We have considered the application of the method of nonlinear dynamic systems in order to analyze and detect the structural patterns in the dynamics of increments in the state of a gas medium generated...
Establishing the effect of eggplant powders on the rheological characteristics of a semi-finished product made from liver pate masses
<p>Results of the study of functional properties of partially prepared liver pate masses with partial replacement of beef liver with edible eggplant powder (3 %, 5 %, 7 %) were presented. Structural and mechanical charac...
Design of the laboratory bench for a hydrovolumetric-mechanical transmission of the tracked tractor
<p>Double-flow hydrovolumetric mechanical transmissions is an advanced technical solution that aims to increase productivity, improve efficiency and convenience of control over wheeled and tracked tractors. Their arrange...
Modeling the process of oil displacement by a heat carrier considering the capillary effect
<p>The manuscript is aimed at improving the mathematical model of oil production in a heterogeneous environment with the use of a thermal mode of displacement considering the action of capillary effect. We have construct...
Development of the universal model of mechatronic system with a hydraulic drive
<p>The growing demands to performance of mechatronic systems with a hydraulic drive of movable operating elements of self-propelled machines require application of new approaches to the process of their development and d...