Towards the automatic evaluation of stylistic quality of natural texts: constructing a special-purpose corpus of stylistic edits from the Wikipedia revision history
Type
Master thesisNot peer reviewed

View/ Open
Date
2016-09-01Author
Metadata
Show full item recordAbstract
This thesis proposes an approach to automatic
evaluation of the stylistic quality of natural
texts through data-driven methods of Natural
Language Processing. Advantages of data driven
methods and their dependency on the size of
training data are discussed. Also the advantages
of using Wikipedia as a source for textual data
mining are presented. The method in this project
crucially involves a program for quick automatic
extraction of sentences edited by users from the
Wikipedia Revision History. The resulting edits
have been compiled in a large-scale corpus of
examples of stylistic editing. The complete
modular structure of the extraction program is
described and its performance is analyzed.
Furthermore, the need to separate stylistic edits
stylistic edits from factual ones is discussed
and a number of Machine Learning classification
algorithms for this task are proposed and tested.
The program developed in this project was able to
process approximately 10% of the whole Russian
Wikipedia Revision history (200 gigabytes of
textual data) in one month, resulting in the
extraction of more than two millions of user
edits. The best algorithm for the classification
of edits into factual and stylistic ones achieved
86.2% cross-validation accuracy, which is
comparable with state-of-the-art performance of
similar models described in published papers.
Publisher
The University of BergenCollections
- Linguistics 64
Copyright the author. All rights reserved