Copy detection in Chinese documents using the Ferret: a report on experiments

Bao, J., Lyon, C., Lane, P.C.R., Ji, W. and Malcolm, J. (2006) Copy detection in Chinese documents using the Ferret: a report on experiments. University of Hertfordshire.

Copy

The Ferret copy detector has been used for some years on English texts to find plagiarism in large collections of students coursework. This article--reports on extending its application to Chinese, which differs from English in many respects: the sequence of characters that make up a Chinese text do not have--word boundaries marked, there is a vast Chinese alphabet , or number of different characters, and they are represented with multi-byte encoding. We discuss issues of representation, focus on the effectiveness of a sub-symbolic approach, and show how the Ferret can circumvent the classic problem of finding word boundaries with an automated system. Corpora of students coursework from two Chinese universities have been collected, and we apply Ferret to investigate the detection of plagiarism. Our experiments show that Ferret can find both artificially constructed plagiarism as well as actually occurring, previously undetected plagiarism. We also investigate--the parameters of the system, and report on typical optimum settings. Experiments reported in this article show that Ferret can work well on Chinese texts, and achieve a consistent performance. The investigation into the representation of written Chinese is likely to be of use in other language processing tasks.

Item Type	Other
Date Deposited	27 Jul 2024 00:08
Last Modified	27 Jul 2024 00:08

Atom

BibTeX

OpenURL ContextObject in Span

OpenURL ContextObject

Dublin Core

MPEG-21 DIDL

EndNote

HTML Citation

METS

MODS

RIOXX2 XML

Reference Manager

Refer

ASCII Citation

Export

Downloads

picture_as_pdf: S89.pdf

View

Download

Copy detection in Chinese documents using the Ferret: a report on experiments

Explore Further