Plagiarism Detection with Genetic-Based Parameter Tuning
Abstract
A crucial step in plagiarism detection is text alignment. This task consists in finding similar text fragments between two given documents. We introduce an optimization methodology based on genetic algorithms to improve the performance of a plagiarism detection model by optimizing its input parameters. The implementation of the genetic algorithm is based on nonbinary representation of individuals, elitism selection, uniform crossover, and high mutation rate. The obtained parameter settings allow the plagiarism detection model to achieve better results than the state-of-the-art approaches.
References
- 1. , Expanded n-grams for semantic text alignment, in CLEF Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 928–938. Google Scholar - 2. , Hashing and merging heuristics for text reuse detection, in CLEF Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 939–946. Google Scholar - 3. , Text reuse detection using a composition of text similarity measures, eds. M. Kay and C. Boitet in Proc 24th Int. Conf. Computat. Linguistics (COLING 2012),
8–15 December 2012 ,Mumbai, India Indian Institute of Technology Bombay, 2012, pp. 167–184. Google Scholar - 4. , Machine learning tool and meta-heuristic based on genetic algorithms for plagiarism detection over mail service, in 2014 IEEE/ACIS 13th Int. Conf. Computer and Information Science (ICIS), (2014), pp. 157–162. Google Scholar
- 5. L. Gillam, Guess again and see if they line up: Surrey’s runs at plagiarism detection notebook for PAN at (CLEF 2013) in CEUR Workshop Proc. Working Notes for CLEF 2013 Conf., 23–26 September 2013, Valencia, Spin. Google Scholar
- 6. , Evaluating robustness for “IPCRESS”: Surrey’s text alignment for plagiarism detection, in CLEF Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 951–957. Google Scholar - 7. , A hybrid architecture for plagiarism detection, in CEUR Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 958–965. Google Scholar - 8. T. Gollub, M. Potthast, A. Beyer, M. Busse, F. M. R. Pardo, P. Rosso, E. Stamatatos and B. Stein, Recent trends in digital text forensics and its evaluation — plagiarism detection, author identification, and author profiling, P. Forner, H. Müller, R. Paredes, P. Rosso and B. Stein (eds.), in Proc. 4th Int. Conf. CLEF Initiative, Information Access Evaluation, Multiquality, Multi-Modality, and Visualization (CLEF 2013), 23–26 September 2013, Valencia, Spain. Lecture Notes in Computer Science, Vol. 8138 (Springer, 2013), pp. 282–302. Google Scholar
- 9. , Improving feature representation based on a neural network for author profiling in social media texts, Comput. Intell. Neurosci. 2016 (2016) 2. Crossref, Web of Science, Google Scholar
- 10. , Automatic authorship detection using textual patterns extracted from integrated syntactic graphs, Sensors 16 (9) (2016) 1374. Crossref, Web of Science, Google Scholar
- 11. , Plagiarism alignment detection by merging context seeds, in CEUR Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 966–972. Google Scholar - 12. L. Kong, H. Qi, C. Du, M. Wang and Z. Han, Approaches for source retrieval and text alignment of plagiarism detection notebook for PAN at CLEF 2013, in CEUR Workshop, Proc. Working Notes for CLEF 2013 Conf., 23–26 September 2013, Valencia, Spain, Vol. 1179 (CEUR-WS.org, 2013). Google Scholar
- 13. , A set-based approach to plagiarism detection, in Proc. CEUR Workshop Evaluation Labs and Workshop, Online Working Notes, eds. P. Forner, J. Karlgren and C. Womser-Hacker
(CLEF 2012) ,17–20 September 2012 ,Rome, Italy , Vol. 1178 (CEUR-WS.org , 2012). Google Scholar - 14. , Using code metric histograms and genetic algorithms to perform author identification for software forensics, in Proc. 9th Annual Conf. Genetic and Evolutionary Computation,
07–11 July 2007 (London, UK , 2007), pp. 2082–2089. Google Scholar - 15. I. Markov, H. Gómez-Adorno, J.-P. Posadas-Durán, G. Sidorov and A. Gelbukh, Author profiling with doc2vec neural network-based document embeddings, in Proc. 15th Mexican Int. Conf. Artificial Intelligence (MICAI 2016), 23–28 October 2016, Cancún, Mexico (Springer, in press). Google Scholar
- 16. , Adapting cross-genre author profiling to language and corpus, in CEUR Workshop Proc. Working Notes of CLEF 2016 Conf.,
5-8 September 2016, Évora ,Portugal , Vol. 1609 (CEUR-WS.org , 2016), pp. 947–955. Google Scholar - 17. F. Mashhadirajab and M. Shamsfard, A text alignment algorithm based on prediction of obfuscation types using SVM neural network, in CEUR Workshop Proc. Working Notes of Forum for Information Retrieval Evaluation (FIRE 2016), eds. P. Majumder, M. Mitra, P. Mehta, J. Sankhavara and K. Ghosh, 7–10 December 2016, Kolkata, India, Vol. 1737 (CEUR-WS.org, 2016), pp. 167–171. Google Scholar
- 18. , Plagiarism – A survey, J. Univers. Comput. Sci. 12 ( August 2006) 1050–1084. Web of Science, Google Scholar
- 19. , An Introduction to Genetic Algorithms (MIT Press, 1998). Crossref, Google Scholar
- 20. G. Oberreuter and A. Eiselt, Submission to the 6th Int. competition on plagiarism detection, http://www.webis.de/research/events/pan-14 (2014), http://www.clef-initiative.eu/publication/working-notes, from Innovand.io, Chile. Google Scholar
- 21. , Using hybrid similarity methods for plagiarism detection notebook for PAN at (CLEF 2013), in CEUR Workshop, Proc. Working Notes for CLEF 2013 Conf.,
23–26 September 2013 ,Valencia, Spain , Vol. 1179 (CEUR-WS.org , 2013). Google Scholar - 22. , Syntactic n-grams as features for the author profiling task, in CEUR Workshop Proc. Working Notes of CLEF 2015 Conf.,
8–11 September 2015 ,Toulouse, France , Vol. 1391 (CEUR-WS.org , 2015). Google Scholar - 23. M. Potthast, A. Barrón-Cedeño, A. Eiselt, B. Stein and P. Rosso, Overview of the 2nd international competition on plagiarism detection, in Proc. of the SEPLN’10 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (2010), Valencia, Spain. Google Scholar
- 24. M. Potthast, M. Hagen, A. Beyer, M. Busse, M. Tippmann, P. Rosso and B. Stein, Overview of the 6th International Competition on Plagiarism Detection, in CEUR Workshop, Proc. Working Notes for CLEF 2014 Conf., 15–18 September 2014, Sheffield, UK, pp. 845–876. Google Scholar
- 25. M. Potthast, M. Hagen, T. Gollub, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos and B. Stein, Overview of the 5th International Competition on Plagiarism Detection, in CEUR Workshop, Proc. Working Notes for CLEF 2013 Conf., 23–26 September 2013, Valencia, Spain. Google Scholar
- 26. , Overview of the 1st International Competition on Plagiarism Detection, in Proc. of the SEPLN’09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09),
10 September 2009 ,San Sebastian Donostia, Spain , Vol. 502 (CEUR-WS.org , 2009), pp. 1–9. Google Scholar - 27. D. A. Rodríguez Torrejón and J. M. Martín Ramos, Text alignment module in CoReMo 2.1 plagiarism detector notebook for PAN at CLEF 2013, in CEUR Workshop, Proc. Working Notes for CLEF 2013 Conf. 23–26 September 2013, Valencia, Spain. Google Scholar
- 28. , Coremo 2.3 plagiarism detector text alignment module, in CEUR Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 997–1003. Google Scholar - 29. M. A. Sanchez-Perez, A. Gelbukh and G. Sidorov, Adaptive algorithm for plagiarism detection: The best-performing approach at PAN 2014 text alignment competition, in Experimental IR Meets Multilinguality, Multimodality, and Interaction — 6th Int. Conf. CLEF Association (CLEF 2015), eds. J. Mothe, J. Savoy, J. Kamps, K. Pinel-Sauvagnat, G. J. F. Jones, E. SanJuan, L. Cappellato and N. Ferro, Toulouse, France, 8–11 September, 2015, Lecture Notes in Computer Science, Vol. 9283 (Springer, 2015), pp. 402–413. Google Scholar
- 30. , Dynamically adjustable approach through obfuscation type recognition, in eds. L. Cappellato, N. Ferro, G. J. F. Jones and E. SanJuan, 2015 CEUR Workshop, Proc. Working Notes of Conference and Labs of the Evaluation Forum (CLEF 2015),
8–11 September 2015 ,Toulouse, France , Vol. 1391 (CEUR-WS.org , 2015). Google Scholar - 31. , The winning approach to text alignment for text reuse detection at PAN 2014, in CEUR Workshop Proc. Working Notes for CLEF 2014 Conf.,
15–18 September 2014 ,Sheffield, UK , Vol. 1180 (CEUR-WS.org , 2014), pp. 1004–1011. Google Scholar - 32. , Using a variety of n-grams for the detection of different kinds of plagiarism notebook for PAN at CLEF 2013, in CEUR Workshop Proc. Working Notes for CLEF 2013 Conf.,
23–26 September 2013 ,Valencia, Spain , Vol. 1179 (CEUR-WS.org , 2013). Google Scholar - 33. , Soft similarity and soft cosine measure: Similarity of features in vector space model, Computación y Sistemas 18 (3) (2014) 491–504. Google Scholar
- 34. , Syntactic dependency-based n-grams: More evidence of usefulness in classification, in ed. A. Gelbukh Proc. 14th Int. Conf. Proc. Computational Linguistics and Intelligent Text Processing — (CICLing 2013),
Samos, Greece ,24–30 March 2013 ,Lecture Notes in Computer Science , Vol. 7816 (Springer, 2013), pp. 13–24. Google Scholar - 35. , Diverse queries and feature type selection for plagiarism discovery notebook for PAN at CLEF 2013, in CEUR Workshop Proc. Working Notes for CLEF 2013 Conf.,
23–26 September 2013 ,Valencia, Spain , Vol. 1179 (CEUR-WS.org , 2013). Google Scholar