Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Labbé, Cyril; Labbé, Dominique

doi:10.1007/s11192-012-0781-y

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Published: 22 June 2012

Volume 94, pages 379–396, (2013)
Cite this article

Scientometrics Aims and scope Submit manuscript

Cyril Labbé¹ &
Dominique Labbé²

3070 Accesses
72 Citations
54 Altmetric
8 Mentions
Explore all metrics

Abstract

Two kinds of bibliographic tools are used to retrieve scientific publications and make them available online. For one kind, access is free as they store information made publicly available online. For the other kind, access fees are required as they are compiled on information provided by the major publishers of scientific literature. The former can easily be interfered with, but it is generally assumed that the latter guarantee the integrity of the data they sell. Unfortunately, duplicate and fake publications are appearing in scientific conferences and, as a result, in the bibliographic services. We demonstrate a software method of detecting these duplicate and fake publications. Both the free services (such as Google Scholar and DBLP) and the charged-for services (such as IEEE Xplore) accept and index these publications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Citation Enrichment Improves Deduplication of Primary Evidence

On the Use of Similarity Search to Detect Fake Scientific Papers

The Road Towards Reproducibility in Science: The Case of Data Citation

Notes

http://ip-science.thomsonreuters.com/news/2005-04/8272986/.
Bibliographic information and corpora are available upon request to the authors.
http://arxiv.org/help/endorsement.
http://pdos.csail.mit.edu/scigen/.
Blog post: http://pythonic.pocoo.org/2009/1/28/fun-with-scigen; SCIgen-Physics Sources: http://bitbucket.org/birkenfeld/scigen-physics/overview.
http://paperdetection.blogspot.com/.
http://montana.informatics.indiana.edu/cgi-bin/fsi/fsi.cgi.
http://sigma.imag.fr/labbe/main.php.
February and March 2012.

References

Ball, P. (2005). Computer conference welcomes gobbledegook paper. Nature, 434, 946.
Google Scholar
Beel, J., & Gipp, B. (2010). Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3). http://hdl.handle.net/2027/spo.3336451.0013.305.
Benzecri J. P. (1980). L’analyse des données. Paris: Dunod.
Cover, T.M., & Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13, 21–27.
Article MATH Google Scholar
Dalkilic, M. M., Clark, W. T., Costello, J. C., & Radivojac, P. (2006). Using compression to identify classes of inauthentic texts. In Proceedings of the 2006 SIAM Conference on Data Mining.
Elmacioglu, E., & Lee, D. (2009). Oracle, where shall i submit my papers?. Communications of the ACM (CACM), 52(2), 115–118.
Article Google Scholar
Falagas, M.E., Pitsouni, E.I., Malietzis, G.A., & Pappas, G. (2008). Comparison of pubmed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses. The FASEB Journal, 22(2), 338–342.
Article Google Scholar
Hockey, S., & Martin, J. (1988). OCP users’ manual. Oxford: Oxford University Computing Service.
Jacso, P. (2008). Testing the calculation of a realistic h-index in Google Scholar, Scopus, and Web of Science for F. W. Lancaster. Library Trends, 56(4)
Jacso, P.: The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32(3), 437–452 (2008). doi:10.1108/14684520810889718.
Kato, J. (2005). Isi Web of Knowledge: proven track record of high quality and value. KnowledgeLink newsletter from Thomson Scientific.
Labbé, C. (2010). Ike antkare, one of the great stars in the scientific firmament. International Society for Scientometrics and Informetrics Newsletter, 6(2), 48–52.
Google Scholar
Labbé, C., & Labbé, D. (2001). Inter-textual distance and authorship attribution corneille and moliere. Journal of Quantitative Linguistics 8(3), 213–231.
Article Google Scholar
Labbé, D. (2007). Experiments on authorship attribution by intertextual distance in english. Journal of Quantitative Linguistics, 14(1), 33–80.
Article Google Scholar
Lavoie, A., Krishnamoorthy, M. (2010). Algorithmic detection of computer generated text. ArXiv e-prints.
Lee, L. (1999). Measures of distributional similarity. In 37th Annual Meeting of the Association for Computational Linguistics, pp. 25–32.
Li, M., Chen, X., Li, X., Ma, B., & Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Article MathSciNet Google Scholar
Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 569–576.
Google Scholar
Parnas, D. L. (2007). Stop the numbers game. Communications of ACM, 50(11), 19–21.
Article Google Scholar
Roux, M. (1985). Algorithmes de classification. Paris: Masson.
Google Scholar
Roux M. (1994) Classification des données d’enquête. Paris: Dunod.
Google Scholar
Savoy, J. (2006). Les résultats de google sont-ils biaisés? Genève: Le Temps.
Google Scholar
Sneath, P., & Sokal, R. (1973). Numerical Taxonomy. San Francisco: Freeman.
MATH Google Scholar
Xiong, J., & Huang, T. (2009). An effective method to identify machine automatically generated paper. In Pacific-Asia Conference on Knowledge Engineering and Software Engineering, 2009, KESE ’09, pp. 101–102
Yang, K., & Meho, L. I. (2006). Citation analysis: a comparison of google scholar, scopus, and web of science. American Society for Information Science and Technology, 43(1), 1–15.
Google Scholar

Download references

Acknowledgments

The authors would like to thank Tom Merriam, Jacques Savoy, Edward Arnold for their careful readings of previous versions of this paper, the anonymous reviewers and members of the LIG laboratory for their valuable comments.

Author information

Authors and Affiliations

Laboratoire d’Informatique de Grenoble, Université Joseph Fourier, Grenoble, France
Cyril Labbé
PACTE, Institut d’Etudes Politiques de Grenoble, Grenoble, France
Dominique Labbé

Authors

Cyril Labbé
View author publications
You can also search for this author inPubMed Google Scholar
Dominique Labbé
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Cyril Labbé.

Appendices

Appendix 1: Examples of SCIgen papers

Figure 5 is an example of a SCIgen-Physics paper. Formula generation have been improved compare to the one used by SCIgen-Origin (cf. Fig. 6).

Appendix 2: Comparison between inter-textual distance and other similarity index

Figures 7, 8 and 9 show the dendrograms obtained using cosine, Jaccard and Euclidean metrics. They are computed using the R text mining package (Meyer et al. 2008). These dendrograms are to be compared to the one in Fig. 4. Dendrograms for Cosine and Euclidean do not group together the Ike Antkare corpus.

Results, for the classification by assigning a text of the MLT corpus to the class of its nearest neighbor, are given in Table 4. The arXiv data set was not tested because of its size which make the use of the R text mining package problematic.

Table 4 Classification of the MLT Corpus (122 papers) using Inter-textual distance, Cosine, Euclidean and Jaccard metrics

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Labbé, C., Labbé, D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?. Scientometrics 94, 379–396 (2013). https://doi.org/10.1007/s11192-012-0781-y

Download citation

Received: 30 January 2012
Published: 22 June 2012
Issue Date: January 2013
DOI: https://doi.org/10.1007/s11192-012-0781-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Citation Enrichment Improves Deduplication of Primary Evidence

On the Use of Similarity Search to Detect Fake Scientific Papers

The Road Towards Reproducibility in Science: The Case of Data Citation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Examples of SCIgen papers

Appendix 2: Comparison between inter-textual distance and other similarity index

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now