A more serious study of the Public Git Archive (PGA)

Following up on the Octoverse clues, I uncovered this GEM — Markovtsev, Vadim, and Waren Long. “Public git archive: a big code dataset for all.” In Proceedings of the 15th International Conference on Mining Software Repositories, pp. 34-37. ACM, 2018. you can look at the arXiv version here.

This study point to the following being the most popular programming languages

  1. C
  2. JS
  3. C++ 
  4. Java
  5. PHP
  6.  Go
  7. Python
  8. Obj-C
  9. C#
  10. Ruby

If you’re into data mining and analysis of REALLY large public datasets, this one offers lots to work with. According to the authors, the Public Git Archive occupies 3.0 TB on disk .   Enjoy ..

 

Octoverse

GitHub’s Octoverse is really providing some serious insight about what’s hot and what’s not with developers. For example: in terms of top projects / contributors:

1 Microsoft/vscode 19K
2 facebook/react-native 10K
3 tensorflow/tensorflow 9.3K

 

Top Growing Languages:

1 Kotlin 2.6X
2 HCL 2.2X
3 TypeScript 1.9X

Lots more data there … I wish they showed more information, not just the top 10s …