A more serious study of the Public Git Archive (PGA)

Following up on the Octoverse clues, I uncovered this GEM — Markovtsev, Vadim, and Waren Long. “Public git archive: a big code dataset for all.” In Proceedings of the 15th International Conference on Mining Software Repositories, pp. 34-37. ACM, 2018. you can look at the arXiv version here.

This study point to the following being the most popular programming languages

  1. C
  2. JS
  3. C++ 
  4. Java
  5. PHP
  6.  Go
  7. Python
  8. Obj-C
  9. C#
  10. Ruby

If you’re into data mining and analysis of REALLY large public datasets, this one offers lots to work with. According to the authors, the Public Git Archive occupies 3.0 TB on disk .   Enjoy ..


Hagen’s Biological and clinical data integration in healthcare study is great!

Just finished looking at Matt Hagen’s 2014 “Biological and clinical data integration and its applications in healthcare.” PhD  dissertation. This is a great piece of work … You can find it here.

While its around 5 years old, the insights and discussion are excellent.  I like the detailed breakdown of how different ontologies and vocabularies align (and how things fall through the cracks).  I liked the discussion of using Neo4j to analyze relationships and simplify searches and relationship mappings.

Particularly liked the discussion of using  ontologies.  to” facilitate improved prioritization of intensive care admissions and accurate clustering of multimorbidity conditions”.  THIS IS BIG! with enormous potential.

Discussion of his BioSPIDA relational database translator and its contrast with  the separate Entrez Gene, Pubmed, CDD, Refseq, MMDB, and Biosystems NCBI databases.

His Table 7.2: Descriptions of patient clusters is rather illuminating, as his discussion and analysis of ICU Electronic Health Records and findings associated with morbidity outcomes.

For example Cluster 1 contains the following Most Prevalent Conditions: Coronary arteriosclerosis, Hypercholesterolemia, Diabetes, Gastroesophageal reflux disease,  Atrial fibrillation, Hyperlipidemia, Tobacco dependence.  Which led to the following Most Prevalent Procedures:  Catheterization of left heart, Cardiopulmonary bypass operation, Angiocardiography of left heart,.


 I  am surprised this work is not cited as much as it should be!.  IMHO, this work definitely should be used as blueprint for additional investigations.