Announcement

Collapse
No announcement yet.

C Code On GitHub Has the Most "Ugly Hacks"

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • C Code On GitHub Has the Most "Ugly Hacks"

    An analysis of GitHub data finds that, among all programming languages, C code contains the most references to down and dirty code fixes

    By Phil Johnson


    Picture of an old Mac power supply with back tape holding it together Credit: Rick Bradley (CC BY 2.0)

    If you’re developer, no matter which programming languages you code in, you’ve no doubt run into - or created - ugly hacks. Whether its due to an impending deadline, a lack of knowledge, or simple laziness, sometimes you just have to code something in a way that you know isn’t the best way to do it with, of course, every intention to revisit and redo it properly later. But which programming language tends to lead programmers to create the most down, dirty, and ugly code fixes and implementations? Based on GitHub data, it turns out that C developers are creating the most ugly hacks or, at least, are those most willing to admit to it.

    To answer the question of which programming language produces the most ugly hacks, I first used the search feature on GitHub, looking for code files that contained the string “ugly hack.”

    In that case, C comes up first by a wide margin, with over 181,000 code files containing that string. The rest of the top ten languages were PHP (79k files), JavaScript (38k), C++ (22k), Python (19k), Text (11k), Makefile (11k), HTML, (10k), Java (7k), and Perl (4k).

    Try to make things a little more formal, I decided to then control for the number of repositories per language. To do that, I first queried the GitHub Archive using Google BigQuery to find the number of non-forked repositories created per language between January 1, 2013 (inclusive) and May 1, 2015, to try base it on relatively recent code (the query I ran is listed at the end of this post).

    I then reran a another search on GitHub, this time using the advanced search options to look for code files containing “ugly hack” from non-forked repositories created between 1/1/13 and 5/1/15 and calculated the average number of code files containing the string "ugly hack" per repository by language.

    Below is a chart of how that shook out.

    ITworld/Phil Johnson

    Even when controlling for the number of repositories, C wins the ugly-hackathon by a landslide. C had almost three times as many mentions of ugly hacks per repository as the next language, PHP, and almost 50 times as many as Java, which ranked 12th on this list.

    This approach has a couple of potential flaws. First, a code file may contain the string “ugly hack” if somebody had fixed or removed an ugly hack (e.g., "Fixed an ugly hack"), so we’re undoubtedly counting some files that say ugly hack but don’t actually have an ugly hack (anymore). Secondly, whether a code file has one or many ugly hacks, it only counts once using this measure. We could, then, be under-counting the actual number of ugly hacks that are out there.

    No matter how you slice it, however, C seems to be generating more ugly hacks than any other programming language. Or, looking at it another way, C developers are the most honest about when they code an ugly hack. Either way, there’s no doubt that all of those ugly hacks will soon be fixed.

    Right?

    Notes:

    Google BigQuery to pull counts of non-forked GitHub repositories by programming language created between 1/1/2013 (inclusive) and 5/1/2015:

    Code:
    [i]SELECT repository_language, count(repository_language) AS repos_by_lang[/i]
    [i]FROM [githubarchive:github.timeline][/i]
    [i]WHERE repository_fork == "false"[/i]
    [i]AND type == "CreateEvent"[/i]
    [i]AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('2013-01-01 00:00:00')[/i]
    [i]AND PARSE_UTC_USEC(repository_created_at) < PARSE_UTC_USEC('2015-05-01 00:00:00')[/i]
    [i]GROUP BY repository_language[/i]
    [i]ORDER BY repos_by_lang DESC[/i]
    The Hackmaster
Working...
X