博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
使用生存分析和git-pandas评估代码质量
阅读量:2518 次
发布时间:2019-05-11

本文共 5802 字,大约阅读时间需要 19 分钟。

Survival analysis is a statistical technique for determining the likelihood of events to happen over a timeline.  It was originally based heavily in the medical/actuarial profession, where it would answer questions like: given this set of conditions, how likely is a person to survive X years?  In , we’ve seen that we can tap into a huge amount of data in git repositories with .  In this post, we will try to derive a measure of relative code quality in a large open source project amongst committers.

生存分析是一种统计技术,用于确定事件在时间轴上发生的可能性。 它最初以医疗/精算师专业为基础,它会回答以下问题:在这种情况下,一个人生存X年的可能性有多大? 在 ,我们已经看到可以使用在git存储库中利用大量数据。 在本文中,我们将尝试在提交者之间的大型开源项目中得出相对代码质量的度量。

To do this, we will use survival analysis, notably Cam Davidson Pilon’s library, to examine the likelihood for a developers code to survive.  First, we must figure out what ‘survive’ means in this context.

为此,我们将使用生存分析(尤其是Cam Davidson Pilon的库)来检查开发人员代码生存的可能性。 首先,我们必须弄清楚在这种情况下“生存”的含义。

With git data it is often much much simpler to do analysis at the file or module level rather than by single lines of code.  In this case, we will look at just files.  Any given file at any given point in time is considered to have an owner who is responsible for the majority of the code in that file.

使用git数据,在文件或模块级别进行分析通常要比单行代码简单得多。 在这种情况下,我们将仅查看文件。 在任何给定时间点的任何给定文件都被视为拥有所有者,该所有者负责该文件中的大多数代码。

We expect regular small updates to the file, but occasionally a large edit, addition, deletion or refactor occurs.  We consider this event to be a ‘death’ in the context of survival analysis.  So pretty quickly, with git-pandas and lifelines, we can generate a dataset given a rule for determining a refactor, then use the Kaplan-Meier estimator to generate a survival plot for those contributors with enough data to do so.

我们希望对文件进行定期的小更新,但偶尔会进行大的编辑,添加,删除或重构。 在生存分析的背景下,我们认为该事件是“死亡”。 很快,使用git-pandas和生命线,我们可以生成具有确定重构规则的数据集,然后使用Kaplan-Meier估计器为具有足够数据的贡献者生成生存图。

This isn’t perfect, not all files really have ‘owners’ and not all major changes are really ‘refactors’, but it’s an interesting start.  So let’s try an example.

这不是完美的,不是所有文件都真正具有“所有者”,也不是所有主要更改都确实是“重构”,但这是一个有趣的开始。 因此,让我们尝试一个例子。

Big Additions

大增加

In this case we use the popular repository.

在这种情况下,我们使用流行的存储库。

Our rule for refactor in this case is any single commit in which the file has a net line growth of over 100 lines.  The code to generate the dataset and plot is pretty simple:

在这种情况下,我们的重构规则是文件的净行增长超过100行的任何单个提交。 生成数据集和绘图的代码非常简单:

from gitpandas import Repositoryimport numpy as npimport lifelinesimport matplotlib.pyplot as pltthreshold = 100repo = Repository(working_dir='git://github.com/scikit-learn/scikit-learn.git', verbose=True)fch = repo.file_change_history(limit=100000, extensions=['py', 'pyx', 'h', 'c', 'cpp'])fch['file_owner'] = ''fch['refactor'] = 0fch['timestamp'] = fch.index.astype(np.int64) // (24 * 3600 * 10**9)fch['observed'] = Falsefch = fch.reindex()fch = fch.reset_index()# add in the file owner and whether or not each item is a refactorfor idx, row in fch.iterrows():    fch.set_value(idx, 'file_owner', repo.file_owner(row.rev, row.filename, committer=True))    if abs(row.insertions - row.deletions) > threshold:        fch.set_value(idx, 'refactor', 1)    else:        fch.set_value(idx, 'refactor', 0)# add in the time since columnfch['time_until_refactor'] = 0for idx, row in fch.iterrows():    ts = None    chunk = fch[(fch['timestamp'] > row.timestamp) & (fch['refactor'] == 1) & (fch['filename'] == row.filename)]    if chunk.shape[0] > 0:        ts = chunk['timestamp'].min()        fch.set_value(idx, 'observed', True)    else:        ts = fch['timestamp'].max()    fch.set_value(idx, 'time_until_refactor', ts - row.timestamp)# plot out some survival curvesfig = plt.figure()ax = plt.subplot(111)for filename in set(fch['file_owner'].values):    sample = fch[fch['file_owner'] == filename]    if sample.shape[0] > 20:        print('Evaluating %s' % (filename, ))        kmf = lifelines.KaplanMeierFitter()        kmf.fit(sample['time_until_refactor'].values, event_observed=sample['observed'], timeline=list(range(365)), label=filename)        ax = kmf.survival_function_.plot(ax=ax)plt.title('Survival function of file owners (thres=%s)' % (threshold, ))plt.xlabel('Lifetime (days)')plt.show()

This will yield the plot:

这将得出图:

Which looks pretty.  You can see there is some kind of decay over time, with most owner’s pages going without refactor (by this measure) for many months, and a clear distinction between owners by the end of the plot (1 year).

看起来不错。 您会发现随着时间的流逝会有某种程度的衰减,大多数所有者的页面在许多个月内都没有进行重构(通过这种方法),并且在图的结尾(1年)之前,所有者之间有了明显的区别。

Big Deletions

大删除

As a second example, let’s look at the exact same repo, same problem, but change the rule for determining refactor to be any time there is a net deletion of more than 100 lines. To do this, we just edit the code above and set threshold equal to -100.  This yields the plot:

作为第二个示例,让我们看一下完全相同的存储库,相同的问题,但是将确定重构的规则更改为任何时候净删除超过100行。 为此,我们只需编辑上面的代码并将阈值设置为-100。 这产生了情节:

Interestingly, owner’s seem to not least nearly as long for deletions, with all dropping below 10% likelihood of survival (going without refactor) after a few months.  This is a pretty stark contrast to the previous definition of refactor, which really drives home the point that the definition of what a refactor looks like really drives the end result here.

有趣的是,所有者删除的时间似乎不短,几个月后所有人的生存率都下降到了10%以下(无需重构)。 这与以前的重构定义形成了鲜明的对比,重构的定义确实使人们明白了重构的外观确实在推动最终结果。

So my first two stabs at it are interesting, but they probably aren’t great representations.  The code to produce this all is , so try your own metrics and let me know what works or doesn’t.  If we can find something that works well in the general case, I’ll add it into the core library of:

所以我的前两个刺是很有趣的,但它们可能不是很好的代表。 生成所有代码的代码在 ,因此请尝试使用自己的指标,让我知道什么有效或无效。 如果我们发现在一般情况下可以正常工作的内容,则将其添加到以下核心库中:

翻译自:

转载地址:http://jzhwd.baihongyu.com/

你可能感兴趣的文章
基本硬件知识(一)
查看>>
js之事件冒泡和事件捕获
查看>>
Linux——LVM 逻辑卷的创建与扩展
查看>>
WIN2003 Apache httpd.exe 进程内存只增不减
查看>>
用Java设计简易的计算器
查看>>
通讯框架后续完善3
查看>>
SharedPreference工具类
查看>>
css文本样式-css学习之旅(4)
查看>>
Java多线程3:Thread中的静态方法
查看>>
找出字符串中第一个只出现一次的字母
查看>>
到底什么样的企业才适合实施SAP系统?
查看>>
事件驱动模型
查看>>
.NET 项目SVN 全局排除设置
查看>>
[语法]全面理解抽象类(abstract class),抽象方法(abstract method),虚方法(virtual method),接口(interface)...
查看>>
PostgreSQL远程连接配置管理/账号密码分配(解决:致命错误: 用户 "postgres" Ident 认证失败)...
查看>>
Java防止SQL注入2(通过filter过滤器功能进行拦截)
查看>>
SQL Server判断语句(IF ELSE/CASE WHEN )
查看>>
Qt: The State Machine Framework 学习
查看>>
实验四.1
查看>>
tf.Session()、tf.InteractiveSession()
查看>>