Identifying Productivity Drivers by Modeling Work Units Using Partial Data

Todd L. Graves* and Audris Mockus+
*National Institute of Statistical Sciences and *+Bell Laboratories


We describe a new algorithm for estimating a model for an independent variable which is not directly observed, but which represents one set of marginal totals of a positive sparse two-way table whose other margin and zero pattern are known.

The application which inspired the development of this algorithm arises in software engineering. It is desired to know what factors affect the effort required for a developer to make a change to the software: for instance, to identify difficult areas of the code, measure changes in the code difficulty through time, and evaluate the effectiveness of development tools. Unfortunately, measurements of effort for changes are not available in historical data. We model change effort using a developer's total monthly effort and information about which changes she worked on in each month. We produce measures of uncertainty in the estimated model coefficients using the jackknife by omitting one developer at a time. We illustrate a few specific applications of our tool, and present a simulation study which speaks well of the reliability of the results our algorithm produces. In short, this algorithm allows analysts to identify if and how much impact a promising software engineering tool or practice has on coding effort.

1. Introduction

A particularly important quantity related to software is the cost of making a change to it. Large software systems are maintained using change management systems, which record information about all the changes programmers made to the software. These systems record a number of measurements on each change, such as its size and type, which one expects to be related to the effort required to make the change. In this paper we discuss the methodology and algorithms we have developed to assess the influence of these factors on software change effort.

Since change management data do not include measurements of the effort that was required to make a change, it is at first difficult to imagine estimating models for change effort. The effort estimates are often obtained for very large changes like new releases, but is important to study effort at the level of individual changes, because at coarser aggregation levels, qualities affecting the effort of individual changes are likely not to be estimable because of aggregation. Therefore, our methodology makes use of two types of known information: the total amount of effort that developer expends in a month, and the set of months during which a developer worked on a given change.

This gives rise to an ill-posed problem in which developers' monthly efforts represent the known column marginal totals of several two-way tables, and we also know which entries in these sparse tables are strictly positive. Also, we postulate the form of a model for the unobservable row marginal totals (amounts of effort for each change) as a function of several observed covariates. Our algorithm then iterates between improving the coefficients in the model for effort using imputed values of change effort, and adjusting the imputed values so that they are consistent with the known measurements of aggregated effort. We have called this algorithm MAIMED, for Model-Assisted Iterative Margin Estimation via Disaggregation. Uncertainty estimates for the model coefficients can then be obtained using the jackknife (Efron, 1982), in which each replication omits one of the two-way tables (developers). We anticipate that this methodology will be useful in other sorts of problems where individuals produce units of output, cost information is available only in units of time, and where the variables affecting productivity operate on the level of the output units.

This methodology has enormous potential for serving as a foundation for a variety of historical studies of software change effort. One example from this research is our finding that the code under study ``decayed'' in that changes to the code grew harder to make at a rate of 20% per year. We were also able to measure the extent to which changes which fix faults in software are more difficult than comparably sized additions of new functionality: fault fixes can be as much as 80% more difficult. (Difficulty of fixing bugs relative to new feature addition depends strongly on the software project under study, and the results from our analyses are consistent with the opinions of developers on those projects.) Other findings include evidence that a development tool makes developers more productive, and confirmation that a collection of subsystems believed to be more difficult than others in fact required measurably more effort to make an average change.

We present recommendations on using the estimation method in practice, discuss alternative approaches and present a simulation study to confirm accuracy, sensitivity to outliers, and convergence of the method.

The next section defines the estimation algorithm. §3 contains more detailed discussion of the data available in the software engineering context, and describes three sample data analyses using this algorithm. §4 describes our simulation study, §5 explains why more familiar approaches such as the EM algorithm were prohibitively difficult, §6 describes the software used to perform the estimation, and §7 gives our conclusions. We include an appendix §8 which describes some theoretical insight into the convergence properties of the algorithm.

2. Methodology

In this section we discuss the procedure to estimate models for change effort. Consider the data corresponding to a single developer d. These data may be thought of as a non-negative two-way table,

\begin{displaymath}\{ Y_{ijd}\geq 0: 1 \leq i \leq M, 1 \leq j \leq N\}.\end{displaymath}

Here i ranges over the M work units (here, changes to the code), and j ranges over the time periods (here, months) over which effort data are available. Known are the zero pattern

\begin{displaymath}\{ (i,j) : Y_{ijd} = 0 \},\end{displaymath}

since change data record when work on a change begins and ends, and the column sums

\begin{displaymath}Y_{\cdot jd} = \sum_{i=1}^M Y_{ijd}.\end{displaymath}

The statistical model for the row sums $Y_{i\cdot d} = \sum_{j=1}^N Y_{ijd}$can be expressed as

g(E(Y_{i\cdot d})) = X_{id}^\prime \beta,
\end{displaymath} (1)

where Xid is a vector of covariates predicting change effort, $\beta$ is a vector of unknown parameters whose estimation is the goal of the procedure, $^\prime$ denotes transpose, and g is a link function. (Since the quantities to be predicted are positive, generalized linear models which lead to multplicative models are appropriate).

2.1 Fitting algorithm

The estimation procedure is inspired by iterative proportional fitting (IPF; see Deming and Stephan, 1940), a technique for estimating cell values in a multi-way table of probabilities when marginal totals are known and a sample from the population is available. IPF alternates (in the two-way case) between rescaling the rows so that the cells sum to the correct row totals, and performing the same operation on the columns, until a normally rapid convergence. In our problem only one of the margins is known, we postulate a model for the quantities given by the other margin, and knowledge of the zero pattern takes the place of the sample. After generating initial guesses at the cell values in the table, our algorithm alternates between making the cell values consistent with the column sums, and fitting a model to the resultant row sums, then rescaling cell values to be consistent with the model's fitted values.

To be more precise, we first generate initial values for the cells in the two-way tables. A natural approach is to divide each column total evenly across all the nonzero entries in that column:

 \begin{displaymath}Y_{ijd}(0) = \vert\{i: Y_{ijd}>0 \}\vert^{-1} Y_{\cdot jd},
\end{displaymath} (2)

where we have denoted by Yijd(t) the estimate of Yijd after the t-th iteration of the algorithm (it will be necessary to allow the argument t to take on fractional values because the cell values are updated twice within each iteration). Alternative initialization methods are considered in §4.3.

The following steps are then repeated until convergence.

Generate row sums from the current estimate of the cell values,

\begin{displaymath}Y_{i \cdot d}(t) = \sum_{j=1}^N Y_{ijd}(t).\end{displaymath}

Fit the statistical model given by (1), using each of the tables (one for each developer $d=1,2,\ldots,D$). This model generates regression coefficients $\beta(t)$ and fitted values $\hat Y_{i \cdot d}(t)$.
Rescale the cell values so that the row sums are equal to the model's fitted values:

\begin{displaymath}Y_{ijd}(t+1/2) = Y_{ijd}(t) \{\sum_{\ell=1}^N Y_{i\ell d}(t)\}^{-1}
\hat Y_{i \cdot d}(t).\end{displaymath}

Finally rescale the cell values so that they agree with the known column sums:

\begin{displaymath}Y_{ijd}(t+1) = Y_{ijd}(t+1/2) \{\sum_{k=1}^M Y_{kjd}(t+1/2)\}^{-1}
Y_{\cdot jd}.\end{displaymath}

Convergence is obtained when the improvement in the error measure in the model fitting is negligible. In practice, we have found that the error measure decreases at each iteration, and that convergence is faster with a Poisson generalized linear model with a log link (i.e. a multiplicative model) than for an additive model.

At convergence, the algorithm reports the coefficients estimated in the model fitting during the last iteration.

2.2 Jackknife

Since these regression coefficients are obtained using an iterative algorithm and a number of fitted regressions, it is necessary to measure their uncertainties using a nonparametric method which takes the estimation procedure into account. The jackknife (see, for instance, Efron, 1982) supplies a method of estimating uncertainties in estimated parameters by rerunning the estimation algorithm once for each data point in the sample, each time leaving one of the data points out. A collection of estimated parameters results, and the degree of difference between these parameters is related to the sampling variability in the overall estimate. For example, suppose that the statistic of interest is denoted by $\hat\theta$, which was computed from a dataset of size n. What is required is an estimate of the standard error of $\hat\theta$. For $i= 1, 2, \ldots, n$, compute $\hat\theta_{(i)}$ using the data set with the ith observation deleted. Set $\hat\theta_{(\cdot)} = n^{-1}\sum_{i=1}^n
\hat\theta_{(i)}$. The jackknife estimate of standard error of $\hat\theta$ is

\begin{displaymath}\left\{ \frac{n-1}{n} \sum_{i=1}^n (\hat \theta _{(i)} -
\hat \theta_{(\cdot)})^2 \right\}^{1/2}. \end{displaymath}

An alternative to the jackknife is the bootstrap (for which see Efron and Tibshirani, 1993, for example). Owing to the time-consuming iterative algorithm, we preferred the jackknife since it required only one additional run of the algorithm for each of the developers. For the purposes of resampling, we have only one datapoint for each developer, because omitting some of a developer's changes leaves some months in which the total effort of the changes is less than the observed monthly effort. Omitting some months will break up changes whose lifetimes span those months, making it impossible to define the total effort for such a change.

3. Change Data and their Analysis

Change history is a promising new tool for measuring and analyzing software evolution (Mockus et al, 1999). Change histories are automatically recorded by source code control and change management systems (see 3.1). Every change to every part of the source code is recorded by the change control system, providing a repository of software modifications which is large, detailed, and uniform over time. Obtaining data for software engineering research relevant to an industrial environment can be difficult and costly; change management data are collected automatically.

In our studies, the data represent changes to the source code of 5ESS (Martersteck and Spencer, 1985), which is a large, distributed, high availability software product. The product was developed over two decades, has tens of millions of lines of source code, and has been changed several hundred thousand times. The source code is mostly written in the C programming language, augmented by the Specification and Description Language (SDL). Tools used to process and analyze software change data are described in Mockus et al, 1999.

The change management (CM) data include information on each change made to the code, its size and content, submission time, and developer. Also available are financial support system (FSS) data, which record amounts of effort spent each month by each developer. Because developers tend to work on multiple changes during most months, and because 14 percent of changes start and end in different months, it is impossible to recover effort measurements for individual changes exactly from these FSS data. In fact, developers' monthly efforts rarely stray far from their natural value of one month, so in most applications we have used $Y_{\cdot jd} \equiv 1$ in place of the FSS data, which in any case are not always available.

3.1 Change data

The extended change management system (ECMS; see, for example, Midha, 1997), and the source code control system (SCCS; Rochkind, 1975) were used to manage the source code. The ECMS groups atomic changes (``deltas'') recorded by SCCS into logical changes referred to as Maintenance Requests (MRs). The open time of the MR is recorded in ECMS. Also available are records of the MR completion time; we use the time of the last delta in that MR. We used the textual abstract that the developer wrote when working on an MR to infer that change's purpose (Mockus and Votta, 1998). In addition to the three primary reasons for changes (repairing faults, adding new functionality, and improving the structure of the code; see, for example, Swanson, 1976), we defined a class for changes that implement code inspection suggestions since this class was easy to separate from others and had distinct size and interval properties (Mockus and Votta, 1998).

The SCCS database records each delta as a tuple including the actual source code that was changed, login of the developer, MR number, date, time, numbers of lines added, deleted, and unchanged.

3.2 Practical issues

There a number of practical issues when modeling change effort. Some are typical to all statistical problems, some specific to software engineering, and some specific to the MAIMED fitting procedure. Here we omit the traditional validity checks in regression analysis such as collinearity among predictors and dwell mostly on issues related to software engineering and the MAIMED procedure. Two particularly important issues are including essential predictors in the model and choosing subsets of the data.

3.2.1 Which variables are needed for the model in software applications?

In our studies of software data, we have found that several variables should always be included in the regression models and that their effects on effort are always present. One can include other variables to test whether they have important effects.

First among variables that we recommend always including in the model is a developer effect. Other studies (for example, Curtis, 1981) have found substantial variation in developer productivity. Even when we have not found significant differences and even though we do not believe that estimated development coefficients constitute a reliable method of rating developers, we have left developer coefficients in the model. The interpretation of estimated developer effects is problematic. Not only could differences appear because of differing developer abilities, but the seemingly less productive developer could be the expert on a particularly difficult area of the code, or that developer could have more extensive duties outside of writing code.

Naturally, the size of a change has a strong effect on the effort required to implement it. A large number of measures are available which measure something closely related to size, and the analyst should generally include exactly one of these in the model. We have typically chosen the number of atomic changes (deltas) that were part of the MR as the measure of the size of an MR. This measure seems to be slightly better for this purpose than other size measures like the total number of lines added and deleted in the MR. Another useful measure of size is the span of the change, i.e., the number of files it touches. A large span tends to increase the effort for a change even if other measures of size are constant.

We found that the purpose of the change (as estimated using the techniques of Mockus and Votta, 1998) also has a strong effect on the effort required to make a change. In most of our studies, changes which fix bugs are more difficult than comparably sized additions of new code. Difficulty of changes classified as ``cleanup'' varies across different parts of the code, while implementing suggestions from code inspections is easy.

3.2.2 Selecting subsets of developers

Another important problem when using MAIMED methodology is choosing a set of developers whose changes to include in the analysis. The 5ESSTMhas been changed by a large number of developers, and we recommend restricting attention to a subset of these, with the subset chosen in order to yield the sharpest possible estimates of the parameters of interest.

Variability in project size and developer capability and experience are the largest sources of variability in software development (see, for example, Curtis, 1981). The effects of tools and process are often smaller by an order of magnitude. To obtain the sharpest results on the effect of a given tool in the presence of developer variability, it is important to have observations of the same developer changing files both using the tool and performing the work without the aid of the tool.

To reduce inherently large naturally occurring inter-developer variability, it is important to select developers who make similar numbers of changes.

To avoid confounding tool effects (see §3.3.2) with developers, it is imperative to choose developers who make similar number of changes with and without the tool. When estimating code-base effects (see §3.3.3), it is important to select developers who make similar numbers of changes on each type of product.

Given the considerable size of the version history data available, both tasks are often easy: in the tool effect study we selected developers who made between 300 and 500 MRs in the six year period between 1990 and 1995 and had similar numbers (more than 40) of MRs done with and without the tool.

We used relatively large samples of MRs in all three examples below. When comparing subsystems, we used 6985 MRs over 72 months to estimate 18 developer coefficients and 6 additional parameters. When analyzing the tool benefits, we used 3438 MRs over 72 months for 9 developers and five additional parameters.

3.3 Results of Data Analyses

The specific model we fit in the modeling stage of the algorithm was a generalized linear model (McCullagh and Nelder, 1989) of effort. MR effort was modeled as having a Poisson distribution with mean given below.1

 \begin{displaymath}E(\mbox{Effort}) = \alpha_{Developer} \times \beta_{Type} \times
\mbox{Size}^\gamma \times \theta_{\mbox{Cause}}.
\end{displaymath} (3)

The covariate ``Cause'' was different in each of three studies summarized below. The decay study (Graves and Mockus, 1998) investigated the effects of calendar time, the tool usage study considered cost savings from using a tool, and the subsystem study investigated effects of different source code bases.

3.3.1 Code Decay

Beyond developer, size, and type of change, another interesting measurement on a change which we have found to be a significant contributor to the change's difficulty is the date of the change (Graves and Mockus, 1998). We were interested to see if there was evidence that the code was getting harder to change, or decaying, as discussed in Belady and Lehman (1976), Parnas (1994), and Eick et al (1999). There was statistically significant evidence of a decay effect: we estimated that in our data, a change begun a year later than an otherwise similar change would require 20% more effort than the earlier change. The fitted model was:

 \begin{displaymath}E(\mbox{Effort}) = \alpha_{Developer} \times \beta_{Type} \times
\mbox{Size}^\gamma \times \theta^{\mbox{Time}}.
\end{displaymath} (4)

Here the estimated developer coefficients, the $\alpha$'s, varied by a factor of 2.7. The jackknife estimate of standard error (see §2.2) indicated that there was no statistically significant evidence that the developers were different from each other. Although we sometimes find quotes in the literature (Curtis, 1981) regarding the maximal productivity ratio for developers ranging from 10 times to 100 times, we were not surprised to see a much smaller ratio in this situation. For one thing, we selected developers in a way which helps ensure that they are relatively uniform, by requiring that they completed large numbers of changes. One should also expect a great deal of variation in these ratios across studies, because they are based on the extreme developers in the studies and hence are subject to enormous variability. Also, we believe that there are a number of tasks beyond coding (requirements, inspections, architecture, project management) that take substantial time from developers involved in such tasks. Consequently, the coding productivity does not necessarily reflect the total productivity or the overall value of the developer to an organization.

We defined $\beta_{NEW}$, the coefficient for changes that add new functionality, to be unity to make the problem identifiable, then estimated the remaining type coefficients to be 1.8 for fault fixes, 1.4 for restructuring and cleanup work, and 0.8 for inspection rework.

The value of $\gamma$, which is the power of the number of files changed which affects effort, we estimated to be 0.26, and its jackknife standard error estimate is 0.06. Since $\gamma<1$, for these data, the difficulty of a change increases sub-linearly with the change's size. At first glance this may appear to conflict with intuition and, for example, the COCOMO model for effort prediction (Boehm, 1981) one expects that the effort for a change should be very nearly proportional to the size of the change. This expectation is reasonable in the context of large changes, such as entire projects, which contain mixtures of small changes of various types. When considering changes as small as MR's, a sublinear result should not be surprising. Large changes are often additions of new functionality and may be mostly straightforward, while small changes and often bug fixes, which may require large initial investments of time to find where to make the change.

We estimated $\theta$ to be 1.21, and its natural logarithm had a standard error of 0.07, so that a confidence interval for $\theta$ calculated by taking two standard errors in each direction of the estimated parameter is $1.21\exp\{\pm 2 \times 0.07\} = (1.05,1.39).$This implies that a similar change became 5% to 39% harder to make in one year.

3.3.2 Tool usage effects

This section tests our hypothesis that the use of a tool reduces the effort needed to make a subset of changes. The tool in this study was the Version Editor, which was designed to make it easier to edit software which is simultaneously developed in several different versions. It does this by hiding all lines of code which belong to versions the developer is not currently working on (for details see Atkins et al (1999). There were three types of changes:

a ``control'' set of changes which modified only files which had no versioning information. Here usage of the tool should have had no effect. Roughly half of all the changes fell into this category; they will be denoted as CONTROL;
changes where the tool should have an effect (i.e. those modifying files with versioning information) and the tool was used (denoted as USED);
changes where tool should have an effect and it was not used (denoted as NOT).

We fit the model (Equation 5, estimated standard errors using the jackknife method, and obtained the following results, as summarized in Table 1.

 \begin{displaymath}E(\mbox{Effort}) = \alpha_{Developer} \times
\beta_{Type} \times \mbox{Size}^\gamma \times
\end{displaymath} (5)

The penalty for failing to use the tool is the ratio of $\theta_{NOT}$ to $\theta_{USED}$, which indicates an increase of about 36% in the effort required to complete an MR. (This coefficient was statistically significant at the 5% level). Restated, developers who used the tool to perform non-control changes could perform 36% more changes than if they had used the tool. At the same time, changes performed using the tool were of the same difficulty (requiring a statistically insignificant 7% increase in effort) as CONTROL changes ( $\theta_{USED}$versus $\theta_{CONTROL}$).

The type of a change was a significant predictor of the effort required to make it, as bug fixes were 26% more difficult than comparably sized additions of new functionality. Improving the structure of the code, the third primary reason for change was of comparable difficulty to adding new code, as was a fourth class of changes, implementing code inspection suggestions.

The variable $\gamma$ in Equation (3) was estimated to be 0.19. That is, the size of a change did not have a particularly strong effect on the effort required to make it.

Table: Estimated coefficients and their precision. By definition, $\log \beta_{NEW} = \log\theta_{USED} = 0$.
  $\gamma$ $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{NOT}$ $\log \theta_{CONTROL}$
estimate 0.17 0.31 -0.34 -0.32 0.29 -0.11
std err 0.14 0.11 0.53 0.53 0.19 0.21

3.3.3 Subsystem effects

In the last example we considered the influence of code-base on effort. The 5ESS switch (the product under discussion) is broken into a number of relatively independent subsystems (each several million lines of code) that belong to and are changed by separate organizations. This product is updated by adding new features or capabilities that can then be sold to customers. An average feature contains tens to hundreds of MRs.

A development manager identified six subsystems that tend to have higher cost per feature than another six similar sized subsystems. To identify whether the effort was higher even at the MR level, we chose 18 developers who made similar numbers of changes to high- and low-cost subsystems. We fit the model (6). The estimates are summarized in Table 2. Since $\exp(0.249) = 1.28$, the results show that similar MRs take about 28% more effort in the high cost subsystems.

 \begin{displaymath}E(\mbox{Effort}) = \alpha_{Developer} \times
\beta_{Type} \t... \times \mbox{Files}^{\gamma_1} \times
\end{displaymath} (6)

In this case we used additional predictor - number of files touched in MR to measure its size. The factor ``LOW'' indicates that it is a low-cost-per-feature subsystem.
Table: Estimated coefficients and precision for the code base effects. By definition, $\log \beta_{NEW} = \log \theta_{LOW} = 0$.
  $\gamma$ $\gamma_1$ $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{HIGH}$


0.08 0.16 -0.107 0.233 0.204 0.249
std err 0.06 0.07 0.016 0.03 0.08 0.07

4. Simulation

To provide confirmation that the effort modeling methodology gives adequate results, we conducted a simulation experiment. We simulated data from a specified model and measured to what extent the model coefficients are recoverable using the MAIMED algorithm. In particular, we study the influence of the initialization method, and the effect of incorrectly specified monthly efforts (§4.3). We also studied what happens to estimated coefficients when we fit models without one of the essential variables, or with an extra variable which does not itself affect effort but which is correlated with one of the predictor variables (§4.4).

The simulation is in part a bootstrap analysis of the data. It begins with a model fit to a set of data. We used the model from the tool usage exampleEquation 5 and the estimated coefficients from Table 1.

Each replication in the experiment consists of the following. We resample with replacement a set of developers. The set of developers in the simulated data consists of a random number of copies of each of the developers in the true data set. We then construct a collection of MRs for a synthetic developer by resampling MRs of the corresponding actual developer. More precisely, we do the following.

Resample with replacement a set of developers. The following steps are iterated over each developer in the resulting set.
Choose an MR at random from the developer's collection. This MR has associated with it a size, type, and other covariates (ignore its open or close time).
Derive the MR's open date by assuming it was opened at the time the MR opened previously to this MR for the same developer was closed (use time zero if this is the first MR for the developer).
Adjust the open date for this MR by generating a time shift using parametric bootstrap (for details see §4.1).
Compute the expected value of the effort for this synthetic MR using the covariates for the resampled MR and the estimated model.
Generate the effort for this MR according to the exponential distribution with the computed mean. The close date of this MR is then the open date, plus this effort value.

This simulation procedure preserves the correlations between all the variables in the model, except for time. In the simulations, the covariates and effort for MRs are independent of the covariates and effort of the previous MRs. Since the actual MRs do overlap we also generated overlaps based on the actual MR overlap data (see §4.1).

We repeated this procedure to generate 100 synthetic data sets, and estimated parameters for the correct model with these data sets. We checked the sensitivity to initialization method and to incorrectly specified monthly efforts. Finally, we use our algorithm to fit models on these data when we misspecify the model by omitting an important variable or by adding an unnecessary variable.

(The reader may find it strange that we simulate exponential data, which have constant coefficient of variation, and model these data as if they were Poisson, which have constant variance to mean ratio. This is intentional; we believe from other studies that the true data have constant coefficient of variation, but the larger changes are not only more variable, but also more important to model correctly. We believe coefficients from a Gamma model are too dependent on the smaller observations.)

4.1 MR overlap distribution

Overlap between MRs is an important component of the software engineering problem. One third of actual MRs have the property that they are closed after a subsequent MR is opened. To produce a realistic simulation, we had to reproduce MR overlap. Number the MRs for a single developer in order of their open times Oi so that for all i, $O_{i+1}\ge O_i$. Let Ci be the closing time of MR i. The overlap is defined as follows:

\begin{displaymath}OVL_i \stackrel{\rm def}{=}\left\{
...C_{i}-O_{i}} & {\rm if} & O_{i+1} \le C_{i}
\end{displaymath} (7)

The actual overlap when $O_{i+1} \le C_{i}$ has to be defined relative to the opening time of the previous MR to preserve MR ordering. The skips ( Oi+1 > Ci) may be defined in absolute time. The empirical distribution of skips and overlap is in Figure 1. The probability of overlap in actual data was $P(O_{i+1} \le C_{i})=0.3$. The overlap closely followed a ${\rm
Beta}(\alpha=0.85, \beta=0.25)$ distribution (with pdf $\left\{\Gamma(\alpha)\Gamma(\beta)\right\}^{-1} \Gamma(\alpha + \beta)
x^{\alpha-1}(1-x)^{\beta-1}$), while negative skips ( OVLi:Oi+1 > Ci) followed a ${\rm
Weibull}(\gamma=0.5, \beta=2.9)$ distribution (with pdf $\gamma\beta^{-1} x^{\gamma-1}\exp(-\beta^{-1}x^\gamma)$).

Figure 1: Empirical distribution of MR overlap. The dotted line shows actual distribution and the solid line shows parametric distribution used to simulate the data.
\epsfig{figure=figs/,width=6.0in,height=3in} \end{center}\end{figure*}

The overlap distribution indicates that when the overlap occurs, MRs are likely to be started at almost the same time. The skip distribution is fairly heavy tailed which is probably caused by vacation or other developer's activities not directly related to changing the code.

4.2 Estimates with the correct model

In all of the following the hundred data sets were simulated from the model in Equation (5) with coefficients as in Table 1. Overlap of MRs was generated as described in §4.1. $Y_{\cdot jd}$ are set to one (except in Table 4) to reflect frequent situations in practice when the actual monthly effort is not available or unreliable. The algorithm in all cases is run for thirty iterations for each data set.

Table 3 shows the means of the estimated coefficients and their standard deviations.

Table 3: Estimated coefficients for correct model with incorrect monthly efforts. All coefficients are rounded to the two significant digits.
  $\gamma$ $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{NOT}$ $\log \theta_{CONTROL}$
true 0.17 0.31 -0.34 -0.32 0.29 -0.11
mean 0.087 0.17 -0.15 -0.22 0.16 -0.043
stdev 0.049 0.075 0.18 0.28 0.11 0.097

The estimates are biased because the monthly efforts are set to one to reflect practical situations when the actual monthly efforts are not known. Not surprisingly, setting all the monthly efforts to be equal causes all coefficients to be estimated to be closer to zero. Using unit monthly efforts is thus a conservative procedure which will generally err on the side of pronouncing too many factors to be unrelated to effort. In particular, the HAND coefficient is underestimated, suggesting that if the exact monthly efforts were known the effect of the tool usage would have been even larger.

First we check the sensitivity with respect to alternative initialization methods and with respect to variations in monthly effort.

4.3 Sensitivity to initialization method

The first initialization method in Equation (2) divides monthly effort evenly among MRs open that month. We also used an alternative initialization method which divides monthly effort proportionally to the time each MR is open during that month:

 \begin{displaymath}Y_{ijd}(0) = \frac{{\rm Interval}_{ijd}}{\sum_i{\rm Interval}_{ijd}} Y_{\cdot jd},
\end{displaymath} (8)

where ${\rm Interval}_{ijd}$ denotes the time MR i (done by developer d) was open during month j.

The results are identical (up to two significant digits) to the results in Table 3, i.e., the initialization method does not affect the values of the estimated coefficients.

In the simulated data (more so than in practice) the monthly efforts vary across months. In most cases monthly effort data is not available, so we substitute unit effort. In other situations such information is available (see, e.g., Graves and Mockus, 1998). In the simulation we have the advantage of knowing the monthly MR efforts exactly and can use this information in the initialization stage:

 \begin{displaymath}Y_{ijd}(0) = {\rm Interval}_{ijd}.
\end{displaymath} (9)

The total monthly efforts would be no longer unity but rather $Y^{\rm exact}_{\cdot jd}=\sum_i{\rm Interval}_{ijd}$. In this case the initialization provides actual simulated effort values for each MR. Table 4 shows the results.
Table 4: Estimated coefficients for correct model with correct monthly efforts and alternative initialization method. t-values (for testing the hypothesis that the mean estimated coefficients from simulation are equal to their true values) are obtained by subtracting the coefficients used in simulation. The final row lists the standard errors obtained by the jackknife for comparison to these standard deviations.
  $\gamma$ $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{NOT}$ $\log \theta_{CONTROL}$
true 0.17 0.31 -0.34 -0.32 0.29 -0.11
mean 0.18 0.34 -0.61 -0.59 0.31 -0.10
stdev 0.079 0.18 0.64 0.87 0.19 0.23
t-value 0.77 1.64 -4.21 -3.05 1.05 0.32
jk stderr 0.14 0.11 0.53 0.53 0.19 0.21

As expected, the results are much closer to the actual values. The only exceptions are coefficients for the rarely occurring cleanup and inspection MRs. Those two coefficients appear consistently below the values used in the simulation.

Typically such behavior would not cause a problem in practice (here, the cleanup and inspection coefficients were not significantly different from the new coefficient), because the uncertainty estimates will be large for coefficients estimated on the basis of a small number of such MRs. There were 15778 inspection and 30538 cleanup MRs (4.5% and 8.7% percent respectively) out of total 348204 MRs in the generated 100 samples. However, we recommend against using predictors that occur rarely in the data and which might be correlated with individual developers. Confounding helps explain the bias problem in this example, because the developer with the least productive coefficient has the smallest percentage, and the most productive developer has the highest percentage of inspection and cleanup MRs.

4.4 Fitting the misspecified model

To test the behavior of the estimator under model misspecification we fitted the model omitting the size predictor and a set of models with correlated predictors.

Table 5 shows the estimates obtained by omitting the size predictor. The estimates are not significantly different from the ones obtained using the correct model in Table 3.

Table 5: Estimated coefficients for incorrect model with incorrect monthly efforts.
  $\gamma$ $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{NOT}$ $\log \theta_{CONTROL}$  
mean -- 0.15 -0.15 -0.20 0.16 -0.067  
stdev -- 0.075 0.18 0.27 0.11 0.095  

We also ran a sequence of tests by adding a coefficient correlated with the size covariate. Since the logarithm of the size was used in the model, the collinear covariate was correlated with the the logarithm of size. Medium correlation of 0.5 and high correlations of 0.8 and 0.95 were used. The results are in Table 6.

Table: Estimated coefficients for incorrect model (predictor collinear with $\log(size+1)$)with incorrect monthly efforts. The row labels $E(\rho )$ refer to the mean estimated coefficient when estimating a model with an extra unimportant covariate which has correlation $\rho $ with $\log(1+\mbox{Size})$. Stdev($\rho $) is the standard deviation of these coefficients.
  $\gamma$ extra $\log \beta_{BUG}$ $\log \beta_{CLEANUP}$ $\log \beta_{INSPECT}$ $\log \theta_{NOT}$ $\log \theta_{CONTROL}$
E(0.5) 0.091 -0.0055 0.167 -0.15 -0.22 0.15 -0.043
Stdev(0.5) 0.056 0.041 0.075 0.18 0.28 0.11 0.097
E(0.8) 0.085 0.0021 0.17 -0.15 -0.22 0.16 -0.043
Stdev(0.8) 0.082 0.058 0.076 0.18 0.28 0.11 0.097
E(0.95) 0.091 -0.0035 0.17 -0.15 -0.22 0.15 -0.044
Stdev(0.95) 0.15 0.11 0.075 0.18 0.28 0.11 0.097

The results are as expected -- the collinear coefficient is not significantly different from zero and the model coefficients are not significantly different from those in Table 3. The collinear predictor increases uncertainty of the size coefficient, but does not affect the other estimates.

5. Related work

It is possible to think of this problem as a missing data problem, tempting one to attack it with the EM algorithm or data augmentation (see Tanner, 1993, for either). These approaches have serious difficulties with this problem. Taking EM as an example, it is necessary to specify distributions of the monthly effort expended on changes in such a way that is practical to compute conditional expectations given row sums and the parameters that determine mean change effort. Our two rescaling steps have the feel of an expectation step, but it is unclear whether one could construct a set of distributions under which they would be. Data augmentation has similar problems.

The problem of estimating network traffic from edge counts, considered in Vardi (1996), has some similarities with the present problem. There, numbers of traversals of each edge in a network are available, but the traffic for complete paths (sequences of connected edges) is of interest. Both problems deal with nonnegative quantities, and in each case linear combinations of quantities of interest are observed, since edge traffic is the sum over all the paths that traverse that edge.

The effort problem adds another layer of difficulty to the network problem, in which it is desired to recover a vector X given a vector Y of sums of the elements of X, satisfying the equation Y=AX, for some known matrix of zeroes and ones A. The difficulty is in the fact that there are more X's than Y's. In the effort estimation problem we seek a collection of sums X=A1 of unobservable quantities A, given another set of sums $Y=1^\prime A$ and the zero pattern of the matrix A. Auxiliary modeling information relates to the desired collection of sums, while in the network problem, it is the unsummed values which can potentially be modeled.

6. Software

Although the algorithm is fairly simple to implement, we describe a set of functions in S-PLUS that would help practitioners use the methodology more readily. The code contains five S-PLUS functions:

maimed: the main wrapper function which prepares the data objects for initialization and fitting. The main parameters are:
data: the data frame containing a list of the MRs with required fields: ``name''-- developer name, ``open'' and ``close'' -- date specified in fractional months (the first MR must have value 0 in the ``open'' field), as well as additional covariates as needed;
formula: the vector of predictor names to be used when performing the fitting. The predictors are taken from the the dataframe data. The first predictor should always be ``name'' since differences among developers represent the largest variation in the effort required to complete an MR;
effmat: optional reported effort matrix where rows correspond to distinct developer names and columns to months. If the matrix is not provided it is assumed to contain unit efforts in each cell;
calculateEffort: boolean variable indicating whether to calculate the effort matrix based on the intervals of the MRs as in Equation (9);
weighted: the boolean variable indicating whether to initialize the algorithm weighting MRs by their open time during a month as in Equation (8);
jk: the boolean variable indicating whether or not to perform jackknife estimation of errors;
maimed.validate: computes standard deviations and t-statistics from the jackknife estimates obtained with maimed;
maimed.initial: default initialization function which distributes effort equally among MRs open in a particular month;
maimed.initial2: alternative initialization function which distributes effort proportionally to the length of time MRs are open;
maimed.initial3: initialization function which calculates monthly effort by adding lengths of time open for each MR in a month; runs the fitting algorithm.

The Splus (or R) code for these functions can be downloaded from here.

Following is a detailed example analysis with some practical considerations. First, we create a data frame data using information retrieved from change management database using, for example, the SoftChange system (Mockus et al, 1999). Each MR should occupy a row in this data frame with the following fields: name containing the developer's name or login; bug, containing 1 if the change is bug fix (otherwise 0); clean containing 1 if the change is perfective; new containing 1 if the change is adaptive; ndelta containing size of the change in number of deltas; open and close representing timestamp of the first and the last delta (or open and close time for an MR) in unix time format (seconds since 1970); and additional factors ToolA and ToolB that indicates whether tool A or/and B were used to complete this MR.

First we select only MRs done by developers that completed at least 20 MRs with and without the tool. This might be done in following way:

usage.table <- table(data$name,data$usedX);
name.subset <- table(data$name)[usage.table[,1]>20 & usage.table[,2]>20]; <- data[match(data$name, name.subset, nomatch=0)>0,];

Now the dataframe contains only developers that worked with and without the tool and hence we will be able to make sharper comparisons. In the next step we transform the open and close fields into fractional months starting at zero:$open  <-$open/3600/2/365.25;$close <-$close/3600/2/365.25 - min($open);$open  <-$open - min($open);

We may also transform the size of the change by a logarithm:$logndelta <- log($ndelta+1);

Finally we may run the maimed algorithm to estimate the coefficients for several models. One of the models might be:

$\displaystyle E(\mbox{effort})$ = $\displaystyle \char93 delta^{\alpha_1} \times \beta_{bug} \times \beta_{clean}
\times \beta_{new}$  
  $\textstyle \times$ $\displaystyle \gamma_{ToolA} \times \gamma_{ToolB}.$  

To fit the model we run:
model <- maimed (, formula=c("name","logndelta","bug","clean","ToolA","ToolB"))
Notice that since in this example the categorization had three types of changes (bug, clean, new) we included only first two in the model. The remaining type (new) is used the basis for comparison with the first two. If all the indicators were included we would get an over determined model.

Once we find a subset of interesting models we estimate the significance of the coefficients via jackknife:

model <- maimed (, formula=c("name","logndelta","bug","clean","ToolA","ToolB"),jk=T)
The result contains n+1 estimates of coefficients obtained by running maimed algorithm on the full dataset and by removing one of the N different developers. The significance values may be calculated as follows:
This command may produce following output:
               estimate      stdev  t-statistic      p-value 
  lognumdelta  0.69179651 0.12414385   5.5725394 2.736186e-05
          bug  0.63789480 0.18284625   3.4886950 2.621708e-03
        clean -0.32523246 0.57564448  -0.5649884 5.790559e-01
        ToolA -1.40700987 0.19134298  -7.3533393 7.966553e-07
        ToolB  0.03172900 0.17362807   0.1827412 8.570436e-01
   Developer1 -2.81395695 0.20994002 -13.4036235 8.338685e-11
   Developer2 -2.08035974 0.18109041 -11.4879622 1.014959e-09
   Developer3 -2.85884880 0.11088977 -25.7809968 1.110223e-15
   Developer4 -2.85802540 0.17252158 -16.5661905 2.419842e-12
   Developer5 -2.84639737 0.21325331 -13.3474946 8.934120e-11
   Developer6 -2.58167088 0.19006092 -13.5833860 6.696665e-11
   Developer7 -2.64266728 0.21459470 -12.3146901 3.322902e-10
We see that bug fixes are significantly harder than new code, and tool A significantly reduces the change effort. Cleanup (perfective) changes are not significantly easier than new code chnages and Tool B is not significantly increasing change effort. Hence for practical purposes the model may be rewritten using the estimates above as
$\displaystyle E(\mbox{effort})$ = $\displaystyle \char93 delta^{0.69} \times e^{0.63 I(bug)} \times e^{-1.41 I(ToolA)}.$  

Here I(characteristic) is equal to 1 if the change has that characteristic, and otherwise equals 0. This implies that, for example, changes with 18 delta are 50 percent harder than changes with 10 delta ( 180.69/100.69=1.5), bug fixes are 88 percent harder than new code changes ( $\exp(0.63)=1.88$), and tool A reduces change effort more than four times ( $\exp(-1.41)=0.24$).

The constant multiplier is included with all developer coefficients. The ratio of the highest and the lowest coefficient is around two e-2.08/e-2.86=2.18.

7. Conclusions

This paper introduced a method for quantifying the extent to which properties of changes to software affect the effort necessary to implement them. The problem is difficult because measurements of the response variable, effort, are available only at an aggregated level. To solve this difficulty we propose an algorithm which imputes disaggregated values of effort, which can be improved by using a statistical model. The method may be applied to the problem of estimating a model for the row sums given only the column sums and zero pattern of an arbitrary sparse non-negative table.

We provided three examples of data analyses in which we used the new algorithm to address important questions in software engineering: monitoring degradation of the code as manifested by increased effort, quantifying the benefits of a development tool, and identifying subsystems of code which are especially difficult.

We explored the properties of the algorithm through simulation. The simulations demonstrated that given enough data, the algorithm can recover the parameters of a true model, and that the algorithm still performs well when an important variable is left out of the analysis or when an extra variable correlated with one of the true predictors is included in the model. Also, the algorithm is conservative in the frequent practical situation in which aggregated effort values must be assumed to be one unit of effort per month. Further justification for the algorithm of a theoretical nature appears in the appendix.

We also described how to use software which implements the algorithm (it is available for download) for the benefit of practitioners.

8. Appendix: Convergence conditions

In this appendix we provide some theoretical results regarding the convergence properties of the algorithm. We present two results characterizing the fixed points of the algorithm.

Each iteration of the method represents a nonlinear transformation of the entries in the tables, of the row sums and of the fitted coefficients. Without loss of generality assume that all rows (changes) and all columns (months) have at least one positive entry. In the notation that follows we will omit the dependence on the developer d. We will often suppress the dependence on the iteration t as well. For example, recall that $Y_{ij}=Y_{ij}(t)\ge 0, \; i=1,
\dots, M, \;j=1, \dots, N$ denote entries for the i-th row and j-th column at iteration t. Let $f_i>0, \; i=1, \dots, n$ be model predicted row sums (fitted values), based on the current row sums $R_i=\sum_{l=1}^N Y_{il}$. Denote column sums as $C_j=\sum_{k=1}^M Y_{kj}$. Then in the next iteration the entries will be as follows:

Yij(t+1) = $\displaystyle \frac{Y_{ij} \sum_{k=1}^M Y_{kj} }
{(\sum_{l=1}^N Y_{il}) \sum_{k=1}^M \frac{\rho_{ki} Y_{kj}}{\sum_{l=1}^N Y_{kl}}}$ (10)
  = $\displaystyle \frac{Y_{ij} C_j }{R_i \sum_{k=1}^M \frac{\rho_{ki} Y_{kj}}{R_k}}$ (11)

where $\rho_{ik}=f_i/f_k$.

Definition 8.1   Rows i and k are called zero connected if there exists a column j such that Yij>0 and Ykj>0. Rows i and k are called n connected if there exists a row $k^\prime$ such that $i,k^\prime$ and $k,k^\prime$ are at least n-1 connected (i.e., if $i,k^\prime$ are n1 connected and $k,k^\prime$ are n2 connected for all $n_1,n_2\le n-1$). Rows i and k are called connected if there exists a finite n such that i,k are n connected. A table is called connected if all pairs of its rows are connected.

Note that if i,k are not connected, then the entries in row i (except for j such Yij=0 or Yij=Cj) can be changed independently of the entries in row k. The connection relation factorizes the table into connected subtables which can be analyzed independently since they affect each other only through the model fitting procedure.

It is of interest when the algorithm stops.

Theorem 8.2   Let (Yij) be entries in the table, with Cj and $\rho_{ik}$ defined as above. The Yij represent a fixed point of the algorithm if and only if one of these three conditions is satisfied for each entry Yij:
$\frac{R_i}{R_k}=\rho_{ik}$ for all rows k connected to i.

Proof: It is easy to see that the algorithm does not modify entries where Yij=0 or Yij=Cj. Consider subtables defined by connected rows. Without loss of generality consider only one such subtable. We will first prove the ``if'' part. Since the ratios $\rho_{ki}=R_k/R_i$, Equation (10) simplifies to

\begin{displaymath}Y_{ij}{(t+1)}=\frac{Y_{ij} C_j } {\sum_{k=1}^M Y_{kj}}=Y_{ij},\end{displaymath}

i.e., the entries represent a fixed point of the algorithm. Now we prove the ``only if'' part. Denote $w_{ik}=\rho_{ki}
R_i/R_k$. It is easy to check that the matrix $W=(w_{ik}/M)_{i=1\dots M,k=1\dots M}$ is idempotent and hence has all eigenvalues equal to 0 or 1. Since all entries in W are strictly positive the eigenvectors corresponding to eigenvalue 1 form a one-dimensional space (see First Frobenius theorem in, e.g., Karlin and Taylor, 1975, pp. 543). Consequently, W has rank one. Since diagonal entries in W are all 1 and $\sum_{k=1}^MY_{kj} = \sum_{k=1}^M w_{ik} Y_{kj}$ for all i and j (using Equation (10) and the fixed point condition Yij(t+1)=Yij), all entries in W must be equal to 1. $\Box$

8.1 Stability

Denote transformation on rows performed by an algorithm as $Q(\cdot)$. Denote the deviance defined by row values Ri, the MLE estimate of the model $\hat{\mu}$ given row values $\mbox{\boldmath${{\bf R}}$ }$, and the log-likelihood of the model $l(\mbox{\boldmath${{\bf R}}$ }\vert\mbox{\boldmath${{\bf\mu}}$ })$ to be $D(\mbox{\boldmath${{\bf R}}$ })=l(\mbox{\boldmath${{\bf R}}$ }\vert\mbox{\boldm...
...R}}$ })-l(\mbox{\boldmath${{\bf R}}$ }\vert\hat{\mbox{\boldmath${{\bf\mu}}$ }})$. With regard to stability of a fixed point it is true that:

Theorem 8.3   Let Yij be entries in a table which correspond to a fixed point of the algorithm, with row values $\mbox{\boldmath${{\bf R}}$ }$ Let Q have bounded third partial derivative tensor. Also, let
the model be Gaussian with a single mean parameter, so that the deviance is the sum of squared differences between fitted and original values, and so that the fitting procedure is linear;
the table be connected.
Then the deviance in a neighborhood of the fixed point is larger than at the fixed point. That is, let zij be deviations to add to the table of Y's (and denote the resulting deviations in the row values by zi; $\mbox{\boldmath${{\bf z}}$ }$ denotes the vector of zi's). There $\exists \epsilon > 0$ such that $\forall 0 < \delta < \epsilon$ and for all such $\mbox{\boldmath${{\bf z}}$ }:\Vert\mbox{\boldmath${{\bf z}}$ }\Vert = 1,\; \mbox{\boldmath${{\bf 1z}}$ }^t=0$, we have $D(\mbox{\boldmath${{\bf R}}$ }+\delta \mbox{\boldmath${{\bf z}}$ })> D(\mbox{\boldmath${{\bf R}}$ })$.

Proof: Note that the constraints of the problem require $\mbox{\boldmath${{\bf 1z}}$ }^t=0$. Take $\delta$ to be small enough so that the perturbed table with cells $Y_{ij}+\delta
z_{ij}$ where $\sum_jz_{ij}=z_i$ is valid ( $Y_{ij}+\delta
z_{ij}>0$, $\sum_i z_{ij}=0$) and connected. Denote the fitted values as $P(\cdot)$. We denote the inner product between two vectors $\mbox{\boldmath${{\bf a,b}}$ }$ by $(\mbox{\boldmath${{\bf a,b}}$ })$, and we will write $(\mbox{\boldmath${{\bf a,a}}$ })$ as $\Vert\mbox{\boldmath${{\bf a}}$ }\Vert^2$. We need to prove that

 \begin{displaymath}\Vert\mbox{\boldmath${{\bf R}}$ }+\delta \mbox{\boldmath${{\b...
...{{\bf R}}$ }+\delta \mbox{\boldmath${{\bf z}}$ }))\Vert^2 > 0
\end{displaymath} (12)

Expand Q around $\mbox{\boldmath${{\bf R}}$ }$ up to second order: $Q(\mbox{\boldmath${{\bf R}}$ }+\delta \mbox{\boldmath${{\bf z}}$ })=\mbox{\boldmath${{\bf R}}$ }+Q_\delta+Q_\delta^2+O(\delta^3)$. Then the left side of Equation (12) may be rewritten as:

\begin{displaymath}\left\Vert\mbox{\boldmath ${{\bf R}}$}-P(\mbox{\boldmath ${{\...

Because the fitting step involves taking averages, Theorem 8.2 implies that $\mbox{\boldmath${{\bf R}}$ }=P(\mbox{\boldmath${{\bf R}}$ })$, so that the terms in zero or first order in $\delta$ are exactly zero. Gathering second order terms we get:

\begin{displaymath}\left\Vert\delta \mbox{\boldmath ${{\bf z}}$}-P(\delta \mbox{...
... z}}$}\Vert^2 - \left\Vert Q_\delta-P(Q_\delta)\right\Vert^2.

The transition was due to the fact that in the simple theorem setting P is an average, so $P(\delta \mbox{\boldmath${{\bf z}}$ })=0$.

To proceed further, we need the first order term of the transformation Q. Denote $\delta_{ij}=\delta z_{ij}$ to be deviation from the fixed point value Yij in row i and column j ( $\sum_i\delta_{ij}=0$ to preserve monthly efforts and $\sum_j\delta_{ij}=\delta_i = \delta z_i$.)

$\displaystyle Q_\delta^i$ = $\displaystyle \delta_i+\sum_j Y_{ij}\left[ \frac{(P\delta)_i}{(PR)_i} - \frac{\...
...lta_k}{R_k^2}-\frac{(PR)_k \delta_{kj}+Y_{kj}(P\delta)_{k}}{R_k}\right\}\right]$  
  = $\displaystyle \delta_i+ \sum_j Y_{ij}\left[ \frac{(P\delta)_i}{(PR)_i} - \frac{...
...m_kY_{kj}\left\{ \frac{\delta_k}{R_k}-\frac{(P\delta)_k}{(PR)_k}\right\}\right]$  
  = $\displaystyle \delta_i+ \sum_j Y_{ij}\left[ - \frac{\delta_i}{R_i} + \frac{1}{C_j}\sum_kY_{kj}\frac{\delta_k}{R_k}\right]$  
  = $\displaystyle \sum_{j,k}\frac{Y_{ij}Y_{kj}\delta_k}{C_jR_k}$  

In the first transition we assumed that the table is connected and so (PR)k/(PR)i=Rk/Ri. In the second transition we used the fact that the projection is an average so $(P\delta)_i=0$. This expression implies that $P(Q_\delta) = 0$. The second order term of (12) is therefore

\begin{displaymath}\Vert\delta \mbox{\boldmath ${{\bf z}}$}\Vert^2 - \sum_i\left...
...z}}$}\Vert^2 - \Vert W\mbox{\boldmath ${{\bf z}}$}\Vert^2 \},

where we have written $w_k^i=\sum_j \frac{Y_{ij}}{R_k}\frac{Y_{kj}}{C_j}=\sum_j
\frac{Y_{ij}}{R_i}\frac{Y_{kj}}{C_j}$ because Ri=Rk, and W=(wki)i,k.

Let $\{\mbox{\boldmath${{\bf v}}$ }_k: 1 \leq k \leq M\}$ be the (orthonormal) eigenvectors of W, with $\lambda_1,\lambda_2, \ldots, \lambda_M$ the corresponding eigenvalues, sorted in decreasing order of absolute value. Then

\begin{displaymath}\Vert\mbox{\boldmath ${{\bf z}}$}\Vert^2 - \Vert W\mbox{\bold... z}}$},\mbox{\boldmath ${{\bf v}}$}_k)^2 (1 - \lambda_k^2).

First, note that W is a stochastic matrix. All $w_k^i\geq 0$ because all $Y_{ij} \geq 0$. Also, $\sum_iw_k^i=\sum_j \sum_i
\frac{Y_{ij}}{R_k}\frac{Y_{kj}}{C_j} = \sum_j
\frac{C_j}{R_k}\frac{Y_{kj}}{C_j} = 1$. Since W is stochastic, the magnitude of an eigenvalue may not exceed 1. It is also true that W is aperiodic and irreducible. It is obviously aperiodic, since all elements in W are non-negative and diagonal elements are positive. Suppose W is reducible, i.e., there exists a set of indexes U such that $\forall l \in
U$ and $\forall i\in \neg U:\; w_i^l=0$, or $\sum_j
\frac{Y_{ij}Y_{lj}}{C_jR_l}=0$. Since all $Y_{ij}, Y_{lj}\geq 0$ we have that rows i and l in the effort table are not connected. This implies that $\lambda_1 = 1$ and $\sup_{2 \leq k \leq M} \vert\lambda_k\vert< 1$. But it is easily seen that $\mbox{\boldmath${{\bf v_1}}$ }$, the eigenvector corresponding to the unit eigenvalue, is a multiple of the unit vector. Therefore $(\mbox{\boldmath${{\bf z}}$ },\mbox{\boldmath${{\bf v_1}}$ }) = 0$. Then for any $\mbox{\boldmath${{\bf z}}$ }$, $(\mbox{\boldmath${{\bf z}}$ },\mbox{\boldmath${{\bf v_k}}$ }) \geq M^{-1}$ for some $2 \leq k \leq M$, so that

\begin{displaymath}\sum_{k=1}^M (\mbox{\boldmath ${{\bf z}}$},\mbox{\boldmath ${...
...$}_k)^2 (1 - \lambda_k^2) \geq
M^{-1} (1 - \lambda_2^2) > 0,

and this bound is independent of $\mbox{\boldmath${{\bf z}}$ }$.

Because Q has a bounded third derivative, it is possible to choose $\epsilon$ small enough so for any $\mbox{\boldmath${{\bf z}}$ }$, the third order term in (12) is smaller in absolute value than $(2M)^{-1} (1 - \lambda_2^2)$, which shows that (12) is strictly positive for all $0 < \delta < \epsilon$ and for any $\mbox{\boldmath${{\bf z}}$ }$. $\Box$

Although the conditions of the theorem are restrictive, we believe they may be relaxed substantially. In particular, the result should be true for an arbitrary Gaussian or Poisson model and also for disconnected tables.


We thank George Schmidt, Janis Sharpless, Harvey Siy, Mark Ardis, David Weiss, Alan Karr, Iris Dowden, and interview subjects for their help and valuable suggestions. This research was supported in part by NSF grants SBR-9529926 and DMS-9208758 to the National Institute of Statistical Sciences.


D. Atkins, T. Ball, T. Graves, and A. Mockus, ``Using version control data to evaluate the effectiveness of software tools,'' in 1999 International Conference on Software Engineering (submitted), (Los Angeles, CA), May 1999.

L. A. Belady and M. M. Lehman, ``A model of large program development,'' IBM Systems Journal, pp. 225-252, 1976.

B. Boehm, Software Engineering Economics.
Prentice-Hall, 1981.

B. Curtis, ``Substantiating programmer variability,'' Proceedings of the IEEE, vol. 69, p. 846, July 1981.

W. E. Deming and F. F. Stephan, ``On a least-squares adjustment of a sampled frequency table when the expected marginal totals are known,'' Annals of Mathematical Statistics, vol. 11, pp. 427-444, 1940.

B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans.
Philadelphia, PA: Society for Industrial and Applied Mathematics, 1982.

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap.
New York: Chapman and Hall, 1993.

S. G. Eick, T. L. Graves, A. F. Karr, J. S. Marron, and A. Mockus, ``Does code decay? assessing the evidence from change management data,'' IEEE Trans. Soft. Engrg., 1998.
To appear.

T. L. Graves and A. Mockus, ``Inferring change effort from configuration management data,'' in Metrics 98: Fifth International Symposium on Software Metrics, (Bethesda, Maryland), pp. 267-273, November 1998.

S. Karlin and H. M. Taylor, A First Course in Stochastic Processes, 2nd Edition.
New York: Academic Press, 1975.

K. Martersteck and A. Spencer, ``Introduction to the 5ESS(TM) switching system,'' AT&T Technical Journal, vol. 64, pp. 1305-1314, July-August 1985.

P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd ed.
New York: Chapman and Hall, 1989.

A. K. Midha, ``Software configuration management for the 21st century,'' Bell Labs Technical Journal, vol. 2, no. 1, Winter 1997.

A. Mockus, S. G. Eick, T. Graves, and A. F. Karr, ``On measurement and analysis of software changes,'' tech. rep., Bell Labs, Lucent Technologies, 1999.

A. Mockus and L. G. Votta, ``Identifying reasons for software changes using historic databases,''
Submitted to ACM Transactions on Software Engineering and Methodology.

D. L. Parnas, ``Software aging,'' in Proceedings 16th International Conference On Software Engineering, (Los Alamitos, California), pp. 279-287, IEEE Computer Society Press, 16 May 1994.

M. Rochkind, ``The source code control system,'' IEEE Trans. on Software Engineering, vol. 1, no. 4, pp. 364-370, 1975.

G. Seber, Linear Regression Analysis.
New York: Wiley, 1977.

E. B. Swanson, ``The dimensions of maintenance,'' in 2nd Conf. on Software Engineering, (San Francisco, California), pp. 492-497, 1976.

M. A. Tanner, Tools for Statistical Inference.
New York, Berlin, Heidelberg: Springer-Verlag, 1993.

Y. Vardi, ``Network tomography: Estimating source destination traffic intensities from link data,'' JASA, vol. 91, no. 433, pp. 365-377, 1996.


... below.1
We strictly speaking do not assume a Poisson distribution because effort values need not be integers. The only critical part of the Poisson assumption is that the variance of a random effort is proportional to its mean.