Experiments

Tool performance

We conducted some experiments to evaluate the performance of DDJ.

DDJ is described in this paper:

The implementation was tested with a collection of real applications (e.g., an interpreter, a parser, a debugger, etc). The table below summarizes the results of the experiments performed. These experiments were performed with a Intel Core2 Quad 2.67 GHz with 2GB RAM.

Benchmark Number of variables ET size ET time First node time Cache limit ET depth
Argparser 8.812 2 MB 22 s 407 ms 7/7 7
Cglib 216.931 200 MB 230 s 719 ms 11/14 18
Kxml2 194.879 85 MB 1318 s 1844 ms 6/6 9
Javassist 650.314 459 MB 556 s 844 ms 7/7 16
Jtestcase 1.859.043 893 MB 1913 s 1531 ms 17/26 57
HTMLcleaner 3.575.513 2909 MB 4828 s 609 ms 4/4 17
Average 1084248,67 758 MB 1477,83 s 992,33 ms 8,67/10,67 20,67

For each benchmark, the second and third columns give an idea of the size of the execution. Fourth and fifth columns are time measures. Finally, sixth and seventh columns show memory bounds. Concretely:

  • Column Benchmark contains the names of the benchmarks;
  • Column Number of variables shows the number of variables participating in the execution considered;
  • Column ET size shows the size in MB of the ET when it is completed, this measure has been taken from the size of the ET in the database (of course, it includes compaction);
  • Column ET time is the time needed to finish the ET;
  • Column First node time is the time needed to complete the first node of the ET;
  • Column Cache limit shows the presentation and logic bounds of these benchmarks. After these bounds, the computer was out of memory;
  • Column ET depth shows the depth of the ET after it was constructed;

The last two columns of the table give an idea of how big is the ET shown in the GUI before it is out of memory. In general, five levels of depth is enough to see the question asked and the part of the computation closely related to this question. In the experiments only HTMLcleaner was out of memory when showin five levels of the ET in the GUI.

Evaluation of the execution trees balancing technique

The technique for balancing execution trees is described here:

We debugged 31 Java benchmarks. Some benchmarks are Java programs implemented by university students; others are small to medium Java benchmarks including games, mathematical transformations, etc; and others are real Java applications such as a parser or a library.

The experiments were done with a collection of small to large programs, and they are divided into two sets. The first set considers the application of the balancing technique by only collapsing/projecting chains with a maximum of five chages. The second set considers the application of the balancing technique without any restriction (i.e., no chain is neccessary to project or collapse nodes). The following tables summarize the results of the experiments:

Balancing chains with at most five changes

Benchmark Nodes Proj. / nodes / lenght Col. / nodes / lenght Proj. Time Col. Time Bal. Time Quest. Quest. Bal. %
NumReader 12 0 / 0 / 0,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 6,46 6,46 0,00
Orderings 72 2 / 5 / 2,50 14 / 45 / 3,21 0 ms 0 ms 0 ms 11,47 8,89 22,46
Factoricer 62 7 / 17 / 2,43 0 / 0 / 0,00 0 ms 0 ms 0 ms 13,89 7,90 43,09
Sedgewick 41 3 / 7 / 2,33 8 / 24 / 3,00 0 ms 0 ms 0 ms 18,79 7,52 59,95
Clasifier 30 4 / 10 / 2,50 7 / 20 / 2,86 0 ms 0 ms 0 ms 15,52 6,48 58,21
LegendGame 93 12 / 28 / 2,33 20 / 40 / 2,00 0 ms 0 ms 0 ms 16,00 9,70 39,36
Cues 19 3 / 8 / 2,67 1 / 2 / 2,00 0 ms 0 ms 0 ms 10,40 8,20 21,15
Romanic 123 20 / 40 / 2,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 25,06 16,63 33,66
FibRecursive 6.724 19 / 70 / 3,68 1.290 / 2.593 / 2,01 16 ms 328 ms 344 ms 38,29 21,47 43,92
Risk 70 7 / 19 / 2,71 8 / 43 / 5,38 0 ms 0 ms 0 ms 30,69 10,28 66,50
FactTrans 198 5 / 12 / 2,40 0 / 0 / 0,00 0 ms 0 ms 0 ms 18,96 14,25 24,88
RndQuicksort 88 3 / 9 / 3,00 3 / 0 / 3,00 0 ms 0 ms 0 ms 12,88 10,40 19,20
BinaryArrays 132 7 / 18 / 2,57 0 / 0 / 0,00 0 ms 0 ms 0 ms 15,56 10,58 32,03
FibFactAna 380 3 / 9 / 3,00 29 / 58 / 2,00 0 ms 0 ms 0 ms 30,13 29,15 3,27
NewtonPol 46 1 / 2 / 2,00 3 / 40 / 40,00 0 ms 0 ms 0 ms 23,09 4,77 79,35
RegresionTest 18 1 / 3 / 3,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 6,84 6,26 8,46
BoubleFibArrays 214 0 / 0 / 0,00 40 / 83 / 2,08 0 ms 0 ms 0 ms 12,42 12,01 3,33
ComplexNumbers 68 17 / 37 / 2,18 9 / 18 / 2,00 16 ms 0 ms 16 ms 20,62 10,20 50,53
StatsMeanFib 104 3 / 6 / 2,00 20 / 56 / 2,80 0 ms 0 ms 0 ms 12,33 11,00 10,81
Integral 25 0 / 0 / 0,00 2 / 22 / 11,00 0 ms 0 ms 0 ms 8,38 3,38 59,63
TestMath 51 1 / 2 / 2,00 2 / 5 / 2,50 0 ms 0 ms 0 ms 12,77 11,65 8,73
TestMath2 267 7 / 16 / 2,29 13 / 52 / 4,00 15 ms 0 ms 31 ms 66,47 58,33 12,24
Figures 116 8 / 16 / 2,00 3 / 6 / 2,00 0 ms 0 ms 0 ms 13,78 12,17 11,66
FactCalc 105 3 / 8 / 2,67 11 / 32 / 2,91 0 ms 0 ms 0 ms 19,81 12,64 36,19
SpaceLimits 127 38 / 76 / 2,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 40,85 29,16 28,61
Argparser 129 31 / 70 / 2,26 9 / 37 / 4,11 0 ms 16 ms 16 ms 20,78 12,71 38,85
Cglib 1.216 67 / 166 / 2,48 39 / 84 / 2,15 400 ms 200 ms 620 ms 80,41 65,01 19,15
Javassist 1.357 10 / 28 / 2,80 8 / 24 / 3,00 4.495 ms 16 ms 4.745 ms 79,52 77,50 2,54
Kxml2 1.172 260 / 695 / 2,67 21 / 42 / 2,00 345 ms 77 ms 452 ms 79,61 28,21 64,56
HTMLcleaner 6.047 394 / 1.001 / 2,54 90 / 223 / 2,48 7.920 ms 251 ms 8.266 ms 169.49 138,85 18,08
Jtestcase 4.151 299 / 776 / 2,60 27 / 54 / 2,00 1.234 ms 78 ms 1.328 ms 85,05 80,52 5,32
Average 23.257 1.235 / 3.154 / 2,49 1.675 / 3.612 / 4,60 17,32 ms 0,77 ms 2.818,25 ms 32,78 23,95 29,86

Balancing without restrictions

Benchmark Nodes Proj. / nodes / lenght Col. / nodes / lenght Proj. Time Col. Time Bal. Time Quest. Quest. Bal. %
NumReader 12 0 / 0 / 0,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 6,46 6,46 0,00
Orderings 85 2 / 4 / 2,00 12 / 55 / 4,58 0 ms 0 ms 0 ms 12,77 9,69 24,13
Factoricer 62 7 / 17 / 2,43 0 / 0 / 0,00 0 ms 0 ms 0 ms 13,89 7,90 43,09
Sedgewick 43 2 / 4 / 2,00 5 / 26 / 5,20 0 ms 0 ms 0 ms 19,20 6,73 64,96
Clasifier 31 4 / 8 / 2,00 3 / 18 / 6,00 0 ms 0 ms 0 ms 16,00 5,87 63,31
LegendGame 93 11 / 25 / 2,27 21 / 43 / 2,05 16 ms 0 ms 16 ms 15,53 9,41 39,38
Cues 19 0 / 0 / 0,00 4 / 17 / 4,25 0 ms 0 ms 0 ms 10,40 3,00 71,15
Romanic 123 16 / 41 / 2,56 0 / 0 / 0,00 0 ms 0 ms 0 ms 25,06 9,29 62,93
FibRecursive 6.097 20 / 68 / 3,40 1.168 / 2.347 / 2,01 48 ms 265 ms 313 ms 43,30 21,24 50,95
Risk 70 7 / 17 / 2,43 9 / 50 / 5,56 0 ms 0 ms 0 ms 31,08 5,79 81,38
FactTrans 198 5 / 12 / 2,40 0 / 0 / 0,00 0 ms 0 ms 0 ms 18,96 14,25 24,88
RndQuicksort 90 3 / 7 / 2,33 2 / 7 / 3,50 0 ms 0 ms 0 ms 13,69 10,60 22,55
BinaryArrays 137 8 / 18 / 2,25 0 / 0 / 0,00 0 ms 0 ms 0 ms 15,78 10,64 32,55
FibFactAna 380 3 / 9 / 3,00 29 / 58 / 2,00 0 ms 0 ms 0 ms 30,13 29,15 3,27
NewtonPol 46 1 / 2 / 2,00 1 / 40 / 40,00 0 ms 0 ms 0 ms 23,09 4,77 79,35
RegresionTest 18 1 / 3 / 3,00 0 / 0 / 0,00 0 ms 0 ms 0 ms 6,84 6,26 8,46
BoubleFibArrays 214 1 / 2 / 2,00 40 / 83 / 2,08 0 ms 16 ms 16 ms 12,42 12,00 3,41
ComplexNumbers 68 17 / 37 / 2,18 9 / 18 / 2,00 15 ms 0 ms 15 ms 20,62 10,20 50,53
StatsMeanFib 104 3 / 6 / 2,00 20 / 56 / 2,80 0 ms 0 ms 0 ms 12,33 11,00 10,81
Integral 25 0 / 0 / 0,00 2 / 22 / 11,00 0 ms 0 ms 0 ms 8,38 3,38 59,63
TestMath 51 1 / 2 / 2,00 2 / 5 / 2,50 0 ms 0 ms 0 ms 12,77 11,65 8,73
TestMath2 267 1 / 2 / 2,00 2 / 59 / 29,50 0 ms 0 ms 0 ms 66,47 53,12 20,08
Figures 116 13 / 32 / 2,46 3 / 6 / 2,00 0 ms 0 ms 0 ms 13,78 11,85 14,02
FactCalc 73 2 / 8 / 4,00 13 / 44 / 3,38 0 ms 0 ms 0 ms 17,26 8,23 52,31
SpaceLimits 127 42 / 104 / 2,48 0 / 0 / 0,00 31 ms 0 ms 31 ms 40,85 10,90 73,32
Argparser 129 31 / 74 / 2,39 9 / 37 / 4,11 15 ms 16 ms 31 ms 20,78 9,68 53,42
Cglib 1.216 119 / 342 / 2,87 42 / 91 / 2,17 656 ms 204 ms 860 ms 80,41 31,66 60,62
Javassist 1.357 206 / 530 / 2,57 19 / 54 / 2,84 704 ms 31 ms 765 ms 79,52 24,51 69,18
Kxml2 1.172 262 / 699 / 2,67 21 / 42 / 2,00 218 ms 16 ms 250 ms 79,61 25,23 68,31
HTMLcleaner 6.047 1.023 / 2.781 / 2,72 119 / 311 / 2,61 20.983 ms 590 ms 21.762 ms 169,49 31,01 81,70
Jtestcase 4.151 340 / 885 / 2,60 37 / 74 / 2,00 906 ms 79 ms 985 ms 85,05 73,67 13,38
Average 22.621 2.151 / 5.739 / 2,46 1.592 / 3.563 / 6,09 1,39 ms 0,70 ms 5.982,77 ms 32,97 15,78 42,32

where

  • Column Benchmark contains the names of the benchmarks;
  • Column Nodes shows the size of the ET: the number of nodes (i.e., method invocations performed during the execution of the program) in the ET associated with each benchmark;
  • Column Proj. / nodes / lenght shows the number of projected nodes inserted into the ET / the number of nodes that have been projected / and the average lenght of the nodes projected;
  • Column Col. / nodes / lenght shows the number of colapsed nodes inserted into the ET / the number of nodes that have been colapsed / and the average lenght of the nodes colapsed;
  • Column Proj. Time shows the time needed by the debugger to project the chains;
  • Column Col. Time shows the time needed by the debugger to colapse the chains;
  • Column Bal. Time shows the time needed by the debugger to balance the whole ET;
  • Column Quest. shows the average number of questions done by the debugger before finding the bug in the original ET;
  • Column Quest. Bal. shows the average number of questions done by the debugger before finding the bug in the balanced ET;
  • Column % shows the improvement achieved with the balancing technique.
The last row shows the averages:
  • Cells Proj. Time and Col. Time averages show the average time of projecting and colapsing a chain in all benchmarks;
  • Cell Bal. Time average shows the average time of balancing all benchmarks (weighting the benchmarks according to their number of nodes);
  • Cell Quest. average shows the average number of questions done by the debugger in all benchmarks in the original ET;
  • Cell Quest. Bal. average shows the average number of questions done by the debugger in all benchmarks in the balanced ET;
  • Cell % average shows the improvement achieved with the balancing technique in all benchmarks.

Evaluation of strategies for algorithmic debugging

All the strategies implemented in DDJ are described here:

  • Josep Silva
    A Survey on Algorithmic Debugging Strategies
    Advances in Engineering Software. Volume 42 Issue 11, November, 2011
    Available: PDF preprint

The following table summarices the results of the experiments (this table is updated from time to time. We are currently implementing the strategy subterm dependency tracking and we will evaluate it and share the results soon).

Each row of the table represents the summary of a collection of experiments performed with the associated benchmark. In particular, for each benchmark, we produced its associated ET and assumed that the buggy node could be any node of the ET (i.e., any subcomputation in the execution of the program could be buggy). Therefore, we performed a different experiment for each possible case.

In particular, benchmark factoricer was debugged 62 times with each strategy; each time, the buggy node was a different node, and the results shown are the average number of questions performed by each strategy. Similarly, benchmark cglib was debugged 1216 times with each strategy, and so on.

% of questions asked by each strategy w.r.t. the number of nodes of each benchmark

Benchmark Nodes D&QO D&QH D&QS HF MRF DR&Q HD-P HD-Y HD-N TD SS Average
NumReader 12 28,994 28,994 31,361 44,379 44,379 29,586 49,704 49,704 49,704 49,704 53,254 41,797
Orderings 81 9,935 9,935 10,291 12,820 12,820 14,099 14,352 13,950 14,352 14,500 50,595 16,150
Factoricer 62 20,030 20,030 20,055 22,046 22,046 20,030 22,046 22,046 22,046 22,046 50,768 23,926
Sedgewick 24 39,360 39,360 39,360 40,160 40,160 39,360 42,080 42,080 51,680 51,680 51,840 43,375
Clasifier 33 29,325 29,325 51,211 30,796 30,796 50,865 36,678 36,678 49,913 49,913 51,384 40,626
LegendGame 92 13,655 13,655 13,736 15,019 15,898 16,742 16,927 16,927 16,927 16,927 50,526 18,812
Cues 19 44,250 44,250 44,250 44,750 44,750 44,250 46,250 46,250 52,250 52,250 52,250 46,886
Romanic 123 13,463 13,463 13,463 13,989 13,989 13,703 21,384 27,354 20,213 20,213 50,397 20,148
FibRecursive 5.729 0,266 0,266 0,269 0,312 0,732 1,778 0,637 0,733 0,614 0,614 50,009 5,112
Risk 70 39,972 39,972 39,972 40,151 40,151 39,972 46,340 42,055 46,876 43,781 50,684 42,721
FactTrans 198 5,432 5,432 5,434 8,010 8,669 8,661 9,530 9,328 9,530 9,530 50,249 11,800
RndQuicksort 74 9,529 9,529 10,311 12,444 12,800 11,627 16,018 14,542 16,018 15,662 50,649 16,284
BinaryArrays 128 8,461 8,461 8,647 10,234 12,391 11,117 12,523 13,425 12,523 12,998 50,382 14,651
FibFactAna 380 2,379 2,379 2,400 6,780 7,449 5,295 5,721 6,476 5,700 7,909 50,131 9,329
NewtonPol 46 41,829 41,829 49,027 43,685 45,450 48,981 49,117 49,117 49,117 49,117 51,019 47,117
RegresionTest 18 25,208 25,208 25,208 28,255 29,086 25,208 36,011 36,011 36,011 36,011 52,355 32,234
BoubleFibArrays 214 3,712 3,717 3,885 4,971 5,778 17,542 7,422 7,801 5,778 5,778 50,230 10,602
ComplexNumbers 68 19,954 19,954 19,954 20,794 20,794 33,081 26,633 25,625 30,540 29,889 50,704 27,084
Integral 25 28,994 28,994 30,621 32,249 32,249 47,781 32,249 32,249 32,249 32,249 51,775 34,696
TestMath 51 11,612 11,612 11,908 15,311 15,533 12,870 21,117 22,411 22,929 24,556 50,925 20,071
TestMath2 267 4,582 4,582 4,572 9,832 10,225 12,579 11,237 11,929 14,424 24,802 50,185 14,450
Figures 116 7,101 7,101 7,400 7,999 8,028 8,058 11,140 10,943 11,944 11,776 50,420 12,901
FactCalc 120 7,424 7,424 7,513 9,023 9,173 11,290 12,287 12,287 18,066 18,066 50,406 14,815
SpaceLimits 127 26,947 26,947 30,609 27,673 27,673 35,101 30,695 30,682 38,574 31,915 50,385 32,473
Argparser 129 12,101 12,101 13,083 13,071 13,320 21,254 16,195 16,195 16,148 15,982 50,379 18,166
Cglib 1.216 1,932 1,932 2,332 2,515 2,649 2,132 6,373 5,756 7,320 6,607 50,041 8,145
Kxml2 1.172 2,856 2,856 3,006 3,059 3,476 3,555 8,578 6,971 7,767 6,787 50,043 8,996
Javassist 1.357 4,336 4,336 5,442 4,735 4,751 4,492 6,204 9,267 6,062 5,856 50,037 9,593
Average 426,821 16,559 16,559 18,047 18,752 19,115 21,108 21,980 22,100 23,760 23,826 50,786 22,963

% of questions asked by each strategy w.r.t. the number of nodes of each benchmark (after balancing the ETs)

Benchmark Nodes D&QO D&QH D&QS DR&Q HF MRF HD-P TD HD-Y HD-N SS Average
NumReader 12 28,994 28,994 31,361 29,586 44,379 44,379 49,704 49,704 49,704 49,704 53,254 41,797
Orderings 46 12,042 12,087 12,630 14,396 17,157 17,293 21,050 20,824 20,598 19,602 51,019 19,881
Factoricer 62 9,826 9,826 9,927 20,030 12,547 12,547 15,042 12,547 15,042 18,292 50,768 16,945
Sedgewick 12 30,769 30,769 33,136 30,769 34,911 34,911 43,787 43,195 43,787 43,787 53,254 38,462
Clasifier 23 19,792 20,313 22,396 21,875 22,917 23,264 32,118 31,944 32,118 34,549 51,910 28,472
LegendGame 71 8,873 8,873 8,951 16,725 11,150 11,227 14,680 13,368 14,680 16,937 50,675 16,013
Cues 18 31,579 32,410 32,410 32,410 33,241 34,626 39,058 42,105 39,058 44,321 52,355 37,598
Romanic 123 6,400 10,842 11,225 13,560 7,440 11,882 13,287 13,411 13,287 13,300 50,397 15,003
FibRecursive 4.619 0,269 0,269 0,281 1,201 0,327 0,407 3,915 0,456 3,915 0,476 50,011 5,593
Risk 33 16,782 16,782 18,080 19,377 18,685 18,685 24,308 31,142 24,308 32,785 51,384 24,756
FactTrans 198 3,891 3,894 3,927 6,222 6,578 6,581 7,369 7,159 7,242 7,505 50,249 10,056
RndQuicksort 72 8,726 8,726 8,726 11,409 12,029 12,235 13,624 13,511 12,929 14,543 50,666 15,193
BinaryArrays 128 5,516 5,523 5,715 7,127 7,746 7,938 7,896 8,587 8,155 8,707 50,382 11,208
FibFactAna 351 2,438 2,438 2,449 5,377 7,615 7,712 6,399 8,573 7,390 5,990 50,141 9,684
NewtonPol 7 39,063 39,063 43,750 39,063 43,750 43,750 45,313 45,313 45,313 45,313 54,688 44,034
RegresionTest 18 23,269 23,269 25,208 25,208 26,870 26,870 32,964 32,964 32,964 32,964 52,355 30,446
BoubleFibArrays 171 4,404 4,411 4,570 11,401 5,946 6,960 24,496 6,960 24,868 6,960 50,287 13,751
ComplexNumbers 60 10,024 10,024 10,320 11,314 11,395 11,395 15,775 15,748 15,802 19,188 50,793 16,525
Integral 5 44,444 44,444 47,222 44,444 50,000 50,000 50,000 50,000 50,000 50,000 55,556 48,737
TestMath 48 11,912 11,912 12,162 12,995 15,952 16,285 22,407 24,198 23,865 22,366 50,979 20,457
TestMath2 228 3,511 3,511 3,511 9,735 10,553 10,808 12,288 28,564 13,236 14,367 50,216 14,573
Figures 113 6,717 6,748 6,787 8,095 7,679 7,787 10,172 10,603 10,165 10,765 50,431 12,359
FactCalc 59 10,111 10,139 10,417 11,528 13,694 14,222 20,472 18,500 20,472 20,694 50,806 18,278
SpaceLimits 127 12,952 16,071 19,147 21,741 13,678 16,797 22,870 22,784 22,858 26,147 50,385 22,312
Argparser 101 6,824 6,824 7,180 7,343 8,218 8,218 11,832 11,890 11,832 12,438 50,481 13,007
Cglib 1.171 1,576 1,585 1,894 1,727 2,182 2,300 5,504 5,462 5,247 5,760 50,043 7,571
Kxml2 1.151 0,897 0,897 0,927 1,358 1,174 1,198 2,270 2,454 2,253 2,929 50,043 6,036
Javassist 1.341 4,270 4,273 5,362 4,427 4,675 4,687 6,090 5,802 9,226 6,048 50,037 9,536
Average 370,286 13,067 13,390 14,274 15,730 16,160 16,606 20,525 20,635 20,725 20,944 51,199 20,296

where

  • Column Benchmark contains the names of the benchmarks;
  • Column Nodes shows the size of the ET: the number of nodes (i.e., method invocations performed during the execution of the program) in the ET associated with each benchmark;
  • Column D&QH shows the number of questions asked (as an average) by Divide & Query by Hirunkitti;
  • Column D&QS shows the number of questions asked (as an average) by Divide & Query by Shapiro;
  • Column DR&Q shows the number of questions asked (as an average) by Divide by Rules & Query;
  • Column HF shows the number of questions asked (as an average) by Heaviest First;
  • Column MRF shows the number of questions asked (as an average) by More Rules First;
  • Column HD-P shows the number of questions asked (as an average) by Hat Delta with the heuristic that computes the proportion of correct and wrong nodes;
  • Column TD shows the number of questions asked (as an average) by Top-Down;
  • Column HD-Y shows the number of questions asked (as an average) by Hat Delta with the heuristic that counts correct nodes (answered YES);
  • Column HD-N shows the number of questions asked (as an average) by Hat Delta with the heuristic that counts wrong nodes (answered NO);
  • Column SS shows the number of questions asked (as an average) by Single Stepping.