Tool performance

We conducted some experiments to evaluate the performance of DDJ.

DDJ is described in this paper:

David Insa, Josep Silva
An Algorithmic Debugger for Java
Proceedings of the 26th IEEE International Conference on Software Maintenance (ICSM 2010).
© IEEE
Available: PDF preprint

The implementation was tested with a collection of real applications (e.g., an interpreter, a parser, a debugger, etc). The table below summarizes the results of the experiments performed. These experiments were performed with a Intel Core2 Quad 2.67 GHz with 2GB RAM.

Benchmark	Number of variables	ET size	ET time	First node time	Cache limit	ET depth
Argparser	8.812	2 MB	22 s	407 ms	7/7	7
Cglib	216.931	200 MB	230 s	719 ms	11/14	18
Kxml2	194.879	85 MB	1318 s	1844 ms	6/6	9
Javassist	650.314	459 MB	556 s	844 ms	7/7	16
Jtestcase	1.859.043	893 MB	1913 s	1531 ms	17/26	57
HTMLcleaner	3.575.513	2909 MB	4828 s	609 ms	4/4	17
Average	1084248,67	758 MB	1477,83 s	992,33 ms	8,67/10,67	20,67

For each benchmark, the second and third columns give an idea of the size of the execution. Fourth and fifth columns are time measures. Finally, sixth and seventh columns show memory bounds. Concretely:

Column Benchmark contains the names of the benchmarks;
Column Number of variables shows the number of variables participating in the execution considered;
Column ET size shows the size in MB of the ET when it is completed, this measure has been taken from the size of the ET in the database (of course, it includes compaction);
Column ET time is the time needed to finish the ET;
Column First node time is the time needed to complete the first node of the ET;
Column Cache limit shows the presentation and logic bounds of these benchmarks. After these bounds, the computer was out of memory;
Column ET depth shows the depth of the ET after it was constructed;

The last two columns of the table give an idea of how big is the ET shown in the GUI before it is out of memory. In general, five levels of depth is enough to see the question asked and the part of the computation closely related to this question. In the experiments only HTMLcleaner was out of memory when showin five levels of the ET in the GUI.

Evaluation of the execution trees balancing technique

The technique for balancing execution trees is described here:

David Insa, Josep Silva, Adrián Riesco
Speeding Up Algorithmic Debugging Using Balanced Execution Trees.
Proceedings of the 7th International Conference on Tests and Proofs (TAP 2013).
© Springer-Verlag, 2013.
Available: PDF preprint

We debugged 31 Java benchmarks. Some benchmarks are Java programs implemented by university students; others are small to medium Java benchmarks including games, mathematical transformations, etc; and others are real Java applications such as a parser or a library.

The experiments were done with a collection of small to large programs, and they are divided into two sets. The first set considers the application of the balancing technique by only collapsing/projecting chains with a maximum of five chages. The second set considers the application of the balancing technique without any restriction (i.e., no chain is neccessary to project or collapse nodes). The following tables summarize the results of the experiments:

Balancing chains with at most five changes

Benchmark	Nodes	Proj. / nodes / lenght	Col. / nodes / lenght	Proj. Time	Col. Time	Bal. Time	Quest.	Quest. Bal.	%
NumReader	12	0 / 0 / 0,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	6,46	6,46	0,00
Orderings	72	2 / 5 / 2,50	14 / 45 / 3,21	0 ms	0 ms	0 ms	11,47	8,89	22,46
Factoricer	62	7 / 17 / 2,43	0 / 0 / 0,00	0 ms	0 ms	0 ms	13,89	7,90	43,09
Sedgewick	41	3 / 7 / 2,33	8 / 24 / 3,00	0 ms	0 ms	0 ms	18,79	7,52	59,95
Clasifier	30	4 / 10 / 2,50	7 / 20 / 2,86	0 ms	0 ms	0 ms	15,52	6,48	58,21
LegendGame	93	12 / 28 / 2,33	20 / 40 / 2,00	0 ms	0 ms	0 ms	16,00	9,70	39,36
Cues	19	3 / 8 / 2,67	1 / 2 / 2,00	0 ms	0 ms	0 ms	10,40	8,20	21,15
Romanic	123	20 / 40 / 2,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	25,06	16,63	33,66
FibRecursive	6.724	19 / 70 / 3,68	1.290 / 2.593 / 2,01	16 ms	328 ms	344 ms	38,29	21,47	43,92
Risk	70	7 / 19 / 2,71	8 / 43 / 5,38	0 ms	0 ms	0 ms	30,69	10,28	66,50
FactTrans	198	5 / 12 / 2,40	0 / 0 / 0,00	0 ms	0 ms	0 ms	18,96	14,25	24,88
RndQuicksort	88	3 / 9 / 3,00	3 / 0 / 3,00	0 ms	0 ms	0 ms	12,88	10,40	19,20
BinaryArrays	132	7 / 18 / 2,57	0 / 0 / 0,00	0 ms	0 ms	0 ms	15,56	10,58	32,03
FibFactAna	380	3 / 9 / 3,00	29 / 58 / 2,00	0 ms	0 ms	0 ms	30,13	29,15	3,27
NewtonPol	46	1 / 2 / 2,00	3 / 40 / 40,00	0 ms	0 ms	0 ms	23,09	4,77	79,35
RegresionTest	18	1 / 3 / 3,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	6,84	6,26	8,46
BoubleFibArrays	214	0 / 0 / 0,00	40 / 83 / 2,08	0 ms	0 ms	0 ms	12,42	12,01	3,33
ComplexNumbers	68	17 / 37 / 2,18	9 / 18 / 2,00	16 ms	0 ms	16 ms	20,62	10,20	50,53
StatsMeanFib	104	3 / 6 / 2,00	20 / 56 / 2,80	0 ms	0 ms	0 ms	12,33	11,00	10,81
Integral	25	0 / 0 / 0,00	2 / 22 / 11,00	0 ms	0 ms	0 ms	8,38	3,38	59,63
TestMath	51	1 / 2 / 2,00	2 / 5 / 2,50	0 ms	0 ms	0 ms	12,77	11,65	8,73
TestMath2	267	7 / 16 / 2,29	13 / 52 / 4,00	15 ms	0 ms	31 ms	66,47	58,33	12,24
Figures	116	8 / 16 / 2,00	3 / 6 / 2,00	0 ms	0 ms	0 ms	13,78	12,17	11,66
FactCalc	105	3 / 8 / 2,67	11 / 32 / 2,91	0 ms	0 ms	0 ms	19,81	12,64	36,19
SpaceLimits	127	38 / 76 / 2,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	40,85	29,16	28,61
Argparser	129	31 / 70 / 2,26	9 / 37 / 4,11	0 ms	16 ms	16 ms	20,78	12,71	38,85
Cglib	1.216	67 / 166 / 2,48	39 / 84 / 2,15	400 ms	200 ms	620 ms	80,41	65,01	19,15
Javassist	1.357	10 / 28 / 2,80	8 / 24 / 3,00	4.495 ms	16 ms	4.745 ms	79,52	77,50	2,54
Kxml2	1.172	260 / 695 / 2,67	21 / 42 / 2,00	345 ms	77 ms	452 ms	79,61	28,21	64,56
HTMLcleaner	6.047	394 / 1.001 / 2,54	90 / 223 / 2,48	7.920 ms	251 ms	8.266 ms	169.49	138,85	18,08
Jtestcase	4.151	299 / 776 / 2,60	27 / 54 / 2,00	1.234 ms	78 ms	1.328 ms	85,05	80,52	5,32
Average	23.257	1.235 / 3.154 / 2,49	1.675 / 3.612 / 4,60	17,32 ms	0,77 ms	2.818,25 ms	32,78	23,95	29,86

Balancing without restrictions

Benchmark	Nodes	Proj. / nodes / lenght	Col. / nodes / lenght	Proj. Time	Col. Time	Bal. Time	Quest.	Quest. Bal.	%
NumReader	12	0 / 0 / 0,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	6,46	6,46	0,00
Orderings	85	2 / 4 / 2,00	12 / 55 / 4,58	0 ms	0 ms	0 ms	12,77	9,69	24,13
Factoricer	62	7 / 17 / 2,43	0 / 0 / 0,00	0 ms	0 ms	0 ms	13,89	7,90	43,09
Sedgewick	43	2 / 4 / 2,00	5 / 26 / 5,20	0 ms	0 ms	0 ms	19,20	6,73	64,96
Clasifier	31	4 / 8 / 2,00	3 / 18 / 6,00	0 ms	0 ms	0 ms	16,00	5,87	63,31
LegendGame	93	11 / 25 / 2,27	21 / 43 / 2,05	16 ms	0 ms	16 ms	15,53	9,41	39,38
Cues	19	0 / 0 / 0,00	4 / 17 / 4,25	0 ms	0 ms	0 ms	10,40	3,00	71,15
Romanic	123	16 / 41 / 2,56	0 / 0 / 0,00	0 ms	0 ms	0 ms	25,06	9,29	62,93
FibRecursive	6.097	20 / 68 / 3,40	1.168 / 2.347 / 2,01	48 ms	265 ms	313 ms	43,30	21,24	50,95
Risk	70	7 / 17 / 2,43	9 / 50 / 5,56	0 ms	0 ms	0 ms	31,08	5,79	81,38
FactTrans	198	5 / 12 / 2,40	0 / 0 / 0,00	0 ms	0 ms	0 ms	18,96	14,25	24,88
RndQuicksort	90	3 / 7 / 2,33	2 / 7 / 3,50	0 ms	0 ms	0 ms	13,69	10,60	22,55
BinaryArrays	137	8 / 18 / 2,25	0 / 0 / 0,00	0 ms	0 ms	0 ms	15,78	10,64	32,55
FibFactAna	380	3 / 9 / 3,00	29 / 58 / 2,00	0 ms	0 ms	0 ms	30,13	29,15	3,27
NewtonPol	46	1 / 2 / 2,00	1 / 40 / 40,00	0 ms	0 ms	0 ms	23,09	4,77	79,35
RegresionTest	18	1 / 3 / 3,00	0 / 0 / 0,00	0 ms	0 ms	0 ms	6,84	6,26	8,46
BoubleFibArrays	214	1 / 2 / 2,00	40 / 83 / 2,08	0 ms	16 ms	16 ms	12,42	12,00	3,41
ComplexNumbers	68	17 / 37 / 2,18	9 / 18 / 2,00	15 ms	0 ms	15 ms	20,62	10,20	50,53
StatsMeanFib	104	3 / 6 / 2,00	20 / 56 / 2,80	0 ms	0 ms	0 ms	12,33	11,00	10,81
Integral	25	0 / 0 / 0,00	2 / 22 / 11,00	0 ms	0 ms	0 ms	8,38	3,38	59,63
TestMath	51	1 / 2 / 2,00	2 / 5 / 2,50	0 ms	0 ms	0 ms	12,77	11,65	8,73
TestMath2	267	1 / 2 / 2,00	2 / 59 / 29,50	0 ms	0 ms	0 ms	66,47	53,12	20,08
Figures	116	13 / 32 / 2,46	3 / 6 / 2,00	0 ms	0 ms	0 ms	13,78	11,85	14,02
FactCalc	73	2 / 8 / 4,00	13 / 44 / 3,38	0 ms	0 ms	0 ms	17,26	8,23	52,31
SpaceLimits	127	42 / 104 / 2,48	0 / 0 / 0,00	31 ms	0 ms	31 ms	40,85	10,90	73,32
Argparser	129	31 / 74 / 2,39	9 / 37 / 4,11	15 ms	16 ms	31 ms	20,78	9,68	53,42
Cglib	1.216	119 / 342 / 2,87	42 / 91 / 2,17	656 ms	204 ms	860 ms	80,41	31,66	60,62
Javassist	1.357	206 / 530 / 2,57	19 / 54 / 2,84	704 ms	31 ms	765 ms	79,52	24,51	69,18
Kxml2	1.172	262 / 699 / 2,67	21 / 42 / 2,00	218 ms	16 ms	250 ms	79,61	25,23	68,31
HTMLcleaner	6.047	1.023 / 2.781 / 2,72	119 / 311 / 2,61	20.983 ms	590 ms	21.762 ms	169,49	31,01	81,70
Jtestcase	4.151	340 / 885 / 2,60	37 / 74 / 2,00	906 ms	79 ms	985 ms	85,05	73,67	13,38
Average	22.621	2.151 / 5.739 / 2,46	1.592 / 3.563 / 6,09	1,39 ms	0,70 ms	5.982,77 ms	32,97	15,78	42,32

where

Column Benchmark contains the names of the benchmarks;
Column Nodes shows the size of the ET: the number of nodes (i.e., method invocations performed during the execution of the program) in the ET associated with each benchmark;
Column Proj. / nodes / lenght shows the number of projected nodes inserted into the ET / the number of nodes that have been projected / and the average lenght of the nodes projected;
Column Col. / nodes / lenght shows the number of colapsed nodes inserted into the ET / the number of nodes that have been colapsed / and the average lenght of the nodes colapsed;
Column Proj. Time shows the time needed by the debugger to project the chains;
Column Col. Time shows the time needed by the debugger to colapse the chains;
Column Bal. Time shows the time needed by the debugger to balance the whole ET;
Column Quest. shows the average number of questions done by the debugger before finding the bug in the original ET;
Column Quest. Bal. shows the average number of questions done by the debugger before finding the bug in the balanced ET;
Column % shows the improvement achieved with the balancing technique.

The last row shows the averages:

Cells Proj. Time and Col. Time averages show the average time of projecting and colapsing a chain in all benchmarks;
Cell Bal. Time average shows the average time of balancing all benchmarks (weighting the benchmarks according to their number of nodes);
Cell Quest. average shows the average number of questions done by the debugger in all benchmarks in the original ET;
Cell Quest. Bal. average shows the average number of questions done by the debugger in all benchmarks in the balanced ET;
Cell % average shows the improvement achieved with the balancing technique in all benchmarks.

Evaluation of strategies for algorithmic debugging

All the strategies implemented in DDJ are described here:

Josep Silva
A Survey on Algorithmic Debugging Strategies
Advances in Engineering Software. Volume 42 Issue 11, November, 2011
Available: PDF preprint

The following table summarices the results of the experiments (this table is updated from time to time. We are currently implementing the strategy subterm dependency tracking and we will evaluate it and share the results soon).

Each row of the table represents the summary of a collection of experiments performed with the associated benchmark. In particular, for each benchmark, we produced its associated ET and assumed that the buggy node could be any node of the ET (i.e., any subcomputation in the execution of the program could be buggy). Therefore, we performed a different experiment for each possible case.

In particular, benchmark factoricer was debugged 62 times with each strategy; each time, the buggy node was a different node, and the results shown are the average number of questions performed by each strategy. Similarly, benchmark cglib was debugged 1216 times with each strategy, and so on.

% of questions asked by each strategy w.r.t. the number of nodes of each benchmark

Benchmark	Nodes	D&QO	D&QH	D&QS	HF	MRF	DR&Q	HD-P	HD-Y	HD-N	TD	SS	Average
NumReader	12	28,994	28,994	31,361	44,379	44,379	29,586	49,704	49,704	49,704	49,704	53,254	41,797
Orderings	81	9,935	9,935	10,291	12,820	12,820	14,099	14,352	13,950	14,352	14,500	50,595	16,150
Factoricer	62	20,030	20,030	20,055	22,046	22,046	20,030	22,046	22,046	22,046	22,046	50,768	23,926
Sedgewick	24	39,360	39,360	39,360	40,160	40,160	39,360	42,080	42,080	51,680	51,680	51,840	43,375
Clasifier	33	29,325	29,325	51,211	30,796	30,796	50,865	36,678	36,678	49,913	49,913	51,384	40,626
LegendGame	92	13,655	13,655	13,736	15,019	15,898	16,742	16,927	16,927	16,927	16,927	50,526	18,812
Cues	19	44,250	44,250	44,250	44,750	44,750	44,250	46,250	46,250	52,250	52,250	52,250	46,886
Romanic	123	13,463	13,463	13,463	13,989	13,989	13,703	21,384	27,354	20,213	20,213	50,397	20,148
FibRecursive	5.729	0,266	0,266	0,269	0,312	0,732	1,778	0,637	0,733	0,614	0,614	50,009	5,112
Risk	70	39,972	39,972	39,972	40,151	40,151	39,972	46,340	42,055	46,876	43,781	50,684	42,721
FactTrans	198	5,432	5,432	5,434	8,010	8,669	8,661	9,530	9,328	9,530	9,530	50,249	11,800
RndQuicksort	74	9,529	9,529	10,311	12,444	12,800	11,627	16,018	14,542	16,018	15,662	50,649	16,284
BinaryArrays	128	8,461	8,461	8,647	10,234	12,391	11,117	12,523	13,425	12,523	12,998	50,382	14,651
FibFactAna	380	2,379	2,379	2,400	6,780	7,449	5,295	5,721	6,476	5,700	7,909	50,131	9,329
NewtonPol	46	41,829	41,829	49,027	43,685	45,450	48,981	49,117	49,117	49,117	49,117	51,019	47,117
RegresionTest	18	25,208	25,208	25,208	28,255	29,086	25,208	36,011	36,011	36,011	36,011	52,355	32,234
BoubleFibArrays	214	3,712	3,717	3,885	4,971	5,778	17,542	7,422	7,801	5,778	5,778	50,230	10,602
ComplexNumbers	68	19,954	19,954	19,954	20,794	20,794	33,081	26,633	25,625	30,540	29,889	50,704	27,084
Integral	25	28,994	28,994	30,621	32,249	32,249	47,781	32,249	32,249	32,249	32,249	51,775	34,696
TestMath	51	11,612	11,612	11,908	15,311	15,533	12,870	21,117	22,411	22,929	24,556	50,925	20,071
TestMath2	267	4,582	4,582	4,572	9,832	10,225	12,579	11,237	11,929	14,424	24,802	50,185	14,450
Figures	116	7,101	7,101	7,400	7,999	8,028	8,058	11,140	10,943	11,944	11,776	50,420	12,901
FactCalc	120	7,424	7,424	7,513	9,023	9,173	11,290	12,287	12,287	18,066	18,066	50,406	14,815
SpaceLimits	127	26,947	26,947	30,609	27,673	27,673	35,101	30,695	30,682	38,574	31,915	50,385	32,473
Argparser	129	12,101	12,101	13,083	13,071	13,320	21,254	16,195	16,195	16,148	15,982	50,379	18,166
Cglib	1.216	1,932	1,932	2,332	2,515	2,649	2,132	6,373	5,756	7,320	6,607	50,041	8,145
Kxml2	1.172	2,856	2,856	3,006	3,059	3,476	3,555	8,578	6,971	7,767	6,787	50,043	8,996
Javassist	1.357	4,336	4,336	5,442	4,735	4,751	4,492	6,204	9,267	6,062	5,856	50,037	9,593
Average	426,821	16,559	16,559	18,047	18,752	19,115	21,108	21,980	22,100	23,760	23,826	50,786	22,963

% of questions asked by each strategy w.r.t. the number of nodes of each benchmark (after balancing the ETs)

Benchmark	Nodes	D&QO	D&QH	D&QS	DR&Q	HF	MRF	HD-P	TD	HD-Y	HD-N	SS	Average
NumReader	12	28,994	28,994	31,361	29,586	44,379	44,379	49,704	49,704	49,704	49,704	53,254	41,797
Orderings	46	12,042	12,087	12,630	14,396	17,157	17,293	21,050	20,824	20,598	19,602	51,019	19,881
Factoricer	62	9,826	9,826	9,927	20,030	12,547	12,547	15,042	12,547	15,042	18,292	50,768	16,945
Sedgewick	12	30,769	30,769	33,136	30,769	34,911	34,911	43,787	43,195	43,787	43,787	53,254	38,462
Clasifier	23	19,792	20,313	22,396	21,875	22,917	23,264	32,118	31,944	32,118	34,549	51,910	28,472
LegendGame	71	8,873	8,873	8,951	16,725	11,150	11,227	14,680	13,368	14,680	16,937	50,675	16,013
Cues	18	31,579	32,410	32,410	32,410	33,241	34,626	39,058	42,105	39,058	44,321	52,355	37,598
Romanic	123	6,400	10,842	11,225	13,560	7,440	11,882	13,287	13,411	13,287	13,300	50,397	15,003
FibRecursive	4.619	0,269	0,269	0,281	1,201	0,327	0,407	3,915	0,456	3,915	0,476	50,011	5,593
Risk	33	16,782	16,782	18,080	19,377	18,685	18,685	24,308	31,142	24,308	32,785	51,384	24,756
FactTrans	198	3,891	3,894	3,927	6,222	6,578	6,581	7,369	7,159	7,242	7,505	50,249	10,056
RndQuicksort	72	8,726	8,726	8,726	11,409	12,029	12,235	13,624	13,511	12,929	14,543	50,666	15,193
BinaryArrays	128	5,516	5,523	5,715	7,127	7,746	7,938	7,896	8,587	8,155	8,707	50,382	11,208
FibFactAna	351	2,438	2,438	2,449	5,377	7,615	7,712	6,399	8,573	7,390	5,990	50,141	9,684
NewtonPol	7	39,063	39,063	43,750	39,063	43,750	43,750	45,313	45,313	45,313	45,313	54,688	44,034
RegresionTest	18	23,269	23,269	25,208	25,208	26,870	26,870	32,964	32,964	32,964	32,964	52,355	30,446
BoubleFibArrays	171	4,404	4,411	4,570	11,401	5,946	6,960	24,496	6,960	24,868	6,960	50,287	13,751
ComplexNumbers	60	10,024	10,024	10,320	11,314	11,395	11,395	15,775	15,748	15,802	19,188	50,793	16,525
Integral	5	44,444	44,444	47,222	44,444	50,000	50,000	50,000	50,000	50,000	50,000	55,556	48,737
TestMath	48	11,912	11,912	12,162	12,995	15,952	16,285	22,407	24,198	23,865	22,366	50,979	20,457
TestMath2	228	3,511	3,511	3,511	9,735	10,553	10,808	12,288	28,564	13,236	14,367	50,216	14,573
Figures	113	6,717	6,748	6,787	8,095	7,679	7,787	10,172	10,603	10,165	10,765	50,431	12,359
FactCalc	59	10,111	10,139	10,417	11,528	13,694	14,222	20,472	18,500	20,472	20,694	50,806	18,278
SpaceLimits	127	12,952	16,071	19,147	21,741	13,678	16,797	22,870	22,784	22,858	26,147	50,385	22,312
Argparser	101	6,824	6,824	7,180	7,343	8,218	8,218	11,832	11,890	11,832	12,438	50,481	13,007
Cglib	1.171	1,576	1,585	1,894	1,727	2,182	2,300	5,504	5,462	5,247	5,760	50,043	7,571
Kxml2	1.151	0,897	0,897	0,927	1,358	1,174	1,198	2,270	2,454	2,253	2,929	50,043	6,036
Javassist	1.341	4,270	4,273	5,362	4,427	4,675	4,687	6,090	5,802	9,226	6,048	50,037	9,536
Average	370,286	13,067	13,390	14,274	15,730	16,160	16,606	20,525	20,635	20,725	20,944	51,199	20,296

where

Column Benchmark contains the names of the benchmarks;
Column Nodes shows the size of the ET: the number of nodes (i.e., method invocations performed during the execution of the program) in the ET associated with each benchmark;
Column D&QH shows the number of questions asked (as an average) by Divide & Query by Hirunkitti;
Column D&QS shows the number of questions asked (as an average) by Divide & Query by Shapiro;
Column DR&Q shows the number of questions asked (as an average) by Divide by Rules & Query;
Column HF shows the number of questions asked (as an average) by Heaviest First;
Column MRF shows the number of questions asked (as an average) by More Rules First;
Column HD-P shows the number of questions asked (as an average) by Hat Delta with the heuristic that computes the proportion of correct and wrong nodes;
Column TD shows the number of questions asked (as an average) by Top-Down;
Column HD-Y shows the number of questions asked (as an average) by Hat Delta with the heuristic that counts correct nodes (answered YES);
Column HD-N shows the number of questions asked (as an average) by Hat Delta with the heuristic that counts wrong nodes (answered NO);
Column SS shows the number of questions asked (as an average) by Single Stepping.

Declarative Debugging for Java

The debugger does the hard work

Experiments

Tool performance

Evaluation of the execution trees balancing technique

Evaluation of strategies for algorithmic debugging