Contributed by Jonathan Pool on 2009/08/30.
UI Tasks 1 included the task of tabulating the characters in the expressions of some language variety. The cost in time and memory was found to vary substantially with the expression counts and the character-type counts of varieties. In fact, one variety, English, exhibited a memory cost sufficient to exhaust the memory and crash the server process when the server had 4 GB of RAM.
In “PanLex Performance Recommendations”, Quinn Weaver predicted that the time and memory costs of character tabulation could be substantially decreased if the tabulation implementation in lvviz2w.pl were changed to a method that iterates over the rows in a table of expressions.
This document reports the results of an experimental reimplementation of the character tabulation task, as recommended.
The original implementation includes this code:
my $tt = (
join '', (
sort { $b cmp $a }
(split //, (decode_utf8 ((join '', (&QCs ("tt from ex where lv = $in{lv}"))), 1))))
);
# Identify a sorted concatenation of all instances of all character values in all expressions
# in the variety.
my ($last, $len, $nbr);
while ($len = length $tt) {
# Until the character instance concatenation is exhausted:
$last = (substr $tt, -1);
# Identify its last character.
$nbr = ($len - (index $tt, $last));
# Identify the count of that type’s instances.
$tt = (substr $tt, 0, (- $nbr));
# Remove that type’s characters from the concatenation.
}
The revised implementation includes this code:
my (%tt, @tt);
my $sth = ($dbh->prepare ("select tt from ex where lv = $in{lv}"));
# Prepare a request for the expressions in the variety.
$sth->execute;
# Execute it.
while (@tt = ($sth->fetchrow_array)) {
# For each expression:
foreach $i (split '', (decode_utf8 ($tt[0]))) {
# For each character value in it:
$tt{$i}++;
# Add 1 to the character’s count, initializing the count if necessary.
}
}
my $nbr;
foreach $i (sort (keys %tt)) {
# For each character type with any instances:
$nbr = $tt{$i};
# Identify its count.
}
The two implementations’ execution times are as follows:
Variety Expressions Sec. (old) Sec. (new)
epo-000 178183 5 5
deu-000 525818 18 13
jpn-000 477425 52 21
cmn-000 984822 272 48
eng-000 2083445 mem 48