UI Tasks 1b

Contributed by Jonathan Pool on 2009/08/30.


UI Tasks 1 included the task of tabulating the characters in the expressions of some language variety. The cost in time and memory was found to vary substantially with the expression counts and the character-type counts of varieties. In fact, one variety, English, exhibited a memory cost sufficient to exhaust the memory and crash the server process when the server had 4 GB of RAM.

In “PanLex Performance Recommendations”, Quinn Weaver predicted that the time and memory costs of character tabulation could be substantially decreased if the tabulation implementation in lvviz2w.pl were changed to a method that iterates over the rows in a table of expressions.

This document reports the results of an experimental reimplementation of the character tabulation task, as recommended.

The original implementation includes this code:

	my $tt = (
		join '', (
			sort { $b cmp $a }
			(split //, (decode_utf8 ((join '', (&QCs ("tt from ex where lv = $in{lv}"))), 1))))
	);
	# Identify a sorted concatenation of all instances of all character values in all expressions
	# in the variety.

	my ($last, $len, $nbr);

	while ($len = length $tt) {
	# Until the character instance concatenation is exhausted:

		$last = (substr $tt, -1);
		# Identify its last character.

		$nbr = ($len - (index $tt, $last));
		# Identify the count of that type’s instances.

		$tt = (substr $tt, 0, (- $nbr));
		# Remove that type’s characters from the concatenation.

	}
	

The revised implementation includes this code:

	my (%tt, @tt);

	my $sth = ($dbh->prepare ("select tt from ex where lv = $in{lv}"));
	# Prepare a request for the expressions in the variety.

	$sth->execute;
	# Execute it.

	while (@tt = ($sth->fetchrow_array)) {
	# For each expression:

		foreach $i (split '', (decode_utf8 ($tt[0]))) {
		# For each character value in it:

			$tt{$i}++;
			# Add 1 to the character’s count, initializing the count if necessary.

		}

	}

	my $nbr;

	foreach $i (sort (keys %tt)) {
	# For each character type with any instances:

		$nbr = $tt{$i};
		# Identify its count.

	}
	

The two implementations’ execution times are as follows:

          Variety   Expressions  Sec. (old)  Sec. (new)
           epo-000     178183         5           5
           deu-000     525818        18          13
           jpn-000     477425        52          21
           cmn-000     984822       272          48
           eng-000    2083445       mem          48