Large, quick wins can be had by making a few global changes, like adding more RAM and upgrading PostgreSQL. This will fundamentally change the performance characteristics of the system. Hence these changes must be made before measuring the effect of further optimizations. We call this process baselining.
After baselining, we should measure performance again, using the macroscopic tests described in UI Tasks 1. I can do this work, or someone from Utilika can do it.
Next, there are a few big, obvious changes to make, such as table partitioning. These too should bring large wins. In particular, the query used in lines 18–22 of lvviz2w.pl is massively inefficient, and should be rewritten (actually, this step could take place before the baseline).
This paper details both baseline steps and post-baseline steps, and gives time estimates for each.
Basic hardware and configuration changes that will fundamentally improve the system's performance
Listed in order from greatest (anticipated) performance impact to least impact. The character tabulation step could be implemented before the baseline. The others should be implemented on an as-needed basis. Because of this, and because of the increased complexity of each of these steps, these estimates are more conditional and less certain than the estimates for baseline steps.
These changes represent the largest, easiest performance wins. They must be made before further optimizations are meaningful (except for refactoring character tabulation—that could precede the baseline).
Free performance gains can be had simply by upgrading to PostgreSQL 8.4. Versions 8.2, 8.3, and 8.4 all introduced significant performance enhancements.
We can back up plx remotely using pg_dumpall 8.4, so that we can safely restore the dump without worrying about 8.1-versus-8.4 compatbility. Restoring from pg_dump brings the added convenience of creating a pristine database with bloat-free, balanced indexes and no unvacuumed rows.
I estimate this will take only about one hour of billable time, though the remote backup, running unattended, will take many hours.
RAM is dramatically faster than disk, as this chart illustrates (if you don't see readable text, click the red bar and then scroll to the top).
Placing the whole database in shared memory is, therefore, in terms of bang for your buck, the best optimization you can make. No amount of software optimization can make up for the slowness of disk.
The plx database is about 9 GB on disk, while the server has only 4 GB of RAM. I recommend at least 16 and at most 32 GB in order to accomodate the database, make space for Apache/mod_perl processes, and leave some room to grow. If we follow the plan of buying a new production server and making the current server a development box, then getting 32 GB should be no problem. If, instead, you prefer to upgrade RAM in the present server, I'll need to ask more info to make sure this is possible with your particular motherboard. (Since this is a quad-Xeon system, I suspect it is indeed possible, but I'll need to check in order to know for sure.)
Once RAM is added, I must reconfigure PostgreSQL to take advantage of it. This means raising all memory-related settings. In fact, even for the current hardware, shared_buffers, max_connections and max_prepared_transactions are set quite low. This will artificially limit performance when there are large numbers of concurrent queries.
I recommend the following settings:
| Name | Value |
|---|---|
| shared_buffers | [1/4 of RAM] |
| work_mem | 512 MB |
| maintenance_work_mem | [1/16 of RAM] |
| wal_buffers | 1 MB |
| effective_cache_size | [2/3 of RAM] |
In addition, I noticed that checkpoint_segments is set unusually high. Was this an attempt to improve performance (perhaps when bulk-loading dictionaries)? Did it work?
In Apache, use mod_perl. Currently Panlex is using plain CGI, which means the Perl code is parsed, byte-compiled, and run for each new HTTP request. This is costly. With mod_perl, the program is byte-compiled once, then stays resident in memory and simply executes for each new request. Also with mod_perl, the Apache parent process's memory is used in a copy-on-write fashion; this means that there need not be n_child copies of read-only data structures.
Apache::Registry is a module that enables CGI-based Perl scripts to run under mod_perl. It carries two caveats:
my variables will not be seen by subroutines.
If you have a my variable in the main body of a script (that is, outside of a subroutine)
and you define any subroutines below it, then those subroutines will see the my variable (even if it is
not passed in as an argument). This is because Perl supports closures, and those subroutines automatically
become closures over all lexical variables preceding them.
The problem is, Apache::Registry (behind the scenes) moves your main script into a separate subroutine.
This is necessary to integrate with Apache. A side-effect, however, is that your other subroutines are no
longer in the same lexical scope as your my variables. If any subs were accidentally
using my variables that weren't explicitly passed in, then they will break.
From a spot-check of plxu.cgi's 126 subs, it appears that they all practice good encapsulation: they take variables as arguments and operate only on those variables. However, I didn't check exhaustively. Instead, I recommend simply testing the application and seeing if it produces the expected results. This should be faster than an a priori code review, and will certainly be more reliable. Probably Utilika will want to do this testing itself, to reduce consulting costs.
Global variables will retain their state across requests.
This means that any code that assumes they will be initialized to empty will stop working properly.
Fortunately, this is easy to fix. plxu.cgi contains very few global variables, and it is trivial to initialize them on each request so that they will be empty. Line 28 declares all the global variables:
use vars ('$ct', '$dbh', '$i', '@i', '$j', '%in', '%lcvc', '%lv', '%tt', '@eff', '@tm0', '@tm1');
Of these, several are already handled. We would need to initialize %in, @tm0, @tm1, %lcvc, %lv, and %tt:
%in = (); # to avoid seeing CGI parameters from the last request
%lcvc = %lv = %tt = (); # These are per-request caches, so clear them on each request
This will preserve the current semantics of the program. At some point, you might want to add logic to keep caches across requests, but then you have to decide when to expire old cache values; I'm not sure whether this would be desirable, or how much performance benefit it would bring. So I recommend simply clearing the caches for now.
One last note: @tm0 seems to contain only undef values, which are never otherwise assigned, and are used only in a subtraction clause (where they amount to subtracting 0 from each value of @tm1). This seems like a programming error. Here is the relevant code:
quinn@tao:~/Desktop/Utilika/panlex/perl$ grep -r -B 5 -A 6 tm0 . | grep -v '.svn' -- -- ./plxu.cgi-# Make every warning fatal. ./plxu.cgi- ./plxu.cgi-use strict; ./plxu.cgi-# Require strict checking of variable references, symbolic references, and barewords. ./plxu.cgi- ./plxu.cgi:use vars ('$ct', '$dbh', '$i', '@i', '$j', '%in', '%lcvc', '%lv', '%tt', '@eff', '@tm0', '@tm1'); ./plxu.cgi- ./plxu.cgi-use encoding 'utf8'; ./plxu.cgi-# Make Perl interpret the script and standard files as UTF-8 rather than bytes. ./plxu.cgi- ./plxu.cgi-use Encode 'decode_utf8', 'encode_utf8'; ./plxu.cgi-# Import a function to decode from UTF-8 to character values and thereby test form-data UTF-8 -- ./plxu.cgi-sub Eff { ./plxu.cgi- ./plxu.cgi- @tm1 = times; ./plxu.cgi- # Identify the elapsed times. ./plxu.cgi- ./plxu.cgi: $eff[0] += ($tm1[0] - $tm0[0]); ./plxu.cgi: $eff[1] += ($tm1[1] - $tm0[1]); ./plxu.cgi: $eff[2] += ($tm1[2] - $tm0[2]); ./plxu.cgi: $eff[3] += ($tm1[3] - $tm0[3]); ./plxu.cgi- # Add the last operation’s times to the efficiency facts. ./plxu.cgi- ./plxu.cgi- $eff[4]++; ./plxu.cgi- # Increment the operation count in the efficiency facts. ./plxu.cgi- ./plxu.cgi-}
One caveat with this approach is that global variables will stay
I recommend using Apache::Registry, but turning on warnings and examining the Apache error log to see if any uninitialized uses of variables.
Under mod_perl, you can automatically benefit from persistent database connections simply by adding 'use Apache::DBI' to plxu.cgi. This is a win because creating database connections is very expensive, especially in PostgreSQL; it is better to hold $dbh's in memory and reuse them across many queries.
Using Apache::DBI gets you one persistent plx $dbh per child process. Since mod_perl children are single-threaded, this is all you need.
In the past, some constraints were dropped to gain performance. This is not a viable strategy. Missing constraints defeat referential integrity, lead to inconsistent data, and undermine the purpose of the RDBMS. This is especially true for user-submitted data, which can't be re-created from another system of record.
If possible (that is, if a record of them exists), restore these constraints. If no such record exists, try to restore them by hand. I could do this by guesswork based on shared column names, but it would be better done by a linguist with knowledge of Panlex (perhaps Jonathan?)
In addition to restoring constraints, consider that dropping them in the first place may have caused data integrity problems in the server. Is it better to keep these problems (and assume that future definitions will gradually correct them), or is it better to re-create data from the original dictionaries?
The trade-off is that re-creating the database would mean losing existing dns and mns. Is this worthwhile? There are many dns and mns. Can most be re-created from source data (such as dictionaries)? Or are they largely user-contributed, hence hard or impossible to re-create?
Again, this is best addressed by a linguist familiar with the project. However, in the course of my other work, I did notice obviously missing foreign-key constraints on some critical columns:
| Table | Columns |
|---|---|
| dn | mn, ex |
| ex | lv |
| mn | ap |
In addition, I noticed that no tables use the serial data type for primary keys. In fact,
some tables use either stored procedures or server-side Perl code to handle IDs. I don't understand why.
This is of most concern for tables mn and dn, since they are large.
The Perl code contains a hard-coded database password (for user apache, database plx).
But the svn repository allows anonymous read access, so that password is now compromised.
I recommend moving the password to a pg_service.conf file (as described in the DBD::Pg POD), then testing, and then changing the password.
Once the baseline is performed and new performance measurements are made, there are a few more optimizations that should yield large gains. I list them here in rough order from greatest (projected) impact to least impact. We should undertake them one at a time, measuring performance after each step to see when acceptable results are achieved.
Lines 18–22 of lvviz2w.pl performs two very memory-inefficient operations:
The goal is to find the total frequency of each known character in all known expressions.
To achieve this same end, I suggest rewriting the code so that it fetches only one expression at a time, breaks that expression into characters, and adds their frequency to a running count (stored in a hash). This will avoid pulling all expressions into memory at once, which is what is causing the out-of-memory errors.
I recommend writing this code in back-end Perl, not plperlu. This will be faster to develop and debug, and may be fine for performance.
If further optimization is needed, we could consider cacheing the contents of this hash; however, note that doing so will necessarily take up memory and cause results to be slightly out of date. I don't recommend taking this step unless other optimizations are inadequate to fix performance.
This is one example of a place where proper use of DBI could help performance. The next section covers the general case.
The Perl code performs most SQL queries using one of three methods:
This causes two problems:
DBI provides solutions to both problems:
my $sth = $dbh->prepare tells PostgreSQL to cache the query plan for a SQL query.
$sth->execute() executes a prepared query.
The various $sth->fetchrow* methods fetch a single row from a database and place it in different
data structures. $sth->fetchrow_arrayref is the most efficient, but they are all much better than
$dbh->do().
For prepare to be most effective, it should be called with placeholders, question marks
that match those parts of the query which will differ across executions:
# INSERT example my $sth1 = $dbh->prepare('INSERT INTO foo (x, y, z) VALUES (?, ?, ?)'); $sth1->execute($x1, $y1, $z1); $sth1->execute($x2, $y2, $z2); ... # SELECT/fetch example my $sth2 = $dbh->prepare('SELECT w, v from bar WHERE u=?'); $sth2->execute($u1); while (my ($w, $v = $sth->fetchrow_array())) { ... }
This solution requires a fair amount of refactoring. For an example, see 21/aprcv3w.pl. This code performs eight large queries using &QCs and &QRs, stores the results in arrays, then uses uses loops to process the arrays. To use DBI, each query and each loop would have to be rewritten.
Therefore, I recommend an incremental approach: refactor only as needed, and then cover only the largest queries. These are as follows:
### Queries that select all rows (or one column of each row) from a large table 21/aprcv3w.pl, line 26: my @wctt = (&QCs ('tt from wcex order by ex', '')); 21/aprcv3w.pl, line 29: my @wcex = (&QCs ('ex from wcex order by ex', '')); 21/aprcv3w.pl, line 58: my @mn = (&QCs ("mn from mn$rid", '')); plxu.cgi, line 5345: return (&QCs ('tt from wcex order by tt')); plxu.cgi, line 5361: my @wctt = (&QCs ('tt from wcex order by tt')); ### A query that selects all the exs in an lv 21/exmod8w.pl, line 14: my @ex = (&QRs ("ex, tt from ex where lv = $in{lv}", 1)); ### Same problem as in lvviz2w's character tabulation 21/lvviz3w.pl, line 21: (split //, (decode_utf8 ((join '', (&QCs ("tt from ex where lv = $in{lv}"))), 1)))) lvviz2w.pl
The larger a table, the worse its performance. Partitioning is a way around this problem. Partitioning means breaking up a large table into several small tables, in a way that is transparent to the application code.
For instance, you could partition ex into many language variant–specific tables, each table covering one lv. Indeed, partitioning is most beneficial when data are logically clustered together, as with plx's lvs. These are the benefits (from the PostgreSQL documentation):
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. [This will help with adding lvs.] ALTER TABLE is far faster than a bulk operation. It also entirely avoids the VACUUM overhead caused by a bulk DELETE.
As a last benefit, partitioning
plx has three rather large tables: dn, ex, and mn.
plx=# select tablename, plx-# pg_relation_size(tablename) as size, plx-# pg_size_pretty(pg_relation_size(tablename)) plx-# from pg_tables plx-# where schemaname='public' plx-# order by size desc; tablename | size | pg_size_pretty -----------+------------+---------------- dn | 1430888448 | 1365 MB -- 27,422,964 rows ex | 861233152 | 821 MB -- 12,258,760 rows mn | 450732032 | 430 MB -- 10,178,756 rows [Further output truncated]
ex can be partitioned on lv.
dn could be partitioned on the lv of its ex's, but only by denormalizing dn to include the lv as a column.
mn cannot be effectively partitioned.
It might be worthwhile to combine the dn and mn tables into a single table. This is another empirical question. Denormalizing would reduce join overhead, but would make the table larger—though it could be partitioned if it were further denormalized to include lv, as discussed in partitioning.
As a side note, I suspect the use of the mnid table is unnecessary and that mn should simply use
regular serial data type for its IDs. (This holds true also if mn is merged into dn.)
If further performance gains are needed, we can begin the process of index optimization. This entails examining the largest and most heavily used tables and finding
This is an empirical, table-by-table process. Since it is time-consuming and doesn't always bring payoffs, it's worthwhile only for large indexes on heavily used tables. For this reason, I place it last in my list of recommendations.
The largest indexes in plx are on dn, ex, and mn (unsurprisingly, since those are the three largest tables):
plx=# select tablename, plx-# indexname, plx-# pg_relation_size(indexname) as size, plx-# pg_size_pretty(pg_relation_size(indexname)) plx-# from pg_indexes plx-# where schemaname='public' plx-# order by size desc; tablename | indexname | size | pg_size_pretty -----------+-------------+-----------+---------------- dn | dn_pkey | 615981056 | 587 MB dn | dn_mn_key | 615981056 | 587 MB dn | dn_ex | 615981056 | 587 MB dn | dn_mn | 615981056 | 587 MB ex | ex_lv_key | 580042752 | 553 MB ex | ex_pkey | 277839872 | 265 MB ex | ex_lv | 276029440 | 263 MB mn | mn_pkey | 228646912 | 218 MB mn | mn_ap | 228646912 | 218 MB [Further output truncated]
Panlex is rife with opportunities for performance gains, many of them easy to achieve. Once the system is baselined, we should evaluate how many optimizations are necessary.