Contributed by Jonathan Pool on 2009-06-21.
One of the major efficiency challenges for PanLex has been allowing users to upload files of data into the database fast enough to avoid frustrating waits.
The first implementation of the file-upload feature was entirely real-time. A user watched a progress bar while the file that was being uploaded was transmitted to the server and processed by the server. Experience with the project management group revealed that the waits for source files could be intolerable, reaching several hours.
A reimplementation decreased the wait times so they would typically not exceed about an hour, but the experience was still unsatisfyingly tedious.
Another reimplementation split the process of contributing a file into two parts. The first part is real-time and involves the transmission of the file to the server and the formal validation of the file by the server. This part generally lasts no more than about 10 minutes, and typically less than 1 minute. When it ends, the server serves a reply page to the client estimating when the user can expect to receive an email message announcing the completion of the processing. The user is then free to do other things. Every 15 minutes the server processes any waiting files and, upon the completion of each file, notifies the submitting user.
PanLex now contains about 27,000,000 denotations from about 625 source files, or about 43,000 denotations per source file. The processing times required for files of various sizes are shown below. A file of the mean size requires about 2 minutes, so the typical wait for notification would be half of 15 minutes plus 2 minutes, i.e. about 10 minutes. This seems satisfactory. A heavy load on PanLex, if it became very popular, would probably involve about 2,500 uploads per year, or about 7 files per day, with processing times totaling about 15 minutes per day, i.e. about 1% of the available time. Thus, the upload efficiency gives a satisfactory impression and will not exceed a small fraction of the server’s processing capacity under any foreseeable conditions.
The above reasoning is simplistic, because it ignores the fact that operations on the database become slower as the database grows. One operation in an upload is to check each expression for its prior existence, so it can be decided whether to consider the new expression an instance of an existing one or to create a new expression for it. This operation would be more expensive if PanLex grew. Thus, this issue may not be permanently dead. For now, however, it is not a significant issue.
