You are on page 1of 7

Yearbook of Labour Statistics

Data Processing Workflow Analysis and Recommendations

Edgardo Greising External Collaborator December, 2009

International Labour Organization Department of Statistics

Yearbook Production - Current process


Data collection at ILO is currently made by two methods: mostly thru Excel questionnaires and some still received in paper. Excel questionnaires are produced customized for each country, and then sent to the countries. The nine Excel books a total of thirty five spreadsheets - are automatically sent by e-mail. Some of the e-mails are rejected, and there is no established procedure to deal with this situation. When answers are received, (See Figure 1: Current process workflow) a manual process of reception and internal distribution thru e-mail is performed by an administrative assistant. The spreadsheet upload to SAS databases is done thru a semi-automatic process that must be run by each Statistical Assistant (SA) on a per-sheet basis (35 for each country). After receiving the Excel questionnaires from the countries and forward it to the responsible SA, manual consistency controls are made by SA on Excel spreadsheets. They have to pass thru all the questionnaires, even those without errors. The automated loading procedure from Excel obliges the SA to go over each sheet of the Excel book for uploading. When answers are received on hardcopy, the SA uses the SAS command line interface to enter the information directly into the SAS database. For those countries with no answer at all (unfortunately most of them), there is no established procedure for retrying. In the best cases, if there is enough time, data is grabbed from web sites by the SA to complete the questionnaires not received. Once a day, a full consistency check for each table over the whole database is performed. The results are made available for each SA, but erroneous data is not marked, so if nobody fix the errors (editing and correcting is an optional procedure), they are published anyway. There is no validation for false positives (errors according to consistency rules but not in reality) or chronic errors (those historical errors that cannot be fixed anymore, or it does not worth it), so errors remain in the report even if they are not errors. When the cause of an error is found, as for entering questionnaires received in paper, data editing is performed using SAS command line interface, whose operation is not intuitive. In the consistency check process, data is not checked considering it will also be used for the Country Profiles publication, making it difficult for the SA to arrange the tables produced for this publication when some of the data is missed. The tables for the publications are produced from the database using SAS to XML and then to PDF, and this documents are sent directly to press. Nevertheless, the Country Profiles publication is not produced directly by processing the databases, but after producing the tables it must be reviewed and temporarily modified by the SA by means of altering the tables definition in SAS using the command line editor to alter columns, years and footnotes, because data is not coherent among subjects combined in some tables of the profile.

Figure 1: Current process workflow

The LABORSTA web site is updated weekly with the information as is in the main database, including those tables with errors listed in the consistency check and not fixed so far.

Yearbook Production - Proposed process


The proposed process for the Yearbook production is sustained on four main ideas: The broadening of the ways of interaction with the countries for data collection The fully automation of those procedures able to be executed by computers in order to relieve SA to use their time in those activities computers cant do The systematization of the consistency and correction procedure regardless of the way the data arrived The ability to know when and why (or why not) data from the countries is arriving, and thus knowing how much information is to be included in the Yearbook.

These simple ideas, once implemented, should lead ILO STAT to have a better response rate from the countries, to reduce the delay of the information received and to improve the overall quality of the publication. The Figure 2: Proposed process workflow depicts the new procedures and databases to be implemented. Two new ways of data collecting will be implemented: the e-questionnaire, allowing the countries to enter the data on line thru the Internet, and the use of SDMX, the Statistical Data and Metadata eXchange standard, to allow the countries to send the information in XML files automatically downloaded from their own databases. Excel questionnaires and hardcopy will remain as valid options but, for simplicity, countries should be discouraged to use them. Those answers received in paper should be entered by the SA using the e-questionnaire system. The e-questionnaire system will include the ability to check for immediate consistency rules, like format, range, totals, etc. Warnings for values out of a tolerance threshold regarding last year datum can be issued when saving the form, allowing confirming it. For the Excel questionnaires as for SDMX, fully automated procedures will be developed for uploading of data. All data received, regardless of the way it arrived thru, will be loaded into the Data collection database, a common repository where the questionnaires will stay while not ready to be published, that is, without errors. The SA are supposed to have a better knowledge of their counterparts in each assigned country, so they will be responsible for making their best in order to get the answers from them. Should the deadline for receiving the data approaches, they must get in contact with their counterpart and help them as much as possible to get the information. The online e-questionnaire will be a very useful tool for this contact, since both the SA and his/her counterpart can be seeing the same erroneous or incomplete information. Even the information sent via Excel, once uploaded, can be reviewed, corrected and/or completed thru the web e-questionnaire.

No Answer Received

Country counterparts

E-mail or phone call Stat. Assistant

Selected data collection mode

Paper or fax questionn aire E-Questionnaires (on line via internet)

SDMX file

Excel questionnaires (1 simplified book)

Full Automatic Upload


Manual Input by Stat. Assistant

Full Automatic Upload

Data Collection Database


Editing by Stat. Assistant

Consistency check
Error Report

NO

CORRECT?

YES

Main Database

Weekly update

Automatic printout generation

Web Database (Replica)

Data Flow Control Dashboard


YEARBOOK Publication YEARBOOK CD LABORSTA Website (with dynamic charts and maps)

Figure 2: Proposed process workflow

With this new approach of a single repository of received data, the consistency check is just a single process that can be run for the whole database (this is recommended to be done once per day) or by each SA for just one country answers. This gives the SA the ability to interact with their counterparts, complete or correct the information and make the consistency check, all in line. Questionnaires with no errors according to the consistency rules defined will go to the Main database. Those with errors will remain in the Data collection database, but marked accordingly. A report will be sent to the SA, so he will know which the defects are, that prevent each questionnaire from passing to the Main database. Just in case of what is called a false positive (an error for the consistency rules, but not in fact, i.e. an outlier), the SA will be able (and responsible) of making a validation or allowance, that is, make the program to bypass this particular check and allow the questionnaire to go to the Main database in spite of this strange or missed value. This step will imply the addendum of a footnote to the table, which will also be under control of the system. From the Main database, the process of dissemination will have no big changes from that in use today. Anyway, improving the quality of data presentation in the LABORSTA website with the inclusion of dynamic tables, charts and thematic maps will probably be easy to achieve, since the use of tools from the Oracle suite (in particular the BI Enterprise Edition) enable to add those features.

Data Flow and Stages.

Figure 3: Data Flow Diagram

One big addendum to the whole process is the workflow control information. All procedures touching the data, from the first communication to each country asking for this year data, until the last table published, will be recorded, and data will flow thru the different stages according to a predefined workflow. (See the symbol in Figure 2: Proposed process workflow). In Figure 3 the Data Flow Diagram with the stages and changes thru them can be seen. The workflow will comprise at least the following stages: SENT, COLLECT, COMPLETED and OK, and can pass thru ERROR and EDITED many times before OK. Only data marked OK and moved to the Main database will be included in the printed, optic or electronic dissemination. The Data Flow Control Dashboard (See Figure 2: Proposed process workflow, on the left) will give realtime information about the status of each questionnaire. The count of questionnaires in each stage will be displayed at first; then you can drill down to see it by continent or SA assignment. Then by country, and even go to see the error listing associated to an ERROR questionnaire or the content of anyone, regardless of its stage. Editing of the questionnaire will be restricted to authorized users, like the SA assigned or the Supervisors.