You are on page 1of 12

Proceedings of Q2008 European Conference on Quality in Official Statistics

Different quality tests on the automatic coding procedure for the Economic Activities descriptions
A. Ferrillo, S. Macchia, P. Vicari1

1 The ISTAT Economic Activities coding application 1.1 Automatic coding in ISTAT: the ACTR software system The coding activity of responses to open questions was made manually until some years ago, but it was a time-consuming job and did not guarantee in terms of standardisation of the process and correctness of the assigned code. That is why in 1998 ISTAT decided to test an automated coding system. The software selected was ACTR (Automated Coding by Text Recognition, v. 3), a package developed by Statistics Canada. The choice fell on ACTR because it is a generalised system, independent from the language, already successfully used by other National Statistics Institutes (Tourigny et Moloney, 1995). ACTR's philosophy lies on methods originally developed at US Census Bureau (Hellerman, 1982), but uses matching algorithms developed at Statistics Canada (Wenzowski, 1988). The coding activity follows a quite sophisticated phase of text standardisation, called parsing, that provides 14 different functions such as characters mapping, deletion of trivial words, definition of synonymous, suffixes removal, etc.. The parsing aims at removing grammatical or syntactical differences so to make equal two different descriptions with the same semantic content. The parsed response to be coded is then compared with the parsed descriptions of the dictionary, the so called reference file. If this search returns a perfect match, called direct match, a unique code is assigned, otherwise the software uses an algorithm to find the best suitable partial (or fuzzy) matches, providing an indirect match. According to a proper measure of similarity between the texts to be coded and descriptions of the reference file and depending on some user-defined threshold parameters defining the range of acceptance, the system produces the following possible results: a unique match, if a unique code is assigned to a response phrase; multiple matches, if several possible codes are proposed; a failed match, if no matches are found. The first case does not require a human intervention, while the other ones have to be evaluated by expert coders. As already mentioned ACTR is a generalised system. It is generalised with respect to the language and the classification used. This means that the construction of the coding environment must be carried out by the user who has to adapt the system to the Italian language and to each classification and, at the end, to test it.

ISTAT, Italian Institute of Statistics, Via C. Balbo 16, 00184 Rome, Italy; e-mail:,,

The construction of the coding dictionary (reference file) is the heaviest activity, since its quality and its size deeply affect the performance of the automated coding. This activity is aimed at processing the texts of the official classification manual so as to reproduce the respondents natural language as close as possible. This is done in different steps, the most important of which is the integration of manuals descriptions with empirical response patterns taken from previous surveys. Not less important is the definition of general classification criteria for the assignment of codes to make the system working as a human coder would do (e. g., in case of two activities, the prevalence between them is automatically assigned). 1.2 Automated coding applications developed in ISTAT Numerous coding applications for different classifications have been developed in ISTAT. The most important are referred to the following variables: Country/Nationality Causes of death Economic activities Education level Municipalities Occupation. They have been used in different surveys and some of them for the 2001 Population and Industry Censuses with very good results. The performance of the automated coding is measured through two indicators: Recall rate (coding rate): percentage of codes automatically assigned. Precision rate: percentage of correct codes automatically assigned. According to these two indicators, the performances of ISTAT coding applications were satisfactory and the results coherent with those obtained by other National Statistical Offices (De Angelis et al., 2000). As far as the economic activities application is concerned, the automated coding results have always been higher in business surveys than in households or individuals surveys (see Table 1). This is due to the fact that the concept of Economic Activity is closer to respondents of the first type of surveys than to the latter one. Table 1 - Economic activities coding application results Economic activities coding application Texts to be Recall Precision coded rate rate Intermediate Industry Census 1,793 58.8 91.0 Population Census 1991 /Quality 6,288 54.5 85.0 survey I Labour Force Pilot survey 43.5 85.0 I 2001 Population Census Pilot survey 51.2 93.7 II 2001 Population Census Pilot 51.9 90.0 survey 2001 Population Census (for 53.6 92.31 Institutional Households forms) 2001 Industry Census 1,130,693 80.7 -

1.3 The new economic activities classification release: ATECO 2007 1.3.1. The new ATECO 2007 The new economic activities classification ATECO 2007 is the national version of NACE rev. 2, the European economic activities classification. NACE Rev. 2 is the result of a complex revision activity at international level of the previous economic activities classification. The revision involved two aspects: a) a process of convergence among the main economic classifications: ISIC, NACE and NAICS (the classification adopted by the North-American countries); b) the need of a classification that reflects the changes in the present economy. In the new classification new concepts at the highest level of the classification have been introduced, and new details have been created to reflect different forms of production and emerging new activities. At the same time, efforts have been made to preserve the structure of the classification in all areas that do not explicitly require changes based on new concepts. The detail of the classification has substantially increased (from 514 to 615 classes and from 883 to 918 categories). For service-producing activities, this increase is visible at all levels, including the highest one, while for other activities the increase in detail affected mostly the lower level of the classification. NACE Rev. 1.1 had 17 sections and 62 divisions; NACE Rev. 2 has 21 sections and 88 divisions. At the highest level of NACE, some sections can be easily compared to the previous version of the classification. However, the introduction of some new concepts at the section level, e.g. the ICT section (section J) or the grouping of activities linked to environment (section E), makes an easy overall comparison between NACE Rev. 2 and its previous version not possible. In order to have an idea of the impact of changes on official statistics due to the implementation of NACE Rev. 2, the four digit codes that split in two or more new codes are the 45 per cent; the five digit codes that split in two or more new codes are around the 35 per cent.

1.3.2 ATECO and ACTR In order to work well, the tool ACTR must be continuously enriched with new descriptions and expressions used by the respondents to the business surveys; ACTR cannot wait for a new classification to change its dictionary. The first version utilized NACE Rev.1 and the dictionary was enriched with the economic activity descriptions coming from the Intermediate Industry Census (1997), it contained 27.306 descriptions; the second version (30.745 descriptions) incorporated the texts collected with the last Industry Census (2001) and the business survey finalized to individuate the local units (2003). In 2003 an updating to incorporate the changes in ATECO 2002 was done but, in that case, ATECO 2002 was not very different from ATECO 1991. The new economic activities classification adopts the same classification principles of the previous one, therefore the way to classify the enterprises doesnt change. Also the general feature of the classification is similar: Agriculture, Manufacturing, Services. ATECO 2007 is deeply different from the contents point of view. As already explained, some sections of the classification completely change in their structure and in the included activities, moreover some parts (e.g., publishing) move from a section (Manufacturing) to another one (Information and Communication). The realization of

the ACTR application with the new economic activities classification was a long and complex process. This updating process was made of different steps and problems: 1. only a part of the old classification at five digit level (around the 65 per cent) directly translated in the new one. The other part had to be checked description by description, 2. since the classification was very different, some descriptions have been completely re-examined; in some case it was necessary to divide old descriptions (e.g.: Repair and installation of pumps) because a part is now in a code (Repair, group 33.1) and the other part is in a different code (Installation, group 33.2), 3. completely new activities were introduced, 4. it was necessary to delete some old descriptions because obsolete (281 texts). Due to the deep changes in some sectors of the classifications, it was necessary to analyse not only the split codes but all the texts interested by these activities; they were: part of Agriculture (section A), Repair and Maintenance (group 33.1), Installation (group 33.2), a large part of Information and Communication (section J), Manufacture of footwear (group 15.2), Manufacture of furniture (division 31), Repair of furniture (class 95.24) Construction (section F), Agents (group 46.1), Wholesale trade (division 46), Retail sale (division 47), Professional services (section M and N). Moreover it was necessary to check the old descriptions because the new classification is more detailed than the previous one especially in the sections M and N; as a consequence it can happen that an old description could be split in two or more different codes. After the complete revision, the amount of the new texts introduced in the dictionary was around 3,000; the texts deleted were 281; the revisions in the dictionary were around 800. The work regarding the new classification involved not only the dictionary but also the parsing. The changes in the classification are so deep that it was necessary to introduce more than 200 revisions in the parsing. Table 2: Dimensions of the dictionaries of economic activities application Texts in the dictionary ATECO 91 27,306 ATECO 2002 30,745 ATECO 2007 33,587

2. Aims of the new ATECO coding application 2.1 ACTR for surveys and Census After the revision and the evaluation of recall and precision rates, ACTR is ready to be used in any field where the new classification is used; it could be applied in business surveys or in every survey that needs to assign a code to a description of economic activity. In particular, the next important application field will be the Industry Census. This tool was already used for Census 2001, so it is set on the descriptions given from this type of respondents. Since the on-line version has been realized, ACTR could be used in the business surveys if the description of an economic activity must be immediately translated in a code.

2.2 ACTR for administrative sources In order to implement the new ATECO 2007 in the Business Register (the economic activity is the main variable in the Business Register), it was necessary to exploit different methodologies. ACTR was one of these and it was chosen because the larger part of the enterprises in the Register have a description of activity collected by the Chambers of Commerce (CCIAA). These descriptions are very different from the descriptions collected with the business surveys. The second ones are synthetic and describe just the economic activities, the first ones are very particular declarations. When the enterprises register their activity to the Chambers of Commerce, they often describe not a single specific activity but a wide range of activities. The enterprises prefer to be very generic in order to change activity without particular problems and without informing the Chambers of Commerce for every change. Besides, they mention in their declarations other things that are completely useless to individuate the economic activity: legal form, laws, the year the enterprise was founded, etc.. The descriptions collected by Chambers of Commerce are particularly complicated to be treated because they very often exceed the physical length manageable by ACTR (200 bytes). 5,471,375 descriptions were analysed: the 41.7 % were longer than 200 bytes. The other 58.3 %, shorter than 200 bytes, produced a low recall rate (32.7 %). For both these reasons, it was necessary to find a way to transform these descriptions in shorter and treatable texts. For the already mentioned reasons, the new application has been customised and integrated with another software so as to reach the 61% of coded descriptions (see par. 3.2). After this treatment, two results were obtained: 1. the CCIAA descriptions became shorter than the original ones; 2. redundancies and useless information were deleted. Just to give an idea of the problem and of its solution: the description the company was founded with the mission of producing shoes since 1995 after the treatment becomes producing shoes that is the only information necessary to assign a code. All the other information are superfluous and they dont permit to ACTR to work well. At the end of this process, the coding rate for texts shorter than 200 bytes became 61%; the coding rate for all the texts, shorter and longer, became 48%.

2.3 ACTR on WEB ISTAT projected to supply to the external users of the ISTAT WEB site the on-line consultation function based on the technology developed for the ATECO classification. When the new ATECO was finalised, this project was implemented. The users of this technology can be numerous: not only the Chambers of Commerce and the Statistics offices but also the private citizens that have to start a new activity or, more simply, have to declare their code in order to pay taxes. It is important to remember that ATECO 2007 is used by all the administrative sources in Italy and that it is the first economic activities classification to be unique for ISTAT and for all the administrative sources. The new tool was available on the ISTAT website from the 26th of May. The ACTR WEB application had a big success; in fact, in the first period, it was consulted by around 10,000 users a week. In the same weeks the questions to the e-mail dedicated to the new ATECO diminished.

Table 3 Number of queries in order to obtain an ATECO code Date N. queries 06/06/08 8,478 09/06/08 9,225 16/06/08 6,290 20/06/08 10,386 27/06/08 10,327 07/07/08 10,535 14/07/08 10,925 21/07/08 10,304 The availability of this tool allows to save a substantial amount of time because before its implementation the ATECO experts in ISTAT were called very often to solve problems connected to the classifications. Moreover the descriptions introduced by the users at the moment of the on-line consultation are used, after a specific analysis and treatment, to expand the ACTR dictionary and to continue to perfect the tool. At the moment we have already introduced other 600 new texts. This function could have other developments. As a matter of fact, after having implemented this technology for the economic activity classification, the same technique could be easily adopted for other classifications. 3. Quality tests on the automatic coding procedure As already mentioned in the previous paragraph, the automatic coding application will be used for two main purposes: to code automatically (in batch) economic activity descriptions provided by Italian enterprises and recorded in ASCII files to support the ISTAT WEB site users to identify the ATECO code corresponding to the activity performed by their enterprises. Concerning the first item, the descriptions are usually provided by respondents to ISTAT surveys, but may also regard texts recorded in other archives used by ISTAT for statistical purposes (e.g., Chamber of Commerce archives). These two sources are very different from a linguistic point of view, because the first one contains descriptions very short and expressed according to some specifications given in the survey questionnaire, while those of the second one are often longer and full of not pertinent details. Due to the complexity of this scenery, the evaluation of the quality of results of the automatic coding application is extremely important, both in terms of percentage of codes automatically assigned (recall rate) and of percentage of correct codes automatically assigned (precision rate). So, in order to measure the quality of the procedure to be used to code not homogeneous descriptions, different quality tests have been planned. They are different both for the methodologies they use and the samples they treat. In particular, three tests will be described: for two of them, the correctness of codes assigned by the automatic coding application is stated by the analysis of expert coders, while in the third one the assigned codes are compared to codes deriving from some special surveys. The first two tests use two samples characterised by descriptions linguistically different: one deriving by an ISTAT survey (short and synthetic descriptions) and the second by the Chamber of Commerce archive (long and redundant descriptions). As described

below, due to the linguistic difference, the two samples have been extracted according to different sample designs. Regarding the third test, the automatic coding application ran on data sets of special surveys regarding some particular economic sectors and the assigned codes were compared to those deriving from the analysis of different correlated variables collected through some questions of the questionnaire. 3.1 Quality test on descriptions of the Industry Census The aim of this test is that of measuring the performance of the automatic coding procedure on descriptions provided by respondents to statistical surveys. For this purpose, a sample of descriptions of the 2001 Industry Census has been selected. The methodology adopted in drawing this sample optimises the analysis of results, so as to examine only once very similar texts (DOrazio, Macchia 2002). As a first step we quantified how many different texts existed in the original file and defined 13 frequency classes. To identify the different texts, a kind of raw standardisation was performed with only a few parsing functions, so as to delete from descriptions the articles, conjunctions, prepositions and suffixes. As can be seen in Table 4, the initial 1,130,662 texts were reduced to 228,738 different ways of describing the economic activity. Then, it was decided to use a stratified random sampling design to draw the sample. In practice, texts were first stratified according to their frequency of occurrence N h ; then, within each stratum, a simple random sample (without replacement) of texts was selected. The strata coincided with the previously defined classes of occurrences. In deciding the sample size an equal precision rate of automated coding in each class ( = 0.75) was hypothesised, while the margin of error ( ) was progressively reduced in higher classes of occurrences of different texts; this guaranteed estimates with a higher precision for heaviest different texts. Table 4 Optimal sample sizes in the strata
Classes of occurrences Number of original texts Number of different texts ( Nh ) 169,420 27,895 20,906 6,475 1,854 902 360 247 246 128 139 83 83 228,738 Hypothesised precision of autom. coding () 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% 75.0% Margin of Approximate Sampling optimal fraction error sample size ( f = n N ) () h h h ( nh ) 0,040 234 0.14% 0,040 233 0.84% 0,035 305 1.63% 0,035 305 5.19% 0,030 414 23.20% 0,030 414 47.35% 0,030 360 100.00% 0,030 247 100.00% 0,020 246 100.00% 0,020 128 100.00% 0,010 139 100.00% 0,010 83 100.00% 0,010 83 100.00% 3,191 1.4%

1 2 3-8 9-25 26-50 51-90 91-130 131-180 181-300 301-430 431-730 731-1,410 1,411 Total

169,420 55,790 92,401 91,828 65,636 60,638 38,941 38,105 56,913 46,462 78,602 83,200 252,726 1,130,662

This sample was automatically coded obtaining a recall rate of 78.47%.

This result is absolutely satisfactory, also if analysed in details. As a matter of fact, while the unique matches are distributed among all the classes, there are not failed matches in classes over 180 occurrences. In addition, the 71.25% of unique matches have a score equal 10, which means that they correspond to direct matches, and more than the 53% of them belong to classes of occurrences greater than 91, which means that the dictionary enrichment was made consistently with the way respondents are used to express themselves (see table 5 and 6). Table 5 Automatic coding results on the Census sample
N. 2,504 642 45 3,191 % 78.47 20.12 1.41 100.00 Direct matches (score=10) 1,784 71.25

Unique Multiple Failed Total

Table 6 Unique and failed matches per classes of occurrences and score
Unique Classes of occurrences Number % on Total Unique 4.63 5.23 8.11 8.95 13.22 14.06 12.14 8.71 8.75 4.63 5.23 3.19 3.15 100.00 Direct matches (Score = 10) Number % on Direct matches 29 1.63 49 2.75 100 5.61 153 8.58 231 12.95 261 14.63 249 13.96 183 10.26 180 10.09 101 5.66 110 6.17 70 3.92 68 3.81 1,784 100.00 Number Failed % On total Failed 31.11 6.67 17.78 8.89 13.33 13.33 6.67 2.22

1 2 3-8 9-25 26-50 51-90 91-130 131-180 181-300 301-430 431-730 731-1,410 1,411

116 131 203 224 331 352 304 218 219 116 131 80 79 2,504

14 3 8 4 6 6 3 1



From the qualitative point of view, the unique codes have been analysed by expert coders in order to state their correctness, to identify wrong codes or codes assigned to descriptions impossible to be coded (responses too generic or describing more than an economic activity). All these errors have been a useful source of information to update the coding applications. These results are shown in table 7 and 8. As it can be seen, the precision rate is higher than 95% and, if analysed per score, the 98.09% of direct matches are correct, which is surely a satisfactory result. In addition, it has been verified that the percentage of correct and not correct codes is uniformly distributed among all the classes of occurrences.

Table 7 Precision of the automatic coding procedure

Precision rate Correct codes Wrong codes Codes assigned to descriptions impossible to be coded N 2,382 36 86 2,504 % 95.13 1.44 3.43 100.00

Table 8 Precision of the automatic coding procedure per score

Indirect matches Correct codes Wrong codes Codes assigned to descriptions impossible to be coded N 632 29 59 720 % 87.8 4.0 8.2 100.00 Direct matches (Score = 10) N % 1,750 98.09 7 0.39 27 1.51 1,784 100.00

3.2 Quality test on descriptions of the Chambers of Commerce This test is supposed to measure the performance of the automatic coding procedure on descriptions not collected in statistical surveys, but recorded in archives provided to ISTAT by administrative sources. For this purpose, it has been selected a dataset provided by Chambers of Commerce. This archive contains the descriptions of economic activities given by entrepreneurs who tend to list a great number of elements in order not to limit their possible activities. As a consequence, descriptions are quite often very long and contain different types of economic activities that could be close or totally different from each other. Besides, there are no specifications or rules on how to describe the companys activity, which often turns to be too much detailed and full of concepts not inherent in the economic activity: the company mission, its legal form, references to laws or other legal aspects of the activity itself, etc.. ISTAT used this archive as one of the administrative sources to update its Business Register that represents the universe of reference for the business surveys. For this reason, the Chambers of Commerce archive was coded using ACTR, obtaining a recall rate of 61%, corresponding to 84,117 coded descriptions. Another source used to reclassify the enterprises in the Business Register were the Sector Studies which assign a five digit code through a specific methodology not based on the text analysis. The first step to carry out this quality test was that of extracting from the ACTR coded dataset the descriptions corresponding to codes at maximum level of detail and the second step consisted in comparing them with those assigned through the Sector Studies. It was assumed that coinciding codes had to be considered correct as two different methodologies came to the same conclusion. The results showed that the 67% of the extracted descriptions had equal codes, which can be considered a good indicator of quality. The quality analysis had to regard the remaining descriptions corresponding to different codes, but, due to their huge quantity (17,746 descriptions), it was decided to consider only a sample of them.

The already mentioned characteristics of these texts demonstrated that it was really rare to find equal descriptions, so it was not considered suitable to adopt the same sampling strategy used for the first quality test. As a matter of fact, frequency classes of descriptions with not coinciding codes have been defined and then a sample has been extracted proportionally within each class, with a margin of error of 0.014%. Table 9 Comparison between codes assigned through ACTR and through Sector Studies - Quality control sample
Quality control sample ACTR codes coinciding first 4 digits ACTR codes coinciding first 3 digits ACTR codes coinciding first 2 digits ACTR codes coinciding first digit ACTR codes not coinciding 2,306 3,042 4,185 5,040 3,173 17,746 13.0 17.1 23.6 28.4 17.9 520 686 943 1,136 715 4,000

This sample was submitted to expert coders who identified: correct codes according to ACTR (A) correct codes according to Chambers of Commerce (C) wrong codes according to both the methodologies (E) doubt codes according to both the methodologies (D). The results are shown in the following table. As it can be seen, also in this sample the precision rate is high, from 80% to 94% in all classes, apart from that coinciding only with the first digit. This is due to the fact that this class is widely populated with very generic descriptions owing to the Construction sector, which has been strongly revised in the new classification. The combination of these two factors implied that both the codes assigned through the two methodologies were correct only at the first digit level. Table 10 Quality analysis sample Chambers of Commerce
Quality control sample N Coinciding codes first 4 digits Coinciding codes first 3 digits Coinciding first 2 digits Coinciding first digits Not coinciding codes 520 686 943 1,136 715 4,000 489 592 920 351 569 A % 94 86 98 31 80 N 23 87 8 119 99 C % 4 13 1 10 14 N 8 7 1 666 10 E % 2 1 0 59 1 N 0 0 13 0 37 D % 0 0 1 0 5

3.3 Quality test on descriptions of special surveys When ATECO 2007 was almost finalised and all the instruments to reclassify the enterprises were planned, it became evident that it was necessary to realize some specific surveys in those sectors where information was not available or the activities included were completely new. Particularly, it was decided to send a questionnaire to the enterprises in the field of :

Information and Communication (section J); Architectural and engineering activities; technical testing and analysis (division 71); Research and experimental development on natural sciences and engineering (group 72.1); Specialised design activities (group 74.1); Services to buildings and landscape activities (division 81); Other professional, scientific and technical activities n.e.c.; Office administrative and support activities; Business support services activities n.e.c. (74.9; 82.1; 82.9). The surveys were sent to around 45.000 enterprises: all the enterprises larger than 10 employees and a sample of the smallest ones (1 9 employees). In order to obtain a sufficient percentage of respondents, six different very simple questionnaires were designed where the new activities were described. The enterprises had to choose the activity where the higher turnover was realized. At the beginning of every questionnaire a description of the economic activity not longer than 200 bytes was required. The aim of the surveys was to test the quality of the ATECO 2007 code in particular sectors of economic activities. The other important purpose was to test ACTR 2007 version with the new descriptions collected through the questionnaires; the last but not least purpose was to enrich the ACTR dictionary linked to the new activities. The respondents were around 30%. In order to realize a quality test on these activities, only the questionnaires where it was possible to attribute an ATECO code, analysing the answers to the survey, were considered. These cases were around 52% of respondents. For this sub-population the coding rate was 44,5%. This rate is not particularly high but it can be considered good enough because it was not easy to attribute codes to new activities. In addition, it can not be considered a failure as the surveys regarded specific sectors for which it was already known that the dictionary had to be enriched. As a matter of fact, the collected texts were included in the application so as to guarantee better results both in terms of recall and precision rates. For ACTR 2007 it was particularly important to collect the texts written by the enterprises regarding new activities especially in the third sector. This sector will be more and more important in the developed countries. The analysis of the answers to the surveys concerning the research of new texts for ACTR 2007 has not yet been completed; at the moment around 600 new texts have been found and used to enrich the coding dictionary.

Conclusions All the analyses described in this paper demonstrate that the ACTR coding application guarantees good performances both from the quantitative and the qualitative point of views, satisfying the needs of different types of respondents. Also the new application ACTR on WEB turned to be a very useful tool, which is confirmed by the number of queries that is weekly monitored and is constantly high, so that this can be considered a pivot experience that could be extended to other official classification. In addition, this tool is a precious source of descriptions which are systematically analysed by the ATECO experts who select which of them must be included in the ACTR database so as to produce even better coding results.

References De Angelis R., Macchia S. and Mazza L. (2000), Applicazioni sperimentali della codifica automatica: analisi di qualit e confronto con la codifica manuale, ISTAT Quaderni di ricerca Rivista di statistica Ufficiale, 1, pp. 29-54. DOrazio, M. Macchia S., (2002), A system to monitor the quality of automated coding of textual answers to open questions, Research in Official Statistics (ROS), n. 2 2002, pp. 7-21. Eurostat, (2007), NACE Rev. 2. Introductory Guidelines, division Statistical governance, quality and evaluation. Eurostat, (2006), Regulation (EC) No 1893/2006 of the European Parliament and of the Council of 20 December 2006, Official Journal of the European Union, L 393/1. Hellermann, E., (1982), Overview of the Hellerman I&O Coding System. US Bureau of the Census internal paper, Washington. Tourigny J. Y. and Moloney J. (1995). The 1991 Canadian Census of Population experience with automated coding. In United Nations Statistical Commission, Statistical Data Editing, 2. Wenzowski M. J. (1988). ACTR A Generalised Automated Coding System, Survey Methodology, vol. 14: 299-308.