Finalize collection (data entry)
This sub-process includes loading the collected data and metadata into a suitable electronic environment for further processing in phase 5 (Process). It may include automatic data take-on, for example using optical character recognition tools to extract data from paper questionnaires, or converting the formats of data files received from other organizations. In cases where there is a physical data collection instrument, such as a paper questionnaire, which is not needed for further processing, this sub-process manages the archiving of that material in conformance with the principles established in phase 8 (Archive).
Unless information is electronically captured at the time of the interview, the information on the questionnaire must be promptly and correctly keyed into an automated system. The design of this data entry system can be the subject of various technologies. The processes can be defined as traditional vs. non-traditional though with changing technologies this definition is constantly evolving.
Traditional data entry is generally done on an individual personal computer and the data is processed in discrete groups based on clusters of households. A cluster is assigned to one data entry clerk. In developing countries these technologies were initially introduced through DOS based software packages called IMPS or ISSA. The current -free- software which has in large part replaced IMPS and ISSA is called CSPro and is a combination of the previous packages. DHS and MICS surveys have developed common principles for data entry on stand alone PCs. These common principles used by the MICS and the DHS could be classified as best practices in processing of surveys and include the following: physical organization of the data entry area (this includes either a centralized or decentralized data entry operation); development of robust control system; system controlled applications with some in-line edits and checks; specified value sets in a data dictionary; pre-programmed and standard keys used to define missing values, inconsistent values or "other" values; double data entry or verification of data. Practical suggestions The source of this paragraph is A guide for data management of household surveys by Juan Muñoz, in Household Sample Surveys in Developing and Transition Countries, United Nations, 2005 (chapter XV). Data entry in the field ? Consideration may be given to integrating data entry into field operations. Under this strategy, data entry and consistency controls are applied on a household-by-household basis as a part of field operations, so that errors and inconsistencies are solved by means of eventual revisits to the households. The most important and direct benefit of integration is that it significantly improves the quality of the information, because it permits the correcting of errors and inconsistencies while the interviewers are still in the field. It can also generate databases that are ready for tabulation and analysis in a more timely fashion. Another indirect advantage of integration is that it fosters the application of uniform criteria by all the interviewers and throughout the whole period of data collection, which is hard to achieve in practice with pre-integration methods. External requirements, such as the need to ensure a permanent power supply for the computers, need to be carefully considered by the survey planners and managers. Organization for centralized systems A well organized and comfortable data entry area is required for a centralized data entry system. The computers are usually organized in groups of data entry clerks managed by a supervisor. The computers are generally networked though allowances should be made in the event of a network failure. Part of the management and organization of a data entry operation requires establishing a specified work schedule. Monitoring the productivity of the individual data entry clerks should be part of a data entry system as well. Like any other process, the data entry process requires good organizational and project management skills. Control systems The key to assuring a successful data entry operation is the development of a robust control system to: provide a shell or tracking system for data entry supervisors to manage their data entry clerks; determine where in the process a particular cluster is; manage data files; back up data files; issue regular status reports designed to inform management on the state of advancement of the data entry operation. A good control system will always use a reference file that reconciles a particular cluster with the original sample design, and will not allow the entry of mistaken clusters or duplicate clusters or households. System controlled applications System controlled applications are data entry applications that will not allow individual data entry clerks to advance through the path of the interview at will. This technique primarily used in sample surveys assures a high level of control and requires input from the data clerk and restricts advancing through the fields. The path must be programmed via various edits and skips by a skilled computer programmer. A distinction can be made in terms of data entry techniques. These techniques are called "heads down" keying vs. "heads up" keying. A "heads down" approach means that the data entry clerk does not remove his eyes from the survey or census questionnaire. This technique most often used in censuses does not use many in-line edits. A "heads up" approach, more common for surveys, switches the focus of the data entry clerk on the computer screen as opposed to the form. This type of keying usually involves the communicating of problems to the data entry clerk via screen messages and informs the data entry clerk and supervisor of inconsistencies in the data being keyed. Value sets and special keys Valid values should be defined in the data dictionary for certain responses. Entry of data is restricted to these values alone. Furthermore, special keys for missing data should be included in the value set and may use a standard identifying digit. For example, 9 or 99 or 999 could be used to identify missing values. Open ended questions or the selection of the broad category of "other" can also be programmed to allow for entering the written response. This keying of open ended questions will require the manual coding of these responses at some future date. Verification by double entry Verification is a process of double entry of the same questionnaire and comparing the responses. Differences in keyed data of the same questionnaire need to be reconciled. A system of verification can virtually assure that the information presented in the questionnaire is faithfully keyed. Verification can be dependent or independent. Dependent data entry uses one data file and reconciles any identified error with the original data file. Independent verification is the process of keying to fully independent data files of the same questionnaire or cluster and comparing the two files. A report of inconsistencies is issued and the differences between the two data files must be fully reconciled. The DHS and MICS use independent double data entry. Other issues Other issues may present themselves during data entry. These include the manual correction of inconsistent data on a questionnaire using colored pens (as determined by a survey statistician); the regular backing up of data files and the consolidation of individual clusters into larger groups eventually to be output and used by analysts and statisticians in a format such as SPSS, SAS or Stata.
More and more technologies such as scanning are replacing traditional manual data entry systems and are themselves becoming a new standard. However, the term non-traditional is used only as a way to distinguish them since their applicability and appropriateness should be more carefully examined due to issues such as: cost, training, maintenance, logistics and social impact. These include Computer Assisted Personal (or Telephone) Interview otherwise known as CAPI and CATI; scanning questionnaires using Optical Character Reading technologies; palm top computers or SMS type cellular technology. Although specialized software are available for implementing these various technologies, editing and analysis of data still require the use of analytical software like SPSS, Stata or SAS. Scanning A scanner is able to read and interpret a questionnaire based on the reflection of light from a special questionnaire. The darkened characters of a questionnaire will in turn be interpreted by light receptors that will digitize the response in binary form. The quality of the scanner hardware is dependent upon the quality of the imaging components, scanning resolution and bit depth. Scanning uses technologies such as OMR (Optical Mark Recognition), OCR (Optical Character Recognition) and ICR (Intelligent Character Recognition). OCR and ICR technologies have the capacity to recognize and interpret written characters and numbers whereas OMR simply interprets a series of filled in characters on a page (usually one of a series of ovals that is filled in by pencil). Scanning technologies are highly dependent on the physical quality of the questionnaire and the manner in which the characters are either filled in or written. Issues such as bit-depth, speed of scanning a questionnaire, environmental ranges (ideal temperature) for operating the equipment, reliable and steady electrical power matter more with this method than with traditional keying. A survey which has a well established track record using scanning technology is the Core Welfare Indicator Survey (CWIQ). This survey uses a combination of OMR and OCR technology and special software to process the documents. Some country experiences The use of scanning technology requires careful evaluation and consideration. The technology has been applied in various areas with frequently mixed and often less than desirable results. The links below provide information to specific experiences different countries have had in applying scanning technology. Philippines Latest Innovations in Methods and Tools for Census Data: Technological Lessons from the 2000. Round of the Philippine Census, by Carmelita N. Ericta, Deputy Administrator and Elpidio C. Nogales Jr., Project Leader, Data Capture Center - Manila, of the National Statistics Office, Philippines. Thailand Implication of ICR for the 2010 Population and Housing Census: Thai Experience, by Sue Lo-Utai, Secretary General, National Statistical Office, Thailand. Kenya Data Processing, Storage and Dissemination: Kenya's Experience, by David S. O. Nalo, Director of Statistics Egypt Workshop on 2006 Population and Housing Census in Egypt, Cairo, 18-20 April 2005 (United Nations, Economic and Social Commission for Western Asia) CAPI & CATI CAPI stands for Computer Assisted Personal Interview. CAPI data entry systems are designed on lap top computers and are used by the enumerators on location. There is no hard copy of a survey form (paperless interviews) as the interview is conducted via the screen of the laptop. The literal question text is provided to the enumerator and the responses directly keyed into the lap top computer. CATI stands for Computer Assisted Telephone Interview. CATI systems are used when the enumerator uses a telephone to conduct an interview and keyes the responses into a computer. PDAs (Personal Digital Assistants) These are similar to the CAPI interviews but take advantage of the smaller Personal Digital Assistants to carry out the survey. The Palm top is less intrusive in the household than a conventional laptop and has lower power usage overheads. Survey View is one of the various software packages available for running CAPI type applications on PDAs. SMS SMS uses cellular phone technology to receive keyed responses from the touch pad to a central database.