CANSIM DATA RETRIEVAL PROGRAM (UNIX Version) Research Data Library Simon Fraser University GENERAL DESCRIPTION The purpose of this program is to provide a convenient interface between the user and the CANSIM University Base. CANSIM is an acronym for CANadian Socio-economic Information Management database. RUNNING THE PROGRAM To start up the Cansim retrieval program from within unix, enter the command: rdl cansim If you get the message "rdl: Command not found", then the directory /usr/local/bin is probably not part of your search path. For details on how to resolve this, contact the computing services duty consultant. As an interim measure, use the full command: /usr/local/bin/rdl cansim If you are still having problems contact the RDL. GENERAL NOTES Input to the CANSIM program may be typed in any mixture of upper or lowercase; there is no distinction in meaning. All command input (in response to the "Cansim>" prompt) must be followed by a carriage return. Long command lines can be continued onto additional lines by ending the line with a hyphen. You can use the BREAK key to halt output. The program uses -99 for missing data. OUTPUT PRODUCED The program can produce four kinds of output: Table output, data output, header output, and message output. ù Table output contains headings and is more appropriate for reading directly on the screen or from a printed page. This output is suitable for appending to a paper or report as is. ù Data output is just the raw data numbers and is more useful as input to other programs. ù Header output contains the heading information for a series as supplied by CANSIM. This output contains matrix and series information and may also be used for inclusion with papers or reports. ù Message output consists of informative messages produced by the CANSIM program. The destination of these four kinds of output is controlled through the TF=, DF=, HF=, and MF= options (see OPTIONS for further explanation). COMMANDS The commands are as follows (abbreviations in parentheses): HELP (H) GET (G) SET (S) STOP Lines beginning with a "*" or a "#", are ignored, and can be used to comment sessions or command input files. Lines beginning with "!" or a "$" are interpreted as operating system commands. Help Online help is available with the HELP command. If HELP is followed by a topic, then help on that topic is provided. HELP with no other input gives general help. HELP TOPICS gives a list of topics available and possible aliases. Most topics have additional pages. For each such page you will be prompted as to whether or not you want the next page. Enter a carriage return (hit the ENTER key) to get the next page. To skip the next page enter the letter N. This skips the section, but continues on with the current topic, and you may be prompted for other sections. To leave HELP altogether, use the letter E. Get The GET command is used to retrieve the data from the database. It is followed by one or more series identifiers and any options specific to that request. The series identifiers must be obtained from the CANSIM codebooks available from the Research Data Library, located on the fourth floor of the main library. If more than one series is requested on a line, they are supplied as a combined table, or sequentially depending on the setting of the column option. Other options available control the files used for output, the date range of the series requested, whether or not the date is included with the output (and what format is used), and frequency conversions. By default, the program writes a series heading, a table of data, and any messages to the screen. To GET a file of raw data, the data file must be set. An example of the GET command, with the DF option to put the data into to the file called DATA, is: GET B100 DF=DATA TF=OFF (See the section on OPTIONS for a more complete discussion of options.) Set The SET command is used to specify the default value of an option. The default value of an option is used by a command (like GET) whenever one has not been specified as part of that command. Thus the SET command can be used to specify an option which is to apply to subsequent commands. The default value of an option is in effect until it is changed by a subsequent SET command. Any number of option specifications can be included after SET. SET, with no other input, displays the current settings. The following example turns off the header file output and restricts output to the 1980's. SET HF=OFF F=1980 L=1989 These settings will apply to subsequent commands unless HF= or a date range is used as part of one of those commands. To reset options to their original values, the special designation DEFAULT may be used. To undo the effects of the above example, the following may be used: SET HF=DEFAULT F=DEFAULT L=DEFAULT (Since OFF is the default for F and L, it could also be used in place of DEFAULT.) !system_command or $system_command Any command preceded by a "!" or a "$" is interpreted as a command to the operating system. Note that this command is executed in a different environment than your normal login, and thus some commands valid there may not work, or may work differently. *comment or #comment Any line starting with a "*" or a "#" is ignored by the CANSIM program. Such lines can be used to annotate CANSIM input files. Stop This command terminates the program. OPTIONS To use an option, the option is followed by and equal sign and then the value desired. An option may be set as part of the SET or GET command. In the former case the option is in effect for all subsequent GET commands, unless it is overridden by the option occurring as part of a GET command. Adding an option setting as part of a GET command applies to that command only. For example, in the following, the data for B1 and B3 go into the file DATA.ONE, but the data for B2 goes into DATA.TWO: SET DF=DATA.ONE GET B1 GET B2 DF=DATA.TWO GET B3 (NOTE: The data for B1 and B3 occur one after the other in DATA.ONE, with no separator, which is inappropriate for many applications.) The options available are (abbreviations in parentheses): file options: DF, TF, HF, MF date range options: FD, LD, FY (F), LY (L) conversion options: PERIOD (P), MQ, MA, QM, QA, AM, AQ style options: COLUMNS (C), DATE, DATA_SEPARATOR (DS) Specifying Output Files (file options) The options DF, TF, HF, and MF are used to control the output destination for data output, table output, header output, and message output, respectively. Table output is a data format suitable for printing, or viewing on the screen. Data output, which is only raw data, is suitable for input into statistical packages such as SHAZAM. Header output is the information supplied by CANSIM on the series. This information is a repeat of that contained in the MATRIX section of the CANSIM University Base Directory. (For the most part it is best to suppress this output using SET HF=OFF.) Message output consists of various informative messages produced by the CANSIM program. Error messages are also produced, but they are not controlled by an option. When the program is started, the table, header, and message output files are set to the standard output, (which is the screen when the program is run using the Microlab menu). No data output destination is set by default. DF, TF, MF, and HF can be set to any valid file. For example: GET B100 TF=B100.TAB DF=B100.DAT HF=B100.HEA In addition to file names, the following designations may be used: SCREEN forces output to go to the computer screen OFF suppresses output ON, DEFAULT the destination at program startup Specifying Date Ranges (date range options) The date range options are used to restrict the data supplied to the given date range. (All CANSIM data is time-series data.) There are two formats available. FY and LY (for "first year" and "last year") accept years either as a four-digit year, or as the last two digits. F and L are abbreviations for FY and LY. For example: GET B100 FY=1980 LY=1989 (This retrieves data for the 1980's - ten years' worth.) FD and LD (for "first date" and "last date") accept fully specified dates as 6 digit integers as in 890405 for April 5, 1989, or fully specified months as in 8904 for April, 1989, or the last two digits of the year like FY and LY. Note that FD and LD will interpret 1985 as the eighty fifth month of 1919, and thus such use will not be accepted. For example: GET B100 FD=8906 LD=8912 (This retrieves data for the second half of 1980 - six months' worth.) Data Conversion (period options) CANSIM data may be annual, quarterly, monthly, and weekly. In order to use data of different periods together, it is normally necessary to convert some of them so that all are the same period. This conversion must be done with some CARE. To help remind you that some thought needs to be applied to this problem, the program will not convert data without being told to do so, AND being told what method to use. The program cannot convert to or from weekly data. All other possible conversions may be done. To specify a desired period, use P= followed immediately with M, Q, or A, for monthly, quarterly, and annually respectively. Thus to specify conversion to quarterly of all subsequent data that is retrieved, you can use: SET P=Q To specify a conversion method from one period to another use one of the following: MQ= (for monthly to quarterly) MA= (for monthly to annual) QM= (for quarterly to monthly) QA= (for quarterly to annual) AM= (for annual to monthly) AQ= (for annual to quarterly) followed immediately with the method to be used. For MQ, MA, and QA a conversion to lower frequency (larger period) is required. For QM, AQ, and AM a conversion to higher frequency (smaller period) is required. Converting Data from a Higher to Lower Frequency To convert data to lower frequency the following choices are available: AVE (conversion by averaging) SUM (conversion by summation) a number (selects that period) Conversion by averaging is the most common method, and is appropriate for data like population size where the frequency does not affect the data, other than specifying when it was sampled. Conversion by summation may be appropriate when the data is clearly cumulative, as in converting absolute monthly profits to absolute quarterly profits. If you specify conversion method = integer, then that period is used. For example, MA=4, specifies that the annual data is the data for the fourth month (April). Example: (Get the data for B1, which is monthly, as quarterly data, by averaging.) GET B1 P=Q MQ=AVE Converting Data from a Lower to Higher Frequency NOTE: Converting data from a lower to a higher frequency requires care. You are asking the program to APPROXIMATE, based on the data points available, what the value was in between these data points. Thus doing so is safe only if the data is known to vary smoothly. To convert data to higher frequency the following choices are available: RCS (approximation by "regular" cubic spline fit) NCS (approximation by "normalized" cubic spline fit) Both these methods use cubic spline approximation to fit the data. In the case of NCS, the result is normalized by dividing the result by the ratio of the two periods. The use of RCS as opposed to NCS is analogous to AVE compared to SUM. To fit population data, RCS is fine. To fit absolute monthly earnings from absolute quarterly earnings, the fit must be normalized, as the earnings for a month will be roughly a third of that for a quarter. Example: (Get the data for B107 which is quarterly as monthly data using RCS) GET B107 P=M QM=RCS Specifying Style of Output (format style options) By default, the program makes a table of all the series requested on a line, placing consecutive series across the page. For this to work, all series on the line must be of the same frequency, either as provided, or after conversion. The format option that corresponds to this default is C=S (or COLUMNS=SERIES). If more than one series with different available time periods are requested, then the largest time period needed to accommodate all the series is used, and the missing value indicator -99 is entered where any of the series do not fill the time range. A second format available produces output for each series in sequence. This option is specified as C= followed immediately by the number of columns to be used for output. Thus for quarterly data, C=4 will use one line for each year. C=1 will produce a single column of data. (An optional date is not counted as one of the columns.) Specifying the Date Format or Excluding the Date Altogether By default, the program supplies the date as one of the columns in any series output. To specify that you do not want the date to be included, use DATE=OFF. Example: GET D14 DATE=OFF The default date format is chosen to correspond with common usage. For annual data, this is the year as a four-digit integer. For monthly and quarterly data, it is two digits for the year followed by two digits for the month (last month in the quarter in the case of quarterly data). For weekly data, there is an additional two digits for the date. In the case of table file output, date fields are separate by a "/". Thus 88/07 may be found as a date in the table file output for monthly data, and 890927 might be found as data file output for weekly data. If you have SET date to OFF, or YEAR.PERIOD (see below), this format may be reset by using DATE=DEFAULT, DATE=ON, DATE=NORMAL, or DATE=YY/MM/DD. It is common in statistical economics to use the year followed by a ".", followed by a period number, to designate dates. The interpretation of this date format is dependent on the data frequency. For example, 1978.3 can designate the third month, the third quarter, or the third week of 1978, depending on whether the data is monthly, quarterly, or weekly. To specify this date format, you can use DATE=YEAR.PERIOD or DATE=Y.P. Specifying that Raw Data is to be Separated by Commas Normally, raw data (data file output) is separated by blanks. However, some applications need the data to be separated by commas. The CANSIM program will provide data in this format if DATA_SEPARATOR=COMMA is used. DS can be used in place of DATA_SEPARATOR, and an actual comma can be used in place of COMMA. Example: GET D14 DF=A:DATA DS=, DS=BLANK or DS=DEFAULT specifies that data is to be separated by blanks. ADVANCED TOPIC: PATTERNS A pattern may be used to specify a group of series with similar series identifiers. A pattern is a series identifier with one or more question marks (?'s) and/or asterisks (*'s). Each question mark matches exactly one character and each asterisk matches zero or more arbitrary characters. For example: D1? Matches D10 through D19 B100* Matches all series identifiers starting with a B100 Since a pattern can potentially match many series, one must be aware of the format style being used. The default format style is to put multiple series across the page, which is unwieldy for a large number of series, as well as having a limit to the number of series allowed. An additional consideration when using the default style is that all series must have the same frequency (or be converted using the appropriate conversion options. Thus the user may wish to use the C=1 (columns=1) format style to experiment with this feature, so that conversion is not needed. To list the series identifiers matched by a particular pattern the following is recommended: SET HF=OFF TF=OFF DF=OFF C=1 With these settings in place, a GET command will list the short header of each series matched, but no data will be retrieved. ADVANCED TOPICS: RDL RETRIEVAL PROGRAMS AND UNIX. Command line arguments as an aid to writing script commands. You may notice that the above command starts up cansim through the program rdl, which is a front end for all the RDL retrieval programs. An alternate method of starting up cansim is to simply enter rdl, and then the choice cansim. When cansim is started up with "rdl cansim", then the argument cansim is simply the initial input for the the program rdl. Arguments after the program name serve as initial input to the program. In addition, if a command ends in a period, then the program is terminated after that command. Thus, the following one line command retrieves data for series B10 into file data without further interaction. rdl cansim get b10 df=data tf=off hf=off mf=off. As an example of the using the initial input feature to tailor your own data retrieval commands, consider the following: alias CAN 'rdl cansim get \!:1 f=80 l=89 df=cansim_data/\!:1 hf=off tf=off mf=off.' With this alias in place, entering CAN b7 will retrieve the data for b7 (restricted to the 1980's) into the file cansim_data/b7. All other forms of output are ignored. Similary, get d13494 will retrieve data into cansim_data/d13494. (Note: The CAN alias above assumes the directory cansim_data exists in the working directory.) CITING DATA OBTAINED FROM THE CANSIM UNIVERSITY BASE Why should you cite your data sources? Increasingly, use is made of computer assisted methods in academic research, including primary or secondary statistical analysis of files of machine-readable data (mrdf). This has in turn lead to an increase in the number of publications based on statistical analysis of mrdf. Such files should, if used as source materials, or references, in a published work, be cited in the bibliography as would any printed monograph or periodical article under the same circumstances. Here are two examples of formats which should be used to cite CANSIM machine-readable data files as sources in bibliographies and footnotes in a paper or publication. a) Footnote Example Statistics Canada requires for its CANSIM University Base the following in a footnote: These data originate from the CANSIM University Base. "CANSIM" is an Official Trademark of Statistics Canada. Matrix 922. Series D1009, D49, B1. b) Bibliography Example In a bibliography the CANSIM University Base should be cited as follows: CANSIM University Base (machine-readable data file). 1946 - Present. SFU/RDL ed., Ottawa, Statistics Canada. If in doubt as to which of these formats should be used, please contact the Research Data Library. ADDITIONAL INFORMATION? If you need any other information or assistance, please contact us in the Research Data Library, Room 4020, WAC Bennett Library, telephone 291-4349. We are open 9:00 a.m. to 4:30 p.m., Monday through Friday. For general information on the Research Data Library: new acquisitions and updates, services and other data management developments, you may want to subscribe to the SFU-RDL E-mail group. Membership forms to join E-mail groups are available from Academic Computing Services. WE WANT YOUR SUGGESTIONS We want to hear any comments or suggestions that you may have. It is difficult for us to provide programs that help you do your work, unless you let us know what is useful to you. You can visit us during office hours Monday through Friday, on the fourth floor of the SFU library. (Our phone number is 291-4349.) Alternately, you can contact either Walter Piovesan or Kobus Barnard at the following electronic mail addresses: walter@sfu.ca kobus@sfu.ca