Data Formats

We have developed a data format for MultNet that seems to be a pretty good one for most types of analysis. It allows very complex data structures to be used and makes them easy to work with. It works for very large networks as well as situations where you have lots of variables describing both nodes and links.

Data Formats

Two file types: one for node attributes and one for link attributes. The node file describes only nodes; the link file describes only links.
Three file formats: fixed, free, matrix.

Node files may be either fixed or free format.
Link files may be fixed, free, or matrix format.
If the delimiter in free format files is a comma, you have a comma-separated-values (CSV) file which can be produced by spreadsheet programs like Excel or Quattro.
Both fixed and free format files are rectangular cases by variables collections of data.
Each case in the node file is associated with an identification number. ID numbers may range from 1 to n, or they may take other values, such as social security numbers or apartment numbers. The first variable for a case should always be the ID number. The name of the ID number variable is "id" or "ID."
In the link file, each link is a case, and each case has two ID numbers. The first identifies the node "sending" or describing the link. The second identifies the node "receiving" the link. The names of the ID number variables are "id1" and "id2." The first two variables for a case should always be the ID numbers.

Fixed format

A fixed format file starts with the keyword "fixed" which appears by itself on the first line of the file.

Following the keyword is the header which lists the name of each variable and the location it occupies.

"location"

In Fixed format files, c1 is the first character or digit that appears on a line. c2 is the second character on a line. If the first four characters on a line are "AB C" (the letter "A", the letter "B", a blank space, the letter "C"), the "A" is c1, the "B" is c2, the space character is c3, and the "C" is c4.

The beginning of a header in a Fixed format file might look like this:

ID (1-3) Age (4-6) :

This says that the ID number variable occupies the first three characters of a line of data -- that the first three characters on a line are the ID number. Similarly, the variable Age will be the fourth, fifth, and sixth characters on the line.

To supply labels for nodes (i.e. people's names), insert a line immediaely after the ID variable that contains "IDLABEL" followed by the location of the labels:

ID (1-3) IDLABEL (4-12) Age (14-16) :

If any other variable has value labels, they should appear on the lines immediately following the variable's name, surrounded by lines containing only opening and closing curly braces:

ID (1-3) Age (4-6) { 1 15-20 2 21-25 3 26-30 } :

Value labels are optional.
Variables must be listed in the order they appear on a line of data.
Overlapping variables are not permitted: "id (1-3) age (3-5)".
The lower character number should be the first listed: "(4-6)" and not "(6-4)".
Variables not listed in the header will not be read by the program.

If the variables for a case require more than one line of data, the pseudo-variable "*" indicates that the next variable should be read from the next line of data. The "*" appears in the header at the beginning of the file, not in the actual data. For example:

ID (1-3) Age (4-6) : Employment (68) Residence (70-74) * Rent/Own (5-6) Years (7-8) :

Here the last variable on the first line of data for a case is "Residence." The data for the same case continues on the next line of data, where the first variable to be read is "Rent/Own" which appears in characters 5-6. The "*" tells the program to go to the next line of data and to reset the character counter.

The "*" goes in the header at the beginning of the file, not in the part of the file that contains the data.

The end of the header is signified by the words "Begin Data" appearing by themselves on the line immediately after the last line of the header and before the first line of data.

ID (1-3) Age (4-6) : EdLevel (58-59) Degree (61-63) * JobClass (5-6) Location (7-8) : Position (48-49) Division (50-51) Branch (52-53) Begin Data 1 47 34 3 2 11 2 3 1 2 3 4 1 1 1 2 22 12 312 21 112 3 1 112 12 32 11 1 2 23 3112 212 334421 122120 2 39 23 3 3 12 2 4 2 3 2 1 1 2 3 4 21 23 123 12 112 4 1 131 23 12 21 2 1 21 2123 122 231322 210311 :

The example shows parts of the header, including the pseudo-variable "*" that indicates the program should read the variables from "ID" through "Degree" from the first line of data for each case and the variables from "Position" through "Branch" from the second line of data for each case.

The data for the first two cases are shown after the "Begin Data" command. Note that the "*" goes in the header and not in the data itself.

Free format

The Free format file starts with the keyword "free" which appears by itself on the first line of the file. Following the keyword is the header which lists the names of all of the variables. For example:

ID, Age, Gender, Marital Status, Religion, Languages, EdLevel, Degree, JobClass, Location, Position, Division, Branch

Each variable name is separated from the next by a comma. This allows variable names to contain spaces.

If the entire variable list doesn't fit on one line in the header, multiple lines can be used. Every line except the last one should end with a comma. A comma as the last non-blank character of a line in the variable list indicates that the variable list continues on the next line. For example:

ID, Age, Gender, Marital Status, Religion, Languages, EdLevel, Degree, JobClass, Location, Position, Division, Branch Begin Data

The commas after the words "gender," "languages," and "location" indicate that the variable list hasn't ended; that there are more variable names in the following lines.
The end of the header is signified by the words "begin data" appearing by themselves on the line immediately after the last line of the header and before the first line of data.
The variables in the data are separated by spaces (or commas or tab characters). The advantage of using commas or tabs to delimit values is that they allow missing values to be left blank. In space-delimited files, a blank surrounded by spaces is indistinguishable from a single long space.
Missing values should be indicated with a "?" in the place where the value would go if it were not missing.

If the program is trying to read eight variabless from a line of data on which there are only 5 values, the last three variables will receive missing data codes. The program will not read past the end of a line of data; it will not go to the next line of data to get more values . (If the program simply continued to read values until it got as many as it needed according to the variable list, a missing value would cause the program to read the value for the k+1^th variable and incorrectly assign it to the k^th variable. This mis-matching would continue for the rest of the data in the file, unless there was an extra value somewhere, which would bring things into synchronization once again.)

Value labels appear after the name of the variable to which they belong, separated from the variable name by curly braces, either with or without quotation marks:

ID, Age {1=15-20, 2=21-25, 3=26-30},Gender,Marital Status, Religion, Languages, ....

ID, Age {1='15-20', 2='21-25', 3='26-30'},Gender,Marital Status, Religion, Languages, ....

Matrix format

A matrix format file for link data starts with the word "Matrix" on the first line of the file.

The word "Matrix" should be followed by "xxx rows yyy cols", where "xxx" is the number of rows and "yyy" is the number of columns. (This will allow rectangular matrices to be read, as well as square ones.)

If the first value on each line of data is the row number, the word "cols" should be followed by the word "numbered."

If the matrix is square and symmetrical, it is necessary to supply only the half above or the half below the main diagonal. In the case of the former, the word "upper" should be the last word on the first line; in the case of the latter, the word "lower" should be the last word.

Here are some examples of valid first lines:

For a 25 by 20 matrix, no row numbers:

Matrix 25 rows 20 cols 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 :

For a 25 by 20 matrix with row numbers:

Matrix 25 rows 20 cols numbered 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 2 1 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 :

For a 20 by 20 symmetrical matrix, no row numbers, lower half:

Matrix 20 rows 20 cols lower 0 1 0 1 1 0 :

For a 20 by 20 symmetrical matrix with row numbers, lower half:

Matrix 20 rows 20 cols numbered lower 1 0 2 1 0 3 1 1 0 :

Each row of the matrix appears on a single line with the values separated from one another by spaces, commas, or tabs.
If spaces are used as delimiters, each missing value must be represented by a "?" (e.g. "...1 ? 0 1 ...")
If commas (or tabs) are used as delimiters, missing values may either be simply omitted (e.g. "..,1,,1,...") or indicated by means of a "?" (e.g. "...,1,?,0,1,...").
The line after the "Matrix ..." line is the first row of the matrix.

Multiple files

Once a file has been read in by the program, additional files may be read and appended to the one previously read.

For example, you may have data collected at two points in time. The node data may be the same for both time periods, but the link data may have changed. You might start your analysis by reading the node data and the first link data files. You may decide to read the second link data file and append it to the first one. When the program does this, it appends the data in the second link file to the data in the first one in such a way that each line of data in the second file is concatenated to the corresponding line of data in the first file, as shown here:

first file + 2nd file = result

1 43112 012512 2 33221 312324 3 23132 453314 : + 1 87831 2 76861 3 77652 : = 1 43112 012512 87831 2 33221 312324 76861 3 23132 453314 77652 :

Another file may be read and concatenated to the result shown above.

If there are cases present in the second (or subsequent) file but not in the first file (or vice versa), missing values are supplied to fill in what would be gaps in the concatenated file:

`first file`	`+`	`2nd file`	`=`	`result`
`1 43112 012512 2 33221 312324 3 34521 452654 ? ????? ?????? :`	`+`	`1 87831 2 76861 ? ????? 4 77652 :`	`=`	`1 43112 012512 87831 2 33221 312324 76861 3 34521 452654 ????? 4 ????? ?????? 77652 :`

This type of concatenation may be performed for node files or for link files.

Reading a Matrix format file is equivalent to reading a link file in which each line of data contains three values -- the ID number of the row, the ID number of the column, and the value of the link variable on which the matrix is based. Fixed and free format headers for such a file are shown here:

ID1 (1-3) ID1, ID2, LinkA ID2 (4-6) Begin Data LinkA (8) : Begin Data : :

When a second Matrix format file is appended to a previous one, the resulting data structure is equivalent to what would be produced by the following fixed format header:

ID1 (1-3) ID1, ID2, LinkA, LinkB ID2 (4-6) Begin Data LinkA (8) : LinkB (10) : Begin Data :

Please give us feedback on this plan. I suggest ....