We have developed a data format for MultNet that seems to be a pretty good one for most types of analysis. It allows very complex data structures to be used and makes them easy to work with. It works for very large networks as well as situations where you have lots of variables describing both nodes and links.


Data Formats

  1. Two file types: one for node attributes and one for link attributes. The node file describes only nodes; the link file describes only links.

  2. Three file formats: fixed, free, matrix.


Fixed format

A fixed format file starts with the keyword "fixed" which appears by itself on the first line of the file.

Following the keyword is the header which lists the name of each variable and the location it occupies.

"location"

In Fixed format files, c1 is the first character or digit that appears on a line. c2 is the second character on a line. If the first four characters on a line are "AB C" (the letter "A", the letter "B", a blank space, the letter "C"), the "A" is c1, the "B" is c2, the space character is c3, and the "C" is c4.

The beginning of a header in a Fixed format file might look like this:

ID (1-3)
Age (4-6)
:

This says that the ID number variable occupies the first three characters of a line of data -- that the first three characters on a line are the ID number. Similarly, the variable Age will be the fourth, fifth, and sixth characters on the line.

To supply labels for nodes (i.e. people's names), insert a line immediaely after the ID variable that contains "IDLABEL" followed by the location of the labels:

ID (1-3)
IDLABEL (4-12)
Age (14-16)
:

If any other variable has value labels, they should appear on the lines immediately following the variable's name, surrounded by lines containing only opening and closing curly braces:

ID (1-3)
Age (4-6)
{
1 15-20
2 21-25
3 26-30
}
:

If the variables for a case require more than one line of data, the pseudo-variable "*" indicates that the next variable should be read from the next line of data. The "*" appears in the header at the beginning of the file, not in the actual data. For example:

ID (1-3)
Age (4-6)
:
Employment (68)
Residence (70-74)
*
Rent/Own (5-6)
Years (7-8)
:

Here the last variable on the first line of data for a case is "Residence." The data for the same case continues on the next line of data, where the first variable to be read is "Rent/Own" which appears in characters 5-6. The "*" tells the program to go to the next line of data and to reset the character counter.

The "*" goes in the header at the beginning of the file, not in the part of the file that contains the data.

The end of the header is signified by the words "Begin Data" appearing by themselves on the line immediately after the last line of the header and before the first line of data.

ID (1-3)
Age (4-6)
    :
EdLevel (58-59)
Degree (61-63)
*
JobClass (5-6)
Location (7-8)
    :
Position (48-49)
Division (50-51)
Branch (52-53)
Begin Data
1 47 34 3 2 11 2 3 1 2 3 4 1 1 1 2 22 12 312 21 112 3 1
  112 12 32 11 1 2 23 3112 212 334421 122120
2 39 23 3 3 12 2 4 2 3 2 1 1 2 3 4 21 23 123 12 112 4 1
  131 23 12 21 2 1 21 2123 122 231322 210311
    :

The example shows parts of the header, including the pseudo-variable "*" that indicates the program should read the variables from "ID" through "Degree" from the first line of data for each case and the variables from "Position" through "Branch" from the second line of data for each case.

The data for the first two cases are shown after the "Begin Data" command. Note that the "*" goes in the header and not in the data itself.


Free format

The Free format file starts with the keyword "free" which appears by itself on the first line of the file. Following the keyword is the header which lists the names of all of the variables. For example:

ID, Age, Gender, Marital Status, Religion, Languages, EdLevel, Degree, JobClass, Location, Position, Division, Branch

Each variable name is separated from the next by a comma. This allows variable names to contain spaces.

If the entire variable list doesn't fit on one line in the header, multiple lines can be used. Every line except the last one should end with a comma. A comma as the last non-blank character of a line in the variable list indicates that the variable list continues on the next line. For example:

ID, Age, Gender,
Marital Status, Religion, Languages,
EdLevel, Degree, JobClass, Location,
Position, Division, Branch
Begin Data

If the program is trying to read eight variabless from a line of data on which there are only 5 values, the last three variables will receive missing data codes. The program will not read past the end of a line of data; it will not go to the next line of data to get more values . (If the program simply continued to read values until it got as many as it needed according to the variable list, a missing value would cause the program to read the value for the k+1th variable and incorrectly assign it to the kth variable. This mis-matching would continue for the rest of the data in the file, unless there was an extra value somewhere, which would bring things into synchronization once again.)

Value labels appear after the name of the variable to which they belong, separated from the variable name by curly braces, either with or without quotation marks:

ID, Age {1=15-20, 2=21-25, 3=26-30},Gender,Marital Status, Religion, Languages, ....

ID, Age {1='15-20', 2='21-25', 3='26-30'},Gender,Marital Status, Religion, Languages, ....


Matrix format

A matrix format file for link data starts with the word "Matrix" on the first line of the file.

The word "Matrix" should be followed by "xxx rows yyy cols", where "xxx" is the number of rows and "yyy" is the number of columns. (This will allow rectangular matrices to be read, as well as square ones.)

If the first value on each line of data is the row number, the word "cols" should be followed by the word "numbered."

If the matrix is square and symmetrical, it is necessary to supply only the half above or the half below the main diagonal. In the case of the former, the word "upper" should be the last word on the first line; in the case of the latter, the word "lower" should be the last word.

Here are some examples of valid first lines:

For a 25 by 20 matrix, no row numbers:

Matrix 25 rows 20 cols
0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0
1 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1
   :

For a 25 by 20 matrix with row numbers:

Matrix 25 rows 20 cols numbered
1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0
2 1 0 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1
   :

For a 20 by 20 symmetrical matrix, no row numbers, lower half:

Matrix 20 rows 20 cols lower
0
1 0
1 1 0
  :

For a 20 by 20 symmetrical matrix with row numbers, lower half:

Matrix 20 rows 20 cols numbered lower
1 0
2 1 0
3 1 1 0
  :


Multiple files

Once a file has been read in by the program, additional files may be read and appended to the one previously read.

For example, you may have data collected at two points in time. The node data may be the same for both time periods, but the link data may have changed. You might start your analysis by reading the node data and the first link data files. You may decide to read the second link data file and append it to the first one. When the program does this, it appends the data in the second link file to the data in the first one in such a way that each line of data in the second file is concatenated to the corresponding line of data in the first file, as shown here:

first file + 2nd file = result
1 43112 012512
2 33221 312324
3 23132 453314
      :
+ 1 87831
2 76861
3 77652
    :
= 1 43112 012512 87831
2 33221 312324 76861
3 23132 453314 77652
      :

Another file may be read and concatenated to the result shown above.

If there are cases present in the second (or subsequent) file but not in the first file (or vice versa), missing values are supplied to fill in what would be gaps in the concatenated file:

first file + 2nd file = result
1 43112 012512
2 33221 312324
3 34521 452654
? ????? ??????
      :
+ 1 87831
2 76861
? ?????
4 77652
    :
= 1 43112 012512 87831
2 33221 312324 76861
3 34521 452654 ?????
4 ????? ?????? 77652
      :

This type of concatenation may be performed for node files or for link files.

Reading a Matrix format file is equivalent to reading a link file in which each line of data contains three values -- the ID number of the row, the ID number of the column, and the value of the link variable on which the matrix is based. Fixed and free format headers for such a file are shown here:

ID1 (1-3)            ID1, ID2, LinkA
ID2 (4-6)            Begin Data
LinkA (8)              :
Begin Data             : 
  :

When a second Matrix format file is appended to a previous one, the resulting data structure is equivalent to what would be produced by the following fixed format header:

ID1 (1-3)            ID1, ID2, LinkA, LinkB
ID2 (4-6)            Begin Data
LinkA (8)              :
LinkB (10)             :
Begin Data   
  :


Please give us feedback on this plan. I suggest ....