Getting data into MLwiN

Importing data from Stata, SPSS or Minitab

It is possible to open Stata (*.dta), SPSS (*.sav), SAS transport (*.xpt) or Minitab (*.mtw) file directly, either via the Graphical User Interface or via commands. For documentation of these commands, see section 6 of the MLwiN Manual supplement. In order to open via the GUI, simply

The worksheet should open in MLwiN, and you should see that missing values, descriptions of variables, and category labels for categorical variables have all been transferred across correctly. It is now possible to save the worksheet as an MLwiN worksheet in the usual way.

(Back to top)

Importing data in ASCII format

If your data comes in the form of an ASCII text file, it can be read using the ASCII text file input option from the File menu or via commands. For documentation on how to use the ASCII text file input option, see chapter 8 of the User's Guide, and for documentation on how to use the commands, see pp32-33 of the Command manual; however note the advice in section 4.0 since using these methods may not be the best option.

Data input: pasting data from the clipboard

Data can be pasted into MLwiN via two methods, the first of which to use ctrl + v to paste in data via the Paste View window. The alternative is to go to the Names window and use the Paste button. The data will be pasted into the columns that are currently selected (highlighted) in the Names window (overwriting any data already in the columns), and if fewer columns are selected than are being pasted, after pasting into the selected columns the next free columns in the worksheet will be used.

The Paste button is equivalent to the Paste View window: after selecting it you will still need to follow the parts of the instructions below which describe what to do after pressing Paste in the Paste View window in order to properly deal with missing values etc. There is also a command that performs the same task as the Paste button; for documentation on the command and on the Paste button see section 8.2.4. on p82 of the MLwiN manual supplement.

1.0 An example without missing data

The easiest way to get data into MLwiN is to paste data in from the clipboard. Here we assume the reader has an electronic version of this document. Below is a simple data set:

Download Table 1 data set (xls, 0.1 mb)

Data to be imported should be separated by spaces (or tabs) between the values in each column, and also keep the alignment as this facilitates error checking. MLwiN will read data with a more complex structure than this (through what is known as a formatted read, see Chapter 8 of the User's Guide) but it is much easier to prepare the data in this fashion. However, it is often convenient to copy data from other packages into the clipboard and then paste them into MLwiN. See also 4.0 Commonly Occurring Import Situations. If you have missing data it is a good idea to use the same missing value code for all variables, for example –999 or *.

The following screen will appear.

MLwiN paste window

 

 

The data displayed in the paste window is being held in a temporary buffer it does not get fully imported until the Paste is pressed. First we must

The data has been imported into MLwiN. To see the summaries of variables

The following screen will appear

MLwiN names window after pasting

Here we see summaries of our five variables listed. The length, number of missing values and range (min, max) is displayed for each variable. Notice the type variable which in our original data set had values {terraced, semi, detached} has been converted to codes ranging from 0 to 2. This is because the categories for this variable have been assigned the numbers 0, 1, 2. The category name information has not been lost:

The following screen will appear

Category window after pasting

‌The category names rather than numbers will often be displayed in MLwiN output. However, since MLwiN is basically a numerical analysis program, variables must have numerical values associated with them if they are to be included in any standard analysis. The Set Category Names window can be used to view, change or assign category names.

You can view the data by

The following screen will appear (after resizing the window)

Data window after pasting

‌‌As you can see for categorical variables category names rather than category numbers are displayed.

Well done! You have imported your first data set into MLwiN.

(Back to top)

2.0 Examples with missing data

Often you will have non-numeric symbols, unique numeric codes or even spaces coding for missing data. Slightly different procedures are required depending on how the missingness is coded.

2.1 Missing data represented by non-numeric character strings

First let's look at the case where missing data is coded as one or more non-numeric characters. In the example below we use a data set with * coding for missing values.

Download Table 2 data set missing data coded as * (xls, 0.1 mb)

Close MLwiN and go through the steps in section 1.0, up to the point of displaying the Names window, but this time apply them to table 2.

Names window after pasting with missing

‌Examining this screen you will see that for the numeric data the missing values are correctly input. However for the categorical represented in the input file by non-numeric strings, no missing data is registered, although we have two missing data points in this column. This is because the code “*” is interpreted as another house type category. As we see if we

The following screen will appear

Category window after pasting with missing

‌Close the Set category names window

We need to recode values of 3 in the variable type to be missing. To do this

The following screen will appear

Recode window after paste with missing

‌Fill in the values in range text boxes to read 3 to 3

The Recode Variables screen should now look like this :

Intermediate recode window after pasting with missing

This will record the requested action to be listed in the table on the right of the Recode variables window:

Completed recode window after pasting with missing

If you had selected more input and output columns in the drop down lists then values of 3 would have been replaced by missing in all the listed columns. Once an action in the Recode specification part of the Recode variables window has been added to the Action list the Recode specification is cleared and further recode actions, on different values and columns, can be specified and in turn added to the Action list. In this way a list of recode actions can be compiled in the Action list. These actions are not executed until the Execute button in the Action list is pressed.

In our case we only require the single recode action on the type variable so

If you look at the Names window:

Names window after recoding missing

‌You will see that the type variable now lists the correct number (2) of missing values.

(Back to top)

2.2 Missing data represented by spaces

Provided the data to be imported is tab delimited, that is variable values are separated by tabs, then spaces can be used as missing values. Most software, when it pastes data out to the clipboard employs a tab delimited format.

Copy the following data, with missing coded by spaces, into the clipboard and paste it into MLwiN.

Download table 3 (xls, 0.1mb)

Names window after pasting with spaces

‌You will see that with missing values coded as spaces the software converts missing values in both the numeric variables and the categorical variable type correctly. That is an entry of only spaces in text based category data is turned to missing.

For detection of missing values when importing data the following rules apply

  1. If a variable contains only numeric values all values are treated as non-missing numerical values.
  2. If a variable contains all numeric values except a unique non-numeric code, then the variable is treated as numeric and the unique code is treated as a missing value specifier.
  3. If a variable contains more than one non-numeric code then the variable is treated as categorical and a category set is built for that variable.
  4. If a variable (in a tab-delimited format) contains only spaces then space is taken to be a missing value specifier (for continuous or categorical data).

Point 3 would mean that the following two variables

apple dog
apple cat
orange cat
apple 43
grape cat

would be imported into the category sets

category numbercategory namecategory name
0 apple dog
1 orange cat
2 grape 43

Category number 2 for our second variable has category name '43'. This is because the variable has more than one non-numeric code and all its entries are therefore treated as category names. If you were to view the data via the View or Edit Data option from the Data manipulation menu you would see

Data window after pasting categories.

‌Item 4 in column 2 is category name “43” (with underlying category number 2.)

(Back to top)

2.3 Missing data represented by numeric codes

Copy the data in the file below with –999 coding for missing into the clipboard and paste it into MLwiN

Download table 4 (xls, 0.1 mb)

Now display the data in the five imported columns

Data window after pasting with missing codes.

‌Again we have to use the recode command to change values of –999 to missing.

Recode missing value codes

We have not included the type variable in the recode specification. Can you think why?

Before pressing Execute fill out a Recode Specification for the categorical variable type

Recode category with missing codes.

Type is a categorical variable and the import function set the value of –999 to be a category name for category number 3 for this variable. You could have seen this by selecting type in the Names window and viewing its category list by pressing the Categories button at the top of the Names window. If we look at our data summaries using the Names or View or edit data windows, both available via the Data manipulation menu, we should see that missing values have been correctly handled.

(Back to top)

3.0 Properly importing long identifiers

All numbers in MLwiN are held internally as double precision floating point numbers, which can only represent 15-17 decimal digits of precision. If your identifiers contain values that are longer than this then MLwiN may be not able to distinguish between them resulting in them being treated as the same unit in the analysis. When loading such values MLwiN will offer to recode them to consecutive numbers starting with zero, with the old values used as textual labels.

Paste the following two variables into MLwiN

Download table 5 (xls, 0.1 mb)

When you hit the Paste button, the following screen will appear

Pasting data that is to long.

‌The software has detected a precision problem and you should

Which will as the window states convert the identifying codes in column C1 to categorical data. Column 2 is a predictor variable (child weight) and does not need to be converted to a categorical variable.

If you look at the category codes for hospital id(column c1) you will see

Categories after converting variable that was too long.

Now you have the best of both worlds – the long hospital IDs are preserved in the category name, but a set of manageable numbers have been created for the category numbers.

Such long IDs are often present in data sets. If they are mixtures of letters and numbers then there is no problem as alpha-numeric codes will automatically be turned into categorical variables. However, long numeric id codes should be changed to categorical variables when the warning screen appears during the paste operation.

4.0 Commonly occurring import situations

MLwiN also provides free-format and fixed format import directly from ASCII files but these are a little clunky and can usually be avoided.

  1. If your data is in an ASCII file we suggest that you read it first a spreadsheet such as Microsoft Excel or Libreoffic Calc both of which have excellent ASCII file import facilities. Once the data is in the spreadsheet simply copy it to the clipboard and then paste it into MLwiN.
  2. If your data has more than 64,000 rows you will not be able to read it into Excel versions 2003 or earlier. In this case try and read it into a package you know well that supports copy to clipboard. Excel 2007 or later can read up to 1,048,576 rows.

Now try and import your data into MLwiN!

(Back to top)