Getting data into MLwiN
- Importing data from Stata, SPSS, SAS or Minitab
- Importing data in ASCII format
- Data Input: pasting data from the clipboard
Importing data from Stata, SPSS or Minitab
It is possible to open Stata (*.dta), SPSS (*.sav), SAS transport (*.xpt) or Minitab (*.mtw) file directly, either via the Graphical User Interface or via commands. For documentation of these commands, see section 6 of the MLwiN Manual supplement. In order to open via the GUI, simply
- Select Open worksheet from the File menu
- Browse to find your worksheet, select it and press Open
The worksheet should open in MLwiN, and you should see that missing values, descriptions of variables, and category labels for categorical variables have all been transferred across correctly. It is now possible to save the worksheet as an MLwiN worksheet in the usual way.
Importing data in ASCII format
If your data comes in the form of an ASCII text file, it can be read using the ASCII text file input option from the File menu or via commands. For documentation on how to use the ASCII text file input option, see chapter 8 of the User's Guide, and for documentation on how to use the commands, see pp32-33 of the Command manual; however note the advice in section 4.0 since using these methods may not be the best option.
Data input: pasting data from the clipboard
Data can be pasted into MLwiN via two methods, the first of which to use ctrl + v to paste in data via the Paste View window. The alternative is to go to the Names window and use the Paste button. The data will be pasted into the columns that are currently selected (highlighted) in the Names window (overwriting any data already in the columns), and if fewer columns are selected than are being pasted, after pasting into the selected columns the next free columns in the worksheet will be used.
The Paste button is equivalent to the Paste View window: after selecting it you will still need to follow the parts of the instructions below which describe what to do after pressing Paste in the Paste View window in order to properly deal with missing values etc. There is also a command that performs the same task as the Paste button; for documentation on the command and on the Paste button see section 8.2.4. on p82 of the MLwiN manual supplement.
- 1.0 An example without missing data
- 2.0 Examples with missing data
- 3.0 Properly importing long identifiers
- 4.0 Commonly occurring import situations
1.0 An example without missing data
The easiest way to get data into MLwiN is to paste data in from the clipboard. Here we assume the reader has an electronic version of this document. Below is a simple data set:
Download Table 1 data set (xls, 0.1 mb)
Data to be imported should be separated by spaces (or tabs) between the values in each column, and also keep the alignment as this facilitates error checking. MLwiN will read data with a more complex structure than this (through what is known as a formatted read, see Chapter 8 of the User's Guide) but it is much easier to prepare the data in this fashion. However, it is often convenient to copy data from other packages into the clipboard and then paste them into MLwiN. See also 4.0 Commonly Occurring Import Situations. If you have missing data it is a good idea to use the same missing value code for all variables, for example –999 or *.
- Select the above table and copy it into the clipboard by typing ctrl<c>
- Open MLwiN and type ctrl<v> to paste the clipboard into MLwiN
The following screen will appear.
The data displayed in the paste window is being held in a temporary buffer it does not get fully imported until the Paste is pressed. First we must
- Ensure that Use first row as names is highlighted; this sets the variable names to the entries in the first row.
- Now select the Paste button to transfer the data into MLwiN
The data has been imported into MLwiN. To see the summaries of variables
- Select Names from the Data Manipulation menu
The following screen will appear
Here we see summaries of our five variables listed. The length, number of missing values and range (min, max) is displayed for each variable. Notice the type variable which in our original data set had values {terraced, semi, detached} has been converted to codes ranging from 0 to 2. This is because the categories for this variable have been assigned the numbers 0, 1, 2. The category name information has not been lost:
- Select on type in the Names window and select the View button in the categories section
The following screen will appear
The category names rather than numbers will often be displayed in MLwiN output. However, since MLwiN is basically a numerical analysis program, variables must have numerical values associated with them if they are to be included in any standard analysis. The Set Category Names window can be used to view, change or assign category names.
- Close the Set Category names window
You can view the data by
- Select View or edit data from the Data manipulation menu
The following screen will appear (after resizing the window)
As you can see for categorical variables category names rather than category numbers are displayed.
Well done! You have imported your first data set into MLwiN.
2.0 Examples with missing data
Often you will have non-numeric symbols, unique numeric codes or even spaces coding for missing data. Slightly different procedures are required depending on how the missingness is coded.
2.1 Missing data represented by non-numeric character strings
First let's look at the case where missing data is coded as one or more non-numeric characters. In the example below we use a data set with * coding for missing values.
Download Table 2 data set missing data coded as * (xls, 0.1 mb)
Close MLwiN and go through the steps in section 1.0, up to the point of displaying the Names window, but this time apply them to table 2.
Examining this screen you will see that for the numeric data the missing values are correctly input. However for the categorical represented in the input file by non-numeric strings, no missing data is registered, although we have two missing data points in this column. This is because the code “*” is interpreted as another house type category. As we see if we
- Select on type in the Names window and click the Categories button
The following screen will appear
Close the Set category names window
We need to recode values of 3 in the variable type to be missing. To do this
- Select recode->by range from the Data manipulation menu.
The following screen will appear
Fill in the values in range text boxes to read 3 to 3
- In the to new value text box type m and the box will automatically be filled with the word missing
- In both the Input columns and Output columns drop down lists select the type variable
The Recode Variables screen should now look like this :
- Select the Add to action list button
This will record the requested action to be listed in the table on the right of the Recode variables window:
If you had selected more input and output columns in the drop down lists then values of 3 would have been replaced by missing in all the listed columns. Once an action in the Recode specification part of the Recode variables window has been added to the Action list the Recode specification is cleared and further recode actions, on different values and columns, can be specified and in turn added to the Action list. In this way a list of recode actions can be compiled in the Action list. These actions are not executed until the Execute button in the Action list is pressed.
In our case we only require the single recode action on the type variable so
- Press Execute and close the Recode variables window
If you look at the Names window:
You will see that the type variable now lists the correct number (2) of missing values.
2.2 Missing data represented by spaces
Provided the data to be imported is tab delimited, that is variable values are separated by tabs, then spaces can be used as missing values. Most software, when it pastes data out to the clipboard employs a tab delimited format.
Copy the following data, with missing coded by spaces, into the clipboard and paste it into MLwiN.
Download table 3 (xls, 0.1mb)
You will see that with missing values coded as spaces the software converts missing values in both the numeric variables and the categorical variable type correctly. That is an entry of only spaces in text based category data is turned to missing.
For detection of missing values when importing data the following rules apply
- If a variable contains only numeric values all values are treated as non-missing numerical values.
- If a variable contains all numeric values except a unique non-numeric code, then the variable is treated as numeric and the unique code is treated as a missing value specifier.
- If a variable contains more than one non-numeric code then the variable is treated as categorical and a category set is built for that variable.
- If a variable (in a tab-delimited format) contains only spaces then space is taken to be a missing value specifier (for continuous or categorical data).
Point 3 would mean that the following two variables
apple | dog |
apple | cat |
orange | cat |
apple | 43 |
grape | cat |
would be imported into the category sets
category number | category name | category name |
---|---|---|
0 | apple | dog |
1 | orange | cat |
2 | grape | 43 |
Category number 2 for our second variable has category name '43'. This is because the variable has more than one non-numeric code and all its entries are therefore treated as category names. If you were to view the data via the View or Edit Data option from the Data manipulation menu you would see
Item 4 in column 2 is category name “43” (with underlying category number 2.)
2.3 Missing data represented by numeric codes
Copy the data in the file below with –999 coding for missing into the clipboard and paste it into MLwiN
Download table 4 (xls, 0.1 mb)
Now display the data in the five imported columns
Again we have to use the recode command to change values of –999 to missing.
- Select Recode->by range from the Data Manipulation menu
- Fill out the specification as follows
We have not included the type variable in the recode specification. Can you think why?
- Select Add to action list
Before pressing Execute fill out a Recode Specification for the categorical variable type
- Add this specification to the Action list
- Select the Execute button
Type is a categorical variable and the import function set the value of –999 to be a category name for category number 3 for this variable. You could have seen this by selecting type in the Names window and viewing its category list by pressing the Categories button at the top of the Names window. If we look at our data summaries using the Names or View or edit data windows, both available via the Data manipulation menu, we should see that missing values have been correctly handled.
3.0 Properly importing long identifiers
All numbers in MLwiN are held internally as double precision floating point numbers, which can only represent 15-17 decimal digits of precision. If your identifiers contain values that are longer than this then MLwiN may be not able to distinguish between them resulting in them being treated as the same unit in the analysis. When loading such values MLwiN will offer to recode them to consecutive numbers starting with zero, with the old values used as textual labels.
- Close MLwiN down
- Re-open MLwiN
Paste the following two variables into MLwiN
Download table 5 (xls, 0.1 mb)
When you hit the Paste button, the following screen will appear
The software has detected a precision problem and you should
- Select the NO by C1 to turn it to YES
Which will as the window states convert the identifying codes in column C1 to categorical data. Column 2 is a predictor variable (child weight) and does not need to be converted to a categorical variable.
If you look at the category codes for hospital id(column c1) you will see
Now you have the best of both worlds – the long hospital IDs are preserved in the category name, but a set of manageable numbers have been created for the category numbers.
Such long IDs are often present in data sets. If they are mixtures of letters and numbers then there is no problem as alpha-numeric codes will automatically be turned into categorical variables. However, long numeric id codes should be changed to categorical variables when the warning screen appears during the paste operation.
4.0 Commonly occurring import situations
MLwiN also provides free-format and fixed format import directly from ASCII files but these are a little clunky and can usually be avoided.
- If your data is in an ASCII file we suggest that you read it first a spreadsheet such as Microsoft Excel or Libreoffic Calc both of which have excellent ASCII file import facilities. Once the data is in the spreadsheet simply copy it to the clipboard and then paste it into MLwiN.
- If your data has more than 64,000 rows you will not be able to read it into Excel versions 2003 or earlier. In this case try and read it into a package you know well that supports copy to clipboard. Excel 2007 or later can read up to 1,048,576 rows.
Now try and import your data into MLwiN!