![]() |
MCHB/EPI Miami Training — December 5 - 6, 2005
Use of SaTScan — Transcript
RUSSELL KIRBY: Yeah, the STS 6.0 icon, we need to install that on each of the computers.
The next section is software that was developed by biostatistician named Martin Kohldorf who was actually — he's from Sweden but he was at the National Cancer Institute in the biometry section and has more recently taken a position at the University of Colorado Health Sciences Center. I'm not sure exactly what — I think he's in like family and community medicine or something like that. But anyway he's in Connecticut and it's a program that was developed to you know enable people to quickly do space time clustering, space time cluster analysis in their data.
It's a program that again is available at no cost. And we simply copied the download file and brought it here to put on all your computers. But again when you get home, if you just put SaTScan into Google, it will take you directly to the site that you can download this from. And again it's probably a good idea to do it that way, because then you'll again be registered with Dr. Kohldorf so that as there are updates.
The version we'll be using is brand new. It was released in the middle of October. In fact, as we were planning the session, one week we had a call about the session. The next week I said you know there's a new version of SaTScan. We better use that. So it's fairly new. I'm not actually able to tell you all the new features compared to the old features. What I've done here is after you get it — after you get the software loaded on your computer, then you'll need to go into All Programs and find it. And I've clicked on SaTScan and you can see the different things that can come up. If you're going to use it a lot you'll probably want to save the icon to the desktop so you can get to it. I also wanted to point out it has a very detailed users guide and that's probably something that's worth you having available to refer to as well.
So anyway we're going to start looking at them at scan. Before we do that I just wanted to mention the whole science behind identifying whether there's clustering of health events is actually a very old science. And it dates back you know actually hundreds of years. And the statistical methods that have been developed to look at space and time clustering are also, some of them very old methods. About 15 years ago I did a, as a consultant to a state, I did a review of all the surveillance statistics, and as of 1990 there were something like 15 or 20 distinctly different statistical tests that have been devised to look at time space clustering, and the approach that Dr. Kohldorf has developed is really one that was not actually his idea. It was developed from some methods that had been developed by Knox and enhanced by Nathan Mantel same guy from the Mantel Hansel Chi square; but the most significant innovation that enabled Kohldorf to do his method was really work by Openshaw in the 1980s in England . He developed an approach that he called the geographical analysis machine, and it would basically, you know, place circles of varying radii all over the map and estimate the prevalence or condition within each of the circles and compare to what occurred by chance and Openshaw's program in the 1980s was -- nobody could use it because unless you were at Carnegie Melon or University of Illinois, had a supercomputing center, you didn't have a computer that could actually run the analysis. And if you were doing it on a PC, with PCs in those days it would probably run for three or four weeks to do the analysis. That's not a very useful tool. Well, nowadays, with the computational algorithm that Kohldorf has come up with, and with modern computers, these can be run much more efficiently. Although, the example that we're going to give you is one that is based on census track to aggregated data because when Ravi tried to run it using individual data, you said 27 hours?
RAVI SHARMA: 26 hours.
RUSSELL KIRBY: To do this particular thing with individual data on a PC it can still take a while. So you have to kind of think about what kind of efficiency.
But anyway, we'll start out here by flicking on SaTScan. Okay. I'm going to try to move this thing around here so that it's going in the right direction. Okay. So when you click on SaTScan, what you wind up with — you know, I forgot to mention something else here. If you go into your exercises folder, you will find a PowerPoint file that is called SaTScan. And it doesn't have this title page because we just realized we didn't make a title. But in any event, if you go into that file, it has I'll just show you what it's got in it. It's step-by-step, starting with you know going into SaTScan. You know, we've already done these things, I hope. Has everybody got SaTScan up on their computer? Anybody who doesn't have SaTScan up on their computer?
Okay. So then we're at the icon level. And this is where I am right now. So we'll just switch back to this.
So what we're going to do, we're going to show you the steps that you go there you to create a new session and pull in the data. And we'll talk about how the data need to be structured in order to run with this software. And then we'll run the application. It's basically a test to see whether there's spatial clustering of low birth weight in Allegheny county Pennsylvania and then we'll show you what you need to do to bring the results of your SaTScan analysis into ARC so that you can actually visualize the data.
So the first thing we'll do is we'll say create new session. Okay. And there's a couple of things about this. This software has the capacity to do a variety of different time space cluster analyses. The analysis that we are going to do is actually one of the more basic ones and we're just going to look at whether there's spatial clustering. The software actually has the capacity to do time clustering, space clustering, and space time clustering but of course in order to do time clustering you have to have dates tagged to all of your records. Of course, with spatial clustering you need some location al identifiers and for space time you need both.
So in order for this to run, we need to have several different files, all of which have to be formatted in a particular way. But we need a case file, which you know tells us how many cases whether incident or prevalent there are. We need a control file. Actually, yeah we need a control file that tells us how many noncases there are, basically, in the area. And then we need a coordinates file that tells the software where these cases and controls basically are located. So we need all those things. So let's start out by looking at the case file.
We start out — we're going to click on this little icon here. It says import case file and then we have to go and find our data. And all of these data should be in the — here it is in the GIS MCH and I think they're in the geography and data folder. There they are.
Now, we've actually already made the files that you need to have to run. And we'll take a look at them in a minute. But the file that we need for the case file is called case.dbf. You don't have to call it case.dbf, but we did just for our own clarity.
And so it says you know this is what happens when we open the file. It's not a very big file. It basically just has two variables. It has an ID variable and it has a case variable. So in order for this to work, we need to assign at least the ID and the case variable. What we need to do is we need to click here and it brings up the list of variables that we have in that file. And we're just going to select ID like so. And we're going to do the same thing for the number of cases.
So basically what we're looking at here, the ID field basically references, referencing number for each of the census tracts. And the case variable tells how many events or cases there were occurring within that particular census track. So that's basically what we have. And if we were doing time space analysis, we would need to also give the information about date or time but we're not going to do that. So we'll leave that blank. And then if we were doing a more sophisticated analysis, like say, for example, we wanted to control for race ethnicity or maternal age or other factors, we could actually designate covariates that we wanted to be adjusted f or in the model. But again this is a relatively simple example.
Did you want to — ?
RAVI SHARMA: On the display SaTScan variable.
RUSSELL KIRBY: Right. On the display, these are the different kinds of analyses that you could do, and you know the Poisson and Bernulli newly make different sums about the distribution in the data. And then it has a space time mutation, ordinal and exponential and my understanding is that those are more, you know, each more complex kinds of analyses you could do. The exponential, I think, is the brand new one that they've just introduced with this particular version. But we're actually going to choose Bernoulli for our analyses. Okay. So once we have that set, then we're going to click on next. And we have to tell it where we want to put the data. And again this is entirely up to you where you want to put the data. I'm going to put the data in a place where it will make sense, which is in the — I think I'll put it in the exercises folder.
Okay. And then I'm just going to click on execute.
Okay. And now I'm back at the same screen where I do the other files. And so we need to do basically the same thing with the control file. And handily we have a file called control.dbf that we can go to and we'll do the same, basically the same operations with that. So we're going to click on location ID and select ID and we're going to click on number of controls and select controls in the same way that we just did a few minutes ago.
UNKNOWN PERSON: In that screen (inaudible).
RUSSELL KIRBY: No. Go back. Cancel and click on this thing right here. Yeah.
That's probably — okay. The question that's come up here is
The different file types that are needed. There are two different icons that are next to the control file. The first one is looking for a file that's formatted as what they call a dot CTL file. And we've made the file as a DBF file. That's why you need to click on that second icon and those file structures are a little bit different.
So once we've done this, we click on next. And again we have to tell it where we want to put the file. It doesn't seem to remember very much from one step to the next.
So I'm going to put it in the same place we did before. Okay. Now, if we were doing, again, if we were doing a time space analysis, we would need to put in something about the dates. But since we're not doing that, we're going to skip that. And then we need to tell it the coordinates file.
So we need to click here again.
RAVI SHARMA: Click on —
RUSSELL KIRBY: Can I do it from here? Do I have to do that first?
RAVI SHARMA: I think first, yeah.
RUSSELL KIRBY: Ravi is saying I need to select the kind of coordinates. And we're actually using Cartesian coordinates rather than latitude and longitude on this particular analysis. And then we're going to select the coordinate DBF file. And we say open. And again I'm not actually sure what the difference would be in what they look like between Cartesian and latitude and longitude, but those look like (inaudible) to me. But who knows. Basically what these data are, the data that we have are census track data. And so these XY coordinates are basically a centroid for each of the census tracks.
RAVI SHARMA: Shall we select all the ones, the longitude —
RUSSELL KIRBY: Want to see what it looks like?
RAVI SHARMA: Yeah. Just cancel that.
RUSSELL KIRBY: Just so we can see what it looks like, we'll see what happens if we select it long. And then — do you have another file?
RAVI SHARMA: No, that's good.
RUSSELL KIRBY: Basically the same. Might not make a lot of difference. Would it make difference to the analysis? Probably not. We'll see.
Then here again you have to specify the variable names.
RAVI SHARMA: On the top, the first one is asking Y.
RUSSELL KIRBY: I did it backwards. That will mess things up a little bit. I wonder why that is.
RAVI SHARMA: That is interesting why it is.
RUSSELL KIRBY: So, anyway, we just have to select. And the interesting thing, has anybody ever used a software program called Cluster that was developed by ATSDR in the mid '90s and had a variety of time space clustering algorithms. Well, if you have used that software, what you probably remember is how painfulfully anal retentive it was in terms of you know the required formats for everything that it had. Well, this is quite a lot better, but it's still, you know it's still going to require you to name things in the way that it's expecting.
Okay. So now we've done that and we'll click on next. And I'm going to again put that in the GIS exercises folder. And then we'll execute it.
Now we have the three basic files that we need. If we had point data, we might need a good file as well. But for this particular example we don't. And then we go to the — do we need to go to the advanced data sets?
RAVI SHARMA: No.
RUSSELL KIRBY: I don't think we need.
RAVI SHARMA: Only if multiple data sets.
RUSSELL KIRBY: If you had your dad in several different files you might need this. But we don't need it. So then we go to analysis. Okay. And we are going to specify you know exactly what analysis we're going to do. And our analysis is only looking at spatial clusters. So we select the purely spatial. The other options are purely temporal. And then space time. I kind of think the reason those are called purely is it has something to do with the way Swedish translates into English or something. But in any event that is how they're called. And we're going to use the Bernoulli probability model. But it's worth reading in the manual and looking at some of the, you know, some of the underlying distributions that your data might have and you know thinking about which one might be the best choice for your particular analysis. And then again there's these different options in terms of what to stand for. You might be interested in areas that have high rates. You might be interested in areas that have low rates. You might be interested in basically those would be one tale tests you might be interested two tale test where you're looking at both high and low rates. So it depends on what you want.
And then down here it says Monte Carlo replications, and I think it can do a lot of replications. Now obviously the bigger the number you use, the longer it's going to take to run because it's going to do that number of rep me occasions. But it actually is a good idea to use a reasonably large number, because it can take a number of iterations before things really start to converge. Do I need to go to advanced here? Let's take a look and see what we have there. Okay. So now this is a more specific set of questions that it's asking us, and basically there's three different choices that we can give in terms of the size of cluster that we're willing to accept. And you can play around with these different options. For our example that we're going to show you we selected the first one and just using the 50% of the population at risk. Now, bear in mind if it actually found a cluster that included 50% of the population, that wouldn't be a very informative thing. But in practice it you know rarely does. But if you wanted — if you had a lot of data, like say, for example, instead of Pittsburgh you were doing maybe the New York metropolitan area and you had ten years worth of records in the file so that you might be dealing with millions of records, you could certainly drop down that percent you know to five percent or 10 percent or 20 percent or what have you. But computationally it seems to work okay using the 50 percent.
So now we need to go to output. And we need to tell it where we want to put the data. And we actually already have — should I give it another name, do you think?
RAVI SHARMA: We can do that.
RUSSELL KIRBY: I think I want to put it in that same — it won't let me go down, I guess. Okay. Well, we'll just — just call it cluster results. Okay. But again you can call it whatever you want. And then this next part, these optional output files are worth creating just so that you have them, if you want to look more at what your results are are. And so we'll select the D base versions of each of these. And then let's see what they have under advanced. I don't think we need to do that. But, okay, the software can identify what they call primary clusters and what they call secondary clusters. The primary clusters are the ones that are typically have a stronger P value associated with them than the secondary clusters. And you can — it gives you a number of different options in terms of the relationships between the clusters. And so if you choose no geographical overlap, then what that will mean is that any particular individual, in this case census track, can only fall into one cluster. But given that we don't know a lot about the spatial relationships or processes underlying a lot of disease transmission, it might not be a good idea to make the assumption that any particular location can only fall into one cluster. It could be that there could be other things going on. And so you could choose some of these other options that would enable to allow that to happen. But in this particular example we decided to just use the no geographical overlap. Okay. And I think I've done everything, right? So now that we're done with that, they've got a cute little icon here. The clusters icon. And so we'll click on that and it's going to run our analysis. You can see it's doing the replications and it's — you can see if we had done 10,000 it would take a while. But it's actually pretty fast. It can run in a minute or so. How many observations do you have census tracks?
RAVI SHARMA: About 400 and some.
RUSSELL KIRBY: 416?
RAVI SHARMA: I ran it with 55,000 births. It took me 26 hours. Not a good idea to do that here.
RUSSELL KIRBY: Okay. So let's just look at the output that we have here. If I can — did everybody get this to run? Okay. Let's make this bigger here. So there's 416 census tracks. It's about 53,000 total population, and the total cases is 4,339. We're basically mapping whether there's clustering in low birth weight. And so now we'll just page down and see what we have here. What it's doing here is it's listing all of the census tracks that fall into what it regards as the most likely cluster and it provides some statistics about how many observations, how many cases there are in that particular cluster. And furthermore it gives some idea of you know the relative risk within that particular cluster compared to other areas. So that's the primary cluster. And then it has a variety of secondary clusters. It actually found not very many secondary clusters. Looks like it found just, basically, one. Right.
RAVI SHARMA: Yes.
RUSSELL KIRBY: This just tells you something about the analysis that you did. And this is the kind of stuff — this is kind of you know the metadata about your analysis. It's probably worth saving this file so that you can have it to refer to, because it tells exactly how you set up your problem and so on.
Okay. So at this point we're basically done using SaTScan for this example. And the problem with SaTScan is that it is dedicated solely to doing the cluster analyses. It is not a data visualization program. All it does is create the result. You have to package your data in the format that this program wants in order to run the analysis. And then when you're finished, you have to export the results or basically import them into a GIS in order to see what you learned. And so that's what we're going to try to do next. And if you're following along with the PowerPoint file — I guess I have to close this.
RAVI SHARMA: Click okay on the PowerPoint. Go back to PowerPoint. That thing there, click okay.
RUSSELL KIRBY: So I'm just going to page through and we'll get to where we need to be. This file gives all the details of what we just did. And then what we're going to do now, and I might have Ravi help me with this, but we're going to basically create a map now that allows us to display the results.