MCHB Conference Webcasts downloads audio slides transcripts
Using Geographic Information System (GIS) to Analyze MCH EPI Data

MCHB/EPI Miami Training — December 5 - 6, 2005

GIS Applications & GIS Tools — Transcript

 

RAVI SHARMA: Let's start our GIS.

RUSSELL KIRBY: So we're going to start up ARC. It's not a lot faster on this computer than it was on that one. So we're going to go into ARC now. What we're going to do is we're going to find the Allegheny county file for the census track shape file and load that. And then we're going to go out and find the SatScan data file and do a few manipulations of it and go from there. So we'll start out with a new empty map.

RAVI SHARMA: Click okay.

RUSSELL KIRBY: And we click on the plus arrow here.

RAVI SHARMA: Yep. Let's look for — we can use —

RUSSELL KIRBY: No, we need.

RAVI SHARMA: Get the Allegheny County .

RUSSELL KIRBY: The question is where are they?

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: It is in here?

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: I'm in the wrong place, I think, is the problem.

RAVI SHARMA: Yeah.

RUSSELL KIRBY: Connect to folder and then come into GIS and will that do it?

UNKNOWN SPEAKER: Yeah.

RUSSELL KIRBY: Or do I need to do it down here? Okay. It's this one.

RAVI SHARMA: We can use that. Actually —

RUSSELL KIRBY: Will this one work better or are they about the same.

RAVI SHARMA: We used the AAHCP (inaudible). Either one works. So long as they're not projected, we just need to make sure.

RUSSELL KIRBY: We'll see. All track.

RAVI SHARMA: All track, SHP.

RUSSELL KIRBY: Does that look right?

RAVI SHARMA: That looks projected, I think.

RUSSELL KIRBY: So do we need to go into here and look at properties?

RAVI SHARMA: Yeah. So right click. On Source.

RUSSELL KIRBY: Okay. I'm just looking to see —

RAVI SHARMA: That looks okay.

RUSSELL KIRBY: So this is okay. Basically this is going to work better if your geography is not projected. So I was just double-checking to make sure that that holds in this particular situation. And it's not, because it says geometry type polygons. Okay. So this is going to be our geography that we're going to use. And then what we need to do is we need to add the data, right?

RAVI SHARMA: Yes. We cluster column 15.

RUSSELL KIRBY: Cluster results 50 column (phonetic). So we're going to bring in this particular table now which has the results from the SatScan. What's the difference between these? The GIS and the —

RAVI SHARMA: The GIS doesn't give you actually — I don't know why they call it GIS. It doesn't give you the X and Y coordinates. Only the column one does.

RUSSELL KIRBY: Only the column one gives us the spatial data that we need to actually map. I took the cluster results 50 COL. You can name it whatever you named it.

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: Whatever you named it you want to use the COL. Why did we call it 50? 50 because we used up to 50% is the reason why we call it that. So now we come in here. Go to properties.

RAVI SHARMA: Right click. Open.

RUSSELL KIRBY: Okay. So right click and then open.

RAVI SHARMA: And you see your latitude and longitude.

RUSSELL KIRBY: So you can see for each of the records it has —

RAVI SHARMA: I'm worried. There are only ten. Oh, that's fine. (Inaudible) Yeah.

RUSSELL KIRBY: Right. And then over here it gives the lat. And long at the point and then it gives the radius that's going to be basically drawn for that particular cluster. And then the date fields are immaterial in our particular analysis because we didn't — in fact it's a little bit tricky if you want to use date, if you want to do this kind of analysis with aggregated data, you have some challenges in terms of doing the time space clustering, because it's difficult to attribute an aggregate date to data that are basically individual event-based. So for time-space clustering, you probably do want to do that analysis on individual records. Do I need to show anything else here?

RAVI SHARMA: No, that's good.

RUSSELL KIRBY: Let me go over to the end, see what other things it has. It actually does provide also the P value for the cluster and it provides the observed and expected numbers as well. And probably let's see what else it has over here. It gives the relative risk. And so on. Okay. So I can close this, right?

RAVI SHARMA: Yes.

RUSSELL KIRBY: Okay. So now the next thing we want to do is we want to actually.

RAVI SHARMA: Put the XY values.

RUSSELL KIRBY: We will not do that. We want to right click and display XY, right?

RAVI SHARMA: Yes.

RUSSELL KIRBY: We'll display XY. We can display them as latitude and longitude, I think.

RAVI SHARMA: Yep.

RUSSELL KIRBY: It's an unknown coordinate system. Okay.

RAVI SHARMA: That's good. So those of you who know the geography of Allegheny County , this is, used to be an area where this is a (inaudible) a lot of industry along this river. So this is the (inaudible) river that flows north and there's Allegheny that comes actually those of you who saw the river hydrology will know Allegheny flows. And then there used to be a lot of indices around this area, and then meet here at the point to form the River Ohio that then flows on until it meets Mississippi at some point. So there is a cluster right here. Is the city — this is the city of Pittsburgh here, this one up here. That's the city. This is the (inaudible) this is southwestern Pennsylvania , mostly Suburban. This here we still have here, this is along the Ohio River , and we have a collection of industries still very active. And these are TRI sites, the toxic release inventory sites around this place here.

RUSSELL KIRBY: And we want to go —

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: Okay, well let's —

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: Some people have projected files?

RAVI SHARMA: Yeah.

UNKNOWN SPEAKER: Because when you did the Cartesian X and Y pop in. Longitudinal (inaudible) wonder if that's —

RAVI SHARMA: That's interesting.

RUSSELL KIRBY: Dine, do you want to come up and show them how you would fix that?

UNKNOWN SPEAKER: It's a guess. It's a theory.

DINE: There should be an alternate shape file if you have a projected all county — all tract. There's an all CYNT tract. Try it with that one.

RAVI SHARMA: It's not projected.

UNKNOWN SPEAKER: Dine, could it possibly be the fact that they did the longitudinal and some of us did Cartesian?

DINE: Yes. If you still see it and it's slightly different, that would be why. If you don't see it at all it's not projected.

RAVI SHARMA: Can you right click on — no, not on this one, but on your results. The one you brought in from the DBF file, right?

UNKNOWN SPEAKER: Open it?

RAVI SHARMA: Yeah. So you have 12. Let me check here. I'm just going to check how many we have here. We have 10 and they have 12. How did that happen?

UNKNOWN SPEAKER: I don't know.

RAVI SHARMA: That can be the cause of XY, I don't think that we're that much different. So this is the one, and we did the one — yesterday.

RUSSELL KIRBY: Yeah, go in there. Yeah, plus our 50 COL.

RAVI SHARMA: This one?

RUSSELL KIRBY: Yeah. Bring that one in. Maybe make it a different color.

RAVI SHARMA: So this one is —

RUSSELL KIRBY: And it has 14. 15.

RAVI SHARMA: This is interesting.

RUSSELL KIRBY: That is interesting. Okay. One of the things that this is showing us actually is that this analysis is basically a probability analysis. Okay. And it's basically doing a Monte Carlo simulation, roughly a thousand times. And you would think that if you do something like that a thousand times, you'd get the same answer because of the law of large numbers. But in fact you don't always get the same answer when you're doing a simulation. And I think that's why some of us are getting slightly different results. Now, if we did the simulation 10,000 times or 20,000 times, these results would probably converge to be very similar. But you can get slightly different results since it is basically a random process that it's running. So we've got ten.

RAVI SHARMA: Actually, if we do it again we may get nine.

RUSSELL KIRBY: Yeah. In fact the clusters could go away altogether if we keep running it enough times. So are we ready to go to the next step?

UNKNOWN SPEAKER: (Inaudible) run this a bunch of times.

RAVI SHARMA: They're not going to evaporate.

RUSSELL KIRBY: They're probably not all going to disappear. But the exact number you're going to find might be slightly different from run to run. We ran it 999 times and got ten and last night we ran it and we got 14. And back here they ran it and they got 12. But the fact that they're all — I would be willing to bet, however, if we look closely at these, the primary cluster is probably coming up the same on all those. And the difference is with some of the secondary clusters.

RAVI SHARMA: That's true, when you look at it. This is the primary cluster whether you have 10 or 14, This will always, the primary cluster will always be picked up, right, Brian? In yours, that's the primary cluster. The other ones are secondary clusters with low probability values.

UNKNOWN SPEAKER: All the other ones are secondary?

RAVI SHARMA: Yeah.

RUSSELL KIRBY: Yeah.

RAVI SHARMA: So you will see when we plot these —

RUSSELL KIRBY: Okay. Let's go to that. No, I wanted to close it.

RAVI SHARMA: Click on the right —

RUSSELL KIRBY: I was trying but I'm left-handed and the mouse is in the wrong place for me.

RAVI SHARMA: So we go —

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: That would be the primary cluster, right. We'll do it on this one here.

RAVI SHARMA: We'll go to tools. We'll click on the toolbox.

RUSSELL KIRBY: Right.

RAVI SHARMA: Shall I come and —

RUSSELL KIRBY: Yeah.

RAVI SHARMA: Give you — . So let's see what we can do. Okay. So we're going to go to toolbox here. And we're going to go to — this is very similar to what we did yesterday in your, what we're going to do is simply do the buffers. So our input file is this one up here and we can leave our output file. We can just take the default and then we need to specify the fields so we can specify the radius and if you like you can dissolve.

If you want to dissolve, you can dissolve. Or you can keep, you know, and shall we dissolve? Okay. And that's what we want to dissolve. So this is what it should look like. Let's put this in. We'll be putting this —

RUSSELL KIRBY: Yeah, it's on C, GIS MCH exercises.

RAVI SHARMA: Okay. And you can call it buffer.

RUSSELL KIRBY: Okay. So is everything okay up here? We need to change the radius is fine. And this is okay. So we click okay. I think this may be the conflict we talked about. See the — I think it's a mismatch because this may have to do with —

UNKNOWN SPEAKER: You can use —

RAVI SHARMA: Yeah, let's use — I am going to eliminate which one is — so let's do the — where is the buffer. Up here. I'm going to eliminate this. So this is the one we did last time. Okay. So let's do this again here. We didn't map it. So let's put X and Y. This one here.

RUSSELL KIRBY: The bottom one. Take that one off.

RAVI SHARMA: Okay. That's one we did last night, I think. So all right. So now we can go to — we're going to do the decimal degrees.

RUSSELL KIRBY: Don't you want to pick the column?

RAVI SHARMA: For? I'm sorry?

RUSSELL KIRBY: (Inaudible) you want to use that, right?

RAVI SHARMA: Right. This one. That's degrees and the field we're going to pick is radius. And you didn't get radius. You can use the dissolve. Use this here. And then radius. Oops. List here.

RUSSELL KIRBY: There we go. So let's just see what we have here.

RAVI SHARMA: We can actually make the buffers transparent so you can see the underlying geography also by going to properties and display. It's much easier to see the underlying geography and you can actually go to cluster here and change the color to red. Or what's another good color? This is a good color. So the primary cluster really is primarily the city of Pittsburgh . And so it's not unexpected, because that's where you have primarily you know the concentration of minority population.

RUSSELL KIRBY: So what we need to do is —

RAVI SHARMA: Low birth weight. No controls.

RUSSELL KIRBY: We need to go through what happens if you have the mismatch coordinates. Because like over here they have, I think they might actually be —

RAVI SHARMA: I think these are Cartesian. We need to use the Cartesian. Did you use — you need to go back and change those two.

RUSSELL KIRBY: So it's in the ALT track.

RAVI SHARMA: Yes. That's why. Because those are map coordinates.

RUSSELL KIRBY: So click on ALT track. It will be a right click. And then —

RAVI SHARMA: We need to specify the Cartesian coordinates. Because that's what — they are really not latitude and longitude.

RUSSELL KIRBY: You need to change the coordinates for that file.

UNKNOWN SPEAKER: What does the point have no buffer?

RAVI SHARMA: They're probably not — let's look at the significance.

UNKNOWN SPEAKER: Nothing significant short of —

RAVI SHARMA: Yeah. Yeah. Because if you look there's no radius for them.

UNKNOWN SPEAKER: Why would it generate?

RAVI SHARMA: What's the P value for these?

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: Yeah, let's look at, for those with no.

UNKNOWN SPEAKER: Radius.

RAVI SHARMA: Radius is zero. That's what happens when there's no radius. And the P value is probably —

UNKNOWN SPEAKER: One.

RAVI SHARMA: Yeah, that's why.

UNKNOWN SPEAKER: Okay.

UNKNOWN SPEAKER: How come you have so many points with radii and some below?

UNKNOWN SPEAKER: You get more points than —

UNKNOWN SPEAKER: These are radii and those (inaudible) more interested in this.

RAVI SHARMA: We're interested in the one with significant value.

UNKNOWN SPEAKER: Run this within Pittsburgh ?

RAVI SHARMA: No. I actually should do it. The problem I have is —

UNKNOWN SPEAKER: I'm going to run (inaudible).

UNKNOWN SPEAKER: Okay. Go back to your cluster results. Go into buffer.

UNKNOWN SPEAKER: What did we dissolve on?

RUSSELL KIRBY: We used dissolve list and —

UNKNOWN SPEAKER: Oh, list.

RUSSELL KIRBY: Yeah.

RAVI SHARMA: So I can actually Geo code.

RUSSELL KIRBY: The two people in front both have the same problem.

RAVI SHARMA: Did you change the tracking code?

RUSSELL KIRBY: We tried to change them.

UNKNOWN SPEAKER: We can fix that.

UNKNOWN SPEAKER: I'm confused — (inaudible).

RUSSELL KIRBY: The buffering, the cluster, the SatScan (inaudible) from the files around this.

UNKNOWN SPEAKER: So where do you see that?

RUSSELL KIRBY: If you go to —

UNKNOWN SPEAKER: Here's our primary.

RUSSELL KIRBY: This is your primary.

UNKNOWN SPEAKER: Right. Where do you see the radius?

RUSSELL KIRBY: Right here. Right. (Inaudible).

UNKNOWN SPEAKER: It's saying, all of these census tracks had a cluster of, do you know what size?

RUSSELL KIRBY: Well, it's telling you (inaudible).

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: Something that looks sort of like this. Not everybody, but almost everybody does. And I wanted to ask you a few questions. I've talked with some of you individually as we've been looking at this. So let's say that we have a map that looks like that or something similar to it. And so we have identified these different areas as being potential clusters for low birth weight. So what are some of the issues around interpreting this information? If we look, for example, you know we've got this really big circle right here. And we've got a few other circles. This one, for example, is really only in one census track. This one covers, you know, over 100 census tracks. What can we say about the pattern of low birth weight in relation to these potential clusters? Go ahead.

UNKNOWN SPEAKER: According to the table one of the clusters was —

RUSSELL KIRBY: According to the data only one of the clusters was significant. And actually the way that we've displayed the data we haven't actually highlighted which one of the ones that was. So you probably want to be able to differentiate the primary and the secondary clusters. But what other kind of issues might there be behind you, Susan?

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: Right. Exactly. So what you have is low birth weight, you know it's true that if you took a — if you built a model at the individual level to estimate low birth weight as a dependent variable, putting everything that we know about low birth weight into the model at the individual level, the R squared that you would get from it is probably around 15%. But all the things that we know about when we actually model them only explain about 15% of the variability in low birth weight. However, even though the things that we know about only explain relatively small proportion, and incidentally there are some clinical factors that we typically don't measure very well that can predict a much higher amount of the variation in low birth weight. But the things that we have on the birth certificate, 15% is about what we're going to get. Well, we know that birth weight varies, at least low birth weight varies by race, ethnicity. We typically have 1.8 to 2.8 two fold higher rate of low birth weight among African Americans, right? We actually know that there are some ethnic groups that actually have low birth weight rates that are lower than among white non-Hispanics and so on.

So we haven't modeled that at all. It's possible that that one factor alone, I mean we may have just made a map that shows where you're more likely to have African American women living in Pittsburgh than a map that shows areas that have higher rates of or probably higher prevalence of low birth weight. So we haven't really done anything about that. Other factors that we could potentially look at. Plurality? You know, there could be spatial differences in the place of residence of women who are having multiple births, and in particular you know in some states, where people are from Massachusetts, where in vitro fertilization must be offered by all health plans, but we have other states where it's not really available to people who are on Medicaid or have no way to pay for it. So you could very well have a spatial pattern in terms of the likelihood of having births that were from assistive reproductive technology. Anyway, there's a lot of different factors that could go into this particular distribution.

And one of the challenges that we have with the time space clustering models that most people run is that they're only based on the spatial distribution of events. And they don't take into account very many other variables. Now, the SatScan program, you know, when we were looking at the input screens, it does actually allow you to include some co-variants. The problem in order to model those co-variants you really need to be using individual level data. And we didn't want to spend 26 hours waiting for it to run so you could see what the results are.

Maybe what we should have done is just saved the results and brought them here to show you. But we didn't do that. But another potential approach that one could use that is certainly worth considering would be to take your data. And say you know for whatever outcome you're looking at but low birth weight might be the outcome you're looking at. And use the individual level data to estimate a multi-variable model. So, for example, you know put in your variables about demography, maternal age, you know, education. Whether she's married. Other kinds of variables that you might have available. And put them into a logistical regression model, which will then give you a model for the data for the entire region. And then estimate the probability of low birth weight for each birth in the data set so that what you'll wind up with is every birth will have a value somewhere between zero and one but none of them likely will be exactly zero or exactly one. But most likely at least the low birth weight babies will probably have probabilities that are closer to one than to zero. And then for each of the census tracks, some across those probabilities to come up with your estimate of low birth weight for that particular census track. And of course the denominator will still be the number of births.

If you did a motel like that, ran it through SatScan, it would basically be mathematically adjusting for the known co-variants that we have data on. And that might be what's happening now. I don't know. That might be something that would be worth considering. Now, another thing that I wanted to mention, this particular methodology for identifying potential clusters is a relatively widely used model. And despite some of the problems we've been having with some of the running the example here just now, it's actually relatively easy model to run compared to some others. But there are other techniques that are available that you can utilize for trying to identify spatial clusters. And one that I think is definitely worth mentioning, we do not have an example for this particular approach. But there's a methodology that was developed by Jerry Rushton at the University of Iowa . He's developed a methodology that's called spatial filtering. And his methodology also uses Monte Carlo simulation to basically estimate a probability map for the whole area. And it's worth also exploring his method. The reason we did not bring it as an example to demo is that it is considerably more complex to set up and requires having some additional software. But his approach is available actually as freeware from the university of Iowa website. If that's something you're interested in, when we put together the post-seminar materials we can certainly put some information about how to use that application. Dianne, have you ever used Rushton's spatial filtering?

DIANNE ENRIGHT: No. Looked into it.

RUSSELL KIRBY: Looked into it. Yeah. There's some spatial statisticians who are not quite sure that it really does what it's supposed to do, which is another reason not to necessarily focus on it. Okay. So Ravi , did we have anything else we wanted to do about SatScan?

RAVI SHARMA: No. I think (inaudible).

RUSSELL KIRBY: Well, why don't we take a couple of minutes and see if anybody has any questions.

UNKNOWN SPEAKER: Hi. I was curious about what actually got into, brought over to ArcView into the map, and I had two questions. One was: Some of the apparent cluster centroids had no radius, zero radius and the others had a real value for the radius. What distinguishes those two groups? Sometimes you just get the point that no circle —

RAVI SHARMA: If you right click on your buffer that you created, you will see the ones that are, that will give you a little idea. Open the attribute table. What you see the P value.

UNKNOWN SPEAKER: Yeah.

RAVI SHARMA: So you see it — yeah, some of them are one and the important one is really the 001 and there are — I don't know why the secondary clusters — so they're these.

UNKNOWN SPEAKER: Go to those original — I'm sorry, this one, from SatScan and we look here, we have 14. So we have a point degree to the 14 but five circuits.

RAVI SHARMA: But also you to have to look.

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: We also have to look at some of the, you know, normally it should be, the primary cluster should be displayed. Only the significant clusters should be displayed. We know that the big circle where it grew 50% of the population, that's the significance. The P value of .001. The rest are, if you look only one other cluster I think is, has a —

RUSSELL KIRBY: There was one other in our group that had that data.

RAVI SHARMA: No, that's not significant. So the question is why are these other ones being, why are there circles around the other ones? It actually — those are secondary clusters.

RUSSELL KIRBY: Those are secondary clusters.

RAVI SHARMA: And we didn't ask for high and low. We just asked for high clusters. This is really interesting. I'm not really sure as to why the secondary clusters with insignificant P values are being displayed.

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: That's a good question. Yes.

RUSSELL KIRBY: So we'll learn about that and see what we can find out. Anybody else have a question? Anybody else? Yeah.

RAVI SHARMA: I just want to check a little bit more about this to see.

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: So that's .3, .4, .5. So really this one up here.

RUSSELL KIRBY: That's primary.

RAVI SHARMA: That's primary. But you have another one here. These have not been displayed.

RUSSELL KIRBY: Those are not being displayed.

RAVI SHARMA: So this one up here, let me turn this off here. Selection. Selection. Let me just look at — so this one up here. This one here. So what does it say here? So you see the P value is — this is insignificant here. I'm just trying to —

RUSSELL KIRBY: (Inaudible).

RAVI SHARMA: Right. That's in the map, what do you call it, decimal degrees. Because that matches the fact that these are all map coordinates.

RUSSELL KIRBY: Okay.

RAVI SHARMA: But this observed is 282. And the relative risk is 1.25. So this is this one here.

RUSSELL KIRBY: I'm not sure that one should actually be showing —

RAVI SHARMA: It shouldn't be showing up.

RUSSELL KIRBY: The P value.

RAVI SHARMA: Yeah.

RUSSELL KIRBY: (Inaudible).

RAVI SHARMA: Yeah. And this is another one.3. This is not significant. I have no idea why these are showing up. This is so weird. Maybe we really shouldn't be displaying these at all. This one up here, radius is .2.

UNKNOWN SPEAKER: Do you think it might have to actually — on a visual basis?

RAVI SHARMA: Probably.

RUSSELL KIRBY: Of the map?

RAVI SHARMA: Yeah. I mean it's shooting for the significant clusters.

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: I'm sorry, did you have a question? Are you okay? Okay.

RUSSELL KIRBY: Yeah. Susan is suggesting something that would be a potential way to proceed with your analysis. And one would be, you know, if for example you know, we know that there is a difference between singleton and multiple births in terms of low birth weight. You could potentially do an analysis significantly of singletons. You could potentially do separate analyses for different race ethnicity and look at those kind of patterns. But I think what you probably need to do also, because the only thing I can think of as to why we end up with these clusters showing up on the map that have P values that are greater than .05 would be because we told the map to make them on the map. But maybe we shouldn't do that. And I think it would make more sense, actually, to, as you're bringing the data in, to write one of those visual basic scripts that would tell, as you bring the file in to tell it only to display map, display on the map those clusters that have P values of less than .05 or 03 or whatever your threshold might be. And if we did that and I think Ravi is going to show you what —

RAVI SHARMA: I'm going to ask — I'm going to call Dianne.

RUSSELL KIRBY: Okay. Dianne.

RAVI SHARMA: So Dianne is going to show us. So we have a P value here.

UNKNOWN SPEAKER: There's the P value.

RAVI SHARMA: So we are going to run a, there's a — we want to calculate values. That's okay. So what we are going to do, we're going to select right P value, okay? It's a number. And we can get all the unique P values here. So we can say P values greater than, what's greater than here, can you see it?

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: Oh yeah. Oh yeah, that's right. That's what I was looking for. Select for attributes. So what we're going to do is drop down to P value, right?

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: I would say less than —

RUSSELL KIRBY: We could say P value of less than .05. You want to say less than or equal to.

RAVI SHARMA: Yeah, probably say less than or equal to.

RUSSELL KIRBY: But it would probably be better to define a role not defined on actual values but whatever your threshold is that you wanted to use.

RAVI SHARMA: Okay. And now what we can do is actually go to, where are we up here?

RUSSELL KIRBY: Now you need to do the buffer around that.

RAVI SHARMA: Which one did the —

RUSSELL KIRBY: Now you have to go into tools now and draw the buffer.

RAVI SHARMA: Draw the buffer. So that's — I mean that's one way to do that in ARCGIS using attribute. So this way there's no confusion.

RUSSELL KIRBY: Yeah. But I would definitely caution everybody that this is a useful tool. But it's really a data exploration tool. And I don't think it would be a wise thing for you to put together atlases of SatScan analyses and put them on the web for everybody to look at. It's better to use this when you're exploring the data yourself and trying to understand patterns and relationships and the data. We do — I mean you do see people publishing papers where they're using SatScan for various things, but I think it would be wiser to use it more as an exploratory tool.

RAVI SHARMA: The real important thing is even if you find a cluster you still have to explain why there's a cluster. Right? So you find a cluster. We find a primary cluster around that area. The next question is, why is there a cluster, what is causing the cluster? You don't want to statistically, run data in, data out type. So you need public health researchers. We need to figure out if there's a cluster, what it's responsible for, the cluster, so that from its epistemologically, from social science it makes some sense that there is some underlying cause. So this is just the beginning of your research project where you do identify significant cluster. Next issue is you need to identify the underlying patterns, causes that is given rise to that particular cluster. Otherwise the job is not done. So you're just beginning your job.

RUSSELL KIRBY: Right. But what this does it gives you the ability — you already had the ability with using the Poisson distribution to identify whether there are patterns in our data that are observed expected happens more often than chance but this gives you the ability to look at the spatial component to that. Now, you're going to have problems when you use this kind of methodology for outcomes that are a little less frequent. You know low birth weight is a relatively common outcome. In fact, I think preterm birth is a little more common than low birth weight. But this is a relatively common outcome. But say, for example, you were doing this analysis to see if there's a pattern in terms of, I don't know, a specific type of congenital heart defect, for example. Maybe you're looking at patterns of hypo plastic left heart syndrome and maybe you have five years worth of data instead of one and you have the ability to actually do a patterns by year as well as in terms of space. Well, you're dealing with a much rarer outcome, and the likelihood that you're going to come up with results, I would bet if you ran this analysis 20 times on that kind of scenario you would probably get some different answers, just because of the fact that there's a random component to it. So it's really best to use it as an exploratory tool. And not use it as the only way that you try to look at these questions.

Okay. I think what we were — they're just setting up our break so we'll give them about another minute or two to do that. Then we're going to take a short break. Then we've got a few additional topics and issues we wanted to raise and then we'll conclude. We're going to try to get done by 4:00 . But I think probably right around now is when we ought to take a break. Henry is reminding us once again it's very important to fill out your evaluation forms. In fact, if you already filled out one, you want to do another, you can do that.


RAVI SHARMA: Let me, before we break. Let me just — some of you may want to run SatScan. So let me give you a little hint how to prepare the data to bring into your program, because we already created for you the DBF files. So you have to create DBF files, because most of your data is going to be in format other than the one that SatScan uses. So what we do is — as you know you'll have a shape file, right? And what you need to do is you need either the latitude longitude either, you know, the XY Cartesian or you know in the longitude, latitude format. So you do need that information. Let's assume that you have birth certificate data. The addresses of women with low birth weights, and with normal birth weights.

So in our case what we have are our cases, our babies that are low birth weight, right. Controls are women with normal birth weights. So we use, you know, low birth weight as less than 2500, right? Grams. Everything else is normal birth weight. So you're going to create two files. One file is the case file that needs only depending on the model. The Bernoulli model only really needs to ID the case.

UNKNOWN SPEAKER: Number of cases.

RAVI SHARMA: And the number of cases. Now if it's point, if you're using point data, it's just one, right? There's only one single low birth weight at one point. So it's really one. You don't have to do anything. And how many cases if you have cases, sorry, controls, it will also be one because you're using point data. And the program will itself add and create the numerators and denominators. So for the point data you just need to have two control files. I'm sorry. You need the case file and the control file. The case file will have the variable, the ID. It gives the locational information. The second is simply the case. You enter the number of cases in this point data. It's going to be five. That's your case file. The control file will be your controls. Normally you will have more controls, right? You will have multiple controls.

RUSSELL KIRBY: Actually, though, it is possible to use this methodology even in the context of a case controlled study, where you actually have a set of cases and a set of controls and you can do it in that way as well.

RAVI SHARMA: So the control file will be very similar to the case file. You have the ID, which is the location of your control. And number of controls. If it's point data, it's going to be 1. So the last five you need is your coordinate file. And the coordinate file will have the location ID which will be similar to your ID for all of your case and controls, right? And your X and Y data. Just like here. You can extract that from your shape file. As you know your shape file has an attribute file which is a DBF file. You can actually process the data either in SPSS or in SAS or whatever and create these three files and then simply you read the data into DBF files and create your files.

Now, if you want to run the Poisson, some of you are interested in controlling for race. There are some controls for co-variants here, but Bernoulli cannot do any controls. If you want to control for (inaudible) you want to use the Poisson model. That will control for let's say race and ethnicity. You can have a 01 variable. And then run the Poisson model. So that will allow for one additional, as you know when you go to Poisson in the drop-downs, you make sure you drop down and specify Poisson and it will have an additional input for co-variant.

It's very simple. If you have the input files, it's easy to run the input files for SatScan. If you go to the National Cancer Institute website, you can actually get a version of SatScan that runs on, as a module under ArcGIS. So you don't even have to leave ArcGIS. So go to the National Cancer Institute, GIS section and you'll have free tools available.

As you know SatScan was made possible through a grant from NCI to Martin.

RUSSELL KIRBY: Using the version that works within ArcGIS you don't lose a lot compared to this newest version, because the newest version has, they have ordinal and exponential methods which I doubt hardly anybody is going to actually use.

RAVI SHARMA: Let's see if I can find — do you remember the website?

RUSSELL KIRBY: I think it's just MCI.

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: What was that?

UNKNOWN SPEAKER: (Inaudible).

RUSSELL KIRBY: GIS.cancer.gov.

RAVI SHARMA: Www?

UNKNOWN SPEAKER: No W's.

RUSSELL KIRBY: Just GIS.cancer.gov.

RAVI SHARMA: GIS dot.

RUSSELL KIRBY: Dot cancer.gov.

RAVI SHARMA: Tools.

RUSSELL KIRBY: There you go. SatScan. Is it there?

RAVI SHARMA: Yeah. And there should be somewhere where it will give you some further information. We need the tools that they have created. How do you like the name head bang? Very interesting GEODAR is now an accepted GIS tool —

RUSSELL KIRBY: They might have taken that off with the new version.

RAVI SHARMA: I just downloaded it not too long ago.

RUSSELL KIRBY: Okay.

RAVI SHARMA: Spatial statistics image analysis. That's not the one. We will have to figure out where this is. Miscellaneous tools.

UNKNOWN SPEAKER: (Inaudible).

RAVI SHARMA: I'm sorry? By the way, this is the color blue we already talked about. Map engine. Do you see anything here?

RUSSELL KIRBY: No, I don't. I think they might have taken that off. But it's probably still available somewhere.

RAVI SHARMA: Okay. So much so for —

RUSSELL KIRBY: That's the website for Maptitude. And we've been using ARC products as the primary GIS software, but if you're in an agency where it's very expensive to get ARC products, which is not uncommon. The Maptitude is software you might want to look at as an alternative. It's a product I think you can buy. It's probably around $800 for a copy but it has no annual license. You own the copy in perpetuity. And it's actually a very good package for you know stand-alone GIS package.

RAVI SHARMA: How do you get to the Long Island site here? Long Island cancer?

RUSSELL KIRBY: It should be on that page right there, yeah. And probably on the geographic information system, they have a variety of tools that they posted here, the downloadable statistical extensions.

RAVI SHARMA: I think that's the one.

RUSSELL KIRBY: Yeah.

RAVI SHARMA: There we go. So this is the one you want to use, cluster analysis.

RUSSELL KIRBY: Yeah.

RAVI SHARMA: If you go to that, I knew at somewhere it would take sometime to get to it. So it will be www.Helt GIS.LI.com. And it has extensions for ARCView now. It has a disease rate calculator, area interpolator, imperial base and cluster analysis. So those who already have ArcGIS, this would be good. I have used the cluster analysis just — downloaded it a few weeks ago and it automatically loads it into ArcGIS, and it's available to you as a tool. So use that. All right. Russ.

RUSSELL KIRBY: You know, I think we probably should take a short break right now and then we've got a few things we wanted to wrap up with. So, maybe about 15 minutes.