![]() |
MCHB/EPI Miami Training — December 5 - 6, 2005
Exploratory Spatial Data Analysis of MCH Data Using GeoDa — Transcript
RUSSELL KIRBY: Okay. One other thing, if you want hard copy of the ARC exercises and you didn't get a copy yet, we have a whole stack of them right over here. There should be enough that everybody could have one. If you got one yesterday I think we made about ten copies yesterday. So just, in fact, why don't you do — why don't you do this? You just walk down here and anybody who wants copies of those — and then if you want a copy of the ArcGIS demo we've got some additional copies of that, too. So what we're going to do today, we're going to do three topics. And they are more or less what's in the agenda. Although the timing to be allocated might be a little different. What we'll start out with I think is we're going to go through the census download now that we have a computer that runs a lot faster so we can actually show you how it works. And we're also going to do it for a small state we decided we'd use the state of Delaware to do that. Again, in the interests of showing the whole process in a way that's functional so that you can see how it works. I don't know if it's a good idea for everybody — do you think it will work if they try to do it, too, or do you want to just demo?
RAVI SHARMA: Actually, the process I demonstrated last evening was actually real. Downloading, because I downloaded the New Jersey census tracts and linked them with the map. If I did it again it would be the same process but for a smaller state. The idea would be if you wanted to work along with me.
RUSSELL KIRBY: We'll do that. Did you have a question? The other two things that we're going to do today are we're going to work through how you use GeoDa. And we have you know some background on what GeoDa is. And you all have it on your desktops so we can run it. We'll do the same thing with SaTScan so you can see how that works as well. But I thought to start out the day, we would have a little bit of received wisdom. Because we had time actually to go to the mountain last night and we discovered that actually there was quite a bit of received wisdom we could draw down from the mountain. And found out that there actually are Ten Commandments of public health GIS.
And here we have a few of them. And I don't have these memorized so I need to look at the screen, too. But the tenth commandment of public GIS is thou shalt not expect health outcomes or disease states to respect administrative boundaries. They don't.
But on the other hand, if you don't collect any geography with your health data, then you have nothing that can be mapped. So we need to have geography.
The ninth commandment, thou shalt not unknowingly commit spatial errors. You know, that one actually is probably going to be difficult to live out in practice. But we might be able to.
Number eight, know thy purpose. And the corollary of number eight is thou shalt always be cognizant that the scientific method is not a built-in feature of any GIS software. It's not. And in fact we should always remember that GIS only allows us to visualize data. It doesn't actually give us a framework where we can necessarily test hypotheses. It's a tool that we can use.
Number seven, thou shalt know and understand thine data before bringing it into a GIS. So just the fact that we have data that can be put into the GIS doesn't mean that we shouldn't fully understand our data. Just as well as we would understand any other data that we want to analyze.
Number six, thou shalt remember that while thine map is an abstraction, it reflects the physical environment and is based on actual data representing actual events that occurred to real people. And we need to always remember that a map is a two dimensional representation basically in a Euclid angeometric framework where we have removed things like housing projects that might be in a census tract that might be either middle class or large employers or mountains or rivers. We typically have removed them but the real world is a very complex place and maps are always abstract representations of that reality.
Number five, thou shalt protect individual records containing XY coordinates exactly as thou would protect the records with individual identifiers, as both can reveal confidential information. When you geo code a record, you are basically putting some very specific identifiable tags on that record. And even when you mask the information, you know, by using a spatial masking technique, for example, we don't have time to know how to go through that but even if you do that if you let the people see the individual records. These are still close to identifiable.
Number four, thou shalt not clutter thine health data maps with unnecessary layers and map elements, nor shalt though ignore information necessary to interpret the patterns on your map. That's self explanatory. If anybody has seen any of the books by Edward Tufty, the Visual Display of Quantitative Information being one of his books, there are many classic examples of what happens when you put too much on a map and make it difficult to read. An addendum to number four: The real art of cartography is knowing more what to leave out, more than what to put in.
Then number there, know thine metadata. It's actually related to one of the earlier commandments but it's worth emphasizing again.
Number two, thou shalt not assume that the default settings of your GIS software will generate a useful and meaningful map. That's actually been a truism throughout the entire history of automated mapping. If you want to get a lousy map use the default settings. If you want to get a lousy chart use the default settings of Microsoft Chart and Excel and so on.
Finally, number one of The Ten Commandments of Public Health GIS, thou shalt show humility to others and be gracious even to those who thought it would take weeks to accomplish what thou has done in a few hours. We also have to be humble in our work. So those are the received wisdom of the Ten Commandments of Public Health GIS.
When we get through with the session and post everything, I'll make sure this is included in the materials so you can all access it.
What I'm going to do is we'll allow Ravi come to the microphone and switch this cable so that the computer —
RAVI SHARMA: All right. If you recall one of the — we wanted to tie some loose ends from yesterday. As you know, with all this barrage of questions being directed, I think I slipped in one of the answers I gave. So we would like to correct that and so we actually have a small presentation. If you recall, you were interested specifically in how would you create new regions within state, depending on, this could be health service regions or metropolitan region, whatever. They're all made up of different counties of different sizes and shapes.
So what I did point out is you can create a numeric ordering, one, two, three, four, which is correct, but the function that you use in GIS to collapse all those different polygons into a new shape is called dissolve.
So Dianne is actually going to demonstrate to you very soon how you actually do it. So Dianne is going to take a few minutes and then we'll go on to GeoDa and the rest of the session for today.
UNKNOWN SPEAKER: Question. Yesterday at the end of the day you went quickly through that whole FTP thing. And I'm wondering if there will be notes, any notes that reflect the sequence of what you did.
RAVI SHARMA: Actually, there is a whole FTP — I need another cup of coffee. There is a slide presentation that was — but that's very difficult to read. So we're going to do redo the slide presentation. It's not legible. The slide presentation is very fine print. So what we're going to do is we're going to send you all those who are registered for this conference, we will e-mail it to you — this one up here. That's okay. But it's very difficult to read. I mean I would probably need to change my glasses to read that.
So we will send you an updated full length presentation, color. PowerPoint. So you can use it on your PCs.
Yeah, it's all there. But you can read — that's very difficult to read.
DIANNE ENRIGHT: Henry, will you dim the lights just a little bit. I think it — yeah, it's just a little easier to see the screen, I think. Just a little bit. Is that good? That's good.
Okay. So going back to yesterday's question on how you would create a health service area: What you would need to do is somehow in your attribute table either add a field directly to what you want to define those service areas to be. For example, we have Pennsylvania counties. So I added a field that's called Health Service or health SER. And I just added numbers. They're defined as text. But I added numbers to define which counties in which service area.
So you can either do that directly by adding the data to your table or by doing a join. Then you would go to your ArcToolbox within data management tools under generalization, there's a tool called Dissolve.
And what this does, basically it dissolves polygons. So you can create larger polygons from smaller polygons. A good rule of them, whenever you're in doubt about what you're doing, copy your shape file and test it. You can't hurt it if you just make a copy and you save your original data. But most of the time in ARCMap these days it prompts you to create a new output, so it's not really changing the original data. This tool actually does ask you to name a new output. So I'll just take the default name which is pacountydissolve.shape and you select which field you want to dissolve on, and I want to dissolve on the health service areas as they were defined. All I have to do is click "Okay." It will create a new shape file and add it to my session.
But it didn't draw. Hmmm, and something didn't work.
Okay. It has to be numeric. So that's simply solved. We'll add a new field. This is also another good question we had yesterday. Everybody wanted to know could you redefine your field? You can't. But you can set it to be equal to another field. Now I want as number. I can't see what I'm doing.
So we'll do it again. My original PA County. On the new field — it's still not drawing. I wonder why. Hmmm. Great. That didn't work. Oh. Let me — it's still not working. I did it yesterday. Of course.
RAVI SHARMA: (Inaudible) can we see the attribute file.
DIANNE ENRIGHT: That's the way it always works.
UNKNOWN SPEAKER: (Inaudible).
DIANNE ENRIGHT: They should automatically appear, yes. If you can see, the attribute table is completely empty. So there's a problem with it. I'm not sure why. It may be that — maybe I'm not writing to the correct space.
UNKNOWN SPEAKER: (Inaudible).
DIANNE ENRIGHT: Could be. Let me put it to a different space. No, that's not it either. Well, I'll figure it out. We'll get back to you. Let's go on to something else.
RAVI SHARMA: I do know she did it last night, evening. It worked.
UNKNOWN SPEAKER: (Inaudible).
RAVI SHARMA: Okay.
UNKNOWN SPEAKER: Question, if for example I wanted to group counties in public health districts I'd have to do a recode I would guess of some kind.
RAVI SHARMA: A re —
UNKNOWN SPEAKER: I would need to be able to pick the counties out of the field (inaudible).
RAVI SHARMA: Yes.
UNKNOWN SPEAKER: And so forth. I would expect that (inaudible).
RAVI SHARMA: No, all —
UNKNOWN SPEAKER: Create a new field.
RAVI SHARMA: Yes, create a new field and the new field would be similar to what Dianne did, you have a numeric field that says which County belong to which districts, right. Yes.
Sorry about that, Dianne. We'll try this again. So what we're going to do is we're going to — I don't know whether this is true or not. We're going to GeoDa. Is that a Latin dance? Is it salsa? So what you have on your — we have to go through a little — we're going to go through a little procedure here which is we need to install GeoDa. It's not installed in your PC. But it's in your MCH/GIS folder. So if you go to — can I get this to also display on my screen?
UNKNOWN SPEAKER: No, you can't.
RAVI SHARMA: So I'm going to click out of here, and we're going to go to — so you want to go to your GIS MCH folder applications. And you see a GeoDa set up. And click on this set up. It will begin to install GeoDa. Next, just follow the instructions on your screen. And just use the typical installation. Install. That's not good.
RUSSELL KIRBY: If you want, we can just go out to — I don't know why that could be a problem. I was able to —
RAVI SHARMA: It's installed, actually.
RUSSELL KIRBY: See if it works. Does everybody else get to this point?
RAVI SHARMA: Is everybody here? Just ignore — if you get some error message try to ignore it and see if it completes the installation process. And then on your desktop you should have the GeoDa icon, just like this one up here at the bottom, and you click on that to start.
RUSSELL KIRBY: Also, if you want to write down, if you're taking notes. If you put GeoDa into Google it will take you directly — this particular site where you can download this if you want to have it back home where you live and work and play. But technically the software is free to have but they do like people to register so that they can get, they'll send you information about updates and so on. So it's a good idea to actually draw down directly from the website. But they gave us permission to make it available to you today in this way.
UNKNOWN SPEAKER: The start button is all the way underneath the wall. Can we move it three or four inches.
RAVI SHARMA: Could we do that? Can you move that back a little. I don't think we want to try that.
On your folder, you should have a demo on exploratory spatial data analysis using GeoDa, an introduction. It should be in your MCH folders.
Now, this is, you know, a pretty — I use this one also in my class. It's been field tested. So what we're going to do is I'm just going to go very quickly through the demo. But then we'll actually go and work with GeoDa. Demo is one thing, but you really need to work with the actual software.
So I'm actually going to go a separate route. But I just wanted to show you, there is a demo. I'm actually going to take you, it's very hands-on. So those of you who need to go back later on when you get back home and you want to refresh your memory, you can go and look at these PowerPoints. So let's go back to — I'm going to exit out of here. We're going to go back to GeoDa. And we're going to open a file. So as you can see, the commands on GeoDa are very similar to what you would find in any windows program. You have file, view, tools. Tools and methods are additional then you have a help file. And you have a GeoDa project setting. And the great thing about GeoDa is that it uses shape files. So it's very easy to work with ArcView, with GeoDa and you can simply import your ArcGIS files. The one thing it does need is what's called a key. A key variable.
The key variable has to be numeric. It can't be text. So you cannot, for example, in ARCView, typically the Phipps code or whatever variable you use as your key variable is a text file. But it has to be, in GeoDa a key variable has to be numeric. You're going to use the same file we used yesterday when we were playing around with ArcMap. It's the PAMCH file. So you will click and open and then navigate to your, the MCH folder, you know, for you it will be MCH folder where you have the data stored.
So in my case this is where I stored it. As you can see, it's asking for a key variable. Now you can either create a key variable or you have to pick a key variable and I think you can pick County as your key variable.
UNKNOWN SPEAKER: What's the (inaudible).
RAVI SHARMA: The key variable actually is important because it allows you then, if for some reason if you want to export this file to ArcGIS, so there should be some relationship between your key variable and your Phipps code or whatever you're using as your key fields.
So that's really the way in which it will link back to ARC. It will, you know, link. So it's important for the key variable to be some meaningful variable in that sense.
UNKNOWN SPEAKER: What did you select?
RAVI SHARMA: I selected County. So you can select County as your key variable. The other thing you need to remember is the key variable has to be numeric. If it stacks, it will not work.
So what I'm going to do is just explain to you all the different features and essentially where we want to end up today is at the point where we will compute, calculate spatial auto-correlation. And I'm assuming most of you are familiar with some of these terms. I'm going to have — there's some explanation of these terms in GeoDa. So the first thing, when we look at this is very similar to a shape file. And I'm going to maximize this.
So this is very similar to a shape file in ArcGIS. And if you click on this obviously has an attribute table, you know. And you can display that attribute table by clicking on this icon here that says table. And here's the table. So if you look at this table, this is the same table — this is the same table that we looked at yesterday when we were creating all different features in ArcMap. You have the area and you know a lot of different features. And I don't need to explain all these variables that we looked at yesterday.
So what we would like to do is when we talk about spatial exploratory spatial data analysis, what we're really saying is we would like to explore the spatial structure of our data. And exploration is actually a three-part field, it has three components. One is visualization. You visualize your data, just as you would any kind of data.
And then the second is exploring the structure of your data. And the third, based on visualization and exploration of the data structure, is fitting a model. So those three make up what we call ESDA, Exploratory Spatial Data Analysis, Visualization Exploration of Spatial Data Sector and the third one would be model fitting. Now we won't be able to do the model fitting but you could fit spatial models here based on the results of your visualization and on the data structure. If you see significant presence of spatial autocorrelation, then you can use your normal multi-linear regression which totally ignores spatial multico-linearity. You need to fit models that would incorporate spatial autocorrelation into the model.
Okay. So what we have — so what I'm going to do now is you know since most of us in our daily routine, what we do is we calculate rates. We map rates. We test for presence of spatial auto correlation in the rates and then we would fit the model. So what I'm going to do is straightaway we're going to start figuring out how do we calculate rates. How do we map those rates and then we're going to talk about how we smooth, because these rates are — some of these rates you will see, as you saw yesterday when you were looking at low birth weight. Some of these counties have low birth weights that are in many cases less than 10. That creates a lot of variance, you know, in statistics, the variance is you know is going to vary simply because of the size. And that's not good.
So we want to incorporate — we want to make sure the variance is taken into account and we want to stabilize the variance. So we have different ways in GeoDa to stabilize the variance. So we're going to talk about that.
So first thing: Let's look at how we will calculate rates and how do we map them. Now, you can, of course, do this also in ArcGIS, but we're in GeoDa. We're going to learn how to do this in GeoDa. So we're going to go and — unfortunately I can't see anything on my screen here. So I have to look at — so what we need to do first is as in ArcGIS, we need to create a new variable. So we're going to calculate, just as we did in ArcMap, we want to calculate low birth weight. And what we want to do is add low birth weights for — let's do it just for the heck of it we're going to do, we'll recalculate low birth weights for just one year, look at it. And then we'll go back and do fancy stuff with it. So let's do just for one year, just to see what it looks like.
Okay. So what we're going to do here is I am going — the number of different ways — if you right click — if we right click anywhere on the table and you can bring up an add column. Can you see add column? I can't. Okay. So we can add a column here. Add column. So let's give it a name. And we're going to do — let's do calculation for low birth weight in 2000. So we'll say, give the name LBW2000. And, oops. LBW. Okay. And since this is a rate, I think let's put here R just to — add the column and you will see that there's a column that is added here to the end. Right? Is everybody with me?
Feel free, as I talk, feel free to experiment. You don't have to do exactly as I do. If you feel comfortable going in any other direction. Okay. So now that we have another column added, we can now go. Again we right click and do you see this in the drop-down field calculation, we're going to click on Field Calculation. This is the work horse for calculating all kinds of rates and also for doing what's called lag operations and all kinds of binary operations. You can multiply here and subtract, add, whatever. So this is a very important part of GeoDa, because it does all the calculations for you for rates and basically for rates.
Okay. So what we're going to do is we're going to put the result in the new field that we just created called LBWR, I think I missed a zero there. So we're going to put the results in here. And what kind of an operation are we doing here? It's a rate operation. You should always click on the rate operation first, because it otherwise initializes. So click on the rate operation first. Determine where you want to put your results in and then choose a method here. If I click on the drop down box, you have several different choices here. You have a raw rate, an access risk. All epidemiologists would know access risk. We can do empirical days. We can calculate a spatial rate, a special empirical days, and empirical days rate standardization. We're just going to do the very simple raw rate.
We at the moment have no rate. We'll come back to that feature. And our event variable is — the event is simply the numerator. So we're going to pick a numerator here. So let's see if we can find the LBW for 2000 here.
Right here. And the base variable will be the births for 2000. So let's see if we can locate where the births are.
Now, did we decide this was the one?
UNKNOWN SPEAKER: Yes.
RAVI SHARMA: Okay. I think we have everything we need. We have our event variable. The numerator, and we have our phase variable, which is the number of births in 2000, and then we can simply say apply. And the calculations are reflected in the last field. Now, as you can see, the calculations are simply ratios, right? We simply divided one figure by another figure.
Low birth weights are typically expressed in you know on the base 100. So we can actually go and convert that into a percent very easily. So I'm going to show you how you would go back and convert this field into a percent, so it's more easy to interpret. I mean the proportions, the ratios are not easy to interpret.
So go back, right click again. You can actually right click anywhere. It doesn't make any difference.
And we're going to go back to field calculations. So what we're going to do is you're going to take the LBWR, in this case it's 200, this is the low birth weight rate. The ratio we just calculated. And we are going to then multiply that by 100. That's a thousand. What did I do?
UNKNOWN SPEAKER: (Inaudible).
RAVI SHARMA: Oh gosh. That's why. It's very difficult to see from this angle. So since this is a thousand, what can I do to get back to 100? Divide by 10, right? So let's go see — well, you know, this is good because this means we can learn additional features of this here. So now we learn how to divide. So simply go back here.
All right. See how easy it is. This program is — one of the problems with this program is it's so easy to use the chances of it being abused is quite good.
Now that we have calculated this, as you know, we have made — we haven't yet looked at — so what we need to do now is we need to explore what this percent looks like spatially. Because we know it's based on one year of data and we know we are going to get some really weird results when we look at this data. So let's go and look at this data and explore it. We're going to do a visual exploration first. So what we're going to do is we're going to go to explore. We're going to first look at the histogram. So go to explore. You see the explore. Are you all with me? Explore. Okay. We're going to look at the histogram.
We're going to look at the histogram. And click okay.
So we are now in what we call the, it's a combination visualization data exploration phase, and this is a histogram — you know this is very interesting, actually, because it looks like a normal, right? This looks like a pretty good normal distribution, which is amazing. Now, we can look at some of these very carefully. So if you click there, there is a link between, there are three links here. Actually, there are three links, because this will be considered a link. So if you click on any of the bars up here it shows up in your map.
So there are three counties that are interesting. We need to look at these very carefully. One is Philadelphia , and I forget what this County is, and all this County in southwestern Pennsylvania . Right? So what I'm going to do is, since I'm a little concerned about this, I'm going to click here and I'm going to click on do you see promotion. Do you see promotion there? When you right click, click on promotion. All the three selected counties are promoted. So now we can actually look at this more carefully to see what's going on with these three counties. So I would like to look at, first, the low birth weight. Let's see, where are we?
So I would like you all to look at what the number of low birth weights are in these three counties.
Are we near, right here, right?
So what you see here is this one has a low birth, 40 low birth weights. There are 18 here, and then this is Philadelphia , obviously. Right? We're 2, 3, 7, 1. And do you see anything interesting there? And if you look at — these are birth weights, low birth weight proportions, percentages we calculated. 10, 11 and 10.
So these are three that are on the high end here. The only one that would worry me, which one would worry you?
UNKNOWN SPEAKER: Philadelphia .
RAVI SHARMA: Philadelphia . And so we need to — so let's look at some of these other ones.
UNKNOWN SPEAKER: Do you want to say why (inaudible).
RAVI SHARMA: Yeah, why?
So the reason why would it worry you, it has to look — you need to look at — the one — LBW — the one that would really worry me the most is actually Philadelphia is this one up here, 18. This one looks — this is 40. The question here is: Does it in epidemiology and public health does it reflect the true underlying risk of low birth weight. That's what we want to measure. And the question is, does this, do we have sufficient numbers here to be able to have a stable rate calculated based on just one year data.
Now, Philadelphia has, you know, relatively large number of low birth weight, 2, 3, 8, 4 and the number of births over 21,000. So I'm probably not worried as much about this, but I'm worried about where are you looking at — this one up here. Because we calculated the low birth weight for 2000. And then this is 18. Now, this doesn't inspire a lot of confidence. But 40 is not too bad. So this is really interesting, because these are, except for Philadelphia , those two others are ruler counties in Pennsylvania . And so the reason I find it interesting is because typically you think of low birth weight as primarily a problem in urban, you know, ethnic communities. Not in rural communities. And so I'm kind of a little puzzled that we, that these two — I mean 18 is, you know, it's a reasonable number. But so we will — but those are the questions that it should be raising in your head, right, is which means Russ was talking about it, no dye data. One of the dictums is this is the train of thought that should be starting in your mind now as you look at this data. Is this reflecting the true underlying risk or is it basically an (inaudible) reflect of the data.
I don't think that the Philadelphia numbers are. They're relatively large, but we can — so let's leave these questions perk, let them percolate as we go on and look at the counties, other counties.
So I'm going to simply — you can click anywhere and it will unselect. You have to be very careful about selection here. So let's now look at this one County here. This would be — would anybody want to guess what kind, what we're going to find here in this one County outlier all the way to the left? I'm going to click on this. Click on it. Actually, you can see this, this one had only one low birth weight. And the number of births were 40. So there's not much we can do with just that data, right?
So that shows up in your — as an outlier. So that really is a true outlier. And it is a county that you will have to deal with when you do any kind — we have just one year data. So what we want to do next is to add data for two, three years. But I just wanted to get you started to think about the different features. So you can click on any one of this. And here's another one. So they're 32 — 12. And in this bar here —
RUSSELL KIRBY: Yeah, one of the things that this is useful for, actually is to learn whether there are potentially spatial patterns in the incompleteness of your data. Because if you were to see a pattern, and I don't know that this can occur in Pennsylvania , but in Wisconsin where I used to work, for example, occasionally there would be years when the great state of Michigan neglected to send the out-of-state birth certificates. And so if you had a tool like this, you would very quickly be able to see, you know, just based on the numbers of births and on the rates, sometimes there are differences in terms of where high risk babies are born in relation to low risk babies that might be reflective of reporting issues. So this is a very useful tool to again just visualize quickly in map form and be able to look at the patterns in the underlying data.
So definitely think about it in terms of that.
RAVI SHARMA: Okay. Very quickly — thank you Russ. Very quickly we're going to look at another tool. I'm sure you've seen these box Wisker plots, right? One of the EDA tools, exploratory data analysis. So if you go to explore. I just want to show it to you. And it's called a box plot. And we're going to do the box for the same variable. And you can see what it identifies here are these outliers. So look at this one up here. Right? Oops. So it immediately identifies first of all, so this one is — oh, up here. If you recall, this also appeared in our histogram. Let's see, where is — so you can actually click on these and it will — so here you can look at it. So this is, according to the box Wisker, the box plot, this looks like an outlier. So something that we need to — and we thought it might be based on the histogram we looked at. Let's look at — I'm looking at this one up here. And if you'll remember, this one also appeared in our histogram, right? These two appeared in our histograms. So that, too, looks like our outliers based on the box plot.
And this one up here on the lower end is the one that actually has no data. That's one with just one low birth weight. So that obviously is one which we will have to either ignore completely or we will have to spatially smooth it to make it meaningful. Otherwise, as we truly cannot use data for this County here.
Okay. So that's the box plot. Very useful device. As you can see what GeoDa is able to do is it's able to connect your data with the box plot and with the map. All three at the same time. And you can visualize. You can see your data literally visualize map it and you're able to actually learn a lot by just looking. So those of you who work with, you know, in states with large number of counties, this is a useful tool to be able to visualize all these in approximately six or seven counties but some in Minnesota had 81.
UNKNOWN SPEAKER: 87.
RAVI SHARMA: 87 counties. There's no way to look at them one by one. You can visually look at these altogether and visualize these patterns. So we're just starting. So this is just to get you going on this, some of the exploratory data analysis features.
So what we're going to do next is we would — what we would like to do is the question always in geography, Russ, is whether these, if you find as Russ was talking about, if there's a pattern in your data, the question is: Is the pattern because of the fact that these counties are neighbors of each others? So here's Allegheny County , and here are the neighbors of Allegheny County . Now, if you find that there are high rates in this, can you statistically determine whether these are spatially auto correlated, that's the term.
When we say things are spatially auto correlated, what we're saying is that the neighbors have either high rates or they have low rates of, in this case we're talking about low birth weights. So when we plot — when we plot this data so we can go and also map this data. So we're going to — as you know, it's very simple, similar to plots, and we're going to use the standard deviation plot here. We can use the quintile, the percentile, the box, standard deviation. We'll forget about this one for a minute. So let's go to map. Because the next thing we'd like to do is to see visually, this is another feature is to look at the spatial map.
So we're going to use — let's do the quintile. And we're going to again use our variable that we just created. We're going to say okay. Number of classes, four, is okay for the time being. And here is now another view of your data.
So what you have, this is still active here. You have now a quintile plot. You can — you can look at the quintile plots and you can see that these counties, you know, tend to be correlated. They're neighbors of each other. And they all seem to share the same characteristics. You can see these northern counties. The same characteristics. Some of these counties here in southwestern Pennsylvania . You know, this is okay as the plot goes.
Let's try on the map we're going to start, we're going to do the standard deviation plot. Let's look at the standard deviation plot. So you go to map. Standard deviation and LBWR 200. So here is the standard deviation plot.
If I can maximize this. So what you get, the mean is 709 and then you have these counties about the standard deviation, these are below the mean. And you can see. So the counties in blue. This one is above. And you know this is interesting. If you recall, in our scatter plot this County shows up. So does this County. And Philadelphia is not the one I'm worried about. But these two I am. And these show up as below the standard deviation. So we need to — so that's a picture that you can look at and ask yourself for all the different views you have so far of the map of the low birth weight rate for 2000. You see — repeatedly you see these counties showing up. And so that's worrisome in terms of there could be small numbers. So we can look at you know the confidence intervals for those two counties.
And you can see — you see all these counties here. This is the — these are all below the standard deviation between seven and 08. So you have a cluster of these counties here. This is Allegheny County here. This is Philadelphia to orient you to the map here.
So this is a standard deviation map which you can calculate in GeoDa. If I can click out of that here. Let's see. I want to go back to the — I'm trying to get back to the map here. Let's see which one is — I don't want to open a new map. Did I lose my map?
UNKNOWN SPEAKER: It's down there at the bottom.
RAVI SHARMA: Where is it? Oh, I'm not sure why I have two copies of GeoDa open. Let's see. This is one of the problems — I'm not sure I want to open another map. Okay. So we can — I just duplicated another map. There's a feature here. You see this feature. Very useful. If you're playing around with several different sets of ideas, you can actually make a copy of your map by clicking on — do you see this? It's called duplicate the map, the main map, and you can essentially create another map.
Okay. So much for this. What we want to do is let's progressively add additional layers of complexity to our analysis. The next thing we would like to do is, since everything in geography — the first law of geography is everything is connected to everything else. How do we measure this connectivity? One way we measure this connectivity of neighbors to each other, as we will call the term we will use is spatial neighbors. So, for example, you need to be able to express that in qualitative terms. So this is Allegheny County in western Pennsylvania . It has Butler County to the north as a spatial neighbor. This is Beaver. Hometown of Joe Namath, for those interested in football. And Joe Montana, too, for Allegheny County . I can name actually half a dozen.
And then this is Westmoreland , Washington and then Armstrong and all. These are all neighbors. The question in our mind has to be first of all how do we define these spatial neighbors, and secondly those neighbors that are one County removed, which is first order neighbor. Or are these neighbors two counties away, second order neighbor?
For example, is Eerie County a spatial neighbor of Allegheny County ? That's a question that only you can answer based on your substantive field of interest.
If you work, for example, in environmental health and you're looking at environmental exposures and its relationship to, let's say, adverse pregnancy outcome and you might find that your environmental exposures are coming, if you live in Allegheny, environmental exposures might be originating from well way up here, from Cleveland . So your spatial neighbor is actually quite far away in terms of distance. So you can actually create neighborhood structure based on what we call contiguity, closeness. Closeness. Whether a neighbor is adjacent to you, contiguity, or you can create a neighborhood structure based on distance. Either one, the choice that you make in either using a distance or an adjacent based measure is pretty much up to you.
Now, the question, of course, rises, and I'm sure you have it in your mind, is, well, you know, look at Allegheny County . Do you see this County here? Right? Do you see this County? This is Allegheny County here. You see this other County. It only meets Allegheny County at the edge, right? Is it a neighbor?
What do you think? I mean, do you see this, the Butler County shares a much larger, what do you call, area here. The border. While this is Armstrong County and you can see it only meets Allegheny County at the edge.
So another choice you need to make is what constitutes whether — what constitutes a neighbor in the sense of how much of the boundary has to be in common for it to be a neighbor.
Based on that, you have several — those of you who play chess, based on chess we have several different rules. You can have what we call — we can have a neighborhood structure based on queen. You know, in chest the queen can move in any direction. So the queen can move diagonally, up and down. Right? So that's a queen structure. So that neighborhood structure is called Queen. So you can define a weight structure in GeoDa called queen.
Another — again using chess, is rook. R-o-o-k. Do you know what direction rook can move, right?
UNKNOWN SPEAKER: Adjacent to —
RAVI SHARMA: Up and down. Right? So a rook can go either this way or this way, right? But it can't go diagonally. So you can define a structure based on rook. So that is another neighborhood structure. Keep that in mind, you know the chess terms. Rook and Queen.
Now, so that's one of the decisions we will have to make as we go to the next stage is how do we define our neighborhood structure. My recommendation is try changing your neighborhood structure and see if it affects your results. If it affects your results drastically, then you know you may have to make some choices of some kind. Maybe you want to use — another one of course you can use you can also use as I pointed out you can also use a distance. A distance, though, has to be from a point so we can create — we can create what we call centroids, the geographic centers of these polygons and then simply state that all neighbors within 50 miles, all polygons within 50 miles constitute a neighbor. So that's a distance-based — I'm sorry?
UNKNOWN SPEAKER: Can you identify neighbors based on like barriers, like in Washington State we have the Cascade Range that goes through the state.
RAVI SHARMA: Yes.
UNKNOWN SPEAKER: And access to care patterns are very different depending on what type of (inaudible).
RAVI SHARMA: So you can create a cost-based. So that's like a cost. There's actually — I may not have time to go into it but there's actually four. Those kinds where you have physical barriers. You can actually develop some very sophisticated models based on what we do — the term we use is cost because distance imposes a cost. And of course time is another barrier. So we can use both of those to — you can actually do this in network. There's another model in ArcGIS called network, because you're actually measuring distances with barriers. So I would — you can, of course, use a feature here. You can define distances using as barriers. But I would prefer for that kind of work, because that, you know, is very specific application, to use a module called network.
What it is, the network module in ArcGIS is based on a road network. So it actually tells you the distances along the way people actually travel. You go by road, right? Because you don't want to measure and including distance as the crow flies. So you have — I would strongly recommend that you experiment with network. And network has very intuitive visit like structure for — you can put all kinds of stuff in the model. It will actually calculate for you those and it will calculate distances. So if you have addresses of general location of the clients and the facility to which they have to travel, you can calculate the distances along the travel paths.
UNKNOWN SPEAKER: What about working with border counties like Ohio ?
RAVI SHARMA: Yeah. So you guys are asking really very sophisticated questions here. So we have what we call age effects. So the age effects, Russ, help me out here. So we have what we call age effects, which is how do you put boundaries around your problems? This is very common when you do environmental health, right? Because we in western Pennsylvania , I do a lot of environmental health work in relation to adverse pregnancy outcomes, too. So my question of this is well, Allegheny County has done a lot, so this is a policy question, done a lot to control its pollution. The steel mills have gone. Pretty much. There's one T and L steel mill left. Most of the pollution now comes from out here. So if I simply use this — this is why — this is what Russell is talking about. Environmental exposures or whatever you know we're interested in really obeys, do not obey you know administrative boundary laws. So we need to be extremely careful. So that's what we call the edge effects. The edge effects are very important. This is how you develop the spatial resolution of your — this is where the spatial extent of your region. So make sure that when you're doing a particular project you take the edge.
You understand what I'm saying when I talk about edge effects, right? Edge effects here refers to where you're cutting off the boundaries.
RUSSELL KIRBY: This problem will actually be a issue when we do our SaTScan example. But basically what you have to do, when you're doing this for all the data you have are Pennsylvania . Obviously it's difficult to factor in other information. But say for example you were looking at a, trying to determine whether there's clustering of some health event, what typically you need to do is to make an assumption about the distribution of that particular health event in the areas surrounding the area that you're studying and fill in some kind of a value which might be an assumption of the average value across the region you're studying or something. Because if you don't, then the areas that are just adjacent to your study area where you have no data are all going to be filled in as zeros, and you will definitely have a diminution, whatever pattern there might be up to your boundary will be diminished by the fact that you're factoring in basically zero values in the rest of the region.
But this is definitely one of the major challenges that we have in using GIS to look at public health issues.
Now, we also have the problem before we used GIS, we just didn't pay any attention to it.
RAVI SHARMA: Exactly. Okay. So while you're here do you want to say anything about the modifiable area problem, which is very similar?
Russell Kirby: The modifiable area unit problem is — there are those vexing problems in mathematics that people have come up with that they give $100,000 prize if you can come up with a solution to it. The modifiable area unit problem is actually an example of that. And the fact of the matter is that whether you have your data aggregated by administrative boundaries, in this case counties or census tracts or zip codes or whatever units you have, or even if you're using your data as a point distribution, the way in which you analyze the distribution of data in terms of the structure in which you consider the data to be arrayed, can influence the finding that you have. And it's been shown, for example, that if you, depending on where you — if you, for example, have point data and you array the data as a grid. Depending on exactly where you start the grid, you can get a slightly different answer to your analysis than you would if you started in some other particular location. But it's especially true with administrative boundaries that you know they basically give you a structure for grouping your data, which is completely artificial. And you will in fact potentially get a different answer to your analysis if you use a slightly different geography. And it's an intractable problem. It's one of the reasons why you know it's — when I say it's untracktable problem, it's especially intractable when you use administrative boundaries for your data. It's potentially a problem as well when you use point data but much less so than it would be otherwise.
There's a whole literature — but imagine actually they usually use the acronym MAUP or modifiable area unit problem. There's extensive literature on this. Not so much in public health but in the cartography and spatial analysis literature.
RAVI SHARMA: All right.
UNKNOWN SPEAKER: Is this exactly the same that SaTScan (inaudible) modifiable area is defined by (inaudible).
RUSSELL KIRBY: Not exactly.
RAVI SHARMA: But it's intractable.
RUSSELL KIRBY: It's basically an intractable problem that you're always going to face. As I say, it's less of a difficulty if your data arrayed is a point distribution.
UNKNOWN SPEAKER: Does this problem minimize the small (inaudible) so if you could (inaudible) would that be possible as opposed to (inaudible).
RUSSELL KIRBY: Definitely the smaller the area units the less of a problem it's going to be because you're capturing your information across smaller numbers. But on the other hand, we could have, I don't know, yesterday we could have shown a series of slides to show the type of area of units that you use influences what kind of patterns you see. I have a set of slides I use when I teach that has data, you know, calculating the infant mortality rate for the city of Des Moines , Iowa from zip code areas by census tracts and census block groups. And what happens, the smaller the aerial unit, the more precise the information, but the wider the variance in the measures that you have, because you're calculating them across smaller numbers of cases.
That is an obvious thing. And so if you use smaller units, the MAUP is less of a problem. But you still — it's something that never goes away and you introduce additional problems with smaller units.
RAVI SHARMA: Somebody else had raised their hand. Are we okay? Now that we have I think a really good understanding of spatial neighbors and we want to go next and create a spatial neighborhood structure for Pennsylvania so that we can then go and do some more interesting work with — so we're now slowly moving from visualization to exploratory data analysis, because exploratory spatial data analysis will really require that we make assumptions with respect to the neighborhood structure. So GeoDa is really nice, because it creates for you this neighborhood structures, but we need to make assumptions. We have to make a decision with respect to what neighborhood structure, whether we're going to use the Rook or the Queen. So I'm going to ask you to tell me — we'll have hands, people raise their hands to figure out which one is in favor of rook or queen.
And then we'll create a neighborhood structure. We can always use distance, right. So let's go on. And very easy to create a neighborhood structure. Let me just close out here.
So let's see if I can see here. Do you see tools? Tools on your menu here. And you will see weight. It will say either open or create. Since we don't have the weight file created, we can't open it. So we have to create it. So we're going to click on create. And so what we need to specify is our input, the share file for which we would like to derive what we call technically a weight matrix. So a weight matrix is simply a set of zeros and one. So if we use a rook criteria, then if this is a neighbor, you know, if Allegheny County has a neighbor, it's one. If not, it's zero. So it's zero one. It's all standardized, but that's the stuff I don't worry about. So you don't need to worry about here either. We need to specify an output folder. We need to specify select an ID variable. This again is the same. So we can select the County as an ID variable.
Now do you see here the contiguity weight? How do we want to define it? You can see everything is blacked out here. The first is the rook. Then below that is queen. If you look out on your screen, and the order of contiguity, whether we're looking at first order, second order. The first/second order is whether it's one removed, two removed or three removed.
UNKNOWN SPEAKER: This distinction between rook and queen, you know, patterns, you know, it's easy on any method. If we're creating a grid it's sort of very simple. But in this case you've got instances where there's one region has two regions adjacent to it on the left. Now, were those both, do they both satisfy rook contiguity or only (inaudible) contiguity.
RAVI SHARMA: Which one, talking about this one up here? This one?
UNKNOWN SPEAKER: Yeah.
RAVI SHARMA: This one, in my feeling, what it will do is it will think of that as a rook. It will probably include that as a neighbor. And it will include this as a neighbor but not this. And not this.
UNKNOWN SPEAKER: Right. So how about the one above it, though?
RAVI SHARMA: This one?
UNKNOWN SPEAKER: Yeah, will that be counted.
RAVI SHARMA: Yes that will be counted.
UNKNOWN SPEAKER: Anything that's touching?
RAVI SHARMA: Yes.
UNKNOWN SPEAKER: What will the Queen pull in beside —
RAVI SHARMA: The Queen will pull in this one up here.
UNKNOWN SPEAKER: It will skip over —
RAVI SHARMA: No, everything. It will pull everything including — the Queen moves in all different directions. So it will pull everything in. It will pull this one, this one and this one.
UNKNOWN SPEAKER: So you're not talking just a small band you're talking about a wider band.
RAVI SHARMA: Wider band with the Queen. So the queen will get you a much larger, because it counts those where the neighbors meet on an edge. It will include those.
UNKNOWN SPEAKER: And that's all conditioned by the distance?
RAVI SHARMA: No. By — this is contiguity. By borders. Whether it touches at any point along the border.
UNKNOWN SPEAKER: It's going to scan the whole state.
RAVI SHARMA: The whole state. That's why we need GIS to do it, otherwise we can't do it. Yeah, this will scan the whole state. So it goes — so the computer goes one by one. It will go all through the 67 counties and determine and neighborhood structure. So we can actually look at it once it creates it because it's a text file. We can actually look at — let me show you — my slide presentation on GeoDa actually has a discussion on what a, what a weight matrix looks like if you go to my PowerPoint slides. Go a little further down, after I create the weight matrix, you will see a slide presentation, because I know I don't have too much time here to do all in the slide presentation. I go through that contiguity. I do the rook. The Queen. And I also do the distance. Up here I won't be — we don't have much time to do all the three different. We're going to do just one.
UNKNOWN SPEAKER: Now for the most part we've been focused here on (inaudible).
RAVI SHARMA: On counties. Yeah.
UNKNOWN SPEAKER: But then what if you actually did have the geo coded spots, are you doing something similar to a buffer.
RAVI SHARMA: If your points — for points, we actually have really very sophisticated tools to — we can — you don't need to worry about neighbors like this. Because then that's what we were doing this afternoon, because there we will be doing clusters. So the clusters will be defined simply on the basis of distance, whether they're within mile, two miles. So the point, if you have point data, addresses of cases, controls, that's point and the point data has a lot more sophisticated techniques and we call it point batten analysis and you could do more with point data. But most of the time we don't have point. Most of the time all we have is aggregated data at some level of geography.
All right? Okay.
So let's go ahead and pull in our input file. So our input file is the same file, the PAM — I call it the PAMCH. We're going to save that. We want to save it somewhere. So pick it anywhere to save it and give it a name. So let's call it PAMCH — normally what I do is if I am, let's do this before we go and do this. Let's determine — because we want to make our name explanatory so we know whether we're using a contiguity rook or queen matrix or queen weight matrix, right? So how many of you would, are in favor of let's say rook first. Raise your hands. All right. Queen. The Queens have it. So we will call — so we will call — we will save it as queen. So we know which we are — so we're going to save it as queen. Q-u-e-e-n. Save. So once you do that you can see an ID variable we're going to use, we'll continue to use, County -- what are we using? County, what are we using, if I can see it here.
UNKNOWN SPEAKER: I think it's shown numerically.
RAVI SHARMA: That's true. Well —
UNKNOWN SPEAKER: I think the whole idea was —
RAVI SHARMA: We can have it generate these bracket numbers. Yeah, we should have actually developed a key for it. But that's okay. But as I said you must have a numeric key for it. So let's use — we're going to use the Queen.
UNKNOWN SPEAKER: Do you use OID up there.
RAVI SHARMA: We can use — yes, we can use OID. That's numeric. Let's use that. What I want to do is — yeah, OID is okay. This one up here. So let's use OID as our ID variable.
We are going to use the Queen contiguity. And we are not using distance. So that stays the same. And we're not using any threshold. We're not using any cut offs and we're not using AK neighbor. So that remains unchecked. The next thing we need to do is simply click on create. And you can see it will take only a few seconds and the weight matrix has an extension GAL. It's a GAL extension. And it's done.
And now we can go to weight and we can open it. And we can set this as a default.
Are you all with me? Shall I do that again?
Okay. I will do this again. So we'll go to tools, weights, open. And as you can see, it says select from. And we are going to select, we're going to navigate to where the queen weight matrix is, the one we just created. We're going to click on that. We are going to open it. And we're going to set this as a default. This way when you set it as a default, we do not have to worry. It automatically uses it when it does any calculation that needs a weight matrix. And then you click okay.
Now we are all set. We have created a weight matrix. The next thing we're going to do is since now we have — what time are we breaking for, am I breaking for? We have a break.