ABOUT THIS COURSE
This course provides YouTube-hosted videos and downloadable practice datasets to illustrate data analysis techniques useful to journalists, advocates and others involved in public affairs communication. The course covers basic and advanced Microsoft Excel techniques, basic querying and matching techniques in Microsoft Access, online mapping with Google Fusion Tables, and inferential statistics with Excel's free Data Analysis ToolPak. I will add updates and new materials as I develop them. These materials are free to use and distribute. I ask only that you cite me as their author and include a link or URL to this page.
-- Ken Blake, Ph.D., associate professor of journalism, Middle Tennessee State University
Analysis with Microsoft Access
Mapping data with Google Fusion Tables
Computing univariate inferential statistics with Microsoft Excel
Computing bivariate inferential statistics with Microsoft Excel
ANALYSIS WITH MICROSOFT EXCEL
Extremes can be newsworthy. The videos below demonstrate how the widely available Microsoft Excel spreadsheet program can be used to quickly discover which city employees received the largest - and smallest - raises during a two-year period. The names and all data are made up. But the wages paid to public employees generally are public record, so real data of this sort should be obtainable for just about any government agency.
The analysis shows that the salary of an employee named Earl Jakes nearly doubled, and an employee named Megan Moore increased almost as much. Both increases were far above the norm for the group. Meanwhile, the biggest proportion of the money spent on the salary increases went to an employee named David Donegan. And two employees, Clay Allsbrook and Haley Blackburn, got the smallest salary increases. There's certainly nothing wrong with getting a raise, and perhaps even nothing wrong with getting a bigger (or smaller) raise than anyone else. But an inquisitive journalist would look into why the raises were allocated the way they were rather than in some more equitable fashion.
Getting data into Excel
Keying words and numbers into Excel can gobble up valuable time for a journalist on deadline. Plus, it invites errors. Plus, it can be mind-numbing. If you're going to use Excel to any appreciable degree, you'll want to know how to import into Excel data that someone else has already typed and formatted. These videos show you how to import data from each of a number of common data sources and formats. A note of caution: No dataset, no matter its source, should be assumed to be free from errors. Always check for oddities, missing information, keypunch errors, etc.
More advanced Excel techniques
Basic Excel skills will take you pretty far. But there are some more advanced skills that will come in handy, too, if a bit less often. This series of videos shows how to analyze crime data using rates, ranking, and filtering techniques. To practice what you see in the videos, you'll need to download this homicide.xlsx data file. Similar crime data are available from the Federal Bureau of Investigation's National Incident-Based Reporting System.
Note: The FBI discourages ranking communities based on crime rates. The bureau points out that simple rates and rankings overlook many important factors in crime patterns. I agree. A computer-assisted reporting analysis can begin and inform a journalistic investigation. But a CAR analysis should never constitute the whole of the investigation. There is no substitute for the interviewing, verifying and contextualizing that journalists have always done, using the methods they have always used. CAR analyses simply help journalists go about those activities with more efficiency, insight and authority.
Here are a few additional specialized techniques, each with its own practice dataset
ANALYSIS WITH MICROSOFT ACCESS
Excel is the best tool around for some things - things like quickly computing and replicating formulas and sorting data, for example. And it's readily available. But Microsoft Access is better for other tasks, such as aggregating and filtering data and matching up data in two or more separate files. The videos below will show you how to set up an Access database and use it to filter and aggregate data in separate files, such as campaign contribution records downloadable from the Federal Election Commission website.
Getting data into Access
As with Excel, you can import data into Access to avoid spending a bunch of time keying it in yourself. These videos show you how to import data from a variety of common data formats. A note of caution: No dataset, no matter its source, should be assumed to be free from errors. Always check for oddities, missing information, keypunch errors, etc.
Note: You may elect to bypass this section, go straight to "Linking tables and running queries" below, download the already-compiled database, and begin learning how to conduct analyses using Access. But at some point, you'll want to know how to import data, so I hope you'll come back to this section eventually.
Linking tables and running queries
Note that these videos illustrate how to join and search three data tables that were intended to go together. The same skills, though, can be used to join and search data tables that were not intended to go together. Journalists frequently use CAR techniques to compare lists that should be mutually exclusive but turn out not to be. For example, one analysis compared a list of individuals convicted of burglary with a list of individuals hired to provide in-home care for the elderly and found a number of names that appeared on both lists.
MAPPING DATA WITH GOOGLE FUSION TABLES
Google Fusion Tables is a fabulously easy - and free - online tool for creating simple online, interactive maps of geocoded data. For example, Tennessee, like a lot of other states, both tracks and publicizes the addresses of people who have been convicted of sex-related offenses. The information is available online here, under the "Click here to search the Tennessee Sex Offender Registry" link.
An hour or so of cutting and pasting information from the TBI website produced this Excel dataset. Along the way, I had to use Excel to deal with some inconsistences. I could have resolved the same problems using Microsoft Access. After all of the address information has been merged into a single column, the data can be turned into a Google Fusion Tables online map.
And there's more. It's possible to subtotal the data to identify addresses shared by two or more offenders, then highlight these multi-offender addresses by creating and using "buckets." Other options include setting up a filter and switching to a satellite view. It's also possible to "layer" two different types of map information, as in this example combining information about sex offender addresses and information about public school addresses. To practice what you see in the video, you'll need these Rutherford County school addresses.
In the video, I use the following style codes:
Additional available style codes include:
FYI, I've found that, if you mix large and small icons on a map, the large icons sometimes end up covering the small ones, which can be unhandy. Another caveat: Google Fusion Tables presently will not import Excel files larger than 1 megabyte. But files of any size can be uploaded if they are in .csv, or comma-separated value, format. Excel will save data in .csv format if you click "Save As / Other Formats," then, under "Save as type:" choose "CSV (Comma delimited) (*.csv)"
The sex offender example above uses address information to place points on the map. But Google Fusion Tables can use latitude and longitude codes as well. Try it out, if you like, using this very small 2009 subset of the Toxics Release Inventory data made publicly available each year by the U.S. Environmental Protection Agency. It contains latitude and longitude coordinates, as well as other information, for the 11 sources of carcinogenic substance releases recorded in Rutherford County during 2009. Carcinogenic substances, as you may know, are substances that tend to produce cancer. Uploading the data to a Google Fusion Tables map will show you where in Rutherford County these releases occurred. The largest quantity of carcinogenic material came from the (now closed) Genmar Tennessee LLC boat manufacturing and repair operation on Church Street in Murfreesboro. The company released 44,637 pounds of styrene, a synthetic chemical used in the manufacture of plastics, rubber and resins. Information in the larger dataset indicates that the material was released into the air.
The second-largest release occurred at the U.S. Army National Guard Volunteer Training Site in Smyrna, Tenn. The substance released: lead. Bullets, perhaps?
COMPUTING UNIVARIATE INFERENTIAL STATISTICS IN EXCEL
Well, that's a mouthful, isn't it? It's less so if you take it one bite at a time, though. "Uni" means "one," and "variate" pertains to "variable." So we're talking about statistics that involve only one variable (as opposed to two variables - which we'll get to a bit later). And "inferential" is just the adjective form of the verb "to infer." You know what that means, right? It means to derive by reasoning. If there's smoke pouring out of the house and fire trucks pulling up out front, it doesn't take a genius to infer that the house is on fire. Nor does it require actually going into the house and observing fire firsthand. So, "univariate inferential statistics" are statistics that let you reason your way to a conclusion about a single variable.
Your brain computes its own version of univariate inferential statistics all the time. When you tell a friend, “I’ll meet you there in 20 minutes,” you come up with the “20 minutes” estimate by sampling your memories of how long it has taken you to get “there” in the past and, informally, averaging them. If you suspect the chili may be too spicy (or not spicy enough), you gather a sample – probably using a spoon – and taste it on the entirely reasonable assumption that the percentage of spices in the sample is about the same as the percentage of spices in the entire batch.
You’ve probably done this sort of thing frequently enough to learn that it can be unreliable, though. Your subconscious hope of getting “there” quickly might distort your memory of how quickly you’ve gotten there in the past. Or you might skim your spoonful of chili from the top, missing some ingredients that have settled on the bottom. So, when being right really matters – and, in journalism, being right always really matters – it’s good to be a little more careful about how you get your samples, evaluate them, and draw inferences from them. In this section, you’ll see that Excel can help you do so pretty quickly and easily.
Estimating a single average
Imagine that, in the grip of a mid-February cold snap, residents of
Metroville University’s two freshman dorms, Holderby East and Holderby West, are
complaining that their dorm rooms feel like meat lockers. As a reporter for The
Metropolitan, Metroville U’s campus newspaper, you've been assigned to check the
complaints out. University maintenance officials tell you everyone's
exaggerating. Sure, there are cyclical temperature variations here and there,
they say, but by and large the heating system that the two dormitories share
keeps the dorm rooms at a comfortable, energy-efficient 68 degrees. Are the
students exaggerating, or are the dorms really colder than the university
maintenance officials think?
First, you use Excel to make this list of the two dormitories’ 560 rooms. It was pretty easy to do, because each dorm has seven floors and 40 rooms per floor, with the rooms on each floor numbered 1 through 40. Then you use the procedure demonstrated in this video to randomly select 25 rooms.
Next, you recruit four friends and equip each of them, plus yourself, with one of five digital thermometers borrowed from a physical science professor whose class you took last semester. Having made arrangements ahead of time with the students living in those 25 randomly selected dorm rooms, you or one of your friends visits each room between 8 p.m. and 9 p.m. on the same night. During the visit, the thermometer is placed three feet off the floor in the approximate center of the room for exactly five minutes, and the lowest temperature registered by the thermometer during that time is recorded. Care is taken to avoid any localized heat sources (laptops, television sets, etc.) that might interfere with an accurate reading.
Meeting later in the library, you compile your data into this spreadsheet. First, you make sure that you have installed the Analysis ToolPak and VBA add-ins for Excel. See this video for guidance on how to do it. Incidentally, the latest Mac version of Microsoft Excel does not include the Data Analysis TookPak option. Fuss at Microsoft, not me, OK? Anyway, you then follow the procedure illustrated in this video. The results show you that the average temperature in the 25 randomly selected rooms was 63.5 degrees. More importantly, though, the fact that those 25 rooms were chosen randomly lets you infer that the temperatures of every single room in the two dorms – had you taken the time to measure them all – would almost certainly have averaged to somewhere between 62 degrees (the “lower limit” of the confidence interval), and 65 degrees (the “upper limit” of the confidence interval). Whenever you use this procedure, “almost certainly” means 95 percent certain. Every time, there will be a 5 percent chance that the actual average – in this case, the average of every single room temperature – is outside of the range you say it’s inside.
Journalists sometimes balk at the idea of reporting something that they know will be wrong 5 percent of the time. But compare this approach to the result of doing what journalists would normally do when asked to cover the dorm temperature story. Essentially, they’d report that dorm residents say the rooms are cold and that university officials say the rooms aren’t cold, and they’d invite their audiences to guess who is telling the truth. When you have to guess blindly between two choices, only one of which is right, you have a 50 percent chance of guessing wrong. Now, which is better: Being wrong 50 percent of the time, or being wrong 5 percent of the time?
I could spend a lot of your time explaining how the above computations work and why you should trust them. But if you’re that curious, you’d be better off reading an intro-level statistics book. The explanation you’re looking for is neither short nor simple. But it’s also not as hard to follow as you might think. I recommend the latest edition of “The Basic Practice of Statistics,” by David S. Moore.
All you need to know for now is how to translate the results of the Excel procedure into a news story. Here’s how the dorm temperature study’s results could be worked into the opening paragraphs of a news story:
Holderby East and West residents aren’t imagining the chilly temperatures in their dorm rooms, an investigation by The Metropolitan suggests.
The temperatures in 25 randomly selected rooms averaged 63.5 degrees yesterday and indicated that the average temperature for all rooms in the two dorms was no warmer than 65 degrees and possibly as cold as 62 degrees.
There was only about a 5 percent chance that the heating system serving the two residence halls was consistently delivering temperatures closer to the 68-degree norm promised by university maintenance officials.
Tullahoma, Tenn., freshman biology major Sarah Franklin, who lives in Holderby East, said she hopes the investigation’s results will get the university to finally crank up the heat …
Estimating a single proportion
Now let’s imagine a slightly different situation.
Suppose the same two freshman dorms, Holderby East and Holderby West, prohibit guests in rooms after 6 p.m. Earlier this semester, a group of students who live in the two dorms staged a protest of the policy. Dorms for seniors and graduate students, they pointed out, have less stringent rules about visitors. They also argued that all college students are adults and should be allowed to make their own decisions about such things. Campus administrators countered by saying that overnight guests would lead to interpersonal conflicts and security problems, given that each freshman dorm room is shared by two roommates. Each unit in the senior and graduate dorms consists of four individual sleeping rooms with a common kitchen, living room and bathroom for every four sleeping rooms.
After debate and more protests – including counterprotests by dorm residents who want no change to the policy – the Student Government Association stepped in and suggested a compromise under which a resident would be allowed to entertain a guest until 8 p.m. on any night and overnight on any Friday or Saturday night after obtaining a signed consent form from the resident’s roommate and floor advisor at least three days in advance. Campus administrators agreed to adopt the compromise, but only if a majority of current residents approve it in a referendum to be held in two weeks.
Do at least a majority of the residents approve of the new policy?
As a reporter for The Metropolitan student newspaper, you’d sure like to know the answer to that question. But talking to every student in the two dorms would take way too much time. Recall that each dorm has seven floors, with 40 rooms per floor. And each room has two occupants. That’s 1,120 students that you’d have to interview between now and the referendum. Most journalists would do the easy thing: talk to two or three residents who favor the new policy, two or three who oppose it, write a story summarizing and comparing their views, then let readers guess how it will all turn out.
But what you really need is a procedure that does what the one described above does, but for a percentage rather than an average. You need to be able to poll a small, but random, number of dorm residents and use the results of those interviews to estimate what proportion of all dorm residents support revising the visitation policy.
A little more about sampling
To do it, begin by selecting 200 dorm rooms at random, using the same procedure used for choosing 25 at random. Just take the top 200 instead of the top 25. But now you have an interesting problem that you didn’t have when sampling dorm rooms so you could measure their temperatures. Because “temperature” is a property of the entire room, all you had to do was choose 35 random rooms. But “opinion about revising the visitors policy” is not necessarily a property of the entire room. One of the two people living in the room could have one opinion, and the other person living in the room could have a different opinion. So, what you really need to do is sample residents, not rooms.
There are at least two ways to solve this problem. One way would be to get in touch with the occupants of each selected room and, for each room, use a random procedure to determine which of the room’s two occupants you’ll interview. One common approach is to ask them what their birthdays are, then interview the one whose birthday will be the next to arrive. Or you could flip a coin.
Even better: You know that there are two residents per room, right? So, take your list of all 560 dorm rooms, duplicate it (by highlighting all of them, then using “copy” and “paste”), then add a column labeled “Resident.” Put a “1” in the “Resident” column for each of the first 560 rows, and put a “2” in the “Resident” column for the remaining 560 rows. Then add a “Random” column with the random number formula, =rand(), in each cell of the column. The result would look like this spreadsheet. Now, when you sort by the “Random” column, every resident has an equal chance of ending up in your sample, regardless of whether his or her roommate was also included in the sample. If both residents of a given room ended up in the sample, simply interview both. If only “Resident 1” or “Resident 2” was selected, use the “next birthday” method or a coin flip to determine which one to interview.
Obviously, this would take a little time. But if you get some friends or fellow reporters to help, it can go pretty quickly. You also could make contact by slipping a note or survey form under the doors of selected rooms. You could invite people to respond via an online survey form set up through Surveymonkey.com or Google Forms. And even if you do it the hard way – by knocking (and reknocking, and reknocking) on all 200 doors yourself, it will be faster than trying to do it for all 560 doors. The key will be to keep trying until you’ve gotten responses from at least most of the people chosen by your random selection procedure. The fewer you settle for, the less likely it is that the sample you end up with has the representative power of the random sample you started out with.
Back to estimating a single proportion
So, let’s suppose you are able to get responses from 170 of the 200 randomly selected dorm residents you started out with. That’s a pretty decent response rate of 170/200 = .85, or 85 percent. You compile the results into this dataset. It would be nice if Excel’s Data Analysis Pak had a build-in procedure for analyzing this type of data. However, it doesn’t. So I’ve made you a little spreadsheet that you can use to compute a confidence interval for any proportion, provided you know a few key numbers that went into creating the proportion. Here’s a link to the spreadsheet, and here’s a video showing you how to use it. You might want to download this spreadsheet and keep it somewhere handy so you can have access to it after the course is over. Based on the results of the analysis, you could write a story beginning like this:
Holderby East and West residents support letting guests stay longer on weeknights and overnight on weekends, a poll by The Metropolitan indicates.
Fifty-five percent of 170 randomly selected Holderby East and West residents said they supported the move, while 36 percent opposed it, and 9 percent had no opinion.
The scientifically valid poll’s error margin indicates that at least 52 percent – and possibly as many as 68 percent – of all Holderby residents presently support the change from the current policy, which prohibits visitors after 6 p.m. on any night of the week.
There is only about a five percent chance that the proportion of residents favoring the change is higher or lower than the range estimated by the poll.
COMPUTING BIVARIATE INFERENTIAL STATISTICS IN EXCEL
As you’ve seen, “univariate” inferential statistics involve making inferences
based on a single variable. A set of randomly gathered dorm room temperatures,
for example, or a set of randomly gathered responses to the question, “Do you
support or oppose revising the dorm’s visitor policy?” You’ve just finished
learning about two of the most useful such techniques: one for averages, and one
for percentages. There are others, but these two will take you pretty far. If
you’ll go just one step further, you’ll find that you’ve made miles of progress.
Because when you go from “univariate” to “bivariate” inferential statistics, you
gain the ability to simultaneously consider two variables rather than only one.
Below are two simple examples of why that’s a big deal.
Bias at the Bus Stop?
Imagine that the East Side of Metroville is mostly white, whereas the West Side is mostly minority. West Side community leaders claim that the West Side gets subtly shortchanged on a range of government services like police and fire protection, water and sewer services, road and bridge maintenance, garbage collection, and more. As a Metroville News reporter, you’re looking into those claims one day when the frustrated pastor of a West Side church tells you, “Even the buses run later over here. I can’t prove it, but I’m telling you, it’s true. Easily half the people in my congregation have no car and rely on the bus system to get to work, the store and wherever else they need to go. When the buses run late, they miss their transfer connections downtown. As a result, a trip that already takes an hour ends up taking two.” Metroville Transit officials concede that buses all over the city often run at least a few minutes late, especially at peak traffic times. But they dismiss as groundless the idea that buses run any later on the West Side than on the East Side.
You decide to check out the pastor’s claim empirically. After obtaining a list of all pick-up times for all bus stops in the city during peak traffic hours, you randomly select 30 scheduled stops during the upcoming week on the East Side and 30 during the upcoming week on the West Side. Equipped with your cell phone (to give you the correct time), you show up for each scheduled stop and record how many minutes late, if any, the bus arrives. You record your results in this spreadsheet.
Notice that you have two variables to deal with: The number of minutes each bus arrived late, and the location of the bus stop, categorized as “East Side” or “West Side.” Using what you learned last week, you could average the East Side times, average the West Side times, compute a confidence interval for each average, and write a story about average bus delays on the East and West sides. But what you really want to do is learn whether there is a significant difference between the average East Side delay and the average West Side delay. This video shows you how to do so. Based on the results, you could write a news story that began like this:
Metroville Transit buses typically run later on the mostly minority West Side than on the mostly white East Side during peak traffic hours, a Metroville News investigation has found.
Buses arrived an average of 5.8 minutes late for 30 randomly selected scheduled stops on the East Side last week compared to an average of 8.5 minutes late for 30 randomly selected schedule stops on the West side last week.
Given these results, there is only a 5 percent chance or less that buses run equally late at bus stops throughout the two communities during peak traffic hours.
Justice for all?
Your story about bus delays gets lots of attention – and produces a flood of tips about other possible discrepancies between how the city serves its white and minority communities. One tip in particular catches your attention. Two friends, one who lives in a mostly white neighborhood and one who lives in a mostly minority neighborhood, suffered burglaries at their homes last year. The white neighborhood resident’s burglary case was solved, or “cleared,” within a month, and much of his stolen property was recovered and returned to him. The minority neighborhood resident’s case is still “open,” meaning unsolved, even though his house was burglarized three months earlier than his friend’s. His stolen property is, in all likelihood, long gone. Both contacted you and speculated that their experience might not be unique.
Given that crime reports are public record, you find it fairly easy to randomly select 15 Metroville burglary cases from last year in which the victims lived in mostly white neighborhoods and another 15 in which the victims lived in mostly minority neighborhoods. You then record the current status of each case as either “Cleared,” meaning solved, or “Open,” meaning unsolved. You compile the results into this spreadsheet.
Note that, here again, you’re dealing simultaneously with two variables: “Case status” (cleared or open) and “Location” (white neighborhood or minority neighborhood). The first thing you do is insert a pivot table to learn the number of open and cleared cases for each type of location. This video offers a refresher on pivot tables. Note that the video leaves the table values as raw counts, which is Excel’s default. The procedure you’re about to learn requires raw counts rather than percentages. Column percentages can come in handy for reporting your results, though. For a refresher on obtaining the column percentages from a pivot table, see this video from earlier in the course.
Next, this video shows you how to perform a chi squared (Hint: the “ch” is pronounced like a “k,” and the “i” is long, so “chi” sounds like the “Ky” in the name “Kyle.”) analysis aimed at determining whether the pattern in the pivot table suggests that a similar pattern exists among all burglary cases from last year. Based on the results, you could write a news story beginning like this:
Justice came faster to burglary victims in white neighborhoods than to burglary victims in minority neighborhoods last year, according to a Metroville News investigation.
Police solved 12 of 15 burglaries the investigation selected at random from among all burglaries that occurred in Metroville’s mostly white neighborhoods last year. By contrast, police solved only five of 15 burglaries the investigation chose at random from among all burglaries that occurred in the city’s mostly minority neighborhoods during the same time period.
Given these results, there is less than a 5 percent chance that all burglaries in Metroville’s white and minority neighborhoods got solved in equivalent proportions last year.
Something important to notice
You’ll get immediately smarter about using statistics if you can grasp a concept that statisticians call “levels of measurement.” It’s really not that complicated. There are two main levels of measurement, and the scenarios described above illustrate both of them.
The lower of the two measurement levels is called “categorical” measurement. Some statistics books call it “qualitative” measurement, but I prefer “categorical,” so that’s what I’ll use here. The “location” variable in the bus stop data is an example of a categorical variable. The variable involved two mutually exclusive categories, “East Side” and “West Side,” and measuring location involved simply sorting each bus stop into one of those two categories. The “vote” variable in the dorm survey scenario was also categorical. People could be in either the “Favor” category, the “Oppose” category, or the “Don’t know” category. When we need to summarize categorical data, we usually do so using percentages. Specifically, we describe the percentage of cases in each category.
By contrast, the higher of the two measurement levels is called “continuous” measurement. The number of minutes each bus arrived late is a good example of a continuous measure. “Minutes” can range anywhere from zero to … who knows how long, and the farther from zero you get, the more “minutes” you have. Furthermore, the amount of time between, say, one minute and two minutes is identical to the amount of time between two minutes and three minutes. These properties wouldn’t make sense if you tried to apply them to a categorical variable like the “East Side/West Side” distinction. If I tried to tell you that living on the West Side involved one more unit of “sidedness” than living on the East side, I’d be talking nonsense. But I’d make perfect sense if I told you that a bus arriving two minutes late is one minute later than a bus arriving one minute late. These properties of continuous variables allow us to summarize them using more sophisticated statistics, including averages.
Now, here’s the payoff for absorbing all of that: You can figure out which statistical procedure to use by figuring out the measurement level of each variable involved. Notice how the dorm survey involved a single categorical variable: Vote. So you summarize the votes using percentages, and you use a specific statistical procedure for computing a percentage’s confidence interval. The dorm temperature problem involves a continuous variable – temperature – which you summarize using an average, and you use a specific statistical procedure to build a confidence interval for that average.
This approach gets even more helpful when it comes to bivariate procedures. If you need to look for an association between two categorical variables (like the type of neighborhood in which a burglary occurs and the status of the resulting investigation), you create a pivot table and calculate a chi square. If you need to look for an association between a categorical “cause” (like whether a bus stop is on the East Side or one the West Side) and a continuous result (like the number of minutes a bus arrives late), then you use a t-test.
If you’re paying attention, you’ll notice that there are other possible combinations we haven’t covered. You might need to look for an association between a continuous cause and a categorical result, for example. Or a continuous cause and a continuous result. There’s a statistical procedure (and sometimes more than one) for each possible combination. And there are still more procedures for the additional combinations that surface when you consider not only univariate and bivariate situations but also multivariate situations. Learning all of those procedures would require way more time than we have in this course. We’ve simply focused on four of the handiest procedures.
Some important cautions
A power saw is a more complex tool than a hammer. But the additional complexity makes it easier to misuse, possibly with disastrous results. The same is true of bivariate, compared to univariate, inferential statistics. So here are a few important tips from the “safety manual.”
1. Association is not the same as causation. In the second example, we learned that burglaries were more likely to be solved in white communities than in minority ones. But why is that the case? Do police put less energy and resources into solving burglaries that occur in minority neighborhoods? Are burglaries in minority neighborhoods more difficult for police to solve? Either or both of these explanations could be true. And there may be others as well. The analysis cannot say which explanation is most accurate. Proving that one thing causes another is very difficult to do, even using statistics. The analysis shows that a difference exists between white and minority neighborhoods. It will take more reporting to get a handle on why that difference exists, and what eliminating it will require. But don’t overlook the value of showing that the difference exists. Without such evidence, the search for ways to eliminate the difference can’t even begin.
2. The chi square test described in the second example will go statistically haywire if any of the expected values is less than five. If you encounter that situation, you’ll need more observations.
3. Both of the procedures described above require “independence” between the groups they compare. Two groups are independent if two things are both true. First, there can be no overlap between the groups. For example, suppose the bus stop study had included a bus stop that was listed among both the East Side and West Side stops. The two groups would not be independent. Second, there can be no cross-group connections that might cause similar results for members of different groups. For example, suppose the burglary clearance investigation’s sample had included two different houses – one in a white neighborhood, one in a minority neighborhood – that were owned by the same person or family. The ability, or lack thereof, of that person or family to press for results from the police might tend to cause similar outcomes in the two cases. There are other statistical procedures for comparing dependent groups. I’ve focused on independent groups procedures, because there’s only so much time available in this course, and I think the independent groups procedures will be prove more useful to you. But any basic statistics book will show you how to perform the dependent samples procedures.
4. When writing about the results of inferential statistical procedures, avoid language that indicates you have proven something beyond all doubt. Given that you’re dealing with random sampling, there is always that 5 percent or so chance that the pattern just happened to show up. To get a sense of this, think about what would happen if you put 10 pennies in cup, shook them up, dumped them out onto a table, and counted the number of coins that had landed as “heads,” meaning face up. Most likely, you will get something pretty close to five heads and five tails. But repeat this exercise enough times, and you’ll eventually get 10 heads and no tails. It’s rare, but still inevitable. Verbs like “suggest” and “indicate” are better than variations of “prove.” Such phrasing choices are always important, but especially when writing about findings of no significant difference. A nonsignificant difference between bus stop delays or burglary case clearances does not prove that no differences exist. It merely offers no evidence that a difference exists. I seriously doubt that the so-called Loch Ness Monster exists. But unless I can simultaneously strain every gallon of the Loch Ness (an impossible task), I can’t prove to you that it doesn’t. I can tell you only that systematic observation of the loch has produced no solid evidence of the monster’s existence.
Some journalists may not like these caveats. Especially the last one. They may say they want to know, not just strongly suspect, that something is true. But in practice, journalists strongly suspect, rather than truly know, nearly all of what they report. Mostly, they report information relayed to them by sources whom they consider credible based on things like the source’s expertise, level of motivation for providing inaccurate or incomplete information, willingness to provide the information “on the record,” agreement with other, independent sources of the same information, and so forth. None of these methods for “knowing” the truth is foolproof. And not even firsthand observations are 100 percent reliable, considering humans’ capacity to misperceive, or selectively perceive, what they observe.