Category Archives: USA IQ Train Wrecks

Information Quality problems impact reporting in US Education

Via The Miami Herald comes a story that highlights a number of impacts of poor quality information in key processes.

In Oklahoma, schools are placed on an “improvement list” if they fail to meet standards for two consecutive years. Once on the list the school show progress in improving standards for two years before they can be taken off the list. This can have implications for funding and access to resources as well. Some Oklahoma School districts are, it is reported, concerned that they don’t make the grade against Federal requirements.

Problems with the quality of demographic data in electronic testing performed by Pearson has affected the publication of the reports against which schools are graded.  These will now be available a full month late, being released in September and not August as expected. This will affect the ability of School Boards to effectively respond to their report card.

Other problems reported on top of missed deadlines include errors in printing report cards to be sent to parents

Oklahoma’s Superintendent of Schools Janet Barresi has described the impacts of poor quality data in this process as a “ripple effect”  that is “imposing an unacceptable burden on school districts” and has called for Pearson’s contract to be reviewed. Pearson are engaging an independent 3rd party to help verify the accuracy and validity of the scoring data (which they are confident in).

Oklahoma is not the first State where data issues have been a problem.

  • In 2010 in Florida Pearson was penalised $14.7 million, and had to ramp up staffing levels and make changes to systems as a result of problems with information quality leading to delays. The problems here related to matching of student records.
  • In 2010 in Wyoming, Pearson also had to pay penalties arising from problems with the testing, ranging from data going missing to other administrative problems such as improperly calibrated protractors.

This video from the Data Quality Campaign, a US Non-Profit working to improve standards of data quality in the US Education system, highlights the value of good quality and timely information in this important sector:

Google maps inaccuracies

We spotted this on Gawker.com. From my experience using Google Maps, it rings true (I recently was sent 15 miles out of my way on a trip in rural Ireland).

It seems that Google Maps has plotted the location of a tourist attraction in New Jersey right at the end of a driveway to a private residence. So, on the 4th of July weekend, the owners of the property had to fend off increasingly irate visitors who were looking for the lake and wound up in a private driveway.

So, the data is inaccurate and of poor quality. Google have responded to their error and replotted the location of the tourist area at the lake? Not yet, according to the story on Gawker.

Green Card, Red Faces

The United States Government is being sued in a massive class-action suit representing Green Card applicants from over 30 countries which alleges that the United States unfairly denied 22,000 people a Green Card due to a computer blunder.

This story is reported in the Irish Times and the Wall Street Journal.

It is not in the remit of this blog to debate the merits of awarding working visas on the basis of a random lottery, but this is precisely what the Green Card system is, offering places to 50,000 people each year based on a random selection of applications submitted over a 30 day period. According to the WSJ:

In early May, the State Department notified 22,000 people they were chosen. But soon after, it informed them the electronic draw would have to be held again because a computer glitch caused 90% of the winners to be selected from the first two days of applications instead of the entire 30-day registration period.

Many of these 22,000 people are qualified workers who had jobs lined up contingent on their getting the Green Card. The WSJ cites the example of a French neurospyschology PhD holder (who earned her PhD in the US) who had a job offer contingent on her green card.

The root causes that contributed to this problem are:

  1. that the random sampling process did not pull records from the entire 30 day period, with the sampling weighted to the first two days of applicants, with 90% of the “winners” being drawn from the first two days.
  2. There was no review of the sampling process and outputs before the notifications were sent to the applicants and published by the State Department. It appears there was a time lag in the error being identified and the decision being taken to scrap the May Visa Lottery draw.

The first error looks like a possible case of a poorly designed sampling strategy in the software. The regulations governing the lottery draw require that there be a “fair and random sampling” of applicants. As 90% of the applicants were drawn from the first two days, the implication is that the draw was not fair enough or was not random enough. At the risk of sounding a little clinical however, fair and random do not always go hand in hand when it comes to statistical sampling.

If the sampling strategy was to pool all the applications into a single population (N) and then randomly pull 50,000 applicants (sample size n), then all applicants had a statistically equal chance of being selected. The fact that the sampling pulled records from the same date range is an interesting correlation or co-incidence. Indeed, the date of application would be irrelevant to the sampling extraction as everyone would be in one single population. Of course, that depends to a degree on the design of the software that created the underlying data set (were identifiers assigned randomly or sequentially before the selection/sampling process began etc.)

This is more or less how your local State or National lottery works… there is a defined sample of balls pulled randomly which create an identifier which is associated with a ticket you have bought (i.e. the numbers you have picked). You then have a certain statistical chance of a) having your identifier pulled and b) being the only person with that identifier in that draw (or else you have to share the winnings).

If the sampling strategy was to pull a random sample of 1666.6667 records from each of the 30 days that is a different approach. Each person on each day of application has the same chance as anyone else who applied that day, with each day having an equal chance at the same number of applicants being selected. Of course it raises the question of what do you do with the rounding difference you are carrying through the 30 days (equating to 20 people) in order to still be fair and random (a mini-lottery perhaps).

Which raises the question: if the approach was the “random in a given day” sampling strategy why was the software not tested before the draw to ensure that it was working correctly?

In relation to the time lag between publication of the results and the identification of the error, this suggests a broken or missing control process in the validation of the sampling to ensure that it conforms to the expected statistical model. Again, in such a critical process it would not be unreasonable to have extensive checks but the checking should be done BEFORE the results are published.

Given the basis of the Class Action suit, expect to see some statistical debate in the evidence being put forward on both sides.

Gas by-products give a pain in the gut

Courtesy of Lwanga Yonke comes this great story about how the choice of unit of measure for reporting, particularly for regulatory reporting or Corporate Social Responsibility reports can be very important.

The natural gas industry’s claim that it is making great strides in reducing the polluted wastewater it discharges to rivers is proving difficult to assess because of inconsistent reporting and a big data entry error in the system for tracking contaminated fluids.

The issue:

Back in February the Natural Gas industry in the US released statistics which appeared to show that they had managed to recycle at least 65% of the toxic waste brine that is a by-product of natural gas production. Unfortunately they had their data input a little bit askew, thanks to one company who had reported data back to the State of Pennsylvania using the wrong unit of measure – confusing barrels with gallons.

For those of us who aren’t into the minutiae of natural gas extraction, the Wall Street Journal helpfully points out that there are 42 gallons in a barrel. So, by reporting 5.2 million barrels of wastewater recycled instead of the 5.2 million gallons that were actually recycled, the helpful data entry error overstated the recycling success by a factor of 42.

Which is, co-incidentally, the answer to Life the Universe and Everything.

According to the Wall Street Journal, it may be impossible to accurately identify the rate of waste water recycling in the natural gas industry in the US.

Not counting Seneca’s bad numbers — and assuming that the rest of the state’s data is accurate — drillers reported that they generated about 5.4 million barrels of wastewater in the second half of 2010. Of that, DEP lists about 2.8 million barrels going to treatment plants that discharge into rivers and streams, about 460,000 barrels being sent to underground disposal wells, and about 2 million barrels being recycled or treated at plants with no river discharge.

That would suggest a recycling rate of around 38 percent, a number that stands in stark contrast to the 90 percent recycling rate claimed by some industry representatives. But Kathryn Klaber, president of the Marcellus Shale Coalition, an industry group, stood by the 90 percent figure this week after it was questioned by The Associated Press, The New York Times and other news organizations.

The WSJ article goes on to point out that there is a lack of clarity about what should actually be reported as recycled waste water and issues with the tracking of and reporting of discharges of waste water from gas extraction.

At least one company, Range Resources of Fort Worth, Texas, said it hadn’t been reporting much of its recycled wastewater at all, because it believed the DEP’s tracking system only covered water that the company sent out for treatment or disposal, not fluids it reused on the spot.

Another company that had boasted of a near 100 percent recycling rate, Cabot Oil & Gas, also Houston-based, told The AP that the figure only included fluids that gush from a well once it is opened for production by a process known as hydraulic fracturing. Company spokesman George Stark said it didn’t include different types of wastewater unrelated to fracturing, like groundwater or rainwater contaminated during the drilling process by chemically tainted drilling muds.

So, a finger flub on data entry, combined with lack of agreement on meaning and usage of data in the industry, and gaps in regulation and enforcement of standards means that there is, as of now, no definitive right answer to the question “how much waste water is recycled from gas production in Pennsylvania?”.

What does your gut tell you?

 

Smart Grid, Dumb Data

In September 2010 a massive gas explosion ripped through the San Francisco suburb of San Bruno, not too far from San Francisco International Airport. The explosion was so powerful it was registered as a magnitude 1.1 earthquake.

Subsequent investigations have identified that poor quality data was a contributory factor in the disaster. According to Fresnobee.com

The cause of the deadly rupture has not yet been determined, but the PUC said it is moving ahead with the penalty phase after the National Transportation Safety Board recently determined that PG&E incorrectly described the pipe as seamless when in fact it was seamed and welded, making it weaker than a seamless pipe.

Read more: http://www.fresnobee.com/2011/02/25/2285689/pge-faces-big-fine-over-gas-pipeline.html#

According to the San Francisco Chronicle the problems with PG&E’s data were nothing new, with problems stretching back almost 20 years.

Omissions or data-entry errors made when the system was developed – and left uncorrected – may explain why PG&E was unaware that the 1956-vintage pipeline that exploded in San Bruno on Sept. 9, killing eight people, had been built with a seam, according to records and interviews. Federal investigators have found that the explosion started at a poorly installed weld on the seam.

Continue reading

The importance of context

Data is often defined as “Facts about things” and Information is often defined as “Facts about things in a context”.

From Lwanga Yonke (IAIDQ Advisor and one of the visionaries behind the CIQP certification) comes this great example of where, without consistent application of context, it is possible for the Data to give rise to poor quality and misleading information.

Sign showing population, feet above sealevel and year founded with the data totalled

Image linked from "thepocket.com"

What we see in the sign opposite are three distinct contexts:

  1. A count of the population (562)
  2. The height of the town above sealevel (2150)
  3. The year the town was founded (1951)

And of course, when we see a column of figures our instinct (since our earliest school days) is to add them all up… to give us 4663.

Of course, that figure is meaningless as information, and is also poor quality data.

I have personally experienced similar “challenges of context” in tracking back root cause analyses in Regulatory Compliance projects.. the stakeholder pulling the incident reports together didn’t consider context and as such was comparing apples with ostrich eggs (if he’d been comparing apples to oranges at least they’d both have been fruit).

I’d love to hear your stories of Contextual conundrums that have lead to poor quality data and erroneous Information.

Did you check on the cheques we sent to County Jail?

Courtesy of Keith Underdown comes yet another classic IQ Trainwreck  which he came across on the CBS News.

It seems that up to 3900 prisoners received cheques (or ‘checks’ to our North American readers) of US$250 each, despite the very low probability that they would be able to actually use them to stimulate the economy. Of the 3900, 2200 were, it seems, entitled to receive them as they had not been incarcerated in any one of the three months prior to the enactment of the Stimulus bill.

However, that still leaves 1700 prisoners who should not have received cheques who did. The root cause?

According to CBS News:

…government records didn’t accurately show they were in prison

A classic information quality problem… accuracy of master data being used in a process resulting in an unexpected or undesired outcome.

While most prisons have intercepted and returned the cheques, there will now need to be a process to identify,  for each prisoner, whether the Recovery payment was actually due. Again, a necessary manual check (no pun intended) at this stage but one which will add to the cost and time involved in processing the Recovery cheques.

Of course, we’ve already written here about the problem with Stimulus cheques being sent to deceased people.

These cases highlight the fact that an Information Quality problem doesn’t have to be massively impacting on your bottom line or impact significant numbers of people to have an impact on your reputation.

US Government Health (S)Care.

Courtesy of Jim Harris at the excellent OCDQBlog.com comes this classic example of a real life Information Quality Trainwreck concerning US Healthcare. Keith Underdown also sent us the link to the story on USAToday’s site

It seems that 1800 US military veterans have recently been sent letters informing them that they have the degenerative neurological disease ALS (a condition similar to that which physicist Stephen Hawking has).

At least some of the letters, it turns out, were sent in error.

[From the LA Times]

As a result of the panic the letters caused, the agency plans to create a more rigorous screening process for its notification letters and is offering to reimburse veterans for medical expenses incurred as a result of the letters.

“That’s the least they can do,” said former Air Force reservist Gale Reid in Montgomery, Ala. She racked up more than $3,000 in bills for medical tests last week to get a second opinion. Her civilian doctor concluded she did not have ALS, also known as Lou Gehrig’s disease.

So, poor quality information entered a process, resulting in incorrect decisions, distressing communications, and additional costs to individuals and governement agencies. Yes. This is ticking all the boxes to be an IQ Trainwreck.

The LA Times reports that the Department of Veterans Affairs estimates that 600 letters were sent to people who did not have ALS. That is a 33% error rate. The cause of the error? According to the USA Today story:

Jim Bunker, president of the National Gulf War Resource Center, said VA officials told him the letters dated Aug. 12 were the result of a computer coding error that mistakenly labeled the veterans with amyotrophic lateral sclerosis, or ALS.

Oh. A coding error on medical data. We have never seen that before on IQTrainwrecks.com in relation to private health insurer/HMO data. Gosh no.

Given the impact that a diagnosis of an illness which kills affected people within an average of 5 years can have on people, the simple coding error has been bumped up to a classic IQTrainwreck.

There are actually two Information quality issues at play here however which illustrate one of the common problems in convincing people that there is an information quality problem in the first place . While the VA now estimates (and I put that in bold for a reason) that the error rate was 600 out of 1800, the LA Times reporting tells us that:

… the VA has increased its estimate on the number of veterans who received the letters in error. Earlier this week, it refuted a Gulf War veterans group’s estimate of 1,200, saying the agency had been contacted by fewer than 10 veterans who had been wrongly notified.

So, the range estimates for error goes from 10 in1800 (1.8%) to 600 in 1800 (33%) to 1200 in 1800 (66%). The intersting thing for me as an information quality practitioner is that the VA’s initial estimate was based on the numberof people who had contacted the agency.

This is an important lesson.. the number of reported errors (anecdotes) may be less than the number of actual errors and the only real way to know is to examine the quality of the data and look for evidence of errors and inconsistency so you can Act on Fact.

The positive news… the VA is changing its procedures. The bad news about that… it looks like they are investing money in inspecting defects out of the process rather than making sure the correct fact is correctly coded in patient records.

No child left behind (except for those that are)

Steve Sarsfield shares with us this classic tale of IQ Trainwreck-ry from Atlanta Georgia.

An analysis of student enrollment and transfer data carried out by the Atlanta Journal-Constitution reveals a shocking number of students who appear to be dropping out of school and off the radar in Georgia.  This suggests that the dropout rate may be higher and the graduation rate lower than previously reported.

Last year, school staff marked more than 25,000 students as transferring to other Georgia public schools, but no school reported them as transferring in, the AJC’s analysis of enrollment data shows.

Analysis carried out by the State agency responsible was able to track down some of the missing students. But poor quality information makes any further tracking problematic if not impossible.

That search located 7,100 of the missing transfers in Georgia schools, state education spokesman Dana Tofig wrote in an e-mailed statement. The state does not know where an additional 19,500 went, but believes other coding errors occurred, he wrote. Some are dropouts but others are not, he said.

In a comment which should warm the hearts of Information Quality professionals everywhere, Cathy Henson, a Georgia State education law professor and former state board of education chairwoman says:

“Garbage in, garbage out.  We’re never going to solve our problems unless we have good data to drive our decisions.”

She might be interested in reading more on just that topic in Tom Redman’s book “Data Driven”.

Drop out rates consitute a significant IQ Trainwreck because:

  • Children who should be helped to better education aren’t. (They get left behind)
  • Schools are measured against Federal Standards, including drop out rates, which can affect funding
  • Political and business leaders often rely on these statistics for decision making, publicity,  and campaigning.
  • Companies consider the drop out rate when planning to locate in Georgia or elsewhere as it is an indicator of future skills pools in the area.

The article quotes Bob Wise on the implications of trying to fudge the data that sums up the impact of masking drop outs by miscoding (by accident or design):

“Entering rosy data won’t get you a bed of roses,” Wise said. “In a state like Georgia that is increasingly technologically oriented, it will get you a group of people that won’t be able to function meaningfully in the workforce.”

The article goes on to highlight yet more knockon impacts from the crummy data and poor quality information that the study showed:

  • Federal standard formulae for calculation of dropouts won’t give an accurate figure if there is mis-coding of students as “transfers” from one school to another.
  • A much touted unique student identifier has been found to be less than unique, with students often being given a new identifier in their new school
  • Inconsistencies exist in other data, for example students who were reported “removed for non-attendance” but had  zero absent days recorded against them.

Given the impact on students, the implications for school rankings and funding, the costs of correcting errors, and the scale and extent of problems uncovered, this counts as a classic IQTrainwreck.

The terror of the Terrorist Watch list

A source who wishes to remain anoynymous sent us this link to a story on Wired.com about the state of the US Government’s Terrorist watch list.

The many and varied problems with the watch list have been covered on this blog before.

However, the reason that this most recent story constitutes an IQTrainwreck is that it seems that, despite undertakings to improve quality, the exact opposite has actually happened given:

  • The growth in the number of entries on the list
  • The failures on the part of the FBI to properly maintain and update information in a timely manner.

According to the report 15% of active terrorism suspects under investigation were not added to the Watch list. 72% of people cleared in closed investigations were not removed.

The report from the US Inspector General said that they “believe that the FBI’s failure to consistently nominate subjects of international and domestic terrorism investigations to the terrorist watchlist could pose a risk to national security.”

That quote sums up why this is an IQTrainwreck.

Continue reading