The United States Government is being sued in a massive class-action suit representing Green Card applicants from over 30 countries which alleges that the United States unfairly denied 22,000 people a Green Card due to a computer blunder.
This story is reported in the Irish Times and the Wall Street Journal.
It is not in the remit of this blog to debate the merits of awarding working visas on the basis of a random lottery, but this is precisely what the Green Card system is, offering places to 50,000 people each year based on a random selection of applications submitted over a 30 day period. According to the WSJ:
In early May, the State Department notified 22,000 people they were chosen. But soon after, it informed them the electronic draw would have to be held again because a computer glitch caused 90% of the winners to be selected from the first two days of applications instead of the entire 30-day registration period.
Many of these 22,000 people are qualified workers who had jobs lined up contingent on their getting the Green Card. The WSJ cites the example of a French neurospyschology PhD holder (who earned her PhD in the US) who had a job offer contingent on her green card.
The root causes that contributed to this problem are:
- that the random sampling process did not pull records from the entire 30 day period, with the sampling weighted to the first two days of applicants, with 90% of the “winners” being drawn from the first two days.
- There was no review of the sampling process and outputs before the notifications were sent to the applicants and published by the State Department. It appears there was a time lag in the error being identified and the decision being taken to scrap the May Visa Lottery draw.
The first error looks like a possible case of a poorly designed sampling strategy in the software. The regulations governing the lottery draw require that there be a “fair and random sampling” of applicants. As 90% of the applicants were drawn from the first two days, the implication is that the draw was not fair enough or was not random enough. At the risk of sounding a little clinical however, fair and random do not always go hand in hand when it comes to statistical sampling.
If the sampling strategy was to pool all the applications into a single population (N) and then randomly pull 50,000 applicants (sample size n), then all applicants had a statistically equal chance of being selected. The fact that the sampling pulled records from the same date range is an interesting correlation or co-incidence. Indeed, the date of application would be irrelevant to the sampling extraction as everyone would be in one single population. Of course, that depends to a degree on the design of the software that created the underlying data set (were identifiers assigned randomly or sequentially before the selection/sampling process began etc.)
This is more or less how your local State or National lottery works… there is a defined sample of balls pulled randomly which create an identifier which is associated with a ticket you have bought (i.e. the numbers you have picked). You then have a certain statistical chance of a) having your identifier pulled and b) being the only person with that identifier in that draw (or else you have to share the winnings).
If the sampling strategy was to pull a random sample of 1666.6667 records from each of the 30 days that is a different approach. Each person on each day of application has the same chance as anyone else who applied that day, with each day having an equal chance at the same number of applicants being selected. Of course it raises the question of what do you do with the rounding difference you are carrying through the 30 days (equating to 20 people) in order to still be fair and random (a mini-lottery perhaps).
Which raises the question: if the approach was the “random in a given day” sampling strategy why was the software not tested before the draw to ensure that it was working correctly?
In relation to the time lag between publication of the results and the identification of the error, this suggests a broken or missing control process in the validation of the sampling to ensure that it conforms to the expected statistical model. Again, in such a critical process it would not be unreasonable to have extensive checks but the checking should be done BEFORE the results are published.
Given the basis of the Class Action suit, expect to see some statistical debate in the evidence being put forward on both sides.