The United States Government is being sued in a massive class-action suit representing Green Card applicants from over 30 countries which alleges that the United States unfairly denied 22,000 people a Green Card due to a computer blunder.
This story is reported in the Irish Times and the Wall Street Journal.
It is not in the remit of this blog to debate the merits of awarding working visas on the basis of a random lottery, but this is precisely what the Green Card system is, offering places to 50,000 people each year based on a random selection of applications submitted over a 30 day period. According to the WSJ:
In early May, the State Department notified 22,000 people they were chosen. But soon after, it informed them the electronic draw would have to be held again because a computer glitch caused 90% of the winners to be selected from the first two days of applications instead of the entire 30-day registration period.
Many of these 22,000 people are qualified workers who had jobs lined up contingent on their getting the Green Card. The WSJ cites the example of a French neurospyschology PhD holder (who earned her PhD in the US) who had a job offer contingent on her green card.
The root causes that contributed to this problem are:
- that the random sampling process did not pull records from the entire 30 day period, with the sampling weighted to the first two days of applicants, with 90% of the “winners” being drawn from the first two days.
- There was no review of the sampling process and outputs before the notifications were sent to the applicants and published by the State Department. It appears there was a time lag in the error being identified and the decision being taken to scrap the May Visa Lottery draw.
The first error looks like a possible case of a poorly designed sampling strategy in the software. The regulations governing the lottery draw require that there be a “fair and random sampling” of applicants. As 90% of the applicants were drawn from the first two days, the implication is that the draw was not fair enough or was not random enough. At the risk of sounding a little clinical however, fair and random do not always go hand in hand when it comes to statistical sampling.
If the sampling strategy was to pool all the applications into a single population (N) and then randomly pull 50,000 applicants (sample size n), then all applicants had a statistically equal chance of being selected. The fact that the sampling pulled records from the same date range is an interesting correlation or co-incidence. Indeed, the date of application would be irrelevant to the sampling extraction as everyone would be in one single population. Of course, that depends to a degree on the design of the software that created the underlying data set (were identifiers assigned randomly or sequentially before the selection/sampling process began etc.)
This is more or less how your local State or National lottery works… there is a defined sample of balls pulled randomly which create an identifier which is associated with a ticket you have bought (i.e. the numbers you have picked). You then have a certain statistical chance of a) having your identifier pulled and b) being the only person with that identifier in that draw (or else you have to share the winnings).
If the sampling strategy was to pull a random sample of 1666.6667 records from each of the 30 days that is a different approach. Each person on each day of application has the same chance as anyone else who applied that day, with each day having an equal chance at the same number of applicants being selected. Of course it raises the question of what do you do with the rounding difference you are carrying through the 30 days (equating to 20 people) in order to still be fair and random (a mini-lottery perhaps).
Which raises the question: if the approach was the “random in a given day” sampling strategy why was the software not tested before the draw to ensure that it was working correctly?
In relation to the time lag between publication of the results and the identification of the error, this suggests a broken or missing control process in the validation of the sampling to ensure that it conforms to the expected statistical model. Again, in such a critical process it would not be unreasonable to have extensive checks but the checking should be done BEFORE the results are published.
Given the basis of the Class Action suit, expect to see some statistical debate in the evidence being put forward on both sides.
Thanks a lot Daragh O’Brien for one of the best article on this topic. Thanks for your support and impartial opinion to this tragic injustice which has happened with 22,000 people. I think if the DoS make investigation of previous lottery results they find that in previous years results have been skewed towards other dates. For example, in 1995 all selectees were from the first entry day, the DV-2011 majority of selectees were from the last days of application. Because DoS use the same algorithm since the initiation of lottery. These 22,000 people have just become scapegoats for current and previous years of lottery.
Marina
It’s interesting that you claim the results were skewed in previous years. Do you have links to evidence for that? If it is the case then it would suggest that all applicants are put in a single population for selection but there is a defect in how the candidate records are identified which would be a surrogate for a date stamp (i.e. a sequential number is applied as people register rather than a random number)
What would be interesting would be to see a document that explains the data flow for the lottery from application through to selection and the statistical sampling approach that they apply.
Fundamentally, there is a huge hole in State Department’s explanation. They are claiming that
a) some in-house programmer approached them saying that he wanted to improve an algorithm that worked perfectly fine in all previous years, but otherwise there was no particular reason to do it;
b) they did not notice the mistake until after they posted the results (ie 6 months after the draw)
c) most interestingly, according to AP, nobody was disciplined.
So, do they really want me to believe that it’s OK to make the staff sift through 100,000 selected candidates (and pay them salary for 6 months), based on incorrectly selected data? I mean, we are all humans, and we can make errors, but what is completely unacceptable is that there was no quality control performed, even after the change of algorithm.
This check would have taken them just a couple of minutes, using the most basic of statistical packages (heck, one can do it in Excel). All that is required is to build a simple histogram by the selection date, and any pattern would have been obvious (especially a pattern where 90% were selected during the first 2 days). Yet this was not done, and “nobody was disciplined”. Even putting aside the fate of 22,000 foreigners (American taxpayers don’t care about them), what about the monumental waste of resources? Why is nobody paying for this negligence?
I have a simple theory explaining this. The State Department knew all along about the algorithm (first, randomly select 2 days, then select the majority of applicants from those 2 days) . They were fine with it all along. Perhaps the algorithm was also used in previous years, too (hence the rumors there were similar patterns in the past). What was different this year is that the results were announced on the first day, and quite a few naive winners shared this information on internet forums. In turn, some unscrupulous people who noticed this pattern started to bombard the Department with threats of suing them for doing a non-random lottery. The Department caved in, and the 22000 were made the scapegoats.
@Ranger78 totally agree with you.
thanks a million Daragh, this article is the most complete and accurate article about randomness of the selection, we will for sure use it as reference on randomness argument .
I completely agree with Ranger78. I do not have any evidence and links, but I read some immigrant’s forums and found very strange when classmates applied the same date and all of them were selected. A lot of people are sure that the only reason for all this mess is that the results were available online immediately the first time in lottery history. And naive winners hurried to share their joy with others. If notification was still via letter, this would have gone unnoticed as in previous years. I repeat that it is not only my thoughts but many people who are interested in this cause. I consider that it is absolutely unacceptable when people even they are foreigners have to suffer due to somebody’s negligence. It was a good lesson and therapy against naivety and gullibility.
Thank you for this article!!
yeah.they should have tested the soft-ware before its use, and also the DOS seems
not to have the right definition of random…
let us see how it goes…
Great article, the problem is very well-explained.
There are more or less conclusive evidences that in previous DV lottery programs the factor of date of entry was never used for selection at least for Europe.
———————————————————————
From an internet forum – a message by one of middleman:
“We were working with some of dv2012 data posting it and chacking winners. About 250000 (250k) were sent from 6/10/2010 till 29/10/2010 divided to equal portions. Our statistic is – 96% of all our winners were post on 6, and 4% for rest 7-29. Also I can confirm that all wiiners from 6/10/2010 that were submit with spouse / childrens have all their spouce and childrens also win. About 6/10/2010 – 62% of all submited were selected.
In my opinion – they have to explain whats going on at least.
I’m talking just about entries we were submitting ourselves. We have preparad 252643 entries and were submitting them 10000-12000 per day from 7 to 29. First day – on 6 we were submitting less entries (2250) as we were testing. (I can give per day numbers of submitting and winners if you want).
We have winners: 1301 win from 6, 0 win from 7 till 17, 57 win from 18 to 29.
Also I can say that we had more winners last year, even that we submited 220k entries. And winner ammount was near to equal among all days.
Also I dont believe in system bug. But if it was a bug it should be that they picked a drunk man right from the street and gave him 50$ for a script that choosed winners. So – then it can be bug
———————————————————————
I refer to the statement :
And winner ammount was near to equal among all days