Autosomal DNA Analysis — Determining Your Ethnic/Geographic origins & Finding Present-Day Relatives You Didn’t Know You Had
© 2020 by James Michael Gossett
What is Autosomal DNA Analysis, and what use is it?
Well, it’s not what we’re using on our Gossett YDNA Project. Our project uses markers on the Y chromosome called short-tandem repeats (STRs) to assist in constructing lineage relationships among Gossetts. Since the Y chromosome is passed down paternal lines (grandfather to father to son, etc.) markers on it are particularly useful in surname projects because in our culture, surnames are also commonly passed down paternal lines. STRs are particularly good for resolving lineage differences over the most recent 400-500 years because they incur differences (mutations) at a useful rate – not too frequent (which would make it hard to discern who belongs to the same line because everyone would look different, Y-marker-wise) and not too seldom (which would make all lines look the same, and thus not useful for seeing differences among lines over the past few hundred years). STRs are seemingly “junk” DNA in regions between functional genes, and thus mutations in them don’t appear to affect function – which is a good thing, since consisting of short, repeated sequences makes STRs more prone to mutation than other, non-repeated sequences.
Autosomal DNA is the DNA on the other 22 chromosomes not associated with the sex chromosome pair (the X and the Y). It is inherited from both parents in roughly equal proportions, and is used in analyses designed to provide information on ethnic/geographical origins, as well was to find long-lost parents, siblings, second-cousins, and such. The tests look for single-nucleotide polymorphisms (SNPs), which is jargon that simply means places where one nucleotide has been substituted for another. Our DNA is made up of chains of nucleotides with each nucleotide in the chain having one of four bases on it (A, T, C, or G). Thus, a DNA sequence is described by the order of these. For example, …AACTGTGAC… A SNP might show up as … AACAGTGAC. FamilyFinder (FF) and similar services such as 23andMe, track the locations of SNPs and comparative DNA sequences among people in their databases to see how much of the sequences are shared. One’s whole genome isn’t being sequenced, but rather a sample portion of the total.
Using Autosomal DNA to Determine Ethnic/Geographic Origins.
When it first started out, FamilyFinder (FF) used publicly available marker sets from researchers who had examined present-day people living in areas where the gene pools had not been much-influenced by migrants INTO those areas. For example, the Orkney Islands was their stand-in for the British Isles because, unlike Britain itself, the Orkney Islands were not a place that experienced immigration. (Who wanted to move THERE?) So, the idea was that markers for “pure” British ancestry could be derived from present-day Orkney Islanders. That approach is a bit limiting, because there are places (e.g., the main of Europe) where you couldn’t really find gene pools uninfluenced by immigrants.
The present approach used by FF and other services (e.g., 23andMe) is a more complex, statistical one. In addition to the publicly available marker sets published by researchers, they also used, as calibration subjects, people whose ancestries are supposedly known quite well — mixed though they might be. With enough such individuals, they can use statistical methods to back-out what markers can stand-in for, say, Scottish or German or French or whatever. In other words, if you confidently know a bunch of calibration subjects are English/German, and another bunch are German/French, you can back-out what must be the German markers by looking for markers shared by all. However, being a statistical method, the result really has some uncertainty associated with it — e.g., the calibration pool might have errors in their self-reported ancestries.
Where I’m going with this, is that your FF-Origins report has some confidence level associated with those numbers you’re reading about your Scandinavian ancestry, but I don’t see where FF is really telling you what the uncertainty is. I know that with 23andMe’s similar reporting, their default confidence level is 50%. 23andMe allows one to specify the confidence level in the final report; and if you change it, the reported result changes. I don’t see where FF allows that option.
Here is an illustration of what happens with 23andMe when you use a different confidence level:
James Michael Gossett (50% confidence level) — 23andMe
British & Irish 63.1%
French & German 16.5%
Eastern European 0.4%
Broadly Northwestern European 16.8%
Broadly European 1.8%
Broadly Southern European 0.9%
James Michael Gossett (90% confidence level) — 23andMe
British & Irish 5.6%
French & German 1.4%
Broadly Northwestern European 66.0%
Broadly European 25.6%
It is apparent that, if I demand more confidence in the report, the report becomes less specific; I’m still considered to be Broadly NW European, but they’re less sure about saying precisely where. Given such uncertainty, it frankly seems goofy to report results to the nearest 0.1%!! And last I looked, FTDNA is not reporting with that misleading precision.
With FF, my ancestral origin is reported as
50% British Isles,
13% Southeast Europe
9% West & Central Europe
<2% East Europe.
Note the significant difference in alleged Scandinavian ancestry between reports from 23andMe and FF. Keep in mind that it’s unclear how far back in time these ancestries are supposed to reference. Go back far enough, and we’re all African! If I’m 63.1% British & Irish, does that refer to British inhabitants before or after the Viking invasions or the Norman Conquest? I’d expect any later Brit to have Scandinavian and French in him/her.
Also keep in mind that, though our YDNA project is focusing on your Gossett lineage, your ancestral origin is just as much influenced by your mother’s mother’s mother as it is by your father’s father’s father.
Using Autosomal DNA to Find Relatives You Didn’t Know You Had.
I’ve increasingly been receiving inquiries from amateur, genealogical researchers who say something like this: “I know I’ve got Gossett ancestry because I have a DNA match to <some long-ago, deceased Gossett>.”
“What do you mean?” I ask. “Did you exhume him and have his DNA analyzed?”
No, what they meant is that FF or 23andMe reports a 4th- or 5th-cousin relationship to somebody who posted a tree on Ancestry.com (or elsewhere) with a Gossett-surnamed individual on it in the early 1800s.
There are several pitfalls with this. One, we have no idea how accurate is the posted tree; and two, such distant matches on FF and 23andMe have a significant probability of being false-positives.
Autosomal DNA Analysis: Significance of Relationships.
You and a 3rd cousin share a GG-grandparent, and each of you received about 6.25% of your DNA from that grandparent – but certainly not the same 6.25%. FTDNA says there’s about a 90% chance of Family Finder (FF) finding a match to what are known 3rd cousins. (Thus, there is a 10% chance of not seeing a match between known 3rd cousins).
When you move back to more distant relationships (4th cousins and beyond), things get more uncertain. There’s only a 50% chance of FF detecting a match between known 4th cousins; and only a 10% chance of detecting a match between 5th cousins.
But that’s all about false negatives. What about false positives (i.e., detecting a supposed relationship where there is none)? FF says that the largest shared DNA segment has to be ≥ 5.5 cM for a match to be noted – but this threshold is modified using a proprietary algorithm that takes into account other segments. In practice, I’ve heard that their effective threshold is more like 7.5 cM. These thresholds differ among the various autosomal DNA analysis services, who each use their own algorithms. And sites like GEDmatch.com are said to be even less discriminating in declaring matches. Their philosophy, apparently, is that it’s better to give you false positives (and have you chase down these leads with solid, paper-trail genealogical research), than to have you miss a possible connection.
But false positives are more common that generally appreciated, especially with distant relations (i.e., 4th cousins and beyond). For a good – but dense – discussion, see:
Waldron looked at data where children – and both parents – were in the databases. False positive matches are evident when a “match” shows up to the child, but not to either parent. Since all of the child’s DNA comes from his parents, any match to the child should match at least one of the parents. Looking at the top 1500 GEDmatch.com for children, somewhat more than 50% were false positives! However, for most people, their top 1500 matches include a lot of what are clearly going to be remote relations.
Another practical view of false positives – and some recommendations -- from Liane Jensen:
She cites Blaine Bettinger, whose take-away is this: When looking at largest matching blocks,
- Above 15 cM, a match is 99.3% likely to be a real match, a match shared with either or both parents
- Below 10 cM, a match is 59% likely to be a match shared with either or both parents
- Below 7cM, a match is 40% likely to be a match shared with either or both parents, so more likely to be a false positive, not really family
Jensen also cites an analysis by John Walden with these statistics:
- 11 cM or greater matching segment: > 99% identical by descent (IBD), < 1% identical by state (IBS)
- 10 cM matching segment: 99% IBD, 1% IBS
- 9 cM matching segment: 80% IBD, 20% IBS
- 8 cM matching segment: 50% IBD, 50% IBS
- 7 cM matching segment: 30% IBD, 70% IBS
- 6 cM matching segment: 20% IBD, 80% IBS
- 5 cM matching segment: 5% IBD, 95% IBS
- 4 cM matching segment: ~1% IBD, ~99% IBS
In general both studies suggest setting a threshold somewhere around 15 to 10 cM as a lower bound for almost certainly good matches. Below that, buyer beware.
When you use filters of peoples’ surname lists, you’re going to find some matches, but some are possibly just false positives. Even an 80% chance of a match means you’re going to be wrong 20% of the time.
Like most everything else, these more distant autosomal DNA “matches” are leads, not sure things. Look for corroborating primary evidence, where you can. And keep in mind that the tree that has been uploaded by your alleged match could be fanciful, too. Errors tend to propagate (like fake news on Facebook), and “quick-draw” genealogists appropriate them wholesale, splicing onto them, saying “My work is done!”
Another similar study:
Return to Homepage