The origin and accuracy of ELO rating: Food for thought
Every chess player knows the importance of ratings, but how accurate are they really? In this article, Yevgeny Levanzov explores the origin of the Elo rating system and the ideas behind it. He explains how ratings are calculated and why they sometimes fail to reflect true playing strength. The article also looks at rating inflation, deflation, underrated juniors, and the challenges faced by modern players. Through simple examples and practical insights, it questions whether ratings should be treated as the absolute truth. A thoughtful read for anyone interested in the deeper side of competitive chess!
Elo ratings under the microscope
Every competitive chess player follows their rating closely. Some do so obsessively, even to an unhealthy degree. This is such an inherent part of being a competitive chess player that there are jokes and even memes, portraying chess players as people who ask “What’s your rating?” as an introductory question when meeting someone new.

In this short essay, I would like to discuss the foundations of the Elo rating system, examine the complex question of whether players’ ratings truly reflect their strength and level of play, and how one can even attempt to quantify a player’s strength. I will also raise several issues related to rating inflation and deflation, as well as the system’s inaccuracy for certain groups of players. This is not intended to be a rigorous scientific paper, and not every claim will be proven formally. The main goal is to present interesting questions concerning chess ratings and to emphasize that they should not be viewed as an “absolute truth”.
Before discussing the Elo model in depth, let us consider the following simple question: if we had only ten chess players in a room, how would we determine who among them is the strongest, the second strongest, and so on?
In such a case, there would obviously be no need to develop mathematical models or estimate players’ levels through a numerical scale, because we could simply have them play a sufficiently large number of games against one another. For example, each player could play ten games against every other player and then analyze the results to determine who is strongest, second strongest, and so forth. Naturally, the more games that are played, the more statistically reliable the conclusions become. This approach works as long as the number of players is small and they are all located in the same place.

In the 1950s, the Hungarian-American physics professor and chess master, Arpad Elo, worked on developing a mathematical model that would assign every chess player a numerical rating intended to reflect their playing strength and update dynamically according to their results. A player’s rating is meant to represent their ability by predicting their expected results and answering the question of how a match between two players is likely to end, even if they have never played one another before. The system that existed prior to Elo’s work was the Harkness system, developed by the American chess organizer Kenneth Harkness.
Elo’s system is based on the assumption that every player has a “true strength” that is unknown and cannot be measured directly. This true strength may change over time, but usually not rapidly. Exceptions include improving players at the beginning of their chess careers, especially teenagers who study intensively and therefore tend to improve quickly up to a certain stage. If the outcome of a game were determined directly by the true strengths of the two opponents, then measuring true ratings would be easy. In practice, however, the Elo model assumes that a player arrives at a game with a number representing their “performance strength” or “playing strength” for that particular game, and the player with the higher performance strength is expected to win. Naturally, this performance strength depends on the player’s true strength, but it may be either lower or higher than it.
A player’s rating is continuously updated according to their results and the ratings or effective playing strengths of both themselves and their opponents. By updating ratings appropriately (as will be explained later), wins and losses are supposed to accumulate in such a way that the rating becomes an unbiased estimator of the player’s true strength.
One may think of the set of players as the vertices of a graph, where an edge exists between two players if they have played one another (recently). A rating model is expected to predict reasonably accurately even the results between players who are far apart in this graph, that is, players connected only through a long chain of opponents.
It is important to distinguish between a chess player’s intrinsic strength which is based on theoretical knowledge (openings, endgames, pawn structures, patterns, and more), positional understanding, tactical ability, calculation skills, psychological stability, and dozens of other factors and their playing form during a particular period, which mathematically can only be quantified through results, despite the inherent unfairness of doing so.
Let us now briefly discuss the mathematical aspects of the Elo model. To simplify his framework, Elo did not attempt to predict the probabilities of wins and draws separately, but instead measured only the points scored in a game (thus treating a draw as half a win). In his original work, Elo proposed that playing strength follows a normal distribution (and therefore so does the difference between the playing strengths of two opponents), whose mean equals the player’s true strength. The standard deviation was calibrated so that a rating difference of 200 points would correspond to an expected score ratio of 3:1. The United States Chess Federation (USCF) adopted the model in 1960 and chose 1500 as the average rating of a registered club player. Of course, this choice is arbitrary, and “shifting” the entire rating scale does not change the essence of the model.
Subsequent measurements and long-term observations showed that performance strength is not normally distributed, and corrections to the model were made. Statistical and mathematical research concluded that the difference in players’ performance strengths follows a logistic distribution (whose name derives not from “logarithm” but from the Greek word logistikos, meaning “reasoning” or “calculation”). A logistic function has the following form:

Its graph has the shape of the letter S (see figure), and it represents growth that begins exponentially and then gradually levels off toward a maximum value. The base of the exponent does not necessarily have to be e.

We now present the formula for calculating the expected score between two players with ratings R1 and R2. Such players are expected to divide the points between them in the ratio of 10 R1/400 to 10 R2/400, that is, in the ratio 10(R1−R2)/400: 1.
In particular, every linear increase of 400 points in the rating difference multiplies the stronger player’s expected score ratio by a factor of 10. Thus, the relationship between the rating difference and the ratio of expected points is logarithmic.
The number 400 serves as a kind of “calibration difference”, at which the stronger player is expected to score in a ratio of 10:1 (that is, about 91%). The reason for choosing the number 400 is partly historical, as it was already used in the previous system and partly because it is convenient for certain calculations involving base-10 logarithms.
We note that the following relationships hold:

Hence, a direct calculation shows that the probability that player 1 will beat player 2 is given by:

In particular, this is the expected score that player 1 will achieve against player 2 (over many games).
This is a logistic function with base 10, where the difference of 400 acts as a kind of “threshold difference”, beyond which the function stops growing exponentially. For rating differences larger than 400, the stronger player’s victory becomes almost certain (statistically), and therefore the predictive power of the formula begins to lose effectiveness, since the probability of an upset becomes negligible.
This is reflected in FIDE’s “400-point rule”, according to which a rating difference of more than 400 points is treated as if it were exactly 400. The motivation for this rule is to protect top players from losing a large number of rating points in the case of a single upset against a much weaker opponent, thereby helping prevent inflation/deflation effects.
And now, the formula for updating the rating of player 1 after playing against player 2 and scoring S points is:

where E1 is the expected score of player 1 against player 2, computed from the rating difference as shown above, and K is the rating coefficient. This coefficient is determined by the player’s strength, age, and additional factors, and it controls the maximum number of rating points a player can gain or lose from a single game (i.e., it regulates the rate of change).
For example, new players are assigned a higher K-factor in order to allow the system to “calibrate” their rating and quickly converge toward their true playing strength. The same is often true for young players, who typically improve rapidly up to a certain level. In contrast, experienced or top-level players are given a lower K-factor, since their ratings are more stable and their performance variance is smaller.
*In the Glicko rating system, a higher K-factor is also assigned to players who have been inactive for a long period, again in order to “recalibrate” their rating to their current level more quickly.
FIDE adopted the Elo rating model in 1970, and in 1971 the first rating list was published, at that time containing only strong players. The initial ratings were calculated manually by Arpad Elo, based on historical results.
Today, a new player does not receive an arbitrary initial rating; instead, it is determined based on performance in their first tournament (with at least 5 games).
Initially, rating lists were published twice a year (January and July), whereas today they are published monthly.
The highest rating ever achieved was 2882, reached by Magnus Carlsen twice in May 2014 and August 2019 thereby breaking the record previously held by Garry Kasparov, which stood at 2851.
After having seen the formula and how it is computed, we now ask the key question at the heart of the rating system: how quickly does a player’s rating converge to their true level (relative to others)?
In particular, suppose we consider a player in the United States and another in China, who not only have never played each other, but are also far apart in the players’ graph. How accurate is the prediction of their game outcome based on their ratings?
As can be understood, the mathematical question of the convergence rate of the process, that is, how quickly players’ ratings converge to their true strength relative to others, is highly nontrivial. It is so difficult that the first serious paper addressing it was published only in 2024. The paper studies the convergence of the process in a theoretical framework and concludes by testing the results via simulation. The paper can be found here.
Various problems in the Elo rating model
Although in many cases the rating system does reflect a player’s chess strength, there are also numerous situations in which it does not. Therefore, it is very important not to treat rating as the absolute truth, and not to become obsessive about it (a phenomenon that is especially common among teenagers). In particular, there are cases where a player possesses significant knowledge and skill, but this is not reflected in their results due to poor training methods. When the methods are corrected (for example, when a player begins working with a strong coach), their results can improve dramatically.
We will now review several reasons and scenarios in which a player’s rating may fail to represent their true strength, sometimes in a significant way.
Rating deflation in recent years
In the 1990s and 2000s, there was a kind of rating inflation. In particular, top-level ratings increased from the range of 2630–2730 to 2700–2800. This was partly due to a genuine increase in playing strength driven by technological progress (chess engines, databases), but also due to structural changes in the rating system that allowed lower-rated players to enter the rating pool (until 1990, the minimum rating was 2200), which gradually pushed ratings upward.
In the past 6–7 years, however, we have observed rating deflation, especially in the sub-2000 range. As a result, FIDE, in cooperation with mathematician Jeff Sonas, implemented a rating reform in 2024 for the 1000–2000 range in order to reduce this deflation. The essence of the reform was a one-time rating increase for players in this range while preserving the relative ordering of players.
Deflation is also visible at the top level. For example, the number of players in the “2700 club” has decreased from about 45 a decade ago to around 32 today.
There are several reasons for this deflation, but the main one is the massive influx of players into the rated pool during the COVID-19 pandemic, which triggered a global “chess boom”, along with the release of “The Queen’s Gambit” in October 2020. This began with a large-scale shift of players into online chess, but many of them later registered in their national federations and FIDE and started participating in over-the-board tournaments after the pandemic subsided.
Many of these players were initially assigned ratings below their true strength, mainly due to lack of experience in competitive play. Since they were beginners, many of whom continued to improve significantly, a situation arose in which the lower end of the rating “pyramid” contained a large number of underrated players. This led to their opponents scoring fewer points than expected based on rating differences, creating a deflationary effect that gradually spread through all levels (though it is less severe at the top).
An important additional point is that when there is a massive influx of new players who continue to improve rapidly, it takes time for their ratings to adjust to their true strength. This is especially relevant in the modern era, where training resources such as courses, software, and videos make it much easier to improve quickly up to a certain level.
The “closed pool” problem
The rating model operates within a given population of players. Therefore, if different populations are isolated from one another, their ratings may become desynchronized.
As is well known, most open chess tournaments are held in Europe and North America, although in recent years, especially with India and Uzbekistan becoming chess superpowers, stronger tournaments are also held in Asia. However, there are many countries that are geographically distant from the “center of chess activity”, and players there, especially juniors, rarely compete internationally. This creates a kind of internal closed pool in which rating, as a zero-sum system, circulates only within a relatively small group.
This leads to many underrated juniors in those countries, since when players of roughly similar strength repeatedly compete only among themselves, most of them cannot increase their rating in expectation, even if the entire group has improved. This contrasts with players in Europe, who can easily (geographically) play in many neighboring countries, resulting in a much broader opponent pool, and therefore less “closed-loop” rating behavior.
Examples of such geographically isolated countries include Kazakhstan and Australia. When juniors from such countries compete in World Junior Championships, it is often clear that they are underrated.
Another related issue is that many local junior tournaments are not FIDE rated, which delays rating progression for young players.
Difficulty of gaining rating in open tournaments for 2600–2700 grandmasters
Chess is a unique sport in at least two respects:
1. Children and juniors can compete together with adults;
2. There are open (Swiss) tournaments, in which professionals, semi-professionals, and
amateurs compete together.
The second feature, as far as I know, does not exist in any other sport. In snooker, for example, which I personally enjoy and follow closely, there is a clear separation between professionals and non-professionals, with completely separate tournament circuits; in tennis, there are also clear tiers such as the ATP Tour, ATP Challenger, and ITF Futures.
On one hand, open tournaments are democratic and appealing, allowing amateurs to face top players. On the other hand, they create challenges for grandmasters in the 2600–2700 rating range, trying to gain rating points.
As is well known, the 2700 rating threshold (or the “2700 club”) is an important milestone for a grandmaster aiming for the elite level. Upon reaching it, a player is typically invited to closed elite tournaments or super-tournaments. However, to reach this level, players in the 2600–2700 range must almost exclusively compete in open tournaments, where they are often top seeds and must score heavily to gain rating points.
This is extremely difficult to do consistently for several reasons. Roughly half of their opponents are significantly lower-rated, but still strong — typically in the 2350–2450 range. Consistently defeating such players is very difficult in the modern era, since today, unlike 20 years ago, players are extremely well-prepared in openings, use strong engines, and have access to vast amounts of information.
In a single game, especially with Black, one slightly inaccurate opening choice may be enough to prevent a win, even against a significantly lower-rated opponent. Over a match, the stronger player will have the upper hand otherwise, the rating model would be fundamentally incorrect but this advantage is much less pronounced in individual games.
Moreover, in open tournaments there is only one day to prepare for opponents, and weaker players often have little database of games. In addition, top players are under considerable pressure, since they must score heavily to improve their rating, not to mention financial pressure, as prize money is often their primary source of income in the 2600–2700 range. As a result, they sometimes take excessive risks that they would not normally take in other even.
In conclusion, it is very difficult for players in the 2600–2700 range to consistently gain rating in open tournaments and break into the elite. Some of them are undoubtedly already playing at a 2700+ level but fail to reach it in rating terms (a notable exception is Arjun Erigaisi). There are also examples showing that when a ~2650-rated grandmaster is invited to a closed super- tournament, they often perform quite well.
In my view, this situation should change, and there are several possible improvements. First and foremost, closed tournaments for 2600–2700 grandmasters with meaningful prizes should be organized. This requires initiative from organizers, sponsors, and others, but it is a narrative that FIDE should actively promote.
In such events, players would face opponents at their own level and also have financial security. One idea is that traditional super-tournaments such as Norway Chess and the Sinquefield Cup could run a parallel event for players rated 2550–2700, similar to the format of Wijk aan Zee (which previously had 3 levels), with the winner of the secondary event earning a place in the main event the following year.
In addition, super-tournament organizers should, in my opinion, invite more 2600–2700 grandmasters so that elite events are not a “closed clique” and include new and interesting players. While priority should still be given to exceptionally talented young players, strong and in- form experienced grandmasters should also receive opportunities. In this context, cooperation between major open tournaments and super-tournaments could be encouraged, where the winner of an open event is invited to a super-tournament the following year (as was the case in Dortmund).
In conclusion, this article attempted to review the history of the Elo rating model, its mathematical foundations, the issues that arise when applying it in the real world, and several possible solutions, as well as to raise additional points for reflection. My hope is that it sheds some light on the Elo rating system.
About the author

Yevgeny Levanzov is a mathematician and lecturer specializing in combinatorics and graph theory, who earned his PhD in 2024. Alongside his academic work, he is a passionate chess enthusiast, player, and international arbiter, with a deep love for chess history.