Friday, January 9, 2026

Data Update 1 for 2026: The Push and Pull of Data!

    In my musings on valuation, I have long described myself as more of a number cruncher than a storyteller, but it is because I love numbers for their own sake, rather than a fondness for abstract mathematics. It is that love for numbers that has led me at the beginning of each year since the 1990s to take publicly available data on individual companies, both from their financial statements and from the markets that they are listed and traded on, and try to make sense of that data for a variety of reasons - to gain perspective, to use in my corporate financial analysis and valuations and to separate information from disinformation . As my access to data has improved, what started as a handful of datasets in my first data update in 1994 has expanded to cover a much wider array of statistics than I had initially envisioned, and my 2026 data updates are now ready. If you are interested in what they contain, please read on.

The Push and Pull of Data
    After a year during which we heard more talk about data and data centers than ever before in history, usually in the context of how AI will change our lives, it is worth considering the draw that data has aways had on not just businesses but on individuals, as well as the dangers with the proliferation of data and the trust we put on that data.
    In a world where we feel adrift and uncertain, the appeal of data is clear. It gives us a sense of control, even if it is only in passing, and provides us with mechanisms for making decisions in the face of uncertainty. 
  1. Signal in the noise: Anyone who has to price/value a stock or assess a project at a firm has to make estimates in the face of contradictions, both in viewpoints and in numbers. The entire point of good data analysis is to find the signals in the noise, allowing for reasoned judgments, albeit with the recognition that you will make mistakes.
  2. Coping mechanism for uncertainty: Investors and businesses, when faced with uncertainty, often respond in unhealthy ways, with denial and paralysis as common responses. Here again, data can help in two ways, first by helping you picture the range of possible outcomes and second by bringing in tools (simulations, data visualizations) for incorporating uncertainty into your decision-making. 
  3. Prescription against tunnel vision: It is easy to get bogged down in details, when faced with having to make investment decisions, and lose perspective.  One of the advantages of looking at data differences over time and across firms is that it can help you elevate and regain perspective, separating the stuff that matters a lot from that which matters little.
  4. Shield from disinformation: At the risk of getting backlash, I find that people make up stuff and present it as fact. While it is easy to blame social media, which has provided a megaphone for these fabulists, I read and hear statements in the media, ostensibly from experts, politicians and regulators, that cause me to do double takes since they are not just wrong, but easily provable as wrong, with the data.
    While data clearly has benefits, as a data-user, I do know that it comes with costs and consequences, and it behooves us all to be aware of them.
  1. False precision: It is undeniable that attaching a number to something that worries you, whether it be your health or your finances, can provide a sense of comfort, but there is the danger with treating estimates as facts. In one of my upcoming posts, for instance, I will look at the historical equity risk premium, measured by looking at what stocks have earned, on an annual basis, over treasury bonds for the last century. The estimate that I will provide is 7.03% (the average over the entire period), but that number comes with a standard error of 2.05%, resulting in a range from a little less than 4% (7.03% - 2 × 2.05%) to greater than 11%. This estimation error plays out over and over again in almost every number that we use in corporate finance and valuation, and while there is little that can be done about it, its presence should animate how we use the data.
  2. The Role of Bias: I have long argued that we are all biased, albeit in varying degrees and in different directions, and that bias will find its way into the choices we make. With data, this can play out consciously, where we use data estimates that feed into our biases and avoid estimates that work in the opposite direction, but more dangerously, they can also play out subconsciously, in the choices we make. While it is true that practitioners are more exposed to bias, because their rewards and compensation are often tied to the output of their research, the notion that academics are somehow objective because their work is peer-reviewed is laughable, since their incentive systems create their own biases. 
  3. Lazy mean reversion: In a series of posts that I wrote about value investing, at least as practiced by many of its old-time practitioners, I argued that it was built around mean reversion, the assumption that the world (and markets) will revert back to historic norms. Thus, you buy low PBV stocks, assuming (and hoping) that those PBV ratios will revert to market averages, and argue that the market is overpriced because the PE ratio today is much higher than it has been historically. That strategy is attractive to those who use it, because mean reversion works much of the time, but it is breaks down when markets go through structural shifts that cause permanent departures from the past. 
  4. The data did it: As we put data on a pedestal, treating the numbers from emerge from it as the truth, there is also the danger that some analysts who use it view themselves as purely data engineers. While they make recommendations based upon the data, they also refuse to take ownership for their own prescriptions, arguing that it is the data that is responsible. 
    As the data that we collect and have access to gets richer and deeper, and the tools that we have to analyze that data become more powerful, there are some who see a utopian world where this data access and analysis leads to better decisions and policy as output. Having watched this data revolution play out in investing and markets, I am not so sure, at least in the investing space. Many analysts now complain that they have too much data, not too little, and struggle with data overload. At the same time, a version of Gresham's law seems to be kicking in, where bad data (or misinformation) often drives out good data, leading to worse decisions and policy choices. My advice, gingerly offered, is that as you access data, it is caveat emptor, and that you should do the following with any data (including my own):
(a) Consider the biases and priors of the data provider.
(b) Not use data that comes from black boxes, where providers refuse to detail how they arrived at numbers.
(c) Crosscheck with alternate data providers, for consistency.


Data Coverage
    As I mentioned at the start of this post, I started my data estimation for purely selfish reasons, which is that I needed those estimates for my corporate financial analyses and valuations. While my sharing of the data may seem altruistic, the truth is that there is little that is proprietary or special about my data analysis, and almost anyone with the time and access to data can do the same. 
    
Data Sources
    At the risk of stating the obvious, you cannot do data analysis without having access to raw data. In 1993, when I did my first estimates, I subscribed to Value Line and bought their company-specific data, which about 2000 US companies and included a subset of items on financial statements, on a compact disc. I used Value Line's industry categorizations to compute industry averages on a few dozen items, and presented them in a few datasets, which I shared with my students. In 2025, my access to data has widened, especially because my NYU affiliation gives me access S&P Capital IQ and a Bloomberg terminal, which I supplement with subscriptions (mostly free) to online data. It is worth noting that these almost all the data from these providers is in the public domain, either in the form of company filings for disclosure or in government macroeconomic data, and the primary benefit (and it is a big one) is easy access. 
    As my data access has improved, I have added variables to my datasets, but the data items that I report reflect my corporate finance and valuation needs. The figure below provides a partial listing of some of these variables:


As you can see from browsing this list, much of the data that I report is at the micro level, and the only macro data that I report is on variables that I need in valuation, such as default spreads and equity risk premiums.   In computing these variables, I have tried to stay consistent with my own thinking and teaching and transparent about my usage. As an illustration for consistency, I have argued for three decades that lease commitments should be treated as debt and that R&D expenditures are capital, not operating, expenses, and my calculations have always reflected those views, even if they were at odds with the accounting rules. In 2019, the accounting rules caught up with my views on lease debt, and while the numbers that I report on debt ratios and invested capital are now closer to the accounting numbers, I continue to do my own computations of lease debt and report on divergences with accounting estimates. With R&D, I remain at odds with accountants, and I report on the affected numbers (like margins and accounting return) with and without my adjustments. On the transparency front, you can find the details of how I computed each variable at this link, and it is entirely possible that you may not agree with my computation, it is in the open.
    There are a few final computational details that are worth emphasizing, and especially so if you plan to use this data in your analyses:
  1. With the micro data, I report on industry values rather than on individual companies, for two reasons. The first is that my raw data providers are understandably protective of their company-level data and have a dim view of my entry into that space. The second is that if you want company-level data for an individual company or even a subset, that data is, for the most part, already available in the financial filings of the company. Put simply, you don't need Capital IQ or Bloomberg to get to the annual reports of an individual company. 
  2. For global statistics, where companies in different countries are included within each industry, and report their financials in different currencies, I download the data converted into US dollars. Thus, numbers that are in absolute value (like total market capitalization) are in US dollars, but most of the statistics that I report are ratios or fractions, where currency is not an issue, at least for measurement. Thus, the PE ratio that I report would be the same for any company in my sample, whether I compute it in US dollar or Chilean pesos, and the same can be said about accounting ratios (margins, accounting returns).
  3. While computing industry averages may seem like a trivial computational challenge, there are two problems you face in large datasets of diverse companies. The first is that there will be individual companies where the data is missing or not available, as is the case with PE ratios for companies with negative earnings. The second is that the companies within a group can vary in size with very small and large companies in the mix. Consequently, a simple average will be a flawed measure for an industry statistic, since it weighs the very small and the very large companies equally, and while a size-weighted average may seem like a fix, the companies with missing data will remain a problem. My solution, and you may not like it, it to compute aggregated values of variable, and use these aggregated values to compute the representative statistics. Thus, my estimate the PE ratio for an industry grouping is obtained by dividing the total market capitalization of all companies in the grouping by the total net income of all companies (including money losers) in the grouping.
    Since my data is now global, I also report on these variables not only across all companies globally in each industry group, but for regional sub-groupings:



I will admit that this breakdown may look quirky, but it reflects the history of my data updates. The reason Japan gets its own grouping is because when I started my data grouping two decades ago, it was a much larger part of both the global economy and markets. The emerging markets grouping has become larger and more unwieldy over time, as some of the countries in this group had or have acquired developed market status and as China and India have grown as economies and markets, I have started reporting statistics for them separately, in addition to including them in the emerging markets grouping. Europe, as a region, has become more dispersed in its risk characteristics, with parts of Southern Europe showing the volatility more typical of emerging markets.
   -   
Data Universe
    In the first part of this post, I noted how bias can skew data analysis, and one of the biggest sources of bias is sampling, where you pick a subset of companies and draw the wrong conclusions about companies. Thus, using only the companies in the S&P 500 or companies that market capitalizations that exceed a billion in your sample in computing industry averages will yield results that reflect what large companies are doing or are priced at, and not the entire market. To reduce this sampling bias, I include all publicly traded companies that have a market price that exceeds zero in my sample, yielding a total sample size of 48,156 companies in my data universe. Note that there will be some sampling bias still left insofar as unlisted and privately owned businesses are not included, but since disclosure requirements for these businesses are much spottier, it is unlikely that we will have datasets that include these ignored companies in the sample in the near future. 
    In terms of geography, the companies in my sample span the globe, and I will add to my earlier note on regional breakdowns, by looking at the number of firms listed and market capitalizations of companies in each sub-region:

Current data link

As you can see, the United States,  with 5994 firms and a total market capitalization of $69.8 trillion, continues to have a dominant share of the global market. While US stocks had a good year, up almost 16.8% in the aggregate, the US share of the global market dipped slightly from the 48.7% at the end of 2024 to 46.8% at the end of 2025. The best performing sub-region in 2025 was China, up almost 32.5% in US dollar terms, and the worst, again in US dollar terms, was India, up only 3.31%. Global equities added $26.3 trillion in market capitalization in 2025, up 21.46% for the year.
    While I do report averages by industry group, for 95 industry groupings, these are part of broader sectors, and in the table below, you can see the breakdown of the overall sample by sector: 
Current data link
Across all global companies, technology is now the largest share of the market, commanding almost 22% of overall market capitalization, followed by financial services with 17.51% and industrials with 12.76%. There is wide divergence across sectors, in terms of market performance in 2025, with technology delivering the highest (20.73%) and real estate and utilities the lowest. There is clearly much more that can be on both the regional and sector analyses that can enrich this analysis, but that will have to wait until the next posts

Usage
    My data is open access and freely available, and it is not my place to tell you how to use it. That said, it behooves me to talk about both the users that this data is directed at, as well as the uses that it is best suited for. 
  1. For practitioners, not academic researchers: The data that I report is for practitioners in corporate finance, investing and valuation, rather than academic researchers. Thus, all of the data is on the current data link is data as of the start of January 2026, and can be used in assessments and analysis today. If you are doctoral student or researcher, you will be better served going to the raw data or having access to a full data service, but if you lack that access, and want to download and use my industry averages over time, you can use the archived data that I have, with the caveat being that not all data items have long histories and my raw data sources have changed over time.
  2. Starting point, not ending point: If you do decide to use any of my data, please do recognize that it is the starting point for your analysis, not a magic bullet. Thus, if you are pricing a steel company in Thailand, you can start with the EV/EBITDA multiple that I report for emerging market steel companies, but you should adjust that multiple for the characteristics of the company being analyzed.
  3. Take ownership: If you do use my data, whether it be on equity risk premiums or pricing ratios, please try to understand how I compute these numbers (from my classes or writing) and take ownership of the resulting analysis. 
If you use my data, and acknowledge me as a source, I thank you, but you do not need to explicitly ask me for permission. The data is in the public domain to be used, not for show, and I am glad that you were able to find a use for it.

The Damodaran Bot!
       In 2024, I talked about the Damodaran Bot, an AI entity that had read or watched everything that I have put online (classes, books, writing, spreadsheets) and talked about what I could do to stay ahead of its reach. I argued that AI bots will not only match, but be better than I am, at mechanical and rule-based tasks, and that my best pathways to creating a differential advantage was in finding aspects of my work that required multi-disciplinary (numbers plus narrative) and generalist thinking, with intuition and imagination playing a key role. As I looked at the process that I went through to put my datasets together, I realized that there was no aspect of it that a bot cannot do better and faster than I can, and I plan to work on involving my bot more in my data update next year, with the end game of having it take over almost the entire process.
   I do think that there is a message here for businesses that are built around collecting and processing data, and charging high prices for that service. Unless they can find other differentials, they are exposed to disruption, with AI doing much of what they do. More generally, to the extent that a great deal of quant investing has been built around smart numbers people working with large datasets to eke out excess returns, it will become more challenging, not less so, with AI in the mix. 

YouTube Video


Links to data

No comments: