About the report

Why Framna built MATR
Framna is a digital product agency. We partner with organizations on the products that matter most to their customers and to their business. MATR is how we see the market, and how we make conversations about product quality concrete.

The app is not dead. It is under pressure from two directions

AI agents are absorbing the jobs inside the app. AI coding agents are compressing the time to build a replacement. What used to take years to ship now takes weeks, and a well-funded competitor with Claude and a weekend can match a feature surface that took you five years.

The apps that survive are the ones whose value is not in their features. The trust relationship. The licensed data. The regulatory contracts. The network effect. The operational complexity. The physical-world service. These are the parts no coding agent can reproduce and no assistant can bypass.

We work with product teams to find, name, and build the parts of their product that nothing else can replicate.

What MATR is

MATR is a measurement framework for mobile app quality, built on a survey of nearly 13,750 respondents across Denmark, the Netherlands, and Sweden. It turns user perception into a set of dimensions that can be compared across products, categories, and time. It lets a product leader ask three questions and get answers grounded in data:

Where does our product stand against its category today?

Which dimensions hold us back, and which ones move App Pulse?

What would it take to move?

What MATR is not

MATR is not an award. It is not a ranking optimized for press coverage. It is not a feature checklist. Apps are not graded as winners or losers.

Scope and coverage

Methodology

The MATR 2026 report analyzes 401 products across 625 app-country observations in Denmark, the Netherlands, and Sweden, drawn from 11,478 respondents via a nationally representative sample (ages 15–79) in each market. Fieldwork was completed and the dataset was finalized in early April 2026. Editorial passages round these to “nearly 400 products” and “nearly 11,500 respondents” for readability. Poland is excluded from all figures in this white paper.

Who answered

Every app score in the report is rated only by people who use that app, weighted by respondent count. The demographic composition behind each app score therefore reflects who actually opens that product, not a synthetic average. Findings that depend on a demographic cut name the relevant split inline: gender on trust (5.3), mindset on design (4.4) and on innovation (6.4), and age on AI install rates (7.1). The rest of the report carries patterns that hold across age, gender, income, and mindset.

Call to action

Location

Approximately 3,800 respondents per market: 3,831 in Denmark, 3,867 in the Netherlands, 3,780 in Sweden. Sampling is nationally representative within each country and covers urban centers and the regions outside them. The raw data carries zipcode, municipality, and country region fields (DK region, SE county, NL NUTS2); a precise urban-versus-rural split per market is available on request but is not aggregated into the public dataset for this report.

Age

Ages span 15 to 79. Every generational cohort is represented (Gen Z, Millennials, Gen X, Boomers), and the full mobile-adoption curve is covered: Innovators, Early Adopters, Early Majority, Late Majority, Laggards. Country-specific income and education distributions are sampled in line with each national panel.

Gender

Approximately equal split between men and women within each market.

Statistical significance

An (app, country) pair is considered significant when the 95 percent confidence-interval width on both the App Pulse Score and the driver score is at or below 0.5 on the 1–5 scale (10 percent of range). Confidence intervals use t-distribution critical values where neff is the Kish effective sample size. $$ t\left(\alpha/2,\ \mathrm{df} = n_{\mathrm{eff}} - 1\right) $$ This penalizes small samples without arbitrary n thresholds. $$ \mathrm{DEFF} = 1 + \mathrm{CV}_w^2 $$

 

A second salvage step is applied. Pairs that fall below the threshold in every individual country can still enter the analysis if the same (app) clears the confidence-interval criterion when pooled across the markets where it appears. This adds roughly 50 app-country pairs (and 11 distinct apps) to the working sample. Pooled-salvaged pairs enter the respondent-weighted averages and the regression on the same footing as country-level significant pairs.

What the study sample means

The 401 distinct apps (625 app-country observations) reported in this white paper are not a random sample of every mobile app in the study region. An app is included only after it clears the significance threshold above: sufficient respondent coverage and stable enough scores to be interpretable. Most apps with very low usage, niche user bases, or unstable ratings do not appear.

 

The practical consequence is that the white paper analyzes the strong segment of the Nordic market. Average App Pulse across the sample is roughly 3.8 on a 1 to 5 scale, and only a small tail of apps scores below 3.3. When a chapter refers to a “bottom quartile” or “lower tier”, it means the lower end of this strong pool, not the bottom of the broader app market. An app in the lower quartile of this study is a product that cleared the reliability bar and is simply not keeping pace with the leaders around it. It is not a weak product in an absolute sense.

 

The same logic applies to language like “vulnerable incumbents”, “good-enough trap”, and “AI-exposed categories” used later in the report. These are relative positions within the strong pool. Every product the study names is, by construction, already above the reliability floor.

Respondent-weighted averaging

All display metrics (app-level pulse, driver scores, NPS) use respondent-weighted averaging instead of simple country averaging: $$ \bar{x} = \frac{\sum_c n_c \cdot score_c}{\sum_c n_c} $$

This ensures apps measured in high-respondent countries are not under-weighted, and approximates inverse-variance weighting (Cochran, 1954), the minimum-variance unbiased estimator for combining independent estimates.

The four-driver model

The white paper measures product quality through four drivers, each composed of two to four survey statements:

Technical performance

Smooth and responsive. No crashes.

UX and design

Easy to navigate. Appropriate information density. Visual appeal. Modern design.

Trust

Acts in my interest. Handles data responsibly.

Feature richness

Personalized. Innovative. Has all needed features.

App Pulse Score combines two inner-loop questions on the 1–5 scale:

$$ \mathrm{App Pulse} = 0.75 \cdot \mathrm{Satisfaction} + 0.25 \cdot \mathrm{NPS}_{\mathrm{scaled}} $$

NPS is rescaled to the 1–5 scale. A third candidate question (“would miss the app if it were gone”) was tested in the model but received weight 0 after nested cross-validation. Eleven driver statements contribute to the four driver scores.

 

Display precision. The report follows the MATR dashboard’s four-tier rounding convention. App Pulse for a single product rounds to the nearest 0.2 (3.4, 3.6, 3.8, 4.0, 4.2). Pulse aggregated across a category rounds to the nearest 0.05 (3.65, 4.10). Pulse aggregated across the market rounds to the nearest 0.01 (3.80, 3.84). Driver and single-statement scores at any aggregation level round to the nearest 0.1 (3.4, 3.5, 3.6). Underlying data is held at full precision. This convention is what lets the report avoid printing margins of error inline. The displayed precision at each tier sits inside the confidence interval the underlying data supports, which means a reader can take every printed number at face value without checking an error bar.

 

A note on notifications. Two notification statements (“useful notifications” and “appropriate notification amount”) are measured in the survey but excluded from the driver model. Validation showed they do not meaningfully predict App Pulse in this dataset and removing them does not damage Cronbach’s α. For that reason they are not reported as a strategic dimension anywhere in this white paper.

Regression: two-way fixed effects with Ridge

The driver importance analysis behind every figure in this white paper operates on the 625 app-country observations from Denmark, the Netherlands, and Sweden. A two-way within-transformation iteratively subtracts country and category group means (alternating projection) until convergence, equivalent to including unpenalised country and category dummies. Coefficients reflect driver impact within country-category cells.

 

Ridge regression (L2 regularization) is applied to stabilize estimation once the halo control (Overall Quality, the mean of the four drivers) is added as a covariate. This is the specification that produced every driver-impact number in the white paper. Why fixed effects rather than random effects: with three countries and nine categories, level-2 variance components would be unreliably estimated (McNeish & Stapleton, 2016).

Relative importance (Johnson’s epsilon) with BCa bootstrap

Driver importance percentages are computed via Johnson’s relative weight analysis (Johnson, 2000), an eigen-based decomposition that handles correlated predictors. Confidence intervals use the BCa (bias-corrected and accelerated) bootstrap (Efron, 1987), the gold standard for relative-weight inference (Tonidandel, LeBreton & Johnson, 2009).

Bias correction (z0z0​)

proportion of bootstrap replicates below the full-sample estimate, transformed via Φ^−1.

Acceleration (â)

computed from leave-one-cluster-out jackknife (drops each app in turn).

Cluster bootstrap

resamples at the app level to account for within-app correlation across countries (Cameron, Gelbach & Miller, 2008).

Category merges

The 16 original categories are merged into nine working categories. Merges were validated via per-category OLS regression on adjusted R², leave-one-out CV-RMSE, and coefficient sign stability. All merges maintain or improve regression predictability.

Sub-category analysis

Sub-categories are reported at the app level (each app can belong to multiple sub-categories). Sub-category statistics (median Pulse, % at ≥4.0, ceiling) are computed only for sub-categories with at least five significant (app, country) pairs.

Replication

All numbers in this white paper can be reproduced from the aggregated JSON files in the Framna MATR dashboard. The build pipeline is a single Python script that reads those JSONs and emits per-chapter data files. Underlying respondent-level data is not published.