Chapter 1: Introduction
1.1 Research Background
In competitive sports, sports data analytics has emerged as a pillar in the context of the performance and strategy. Tennis, specifically is a sport that offers a very good area of quantitative analysis since it has a structured scoring system, a system of scoring point by point, and it has a rich history that can be analyzed. Massive match data on ATP and WTA tournaments in recent years have enabled analysts to go beyond such straight-forward statistics as the percentages of serves or match wins to more complex information, such as the effect of high-leverage points on match results.
Break points, tie-breaks and set-deciding rallies are high-leverage points, which makes a considerable difference in whether an individual wins a match or not. Coaches, players, and analysts wishing to get better on training and strategy are important to know how players respond to such crucial situations. Although these are important, most of the traditional studies emphasize aggregate performance measures that hide the influence of the important points. The purpose of the study is to measure the effect of key-point performance on the match outcomes based on the statistical models applied to these high-stakes moments, which would give a more accurate explanation of what determines competitive tennis games to be successful.
1.2 Research Questions
The following research questions will guide this study:
1. Does the performance of the player in the critical moments, like in break points or tie-break, significantly impact on the likelihood of the player winning a match?
2. What is the statistical modelling method to measure the effect that key-point performance has on the result of matches?
1.3 Objectives and Significance of the Research.
This study aims at achieving two things:
Predictive Objective: Create a logistic regression model using key-point performance metrics to predict match outcomes using maximum likelihood estimation (MLE). Analytical Objective: To estimate the relative values of key performance indicators on the probability of winning a match including saved break points, tie-break win rate, and aces. The significance of the research is theoretical and practical. Theoretically it demonstrates the application of MLE in modeling discrete outcomes in sports analytics on the basis of previous studies concerning tennis outcome prediction. In practice, the results indicate the areas of key-point performance most important to achieve match victory and can inform players and coaches with information applicable in practice. The model can also be used to guide the data driven approaches towards training, match preparations and competitive analysis.
1.4 Structure of the Thesis
- Chapter 2: Literature Review analyses the past research on the metrics of tennis performance and key-point importance and the way logistic regression application can be applied in sports analytics.
- Chapter 3: Methodology explains the sources of data, definitions of variables, preprocesses, specifications of logistic regression model, and validation.
- Chapter 5: Discussion contextualizes findings in relation to previous research, provide practical implications and give limitations and future research directions.
- Chapter 6: Conclusion draws the conclusions about the contributions of the study, pointing at the methodological innovations and practical importance.
Chapter 2: Literature Review
The overview of Sports Prediction Models will be presented in the following section.
The use of predicting sports outcomes is not new in operations research and statistics. Classical models, like the Bradley-Terry model, determine the likelihood of a certain competitor defeating another by the results of previous meetings. Though useful in ranking players and prediction of match outcomes, such models do not consider the impact of point-level dynamics, which is why they do not acknowledge the significance of particular in-match incidents that can help to change the tide.
2.2 Key performance Indicators in Tennis.
The first-serve percentage, aces, double faults, break points saved, and winners-to-unforced-error ratio are among the usually used metrics to measure tennis performance. Recent research findings on high-leverage points which have a disproportionate influence on match outcomes are critical predictors of competitive advantage. For example, Magnus and Klaassen (1999) demonstrated that break-point conversion rates are stronger predictors of match outcomes than aggregate service statistics. Similarly, tie-break performance can determine set wins and, consequently, match outcomes in closely contested contests.
2.3 Application of Statistical Methods in Tennis Analysis
The logistic regression technique has become a favourite technique in the analysis of match results because it is interpretable and probabilistic. It quantifies the probability of winning as a variable according to the independent variables, which enables analysts to measure the impact of every performance measure. Logistic regression is based on maximum likelihood estimation (MLE), which is used to estimate the value of the parameters that maximize the probability to see the real values. There are alternative methods like machine learning methods like random forests and gradient boosting which are strong predictors, but usually lack interpretability. The given work is based on the previous work because it pays attention to the multivariate logistic regression which puts a stress on the interpretation of each key-point measure.
2.4 Research Gap and Contribution
Even though many studies focus on tennis performances, only limited ones focus on point-level key performance measures in a multivariate analysis. Several other previous studies consider the single-variable analysis, aggregate statistics by season, or merely the rank-based predictors. This thesis fills all these gaps by:
1.Using a combination of key-point metrics in order to model match results.
2.Using logistic regression based on MLE to interpret the findings and to be statistically rigid.
3.Offering real-life examples of performance analysis of players, coaching and optimization of strategies.
Chapter 3: Methodology
3.1 Data Source and Collection
This study received the data in the form of ATP match CSV files of the years 2018 to 2024. The files also consist of comprehensive match statistics like aces, breakpoints saved, points won on the first and the second serve, and match points. All CSV files were combined in one dataset, which included 18,877 matches and 49 variables.
3.2 Feature Selection
The selection of the key features in order to model involved the following features that have been previously studied in the literature of tennis performance indices:
• Winner statistics: w_ace, w_bpSaved, w_1stWon, w_2ndWon
• Losers Statistics: l_ace, l_bpSaved, l_1stWon, and l_2ndWon.
The outcome (target variable) was 0 in case of a loser and 1 in case of a winner. A new feature is called Tie break win and it really informs you whether the game was won or lost in a tiebreak .
3.3 Data Preprocessing
•Winner and loser data were recoded and merged to a single dataset of 36, 366 rows.
• Missing values were removed using a complete-case approach.
• Features selected for modeling were:
ace, bpSaved, 1stWon, 2ndWon, tie_break_win.
• The target variable: outcome (1 = winner, 0 = loser).
3.4 Train-Test Split
A 70%-30% stratified split was applied:
|
Set |
Rows |
|
Training |
25,458 |
|
Test |
10,908 |
Table 1Train Split
Stratification was used to make sure that there is a proportion of winners and losers in both sets.
3.5 Statistical Modeling
A logistic regression model using Maximum Likelihood Estimation (MLE) was fitted:
Logit(P(outcome =1)) =
where:
- outcome = 1 indicates a match win
- ace = number of aces
- bpSaved = number of break points saved
- 1stWon = percentage of first-serve points won
- 2ndWon = percentage of second-serve points won
- tie_break_win = indicator variable showing whether the player won a tie-break
The model was estimated under the following assumptions:
- Family: binomial
- Link function: logit
Model fitting was performed on the training set.
Chapter 4: Results and Analysis
4.1 Descriptive Statistics
|
Feature |
Mean |
SD |
Min |
Max |
|
ace |
6.32 |
5.45 |
0 |
67 |
|
bpSaved |
4.12 |
3.25 |
0 |
27 |
|
1stWon |
35.99 |
14.22 |
0 |
171 |
|
2ndWon |
15.47 |
6.86 |
0 |
56 |
|
Tie_break_win |
0.33 |
0.47 |
0 |
1 |
Table 2Descriptive Statistics
These statistics were calculated from the combined dataset (X) and provide an overview of players’ performance distributions.
4.2 Logistic Regression Model Performance
- Training Set Coefficients:
|
· Feature |
Coefficient |
|
2ndWon |
0.0391 |
|
1stWon |
0.0329 |
|
ace |
0.0203 |
|
bpSaved |
-0.2342 |
|
Tie_break_win |
-0.5605 |
Interpretation: Positive coefficients are that as 2 ndWon, 1 stWon, and ace increase the probability to win. These negative coefficients imply that bpSaved and tie_break_win are negatively influenced, and this could be a result of multicollinearity.
- Test Set Evaluation:
|
Metric |
Value |
|
Accuracy |
0.6608 |
|
ROC-AUC |
0.7056 |
Confusion Matrix:
4.3 Feature Importance Visualization
Bar plot of logistic regression coefficients
Scatter plots of predicted probability versus tie break
These plots show the correlation of each feature with the probability of winning that is predicted.
Chapter 5: Discussion
5.1 Overview
Chapter 4 indicated that logistic regression analysis revealed that some of the key performance indicators (KPIs) are very strong predictors of the outcome of ATP tennis matches. Using serve statistics and tie-break wins, the study offers a relative comprehension of the way match-level indicators are converted into the chance of winning. The implications of the results are discussed in this chapter, the effectiveness of the model is evaluated, limitations and future research opportunities are also considered.
5.2 Interpretation of Key Features
- Second Serve Points Won (2ndWon)
Coefficient: 0.0391
Interpretation: Each additional second-serve point won increases the log-odds of winning a match.
Implication: Players with higher second-serve efficiency are more likely to win, highlighting the importance of a reliable second serve under pressure.
2. First Serve Points Won (1stWon)
Coefficient: 0.0329
Interpretation: Similar to second serve, success on the first serve contributes positively to match outcomes.
Implication: Consistency in first-serve points provides a strong advantage, reflecting both technical skill and mental focus.
2. Aces (the ace)
Coefficient: 0.0203
Interpretation: Aces have a positive but smaller effect compared to serve points won.
Implication: While aces can contribute to quick points, their overall impact is less than consistent point-winning on serves.
4. Break Points Saved (bpSaved)
Coefficient: -0.2342
Interpretation: Surprisingly, the negative coefficient suggests that higher bpSaved counts are associated with low probability of winning.
Explanation: This may be due to the fact that players who face a lot of break points are typically weaker in that match; consequently, saving break points is reactive rather than a sign of dominance.
5.Tie-Break Wins (tie_break_win)
Coefficient: -0.5605
Interpretation: Matches won via a tie-break appear negatively associated with overall match outcome in the logistic model.
Explanation: Tie-breaks occur in evenly matched games, so a tie-break win is often part of a close contest rather than a strong predictor of total match dominance.
5.3 Model Performance and Predictive Power
Accuracy: 0.6608
The model correctly predicts approximately 66% of match outcomes.
ROC-AUC: 0.7056
Indicates moderate discriminatory ability between winners and losers.
Confusion Matrix Investigation: True Positives (Predicted Winner = 1 & Actual Winner = 1): 3,692
True Negatives (Predicted Loser = 0 & Actual Loser = 0): 3,516
Misclassifications occur primarily among borderline cases, reflecting close matches where features alone do not fully capture the outcome.
Implications:
The logistic regression captures overall trends well but has limitations for matches with subtle or context-dependent dynamics.
This level of predictive accuracy is sufficient for analytical insights but may not replace expert judgment in tactical decision-making.
5.4 Practical Applications
1.Evaluation of Performance by Player
Coaches can concentrate on increasing serve efficiency, particularly second-serve points, which are the most important factor in winning.
2. Match Strategy
Break point analysis and tie-break performance analysis give an understanding of situational resilience and mental toughness to enable players to be more strategic in the game.
3.Sports Analytics Tools
Probabilistic match prediction and performance benchmarking models can be added to dashboards of ATP analysts, tournament scouts, or sports betting systems.
5.5 Comparison to Prior Research
Previous studies have emphasized the importance of first and second serves in predicting tennis outcomes. The finding that tie-break wins and break points saved have weaker or negative coefficients aligns with prior evidence that contextual performance (e.g., handling pressure) is not always linearly predictive in logistic models.
The model confirms that consistency over multiple points, rather than isolated successes (like aces), drives match outcomes.
5.6 Limitations of the Study
1. Limitations of the feature
Only serve statistics and tie-break outcomes were included. Variables such as player ranking, fatigue, match surface, head-to-head history, and environmental conditions were not considered.
2.Model Assumptions
Logistic regression assumes linearity between predictors and log-odds, which may not capture complex interactions or non-linear effects.
3. Data Issues
Missing or incomplete records were dropped, potentially introducing bias.Because the dataset only contains ATP matches, it cannot be applied to other tours or amateur levels.
Conclusion
This study examined how key performance indicators influence tennis match outcomes using maximum likelihood estimation through logistic regression. Using ATP match data, the model analyzed variables such as aces, first and second serve points won, break points saved, and tie-break wins to predict the likelihood of winning. The results showed that first and second serve consistency had the strongest positive effects on winning, while frequent break points saved and tie-break wins were negatively associated, suggesting that players under constant pressure are less likely to win. The model achieved 66% accuracy and a ROC-AUC score of 0.71, indicating moderate predictive strength. On the whole, the results have shown the importance of serve performance in match success as stable serve reliability is more critical than a single instance such as aces or tie-breaks. In the future, the research may involve player ranking, surfaces, and cutting-edge machine learning models to include more intricate performance behavior patterns and enhance prediction accuracy



Comments