1. Introduction

FIFA World Cup is the most popular sports event in the world. As shown in the image below [5], its viewership surpasses all other major sports events. With the popularity of the sport comes the importance of predictive analysis of the tournament and its matches. A lot of industries seek a good prediction at different levels for these matches for different purposes like sports betting, media and broadcast analysis, tactical decision making, driving online fan excitement.

Importance of Football

FIFA World Cup happens once in four years with 32 participating teams. First, 8 groups are created with 4 teams each. In every group, each team plays the other once. 3 points are given to the winner, 0 to the loser, and 1 point each is awarded to the teams in case of a draw. After this, top 2 teams from each group qualify for the knockout stages.From the knockout the tournament trasnforms into single-death elimination mode.

In this work, we predict the FIFA World Cup matches and attempt to propose a method to generate ideal grouping of teams for a balanced tournament using past match results and rankings which can be used to faciliate a tournament end-to-end. In summary, we use the data to generate and extract relevant features, and then use multiple supervised techniques to predict winner of a match. We implement multiple algorithms and do a thorough comparison of each of them. Apart from real data, we also explore creating fictitious matches and use semi-supervised learning in an attempt to improve the models. Alongside match predictions, we also use unsupervised clustering techniques to create groups that can facilitate a good tournament. Through these two processes, we create an end-to-end tool that can take in participating teams, build groups, predict results of matches and ultimately, predict a complete tournament.

Link to Implementation Code - https://github.com/neelabhsinha/fifa-world-cup-prediction-ml

The problem of predicting game outcomes especially in Football (also called Soccer in North America) is usually handled as a classification problem [4] [6]. Techniques ranging from logistic regression [4] to RNN/Deep Learning [6] have been employed for this task. Furthermore the problem of predicting outcome of a match is also extensible other team sports [1],[3] given usage of appropriate features. A comprehensive survey of the use of techniques across the various sports is described in [3].

3. Method Overview

3.1. Problem Definition

A tournament \(\mathcal{T}(\boldsymbol{T},\boldsymbol{G},\boldsymbol{T_b})\) is a set of teams \(\boldsymbol{T}\) participating in games \(\boldsymbol{G}\) (either a winner or tie (only in group stage)) over stages \(\boldsymbol{b} = 0,1,2,…\) ,with a set of (\(\boldsymbol{T_b}\)) teams qualifying to play them. Our goal is:

Outcome prediction : \(\forall G(T_i,T_j) \in \boldsymbol{G}\) we predict \(\hat{G}(T_i,T_j)\) accurately.
Grouping of Teams : Given \(\boldsymbol{T}\) teams, we assign 8 groups \(\boldsymbol{g}\) such that \(|g_i| = 4\) and each \(T_i \Rightarrow g_i\). We select the groups such that stronger teams are not placed together in a single tournament to faciliate a more competitive tournament towards the later ellimination stages rather than stronger teams staying in same groups and getting elliminated early.

For match prediction, we use supervised classification models and evaluate it using all standard metrics like accuracy, precision, recall, F1-score and ROC-AUC. We do not have any bias towards optimizing false positives over true negatives and vice versa due to the nature of our problem.

To evaluate our unsupervised clustering method for grouping of teams, we use clustering techniques and evaluate it using internal measures only. We do not take external measures because for our use case, there is no ground truth of groups. Even the groups FIFA assigns are using randomization across different pots. There can actually be multiple correct answers. So, the only evaluation we feel is necessary is how close are the groups. How we do this is described in detail later.

3.2. Overall Pipeline

Overall Pipeline

The above figure represents the overall pipeline of our Tournament Simulator. Given a set of input teams(in our case teams which have qualified for FIFA-2022), we first use unsupervised clustering techniques to create ideal groups, then we simulate the matches between them under the tournament structure using our supervised match outcome predictors iteratively to determine the winner of the tournament.

4. Implementation Details

4.1. Dataset

To train our models, we use the datasets listed below:

The dataset features are described in the following figure - Dataset Summary

Of these the datasets All International Matches and FIFA World Rankings are used to train and test our Machine Learning schemes ,while the dataset Soccer World Cup Data is used to prepare and run tournament simulations.

4.1.1. Data Cleaning

We initially began with data from various types of matches, including individual matches in FIFA World Cups, qualifiers, friendlies and matches from other tournaments. However, this did not yield favorable accuracy in prediction. This was probably because teams don’t play to their full strength, particularly in friendlies, so it probably adds up noise. Thus, based on our domain knowledge of the tournament, we reduced the dataset to consider only include individual matches in FIFA World Cups and qualifiers. This yielded better results across all our methods. We also cleaned the data by replacing the names of countries which were listed differently in history as compared to now or synchronized their names in the match dataset and the fifa ranking dataset(for example South Korean football team’s name is Korea Republic), so that there is consistency across various datasets. Apart from that, there was no missing data in the dataset.

4.1.2. Feature Extraction

To predict the outcomes, we first extract features for a match fixture using domain knowledge and correlation analysis. For the two teams playing, we take last \(n_{ind}\) individual matches in FIFA World Cups and qualifiers against any team. From this, we extract number of wins, goals scored (mean, std), goals conceded (mean, std), mean of rank difference of this team against oppositions played for each team [7]. Alongside this, we also take in the current rank of the teams. After this, we take last \(n_{h2h}\) matches against each other in the same category and extract difference in rank of the teams and mean, std of goals scored by both the teams. We also take a categorical variable of whether the match is at a neutral venue, and if it is a world cup match or a qualifier. Complete set of features are described in the table below. To get the target/outcome labels, we compare the goals scored for both teams in the match and if home_team scores more, we make the label = 1, otherwise 0. The following table summarizes the chosen features.

New Features

We explicitly omit ties from our use case because of the following two reasons -

It did not correlate well with any feature or any combinations thereof from the available dataset that we have both during preliminary analysis and our attempts at predicting it using various ML techniques.
We hypothesize that the business side of the tournament is a knockout. For matches beyond the group stage in case of a tie, there is a penalty shootout, the outcome of which is difficult to predict with the data we have. Therefore, instead of settling for a random choice, we would instead select the team with higher probability of a win.

In the case of Unsupervised techniques for grouping of teams, we use the columns marked “Individual” with a few changes. We used rank of the team before the start of the world cup, mean and std of goals scored and conceded as features. There was no head-to-head in this scenario as we needed performance of each team individually to group them based on their potential.

4.1.3. Exploratory Data Analysis

After extracting the features, we analyze our data as below.

Correlation Heatmap

The graph above shows the correlation of features between themselves and with the label. First three columns/rows are labels. The ‘h2h’ prefix means head to head which is data when both teams face each other. As visible, the features are largely uncorrelated and have some coorelation with the output of ‘Home Team Win’ and ‘Away Team Win’ labels. Also, ‘Draw’ is not is not well correlated with any other feature.

Distribution of Match Results

We particularly want to visualize the distribution of goals scored/conceded for both individual teams and head to head (h2h) on the outcome of the match under consideration, which are given below. ‘Win’ here means win for the home team.

Goals Conceded - Home vs. Away Mean Goals Scored - Home vs. Away

h2h Mean Goals - Home vs Away

As we can see, these features are separable linearly at the tail ends on each side and have an overlap region in between where the outcome can not be determined by these features alone. This is where the ranking distributions and other features help. In head to head, the behavior is similar.

4.2. Match Outcome Prediction - Supervised Classification

Using features extracted as described in Section 4.1.2., we train a binary classifier using various algorithms. To start, we implement Logistic Regression [8], Support Vector Machines [9], Decision Tree [10] , Gaussian Naive Bayes [19] and K-Nearest Neighbours [18] which are simple, efficient and interpretable algorithms and then move to ensemble classifiers like Random Forest [11], Gradient Boost [12] and Adaptive Boosting [20] using Logistic Regression as base classifier to predict the probability of team labeled as home winning the match. We also create an ‘Ensemble Classifier’ which combines the tained models and makes inference by majority voting and returns the predicted probability for class labels as the average of all constituent models’ and compare its performance with the others. In working with the classifier, we also experiment with forward feature selection [13] to select best features from the initial feature set, and also do Principal Component Analysis [14] to reduce the dimensionality of features. We tune all these methods by defining a search space and using Randomized Search followed by Grid Search using k-fold cross validation.

As we realize that the number of data can also be a cause of concern since World Cups happen once every four years in a space of two months, we generate artificial permutation of matches of two teams. To do this, we take a date \(D\) and team playing a match on that day \(T_D\). Then, for each team \(T_D^i\) in this set, if the team has played against a set \(T_R\) teams in the past, we generate a match between \(T_D^i\) and each member of set \(T_R - T_D\). After this, we select a random \(N_A\) set of matches from this and follow a semi-supervised learning [15] approach to train the classifier using labeled real matches and this unlabeled artificial matches to predict the results. We pass these unlabeled data points with the labels to iteratively predict the outcome of unlabeled points and add them to the training set if the confidence threshold of prediction is more than 0.75.

4.2.1. Model Training

To select the features, we heuristically optimized \(n_{ind}\)=15 and \(n_{h2h}\)=15 by changing the values, training the models and analyzing the accuracy. We sweeped these values from 5 to 20 in steps of 5 and took the best combination.

To train the model, we started with splitting our dataset into 80% Training Data and 20% Test Data.

In all the learning algorithms employed we have a fixed set of hyperparameters (example penalty and ‘c’ for logistic regression, number of trees/tree depth/sampling rate for Random Forest etc). To tune these parameters we defined a search space and employed 2 types of searches, Randomized Search and Grid Search. Since this is a multivariate optimization problem, randomly sampling the parameters helps us narrow down the search space. We began with a Randomized Search in order to get to the vicinity of hyperparameters. Then, we conducted Grid Search in the proximity of the best performing solution of Randomized Search to fine tune a better performing set of hyperparameters. However, Grid Search did not yield significantly different results from the Randomized Search. Owing to the computational cost of Grid Search, we chose to run only Randomized Search. We did both of these using K-Fold Cross Validation with K = 5.

4.3. Grouping of Participating Teams - Unsupervised Clustering

The group assignment of FIFA World Cups are done in a ceremony via random draws without replacement on 4 pots of teams [16]. We explore whether there is a way to decide the tournament groups based on grouping teams of similar strength in a cluster but assigning them to different groups so that the overall tournament is balanced . The FIFA pots roughly emulate this via grouping teams in order of their rankings, however, the group assignment is still done randomly. Instead, we use additional features to determine the clusters. Our guiding assumption for this exercise is we do not want to keep similar teams in the same groups so that most of the stronger teams can qualify and the tournament is more competitive in the later stages.

Given 32 teams and 8 groups and assuming the group numbers didn’t matter the total number of ways to randomly form 8 distinct groups of 4 teams each would be \(\frac{32!}{(4!)^88!} \sim 6 \times 10^{19}\) . The heuristics stated above allow us to explore an exponentially smaller subset for our search. The notion of team strength is extracted and evaluated via Unsupervised Learning Techniques detailed below.

Since clustering essentially groups similar teams together, and we have 8 groups with 4 teams in each group, we create four clusters. This groups similar teams together. For one cluster, we start assigning each team to separate groups until eight groups are done. How we do this is different for each algorithm and we describe it below.

4.3.1. Clustering Techniques Used

We employed Constrained K-Means, a variation of KMeans[21] and Gaussian Mixture Model (GMM)[22], to organize the 32 participating FIFA teams into eight distinct groups for the group stage of the tournament. As stated earlier, we used rank of the team before the start of the world cup, mean and std of goals scored and conceded as features.

Constrained K-means is a variation of the K-means clustering algorithm where we can control the cluster size [17]. Here we set the minimum and maximum of each cluster size to 8 simultaneousy for our task. This automatically gives us 8 teams per cluster. We assign these 8 teams in one cluster to 8 different groups, where the strongest team in that cluster is the one closest to the cluster center. This is done for each cluster, yeilding groups where teams are of different skill level to achieve our objective.

In the case of GMM, we get responsibility values for each data point in each cluster. We used to this to satisfy our constraint. For the first cluster, we first maintained eight teams within the cluster by selecting teams with the top-8 highest responsibility. Suppose there were 10 teams with the maximum probability for cluster 1; we assigned cluster 1 only to the top eight teams among these, ensuring the desired number of groups, and moved the remaining two to the other. This has a cascading effect of course, so we take only the top-p remaining teams in the next cluster, and so on. Because there are only 32 teams, this will eventually converge and give us four pots of 8 teams each. We assign these teams of one pot to separate groups for each pot to achieve our objective.

The reason to select these algorithms was our constraint of having four clusters and eight teams in one cluster. Each of those two gave a way to handle this as described above which was not offered by other clustering algorithms.

5 Experiments

5.1. Performance of Match Outcome Predictors

5.1.1. Base Models

We analyze the performance of the various classification schemes on our dataset using all standard metrics of evaluating a supervised learning algorithm. We primarily want to optimize the accuracy of the prediction, but since we have a balanced dataset and no bias towards avoiding either true negatives or false positives, we treat them fairly. Our final results are as shown below:

Technique	Accuracy	Precision	Recall	F-1 score	ROC-AUC
Logistic Regression	73.49%	73.59%	73.49%	73.49%	0.81
SVM	73.38%	73.51%	73.38%	73.37%	0.81
Decision Tree	70.29%	70.46%	70.29%	70.26%	0.77
kNN	71.41%	71.51%	71.41%	71.40%	0.80
Gaussian Naive Bayes	72.54%	72.54%	72.54%	72.53%	0.80
Random Forest	71.69%	71.81%	71.69%	71.68%	0.80
Gradient Boosting	71.52%	72.06%	71.52%	71.41%	0.80
Adaptive Boost (base model - logistic regression)	73.16%	73.30%	73.16%	73.14%	0.81

From the above table, we can see that logistic regression outperforms all models with support vector machines a close second. On hyperparameter tuning using SVM, linear kernel was chosen which explains similar results of Logistic Regression and SVM. The value of C chosen was 0.007 for SVM and 0.01 for Logistic Regression. The slight difference can be explained by choices made by random search.

5.1.2. Confusion Matrix

LogisticRegressionCM SVMCM

DTCM KNNCM

NBCM RFCM

GBCM ADALRCM

5.1.3. Learning Curve

LogisticRegressionCurve SVMCurve

DTCurve KNNCurve

NBCurve RFCurve

GBCurve ADALRCurve

5.1.4. ROC/AUC Curve

LogisticRegressionROC SVMROC

DTROC KNNROC

NBROC RFROC

GBROC LRADAROC

5.2. Ensemble Classifier for Match Outcome Prediction

We tried taking all the classifiers that we trained and create an ensemble using it. This did not require any training as we loaded all the trained models. Precisely, we took Support Vector Machines, Decision Trees, Logistic Regression, KNN and Gaussian Naive Bayes classifiers as they are widely different in nature and should cover a large scenarios.

For prediction, we employed the technique of majority voting where the final prediction was the class which was predicted as majority by the individual features, and the probability was the average across all models. The results are obtained given below.

Metric	Value
Accuracy	73.94%
Precision	74.02%
Recall	73.94%
F-1 score	73.94%
ROC-AUC	0.81

EnsembleCM

From the above, we can see that there was a slight improvement on the performance but not by a lot. This further shows that the correct and incorrect predictions across all models are largely the same and thus, the accuracies that we get are the maximum we can achieve using the available data and features.

5.3. Impact of Forward Feature Selection on Match Outcome Prediction

Forward feature selection is the iterative addition of features to the model one at a time. The process starts with an empty set of features and gradually incorporates the most relevant features based on certain criteria, in our case the increase in accuracy of the model based on the set of features being added. Post forward feature selection, we found the accuracy of each model to be drop by approximately 3-5%. Due to this, we did not move forward with employing this technique. A possible hypothesis and explanation for this behavior is that individual features had lesser contribution to the accuracy of the model, and were enforced by other features of the dataset, thus leading to better accuracy without forward feature selection.

5.4. Impact of Principal Component Analysis on Match Outcome Prediction

To analyze the impact of dimensionality reduction, we perform PCA on our features and run logistic regression and random forest on the features after doing PCA. After that, we select first five and first fifteen components and train the models using this data. The accuracy of each of these are given below -

Method	n=5	n=15	raw features
Logistic Regression	71.75%	72.70%	73.49%
Random Forest	72.76%	72.53%	71.69%

Interestingly, the trend in both algorithms are opposite. With logistic regression, more features/components yield more accuracy and with random forest, the vice versa. This probably suggests that Logistic Regression, which was working optimally before starts to suffer when we reduce the dimensions as it gets less information, whereas Random Forest, which was probably over-fitting originally despite hyperparameter tuning now is able to better learn the representation with reducing dimensionality. However, even with n=5 (best case), it is not able to outperform logistic regression with raw features.

5.5. Impact of Semi-supervised Learning on Match Outcome Prediction

5.5.1. Motivation and Procedure

Given the infrequency of the World Cup occurring every four years, the limited availability of data points posed a challenge for traditional supervised learning approaches. To overcome this constraint, we were motivated to explore semi-supervised learning for our prediction model. This adaptive methodology allows us to make the most out of the available labeled data while efficiently incorporating the valuable information from unlabeled data, thereby enhancing the robustness and effectiveness of our predictive model.

We implement semi-supervised learning [15] as described in section 4.2 on Logistic Regression and Random Forest and the results are below.

5.5.2. Semi-supervised vs Supervised Learning

5.5.2.1. Model Performance

We analyze the performance of the various classification schemes on our dataset as shown below:

Technique	Accuracy	Precision	Recall	F-1 score	ROC-AUC
Logistic Regression (Supervised)	73.49%	73.59%	73.49%	73.49%	0.81
Logistic Regression (Semi Supervised)	71.46%	71.92%	71.47%	71.37%	0.79
Random Forest (Supervised)	71.69%	71.81%	71.69%	71.68%	0.80
Random Forest (Semi Supervised)	71.65%	71.77%	71.55%	71.6%	0.77

As we can see, there is no significant improvement in semi supervised learning over the supervised results. The possible reason is that these hyptohetical matches learn from a similar data representation and do not add a lot of variety to the dataset. Thus, only increasing the number of data points but not adding a lot of extra information probably impact the model slightly negatively. The confusion matrix and ROC curve for logistic regression supervised vs. semi-supervised are given below.

5.5.2.2. Confusion Matrix

LogisticRegressionSupervised LogisticRegressionSemiSupervised

5.5.2.3. ROC/AUC Curve

LogisticRegressionSupervised LogisticRegressionSemiSupervised

5.6. Tournament Simulation of 2022 World Cup using Match Outcome Predictors

5.6.1 Tournament Schedule

We are following the official FIFA World Cup match scheduling strategy. For this simulation, we have used the official FIFA World Cup 2022 Groups. The groups are as follows:

WC Groups

  Group A= ['Qatar', 'Ecuador', 'Senegal', 'Netherlands']
  Group B= ['England', 'Iran', 'USA', 'Wales']
  Group C= ['Argentina', 'Saudi Arabia', 'Mexico', 'Poland']
  Group D= ['France', 'Australia', 'Denmark', 'Tunisia']
  Group E= ['Spain', 'Costa Rica', 'Germany', 'Japan']
  Group F= ['Belgium', 'Canada', 'Morocco', 'Croatia']
  Group G= ['Brazil', 'Serbia', 'Switzerland', 'Cameroon']
  Group H= ['Portugal', 'Ghana', 'Uruguay', 'South Korea']

Total matches= 64

A. Group Stage- 8 groups of 4 teams each

 Each team plays 3 matches with the other teams in the group
 Total matches per group= 6 (4C2)
 Total matches= 48 #### B. Knockout Stages- Played after Group Stages ##### 1. Round of 16- 8 Matches (16C2 Matches)<br>
 First-place Group A vs. Second-place Group B- W1
 First-place Group B vs. Second-place Group A- W2
 First-place Group C vs. Second-place Group D- W3
 First-place Group D vs. Second-place Group C- W4
 First-place Group E vs. Second-place Group F- W5
 First-place Group F vs. Second-place Group E- W6
 First-place Group G vs. Second-place Group H- W7
 First-place Group H vs. Second-place Group G- W8

2. Quarter Finals- 4 Matches (8C2)

 W1 vs W2- QF_W1
 W3 vs W4- QF_W2
 W5 vs W6- QF_W3
 W7 vs W8- QF_W4

3. Semi Finals- 2 Matches (4C2)

 QF_W1 vs QF_W2- SF_W1
 QF_W3 vs QF_W4- SF_W2

4. Play-offs/ Third Place- 1 Match (2C2)

 Semi Final Losers

5. Final- 1 Match

 SF_W1 vs SF_W2

Simulation

We have analysed the results using 5 different models:

Decision Tree Simulation Gradient Boost Simulation Logistic Regression Simulation Random Forest Simulation Support Vector Machine Simulation

From here, we can see that likely performance of all models are similar, as also indicated by the close performance scores. One reason why in reality Brazil did not win was because it lost to Croatia over penalty shootout on the given day. But, overall, it was a stronger team and more likely to win also. Argentina, the eventual winners are in the finals here as well. And, statistically, Brazil were a stronger team based on the features that we are using. So, in conclusion, we would say that our tournament predictor works well within error bounds.

5.7. Evaluation of Clustering Techniques to Group the Teams

As mentioned before, we only used internal measures to evaluate clustering. The objective was to cluster the teams so that we can get a group where stronger teams are not together to ensure that later part of the tournament is more competitive. In order to acheive this we want the clusters to be as compact as posssbible so that they can be moved to different froups. So, we evaluate our clustering algorithms using Silhoutte Score, DB-Index and Beta-CV measure. The results are given below -

Algorithm	Silhoutte Score	Davies-Bouldin Index	Beta-CV Measure
Constrined K-Means	0.4955	0.5128	0.2495
GMM	0.4523	0.5851	0.2463

The Constrained K-Means algorithm shows a higher Silhouette Score (0.4955) compared to GMM (0.4523), indicating better cluster cohesion and separation in the former. Similarly, it has a lower Davies-Bouldin Index (0.5128) than GMM (0.5851), suggesting more compact and well-separated clusters. However, the Beta-CV Measure is marginally better for GMM (0.2463) than for Constrained K-Means (0.2495), implying slightly tighter clustering in the GMM. Overall, the Constrained K-Means seems to perform better in terms of cluster separation and cohesion, while GMM has a slight edge in cluster compactness. The reason can be because the constrained KMeans is optimized to handle a constrained clustering for a problem like us while for GMM, we use a naive greedy approach to separate the groups.

The visualizations related to final clusters formed are given below for Constrained K-Means clustering:

kmeans cluster visual Mean Goal Scored - Mean Goal Conceded

Rank - Mean Goal Conceded Rank - Mean Goal Scored

Rank - Win Count Mean Goal Scored - Win Count

From the above images, we can see that the rank of the team seperates the datas the most while the mean goals scored and mean goals conceded are also useful features to form clusters.

The visualizations related to final clusters formed are given below for GMM:

gmm cluster visual Mean Goal Scored - Mean Goal Conceded

Rank - Mean Goal Conceded Rank - Mean Goal Scored

Rank - Win Count Mean Goal Scored - Win Count

5.8. End-to-end Tournament Simulation with Grouping

We used unsupervised machine learning models, Constrained K-Means and GMM, to create the 8 FIFA groups that compete in the group stage of FIFA.

For this, we generated 4 clusters with 8 teams each and picked one team from each cluster and put it to a group. The final results are displayed in the following section.

5.8.1. Grouping of Teams using Constrained K-Means

Following 8 FIFA groups were generated using KMeans:

Group	Team 1	Team 2	Team 3	Team 4
A	Argentina	Switzerland	Tunisia	Wales
B	Netherlands	Senegal	Serbia	Cameroon
C	Croatia	Spain	Denmark	Saudi Arabia
D	France	Germany	Poland	Canada
E	Portugal	USA	Iran	Costa Rica
F	Belgium	Uruguay	South Korea	Qatar
G	England	Morocco	Japan	Ecuador
H	Brazil	Mexico	Australia	Ghana

Using these groups, following are the predictions for the single-ellimination phase using Logistic Regression and Ensemble Classifier, two of our best performing models:

Prediction using Logistic Regression: Logistic Regression model using KMeans Grouping Prediction using Ensemble Model: Ensemble model using KMeans Grouping

5.8.2. Grouping of Teams using GMM

Following 8 FIFA groups were generated using GMM:

Group	Team 1	Team 2	Team 3	Team 4
A	Qatar	Netherlands	England	Poland
B	Ghana	Switzerland	Argentina	Australia
C	Ecuador	Germany	Portugal	Tunisia
D	Costa Rica	Morocco	Spain	Serbia
E	Saudi Arabia	Denmark	USA	Iran
F	Cameroon	Uruguay	France	Wales
G	Canada	Senegal	Brazil	Japan
H	South Korea	Mexico	Belgium	Croatia

Using these groups, following are the predictions:

Prediction using Logistic Regression: Logistic Regression model using Gaussian Mixture Model Grouping Prediction using Ensemble Model: Ensemble model using Gaussian Mixture Model Grouping

6. Scope for Improvement and Ideas for further exploration

Based on our understanding we believe the following areas could be explored further to improve our analysis and to also generate new ideas:

Ties: Given our dataset we found that Ties/Draws are very difficult to predict with any reasonable accuracy. We can predict ties with a different approach, like predicting the number of goals scored by each teams than just the win/loss probability with additional datasets could we make a more informed/accurate predicition of ties. Also, in case a tie occurs, there should be ways to predict penalty shootouts using additional data.
Team features using Player data*: For this project we extracted the relevant team features using past matches’ data to understand/approximate the various defensive/offensive attributes a team might have.The same could be established via combining the attributes of the relevant player(s).For example with everything remaining the same, a goalkeeper with a poor record may lower the defensive attributes of the team(this is not always desirable but sometimes forced because of injuries/other problems).Collating player data and using them to build team features is another area one could explore.

Most of the ideas and techniques employed in our project could be applied to other team based sports and their tournaments as well with some modifications ,and hence it would be an interesting exercise to do a comparative analysis to gauge the relative performance of different ML algorithms on different team sports.

7. Project Timeline and Responsibilities

7.1. Contributions from Mid-term to Final Submission

Team Member	Responsibility
Ananya Sharma	KNN, Gaussian Naive Bayes, Results Evaluation and Analysis, Report, Presentation
Apoorva Sinha	Feature Selection, Dimensionality Reduction, Results Evaluation and Analysis, Report, Presentation
Neelabh Sinha	Ensemble Classifier, AdaBoost, Results Evaluation and Analysis, Report, Presentation
Snigdha Verma	GMM, Tournament Prediction Pipeline, Results Evaluation and Analysis, Report, Presentation
Yu- Chen Lin	Constrained K-Means, Generating Visualizations, Results Evaluation and Analysis, Report, Presentation

7.2 Project Gantt Chart

The gantt chart covering complete timeline and responsibility distribution can be found here.

8 References

D. Delen, D. Cogdell, and N. Kasap, “A comparative analysis of data mining methods in predicting NCAA bowl outcomes,” International Journal of Forecasting, vol. 28, no. 2, pp. 543–552, 2012.
T. Horvat and J. Job, “The use of machine learning in sport outcome prediction: A review,” WIREs Data Mining and Knowledge Discovery, vol. 10, no. 5, p. e1380, 2020.
T. Horvat, J. Job, R. Logozar, and I. Livada, “A data-driven machine learning algorithm for predicting the outcomes of NBA games,” Symmetry, vol. 15, no. 4, 2023.
D. Prasetio and D. Harlili, “Predicting football match results with logistic regression,” in 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA), 2016, pp. 1–5.
[Online]. Available: https://www.statista.com/chart/28766/global-reach-and-tv-viewership-of-the-fifa-world-cup
E. Tiwari, P. Sardar, and S. Jain, “Football Match Result Prediction Using Neural Networks and Deep Learning,” in 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2020, pp. 229-231.
M. J. Dixon and S. G. Coles, “Modelling Association Football Scores and Inefficiencies in the Football Betting Market,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 46, pp. 265-280, 1997. DOI: 10.1111/1467-9876.00065
D. R. Cox, “The regression analysis of binary sequences (with discussion),” Journal of the Royal Statistical Society: Series B (Methodological), vol. 20, no. 2, pp. 215-242, 1958.
B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 144-152.
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189-1232, 2001.
G. H. John and P. Langley, “Feature selection for high-dimensional genomic microarray data,” Computer Methods and Programs in Biomedicine, vol. 56, no. 1, pp. 37-48, 1997. DOI: 10.1016/S0169-2607(98)00047-7
K. Pearson, “On Lines and Planes of Closest Fit to Systems of Points in Space,” Philosophical Magazine Series 6, vol. 2, no. 11, pp. 559-572, 1901. DOI: 10.1080/14786440109462720
D. Yarowsky, “Unsupervised word sense disambiguation rivaling supervised methods,” in Proceedings of the 33rd annual meeting on Association for Computational Linguistics (ACL ‘95), Association for Computational Linguistics, Stroudsburg, PA, USA, 1995, pp. 189-196.
[Online]. Available: https://www.fifa.com/tournaments/mens/worldcup/qatar2022/news/qatar-2022-final-draw-all-you-need-to-know
Bennett, K.P. and Bradley, P.S. and Demiriz, A.,”Constrained K-Means Clustering”, MSR-TR-2000-65 , Microsoft Research 2000
Fix, E., & Hodges, J. L. (1951). Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties (PDF) (Report). USAF School of Aviation Medicine, Randolph Field, Texas. Archived (PDF) from the original on September 26, 2020
Maron, M. E. (1961). “Automatic Indexing: An Experimental Inquiry”. Journal of the ACM, 8(3), 404–417. doi:10.1145/321075.321084
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119-139.
MacQueen, J. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Volume 1: Statistics, 281-297. University of California Press.
Pearson, K. (1894). Contributions to the Mathematical Theory of Evolution. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 185, 71–110. doi:10.1098/rsta.1894.0003. This paper is often cited as the first to explicitly address the decomposition problem in characterising non-normal attributes

1. Introduction

2. Related Work

3. Method Overview

3.1. Problem Definition

3.2. Overall Pipeline

4. Implementation Details

4.1. Dataset

4.1.1. Data Cleaning

4.1.2. Feature Extraction

4.1.3. Exploratory Data Analysis

4.2. Match Outcome Prediction - Supervised Classification

4.2.1. Model Training

4.3. Grouping of Participating Teams - Unsupervised Clustering

4.3.1. Clustering Techniques Used

5 Experiments

5.1. Performance of Match Outcome Predictors

5.1.1. Base Models

5.1.2. Confusion Matrix

5.1.3. Learning Curve

5.1.4. ROC/AUC Curve

5.2. Ensemble Classifier for Match Outcome Prediction

5.3. Impact of Forward Feature Selection on Match Outcome Prediction

5.4. Impact of Principal Component Analysis on Match Outcome Prediction

5.5. Impact of Semi-supervised Learning on Match Outcome Prediction

5.5.1. Motivation and Procedure

5.5.2. Semi-supervised vs Supervised Learning

5.5.2.1. Model Performance

5.5.2.2. Confusion Matrix

5.5.2.3. ROC/AUC Curve

5.6. Tournament Simulation of 2022 World Cup using Match Outcome Predictors

5.6.1 Tournament Schedule

WC Groups

Total matches= 64

A. Group Stage- 8 groups of 4 teams each

2. Quarter Finals- 4 Matches (8C2)

3. Semi Finals- 2 Matches (4C2)

4. Play-offs/ Third Place- 1 Match (2C2)

5. Final- 1 Match

Simulation

5.7. Evaluation of Clustering Techniques to Group the Teams

5.8. End-to-end Tournament Simulation with Grouping

5.8.1. Grouping of Teams using Constrained K-Means

5.8.2. Grouping of Teams using GMM

6. Scope for Improvement and Ideas for further exploration

7. Project Timeline and Responsibilities

7.1. Contributions from Mid-term to Final Submission

7.2 Project Gantt Chart

8 References