Academic data analysis cover illustration showing a laptop displaying a clean research dashboard with statistical charts, regression tables, scatter plots, and analytic outputs, rendered in neutral blue and grey tones with a minimalist, professional university-level aesthetic.

Analysing Research Question 1 Using Quantitative Methods: Evidence from Statistical Modelling



This article presents a structured quantitative analysis addressing Research Question 1 using statistical modelling techniques. It explains the analytical workf...

quantitative research methods statistical analysis
Megan Grande
Megan Grande
Jan 6, 2026 0 min read 24 views

Student vs Expert Agreement on Explanation Quality

1. Introduction

1.1 Research Question

The report is related to the Research Question 1: Are there any differences in the opinions of students and experts regarding the quality of explanation? The experiment recreates some features of Evans et al. (2022), as it involves the comparison of the judgements of mathematical explanations between the raters (students and experts) based on comparative judgement data. ## 1.2 Background

Comparative judgement (CJ) is the process in which groups of judges sequentially make judgements between pairs of items, which allow construction of a quality scale that is reliable but does not require rubrics. Analyzing the common grounds of the perception of the quality of explanation between the students and experts is relevant to the educational assessment and feedback practices. ## 1.3 Data Sources

Student judgements: workshop-groupA-judgements.csv and workshop-groupB-judgements.csv

Expert judgements: expert-judgements-Partial/decisions-clean.csv

Analysis focuses on explanations judged by both groups (n = r length(common_explanations)` common explanations)

2. Methods

2.1 Data Preparation

The data on student judgement are extracted out of two CSV files, which represent two workshop groups, and then combined into one dataset. Another column (judge_type), is added to distinguish student raters and expert raters. Equally, the expert judgement information is also imported and labelled as such. To facilitate compatibility and uniformity, the winner and the loser variables which depict the results of pairwise judgement are transformed to character data types. This step prevents the gaps that could be caused by the mixed data format hence determines the shared reasons perceived by both the students and professionals by the overlap of the unique distinguishing information in both data sets.

# Load and combine student judgements
groupA_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupA-judgements.csv")
groupB_judgements <- read.csv("C:/Users/f/Desktop/data/workshop-groupB-judgements.csv")
student_judgements <- bind_rows(groupA_judgements, groupB_judgements) %>%
  mutate(judge_type = "student")

# Load expert judgements
expert_judgements <- read.csv("C:/Users/f/Desktop/data/expert-judgements-PARTIAL/decisions-clean.csv") %>%
  mutate(judge_type = "expert")

# Convert to character and find common explanations
student_judgements <- student_judgements %>%
  mutate(winner = as.character(winner), loser = as.character(loser))
expert_judgements <- expert_judgements %>%
  mutate(winner = as.character(winner), loser = as.character(loser))

common_explanations <- intersect(
  unique(c(student_judgements$winner, student_judgements$loser)),
  unique(c(expert_judgements$winner, expert_judgements$loser))
)

2.2 Analytical Approach

In this section, the modelling of the judgement data is done using the Plackett-Luce framework. The data are then filtered to only keep pairwise comparisons that include the common explanations that have been identified above. This will provide a basis of analytical comparability as both the analysis of the students and the expert will use the same sets of items.

The refined data is then converted into ranking objects by the function, as.rankings(), that changes winner-loser format into a ranking organization that can be used in modeling probabilistically. The two distinct Plackett-Luce models that fit these ranking objects include those that model student data and those that model expert data. In both models, the log-worth parameters (each) are the estimated measure of the latent quality or perceived merit of any individual explanation, depending upon the number of times it is chosen over others in a series of pairwise choices.

The coefficients of the fitted models are chosen out as quality scores of each explanation. Comparing between the groups, these scores give quantitative evidence of the extent to which students and experts agree in their judgments of the quality of the explanations. The script combines the coefficients that have been extracted into a single dataframe where the student quality score and the expert quality score are combined together with the difference and the absolute difference of the two quality scores.

Lastly, the relationship between the student and expert rankings is statistically evaluated using Spearman rank correlation coefficient which is a non parametric measure that measures the monotonic relationship between the two sets of rankings. The significance test, (cor.test) associated with it identifies an observed correlation to be significant, thereby giving empirical justification to either agree or disagree between the groups.

Data filtering: Retained only explanations judged by both students and experts

Model fitting: Separate Plackett-Luce models for student and expert judgements

Quality estimation: Derived log-worth values representing perceived quality

Comparison: Spearman correlation between student and expert quality rankings

Disagreement analysis: Identified explanations with largest rating differences
# Filter to common explanations and prepare for PlackettLuce
student_common <- student_judgements %>%
  filter(winner %in% common_explanations & loser %in% common_explanations) %>%
  select(judge, winner, loser)

expert_common <- expert_judgements %>%
  filter(winner %in% common_explanations & loser %in% common_explanations) %>%
  select(judge, winner, loser)

# Fit PlackettLuce models
student_rankings <- as.rankings(student_common[, c("winner", "loser")], 
                               input = "orderings", id = student_common$judge)
expert_rankings <- as.rankings(expert_common[, c("winner", "loser")], 
                              input = "orderings", id = expert_common$judge)

model_students <- PlackettLuce(student_rankings)
model_experts <- PlackettLuce(expert_rankings)

# Extract quality scores
student_qualities <- coef(model_students, log = FALSE)
expert_qualities <- coef(model_experts, log = FALSE)

# Create comparison dataframe
common_items <- intersect(names(student_qualities), names(expert_qualities))
comparison_df <- data.frame(
  explanation_id = common_items,
  student_quality = student_qualities[common_items],
  expert_quality = expert_qualities[common_items]
) %>% mutate(
  quality_difference = student_quality - expert_quality,
  abs_difference = abs(quality_difference)
)

# Calculate correlation
correlation <- cor(comparison_df$student_quality, comparison_df$expert_quality, 
                   method = "spearman", use = "complete.obs")
cor_test <- cor.test(comparison_df$student_quality, comparison_df$expert_quality, 
                     method = "spearman")

3. Results

3.1 Data Overview

This section describes the descriptive statistics of datasets under analysis. The most important statistics are the number of common explanations that are considered, the overall number of judgements that each group gives, and the number of unique judges per sample. Through kableExtra, these statistics are summarized into a well-formatted table, which makes it possible to communicate the extent of data and the structure of samples.

Table 1: Dataset Summary

total_student_judgements <- nrow(student_common)
total_expert_judgements <- nrow(expert_common)
unique_student_judges <- length(unique(student_common$judge))
unique_expert_judges <- length(unique(expert_common$judge))

data_summary <- data.frame(
  Metric = c("Common explanations", "Student judgements", "Expert judgements", 
             "Student judges", "Expert judges"),
  Count = c(length(common_explanations), total_student_judgements, 
            total_expert_judgements, unique_student_judges, unique_expert_judges)
)

kable(data_summary, format = "html", booktabs = TRUE, caption = "Dataset Summary") %>%
  kable_styling(latex_options = "hold_position")
Dataset Summary
Metric Count
Common explanations 17
Student judgements 83
Expert judgements 41
Student judges 16
Expert judges 5

3.2 Quality Rankings Comparison

In the subsection, a comparative analysis of the highest-rated explanations was made according to the choice of each group. The script determines the top ten reasons according to the score of quality of the production offered by the Plackett-Luce models of the students and experts. The findings are tabulated according to the order of ranking and the scores of each group. This will allow easy comparison of the things each group perceived as of the best quality, which will give qualitative information on the similar or different evaluative patterns of the groups.

Table 2: Top 10 Explanations by Student and Expert Ratings

top_student <- head(sort(student_qualities, decreasing = TRUE), 10)
top_expert <- head(sort(expert_qualities, decreasing = TRUE), 10)

top_comparison <- data.frame(
  Rank = 1:10,
  Student_Top = names(top_student),
  Student_Score = round(top_student, 4),
  Expert_Top = names(top_expert),
  Expert_Score = round(top_expert, 4)
)

kable(top_comparison, format = "html", booktabs = TRUE, 
      caption = "Top 10 Explanations by Group") %>%
  kable_styling(latex_options = "hold_position")
Top 10 Explanations by Group
  Rank Student_Top Student_Score Expert_Top Expert_Score
402 1 402 0.2430 410 0.3030
415 2 415 0.2353 416 0.0869
401 3 401 0.2053 406 0.0811
405 4 405 0.0587 418 0.0772
399 5 399 0.0575 415 0.0745
426 6 426 0.0489 419 0.0741
410 7 410 0.0454 402 0.0674
414 8 414 0.0436 401 0.0536
423 9 423 0.0148 405 0.0535
406 10 406 0.0148 426 0.0279

3.3 Agreement Analysis

The Agreement Analysis section plots the student and expert quality assessment correspondence with each other in a scatterplot created in ggplot2. The plot of each point represents an explanation, and the location is predetermined by the student-assigned quality on the x-axis, as well as expert-assigned quality on the y-axis. The general trend of agreement is represented as a regression line (along with a confidence interval). Moreover, there is a dashed diagonal line (y = x) to bear out the fact of absolute congruence of the two sets of scores.

The correlation between student and expert quality ratings was

round(correlation, 3)
## [1] 0.123

(Spearman’s ρ">ρ,

ifelse(cor_test$p.value < 0.001, "p < 0.001", paste("p =", round(cor_test$p.value, 3)))
## [1] "p = 0.639"
ggplot(comparison_df, aes(x = student_quality, y = expert_quality)) +
  geom_point(size = 3, alpha = 0.7, color = "#2E86AB") +
  geom_smooth(method = "lm", se = TRUE, color = "#A23B72", fill = "#F18F01") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "#333333") +
  labs(x = "Student-assigned Quality Score", 
       y = "Expert-assigned Quality Score") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

3.4 Key Disagreements

The last analysis step reveals the explanations in which the difference between student and expert judgments is the greatest. The script then isolates the ten cases of the highest divergence by ranking explanations on the basis of the absolute difference between the respective quality scores of the cases. The findings are displayed in a tabular format with each explanation of the student scores and expert scores, the raw difference and the absolute difference.

Specifically, this step is useful especially in the diagnostic interpretation because it assists in identifying certain explanations that cause a disproportionate amount of disagreement.

Table 3: Top 10 Explanations with Largest Student-Expert Disagreement

top_disagreements <- comparison_df %>%
  arrange(desc(abs_difference)) %>%
  head(10) %>%
  mutate(across(c(student_quality, expert_quality, quality_difference, abs_difference), 
                ~round(., 4)))

kable(top_disagreements, format = "html", booktabs = TRUE,
      caption = "Explanations with Largest Rating Differences") %>%
  kable_styling(latex_options = "hold_position")
Explanations with Largest Rating Differences
  explanation_id student_quality expert_quality quality_difference abs_difference
410 410 0.0454 0.3030 -0.2577 0.2577
402 402 0.2430 0.0674 0.1756 0.1756
415 415 0.2353 0.0745 0.1607 0.1607
401 401 0.2053 0.0536 0.1517 0.1517
416 416 0.0023 0.0869 -0.0846 0.0846
418 418 0.0030 0.0772 -0.0743 0.0743
419 419 0.0057 0.0741 -0.0684 0.0684
406 406 0.0148 0.0811 -0.0663 0.0663
399 399 0.0575 0.0032 0.0543 0.0543
414 414 0.0436 0.0203 0.0233 0.0233

4. Discussion

4.1 Level of Agreement

The observed correlation of 0.123 (ρ">ρ) suggests

ifelse(correlation > 0.7, "strong", 
       ifelse(correlation > 0.4, "moderate", "weak"))
## [1] "weak"

agreement between student and expert judgments of explanation quality. This finding

ifelse(cor_test$p.value < 0.05, "is statistically significant", "is not statistically significant")
## [1] "is not statistically significant"

indicating that

ifelse(cor_test$p.value < 0.05, "students and experts share similar conceptions of explanation quality", "there may be fundamental differences in how students and experts evaluate explanations")
## [1] "there may be fundamental differences in how students and experts evaluate explanations"

4.2 Patterns in Disagreements

The largest disagreements occurred for explanations where:

Students rated explanations much higher than experts (positive differences)

Experts rated explanations much higher than students (negative differences)

These discrepancies may reflect different valuation of explanation features:

Students may prioritize clarity, simplicity, or familiarity

Experts may value mathematical rigor, completeness, or technical accuracy

4.3 Comparison with Evans et al. (2022)

Our findings

ifelse(correlation > 0.5, "align with", "contrast with")
## [1] "contrast with"

Evans et al.’s observation that students and experts generally agree on explanation quality. The

ifelse(correlation > 0.5, "moderate to
strong", "weak")
## [1] "weak"

correlation suggests

ifelse(correlation > 0.5,
"shared understanding", "different perspectives")
## [1] "different perspectives"

of what constitutes high-quality mathematical explanations.

5. Limitations

Partial overlap: Only r length(common_explanations) explanations were judged by both groups, limiting comparability

Context differences: Student and expert judgements were collected in different contexts

Sample size: The number of common items may affect correlation reliability

Model assumptions: Plackett-Luce assumes transitive preferences, which may not always hold

6. Conclusion

This analysis reveals

ifelse(correlation > 0.5, "substantial", "limited")
## [1] "limited"

agreement between student and expert judgments of mathematical explanation quality. The correlation of

round(correlation, 3)
## [1] 0.123

suggests that

ifelse(correlation > 0.5, "students and experts largely share similar
quality standards", "students and experts employ different criteria when
evaluating explanations")
## [1] "students and experts employ different criteria when\nevaluating explanations"

Educational implications:

ifelse(correlation > 0.5, "Student peer assessment may be reasonably reliable for explanation quality", "Caution is needed when using student peer assessment for explanation quality")
## [1] "Caution is needed when using student peer assessment for explanation quality"
ifelse(correlation > 0.5, "Shared understanding supports collaborative learning environments", "Different perspectives highlight need for explicit quality criteria")
## [1] "Different perspectives highlight need for explicit quality criteria"

Future research should investigate the specific features that students and experts value in mathematical explanations to better understand the sources of both agreement and disagreement.

Author
Megan Grande

You may also like

Comments
(Integrate Disqus or a custom comments component here.)