The research is focused on the pronunciation of the word “что” in Russian dialects.
The data was taken from the Dialect Corpus of the Russian National Corpus.
In total, there were 99 observations.
Columns data: variant, region, gender, age, birth year, position, yat before hard consonants, yat before soft consonants, type of vocalism, realisation of /г/, /ц/ and /ч/.
We hypothesize that pronunciation of “что” depends on all the above-mentioned data.
Sys.setlocale("LC_ALL", "ru_RU.UTF-8")
## [1] "ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/C"
library(tidyverse)
cho <- read.csv(url("http://goo.gl/m38tRA"), sep = ";")
summary(cho)
## variant region gender age birth_year position
## чё :50 Тамбов :41 ж:93 Min. :32.0 Min. :1870 conj:32
## шо : 5 Волгоград :36 м: 6 1st Qu.:71.5 1st Qu.:1918 part:11
## што:44 Тверь : 8 Median :76.0 Median :1924 pron:56
## Забайкалье : 5 Mean :75.7 Mean :1923
## Архангельск: 3 3rd Qu.:82.0 3rd Qu.:1931
## Самара : 3 Max. :96.0 Max. :1979
## (Other) : 3
## yat_hard yat_soft vocalism
## не [и] :97 не [и] :88 аканье сильное:94
## непоследовательно [и]: 2 непоследовательно [и]:11 оканье полное : 5
##
##
##
##
##
## g ts ch
## взрывной: 6 [ц'] мягк : 3 [ч'] мягкий :74
## щелевой :93 [ц] твёрд. :90 утрата затвора [ш]:25
## утрата затвора [с]: 6
##
##
##
##
## variant_num region_num gender_num yat_hard_num
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.000 Median :6.000 Median :1.0000 Median :0.0000
## Mean :1.939 Mean :4.768 Mean :0.9394 Mean :0.0202
## 3rd Qu.:3.000 3rd Qu.:7.000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :3.000 Max. :8.000 Max. :1.0000 Max. :1.0000
##
## yat_soft_num vocalism_num g_num ch_num
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.1111 Mean :0.05051 Mean :0.06061 Mean :0.2525
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.5000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000
##
cho %>%
ggplot(aes(variant, fill = variant)) +
geom_bar()
On this graph, we can see distribution of three possible variants of “что” pronunciation in the dialect corpus: “чё”, “шо” and “што”. 50 cases of “чё”, 44 cases of “што” and 5 cases of “шо”.
cho %>%
ggplot(aes(region, fill = variant)) +
geom_bar()
This graph illustrates which regions are included in the dialect corpus and which kinds of “что” pronunciation exist in these regions. Although the dialect variant “чё” is leading in the whole corpus (50 cases), in 5 regions out of 8 the non-dialect variant “што” is predominant.
cho %>%
ggplot(aes(age, variant, colour = gender)) +
geom_point(size = 2)
This graph shows the gender and the age of the informants. 96% of the informants are female. The median age is 76. The maximum age is 96. The minimum age is 32.
cho %>%
ggplot(aes(birth_year, variant, colour = gender)) +
geom_point(size = 1)
Here is a distribution of our informants according to their age of birth. The median year of birth is 1924. The earliest year of birth is 1870. The latest year of birth is 1979.
cho %>%
ggplot(aes(position, fill = variant)) +
geom_bar()
We also explored which parts of speech “что”-words are in each case. In the corpus, there are 56 pronouns, 32 conjunctions and 11 particles. According to the graph, pronouns are mostly pronounced as “чё”, while conjunctions and particles as “што”.
cho %>%
ggplot(aes(variant, fill = yat_hard)) +
geom_bar()
cho %>%
ggplot(aes(variant, fill = yat_soft)) +
geom_bar()
cho %>%
ggplot(aes(variant, fill = g)) +
geom_bar()
cho %>%
ggplot(aes(variant, fill = ts)) +
geom_bar()
cho %>%
ggplot(aes(variant, fill = ch)) +
geom_bar()
Having explored the dataset, we found out that there is not much difference concerning other dialect features, as we expected at the beginning.
We decided not to comment each of these graphs. We can only point out that that the biggest difference was found with the last ch feature.
In order to predict one of three possible variants of pronunciation, we decided to apply decision trees. Firstly, it was necessary to define the importance of our variables.
library(party)
cho$region_num <- as.factor(cho$region_num)
set.seed(123)
train <- 1:68
train.cho <- cho[train,]
test.cho <- cho[-train,]
tr <- cforest(variant~region_num + gender + age + birth_year + position + yat_hard + yat_soft + vocalism + g + ts + ch, data = train.cho,
controls = cforest_unbiased(ntree=1000, mtry=3))
varimp(tr)
## region_num gender age birth_year position yat_hard
## 0.06236 0.00000 0.01136 -0.00080 0.07268 0.00000
## yat_soft vocalism g ts ch
## 0.00092 0.00000 0.00000 0.00012 0.00180
According to the results, we choosed the following features for our model: region, age and position.
test_answers <- test.cho$variant
tr <- cforest(variant~region_num + age + position, data = train.cho,
controls = cforest_unbiased(ntree=1000, mtry=3))
predictions <- predict(tr, newdata=test.cho)
results <- test_answers == predictions
results
## [1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## [12] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
accuracy
## [1] 0.6129032
The accuracy of the model was 0.61.
Then we decided to delete all the “шо” variants from the corpus, in order to make the choice of our model simpler.
cho <- read.csv(url("http://goo.gl/m38tRA"), sep = ";")
cho <- cho[-c(34, 35, 36, 86, 87),]
train <- 1:65
train.cho <- cho[train,]
test.cho <- cho[-train,]
test_answers <- test.cho$variant
tr <- cforest(variant~region_num + age + position, data = train.cho,
controls = cforest_unbiased(ntree=1000, mtry=3))
predictions <- predict(tr, newdata=test.cho)
results <- test_answers == predictions
results
## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE FALSE
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
accuracy
## [1] 0.7586207
The result was better. The accuracy of the model became 0.76.
Finally, we made a decision to also delete several outliers (age).
cho <- read.csv(url("http://goo.gl/m38tRA"), sep = ";")
cho <- cho[-c(34, 35, 36, 49, 50, 63, 86, 87),]
train <- 1:62
train.cho <- cho[train,]
test.cho <- cho[-train,]
test_answers <- test.cho$variant
tr <- cforest(variant~region_num + age + position, data = train.cho,
controls = cforest_unbiased(ntree=1000, mtry=3))
predictions <- predict(tr, newdata=test.cho)
results <- test_answers == predictions
results
## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE
## [23] TRUE FALSE TRUE TRUE TRUE TRUE FALSE
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
accuracy
## [1] 0.7586207
The accuracy of the model did not change. So the best accuracy with decision trees was after deleting “шо”, but without deleting outliers.
Apart from CART, we used KNN for classification. Before applying the method, we normalized our data:
library(class)
cho <- read.csv(url("http://goo.gl/m38tRA"), sep = ";")
myvars <- c("variant_num", "region_num", "gender_num", "age", "birth_year", "position", "yat_hard_num", "yat_soft_num", "vocalism_num", "g_num", "ts", "ch_num")
cho.subset <- cho[myvars]
cho_new <- cho.subset
cho$variant_num <- as.factor(cho$variant_num)
cho_new$gender_num <- scale(cho_new$gender_num)
cho_new$age <- scale(cho_new$age)
cho_new$birth_year <- scale(cho_new$birth_year)
cho_new$yat_hard_num <- scale(cho_new$yat_hard_num)
cho_new$yat_soft_num <- scale(cho_new$yat_soft_num)
cho_new$vocalism_num <- scale(cho_new$vocalism_num)
cho_new$g_num <- scale(cho_new$g_num)
cho_new$ch_num <- scale(cho_new$ch_num)
cho_new$region_num <- as.factor(cho_new$region_num)
region <- model.matrix(~region_num + 0, data=cho_new)
region <- scale(region)
cho_new$position <- as.factor(cho_new$position)
position <- model.matrix(~position + 0, data=cho_new)
position <- scale(position)
cho_new$ts <- as.factor(cho_new$ts)
ts <- model.matrix(~ts + 0, data=cho_new)
ts <- scale(ts)
cho_new$region_num <- NULL
cho_new$position <- NULL
cho_new$ts <- NULL
cho_final <- cbind(cho_new, region)
cho_final <- cbind(cho_final, position)
cho_final <- cbind(cho_final, ts)
head(cho_final)
## variant_num gender_num age birth_year yat_hard_num yat_soft_num
## 1 1 0.2527142 1.46378506 -1.32445281 -0.1428646 -0.3517632
## 2 1 0.2527142 -0.06666744 0.24218566 -0.1428646 -0.3517632
## 3 1 0.2527142 -0.06666744 0.24218566 -0.1428646 -0.3517632
## 4 1 0.2527142 -0.06666744 0.24218566 -0.1428646 -0.3517632
## 5 1 0.2527142 0.50725225 -0.16650264 -0.1428646 -0.3517632
## 6 1 0.2527142 0.41159897 -0.03027321 -0.1428646 -0.3517632
## vocalism_num g_num ch_num region_num1 region_num2 region_num3
## 1 -0.229465 -0.2527142 -0.5782952 -0.1758816 1.3161775 -0.229465
## 2 -0.229465 3.9170696 -0.5782952 -0.1758816 -0.7521014 4.313943
## 3 -0.229465 3.9170696 -0.5782952 -0.1758816 -0.7521014 4.313943
## 4 -0.229465 3.9170696 -0.5782952 -0.1758816 -0.7521014 4.313943
## 5 -0.229465 -0.2527142 -0.5782952 -0.1758816 -0.7521014 -0.229465
## 6 -0.229465 -0.2527142 -0.5782952 -0.1758816 -0.7521014 -0.229465
## region_num4 region_num5 region_num6 region_num7 region_num8 positionconj
## 1 -0.1005038 -0.1758816 -0.1428646 -0.8365143 -0.2949985 -0.6875955
## 2 -0.1005038 -0.1758816 -0.1428646 -0.8365143 -0.2949985 -0.6875955
## 3 -0.1005038 -0.1758816 -0.1428646 -0.8365143 -0.2949985 -0.6875955
## 4 -0.1005038 -0.1758816 -0.1428646 -0.8365143 -0.2949985 1.4396531
## 5 -0.1005038 -0.1758816 -0.1428646 1.1833617 -0.2949985 -0.6875955
## 6 -0.1005038 -0.1758816 -0.1428646 1.1833617 -0.2949985 -0.6875955
## positionpart positionpron ts[ц'] мягк ts[ц] твёрд. tsутрата затвора [с]
## 1 -0.3517632 0.8718377 -0.1758816 0.3146266 -0.2527142
## 2 2.8141059 -1.1354166 -0.1758816 -3.1462660 3.9170696
## 3 -0.3517632 0.8718377 -0.1758816 -3.1462660 3.9170696
## 4 -0.3517632 -1.1354166 -0.1758816 -3.1462660 3.9170696
## 5 -0.3517632 0.8718377 -0.1758816 0.3146266 -0.2527142
## 6 -0.3517632 0.8718377 -0.1758816 0.3146266 -0.2527142
set.seed(123)
We wanted to know not only the accuracy, but also which features are the most important for our model with KNN.
train <- 1:68
max <- 0
for (i in 1:511){
copy_i <- i
copy_cho <- cho_final
comb <- vector()
for (j in 9:2){
k <- copy_i %% 2
if (k == 0){
copy_cho[,j] <- NULL
} else{
comb <- c(comb, j)
}
copy_i <- copy_i %/% 2
}
train.cho <- copy_cho[train,]
test.cho <- copy_cho[-train,]
train.var <- copy_cho$variant_num[train]
knn.2 <- knn(train.cho, test.cho, train.var, k=2)
test_answers <- test.cho$variant_num
results <- test_answers == knn.2
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
if (accuracy > max){
max <- accuracy
best_comb <- comb
}
}
print (max)
## [1] 0.8709677
print (best_comb)
## [1] 8 6 3
The best accuracy was 0.87. The features were region, position, ts (these three remained in any case), age, yat soft and g. We were not sure that ts is an important feature, so we deleted it and checked the accuracy.
cho_final[, 21:23] <- NULL
train <- 1:68
max <- 0
for (i in 1:511){
copy_i <- i
copy_cho <- cho_final
comb <- vector()
for (j in 9:2){
k <- copy_i %% 2
if (k == 0){
copy_cho[,j] <- NULL
} else{
comb <- c(comb, j)
}
copy_i <- copy_i %/% 2
}
train.cho <- copy_cho[train,]
test.cho <- copy_cho[-train,]
train.var <- copy_cho$variant_num[train]
knn.2 <- knn(train.cho, test.cho, train.var, k=2)
test_answers <- test.cho$variant_num
results <- test_answers == knn.2
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
if (accuracy > max){
max <- accuracy
best_comb <- comb
}
}
print (max)
## [1] 0.8709677
print (best_comb)
## [1] 3
The accuracy did not change. The features were region, position, age and yat_soft.
Then we tried one neighbour with these features.
notvars <- c("gender_num", "birth_year", "yat_hard_num", "vocalism_num", "g_num", "ts", "ch_num")
cho_final[notvars] <- NULL
train <- 1:68
train.cho <- cho_final[train,]
test.cho <- cho_final[-train,]
train.var <- cho_final$variant_num[train]
knn.1 <- knn(train.cho, test.cho, train.var, k=1)
results <- test_answers == knn.1
results
## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
accuracy
## [1] 0.8709677
The accuracy was 0.87.
Finally, we deleted “шо”-variants and get our best accuracy with KNN, which was 0.93:
cho_final <- cho_final[-c(34, 35, 36, 86, 87),]
train <- 1:65
train.cho <- cho_final[train,]
test.cho <- cho_final[-train,]
train.var <- cho_final$variant_num[train]
test.var <- cho_final$variant_num[-train]
knn.1 <- knn(train.cho, test.cho, train.var, k=1)
test_answers <- test.cho$variant_num
results <- test_answers == knn.1
results
## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [12] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
results_true <- results[results == TRUE]
accuracy <- length(results_true) / length(test_answers)
accuracy
## [1] 0.9310345
We chose region, position and age to built a regression model. We also included realisation of /ч/ as the most important feature among other dialect features according to the graphs and the results of varimp() with decision trees to check whether it will be important for our model. We deleted lines with “шо”.
cho_1 <- read.csv(url("http://goo.gl/m38tRA"), header = TRUE, sep=";", stringsAsFactors = FALSE)
cho <- cho_1[-c(34, 35, 36, 63, 86, 87),]
cho_dataset <- function(f) {
cho <- cho
cho$variant <- as.factor(cho$variant)
cho$ch <- as.factor(cho$ch)
cho$position <- as.factor(cho$position)
cho <- cho[,c("variant","ch","position","age","region_num")]
cho <- cho[complete.cases(cho),]
return(cho)
}
cho <- cho_dataset(cho)
str(cho)
## 'data.frame': 93 obs. of 5 variables:
## $ variant : Factor w/ 2 levels "чё","што": 1 1 1 1 1 1 1 1 1 1 ...
## $ ch : Factor w/ 2 levels "[ч'] мягкий",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ position : Factor w/ 3 levels "conj","part",..: 3 2 3 1 3 3 3 3 3 3 ...
## $ age : int 91 75 75 75 81 80 96 70 83 68 ...
## $ region_num: int 2 3 3 3 7 7 7 7 7 7 ...
The model was trained on N-10 cases, while remaining 10 were used for testing.
log_reg <- function(df, size=10) {
N <- nrow(df)
size=10
df <- df[sample(N),]
num <- floor(N/size)
rest <- N - num * size
ncv <- cumsum(c(rep(size,num), rest))
predictions <- data.frame(variant = df$variant, pred = NA)
for(n in ncv) {
v <- rep(TRUE, N)
v[(n-size+1):n] <- FALSE
lr <- glm(variant ~ ., data = df[v,], family = binomial(logit))
predictions[!v,"pred"] <- predict(lr, newdata=df[!v,], type="response")
}
return(predictions)
}
predictions <- log_reg(cho, size=10)
str(predictions)
## 'data.frame': 93 obs. of 2 variables:
## $ variant: Factor w/ 2 levels "чё","што": 1 1 2 2 2 2 2 2 1 2 ...
## $ pred : num 0.337 0.161 0.768 0.438 0.477 ...
lr <- glm(variant ~ ., data = cho, family = binomial(logit))
summary(lr)
##
## Call:
## glm(formula = variant ~ ., family = binomial(logit), data = cho)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2924 -0.6497 -0.4276 0.6348 2.1173
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.74328 2.44187 1.942 0.0521 .
## chутрата затвора [ш] 0.43269 0.64377 0.672 0.5015
## positionpart -0.43055 0.85578 -0.503 0.6149
## positionpron -2.60660 0.62623 -4.162 3.15e-05 ***
## age -0.03102 0.02999 -1.034 0.3011
## region_num -0.24162 0.11763 -2.054 0.0400 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.398 on 92 degrees of freedom
## Residual deviance: 89.493 on 87 degrees of freedom
## AIC: 101.49
##
## Number of Fisher Scoring iterations: 4
exp(lr$coefficients)
## (Intercept) chутрата затвора [ш] positionpart
## 114.81048157 1.54139155 0.65015261
## positionpron age region_num
## 0.07378482 0.96945890 0.78535579
The results say that ch type is not important for logistic regression model (>1).
lr_reduced <- glm(variant ~ position + age + region_num , data = cho, family = binomial(logit))
summary(lr_reduced)
##
## Call:
## glm(formula = variant ~ position + age + region_num, family = binomial(logit),
## data = cho)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1987 -0.6562 -0.4385 0.6565 2.0951
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.00386 2.41900 2.069 0.0386 *
## positionpart -0.46362 0.85743 -0.541 0.5887
## positionpron -2.61098 0.62269 -4.193 2.75e-05 ***
## age -0.03099 0.03003 -1.032 0.3021
## region_num -0.27111 0.10971 -2.471 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128.398 on 92 degrees of freedom
## Residual deviance: 89.944 on 88 degrees of freedom
## AIC: 99.944
##
## Number of Fisher Scoring iterations: 4
exp(lr_reduced$coefficients)
## (Intercept) positionpart positionpron age region_num
## 148.98719728 0.62900353 0.07346225 0.96948950 0.76253136
Then we compared two models with and without ch variable.
anova(lr, lr_reduced, test = 'Chisq')
## Analysis of Deviance Table
##
## Model 1: variant ~ ch + position + age + region_num
## Model 2: variant ~ position + age + region_num
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 87 89.493
## 2 88 89.944 -1 -0.45157 0.5016
If we compare two models with and without ch-type factor, we see, that there is no significant difference between them (Pr(>Chi) = 0.5016). So we can use a reduced model. The results mean that ch-type is actualy not important.
library(tidyverse)
plot_pred_type_distribution <- function(df, threshold) {
v <- rep(NA, nrow(df))
v <- ifelse(df$pred >= threshold & df$variant == 1, "TP", v)
v <- ifelse(df$pred >= threshold & df$variant == 0, "FP", v)
v <- ifelse(df$pred < threshold & df$variant == 1, "FN", v)
v <- ifelse(df$pred < threshold & df$variant == 0, "TN", v)
df$pred_type <- v
ggplot(data=df, aes(x=variant, y=pred)) +
geom_violin(fill=rgb(1,1,1,alpha=0.6), color=NA) +
geom_jitter(aes(color = variant)) +
geom_hline(yintercept=threshold, color="red", alpha=0.6) +
scale_color_discrete(name = "type") +
labs(title=sprintf("Threshold at %.2f", threshold))
}
plot_pred_type_distribution(predictions, 0.6)
Let’s consider that points closer to 1 - “што” pronunciation. And those closer to 0 - “чё”. We desided to put a threshold at 0.6. If we move up a threshold, the number of FP lowers and number of FN increases. This plot illlustrates distribution of predictions made by the model.
We tried different models to predict type of pronunciation of “что” in Russian dialects. The results show that not all of the variables in our dataset are significant for predictions. For instance, g, ts and vocalism have a very low influence on the что pronunciation. However, age, region and pos type are important. That means that our initial hypothesis wasn’t confirmed. According to the results, we can say that “что” pronunciation is independent from other pronunciation features described in the dataset.