In standard Russian, all third person pronouns in prepositional constructions must have initial [n]. However, in some dialects, it is not the case. Mikhalevskaya village dialect is characterized with forms without the initial [n], which is one of its dialectal features. However, the language of the speakers is becoming standardized and there are less and less occurrences of dialectal forms. The data set used in this project consists of 1015 observations. It includes the following variables. Output variable is absence or presence of the initial [n] (categorical). Input variables are informants’ ID (categorical), their year of birth (numerical), gender (categorical: male or female), education level (categorical) as well as some variables that characterize prepositional constructions: type (categorical) and frequency (numerical) of the preposition, form (categorical) and case (categorical) of the pronoun. Our hypothesis is that sociolinguistic factors (such as age of the informants, their gender and education) might influence the proportion of dialectal forms. The main idea is that the younger the speakers are, the more forms with [n] he/she has. Another (weaker) supposition is that the higher the education level (i.e. the more the contact with the standard variant), again the more cases of the initial [n] we can observe. Other variables will serve as possible predictors, although we do not know in what degree and direction they can influence the absence or presence of [n].
The phenomenon illustrated in this paper is one of the linguistic variables that differs dialect of Mikhalevskaya from the standard language. It was a result of reanalysis of constructions including prepositions vъn ‘in’, kъn ‘to’ и sъn ‘with’ (later expanded to other prepositions) with third person pronouns, which took place very early in the history of Russian. In modern standard Russian, the initial nasal in pronouns is obligatory in most prepositional constructions (primary prepositions), and is optional or even impossible in constructions with some prepositions. Examples of initial [n]: u n’ego ‘by him’, na n’ix ‘on them’, s n’im ‘with him’. On the contrary, in some Russian dialects, the initial nasal consonant after prepositions had been lost and became a dialectal feature.
The Ustja River Basin Corpus, that includes data collected in 2013 to 2016 during four field trips to Mikhalevskaya, the village in Ustya district of Arkhangelskaya Oblast, was the source of data this research is based on. It consists of of interviews, transcribed in standard Russian orthography and aligned with original audio (von Waldenfels et al. 2014). The data were collected through CQP-queries as follows: [lemma=“pronoun”] ::match.utterance_spkr=“speaker”. Instead of the word pronoun, a pronoun was included (он, она or они), and instead of the word speaker, the abbreviation of the selected speaker was included (пфп1928, авм1922 etc., where part in letters is an abbreviation of speaker’s name and numerical part is their year of birth). For example, the query [lemma=“он”] ::match.utterance_spkr=“пфп1928” allows us to find all forms of the pronoun он (singular) that were used by the speaker PFP born in 1928. The data includes third person pronouns, singular and plural, in oblique cases both in prepositional (1015 occurrences, 33 informants). Male and neuter pronouns were considered together, because they are not differentiated in the corpus annotation. Each pronominal form was examined for the presence or absence of initial nasal [n]. We did not register the initial sound in pronouns without nasal consonant (i.e. [j] or a vowel, which may itself be another parameter of variation), because there are many cases when determining the quality of the anlaut is very problematic. We only controlled whether the initial nasal [n] is present or not.
df <- read.csv("pronouns.csv")
We need to check what variables might be relevant for us in the research. They are:
(33 informants),year
(from 1922 to 1996),gender
or male
, low-mid
, high-mid
, high
(25 prepositions),prep_type
, i.e. в, к and с and later
, i.e. other prepositions)case
and form
(case and number / gender form of pronoun)consonant
and yes
In order to see whether the frequency of the preposition has any influence on the pronunciation of the following pronoun, we need to calculate and add this information. For that, we create a separate table with frequencies and then add them in a column to the data frame with the help of inner_join
(it is a numerical variable):
df %>%
group_by(preposition) %>%
summarise(prep_frequency = n()/1015) ->
df <- inner_join(df, df_freq)
To begin with, we want to visualize the correlation between the year of birth and the absence or presence of [n]. For first, we will not differentiate between the speakers in order to see the general tendency. We draw a violin plot and see that, in general, there is a trend to have more observations without [n] among older speakers and with [n] among younger ones. But we must be careful because this kind of visualization does not take into account how many utterances alltogether there are in the interview from one speaker. It might display not the tendency but the disproportionality of the collected data.
df %>%
ggplot(aes(consonant, year, fill = consonant, color = consonant)) +
geom_violin(show.legend = FALSE) +
labs(title = "Correlation between the year of birth and the absence / presence of [n]", x = "Initial [n]", y = "Year of birth") +
Therefore, let us draw a scatter plot considering each speaker separately in order to check the correlation between the year of birth and the empirical proportion of observations with [n]. As these are only observed proportions and not absolute measures, we also want to see and keep in mind what is the number of observations.
df %>%
group_by(year, speaker, education) %>%
summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no")), all_consonant = (sum(consonant == "yes") + sum(consonant == "no"))) %>%
ggplot(aes(year, prop_consonant, color = all_consonant, label = speaker)) +
geom_text(nudge_y = 0.02, size = 3) +
geom_point() +
labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Number of observations") +
As we also suppose that education level might have an impact on the dialectal performance on the speakers, we need to visualize it first. Let us display the education level on our scatter plot. We can observe that it is probably not fully independent variable and depends on the year of birth. Therefore, in our analysis we should keep in mind the option to consider the integration of these variables.
df %>%
group_by(year, speaker, education) %>%
summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no"))) %>%
ggplot(aes(year, prop_consonant, colour = education, label = speaker)) +
geom_text(nudge_y = 0.02, size = 3, show.legend = FALSE) +
geom_point() +
labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth: with regard to the education level", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Education level") +
We also suppose the dependency on the gender, so let us display this variable on the scatter plot. Again, male speakers are mostly born in 1950-1970, so probably the sample is not perfect for the analysis and depends on the year. We chould check the correlation with the statistical methods.
df %>%
group_by(year, speaker, gender) %>%
summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no"))) %>%
ggplot(aes(year, prop_consonant, color = gender, label = speaker)) +
geom_text(nudge_y = 0.02, size = 3, show.legend = FALSE) +
geom_point() +
labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth: with regard to the gender", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Gender") +
First, we want to check whether the linear regression model is good for our data. In order to do that, we should transform our data frame into a shorter format, so that each observation is not a pronoun with preposition but a speaker with a certain number or dialectal dial
(without [n]) and innovative inn
(with [n]) pronunciations. Then we plot our linear regression with the predictor year
df %>%
group_by(speaker, year, gender, education) %>%
summarise(dial = sum(consonant=="no"), inn = sum(consonant=="yes")) ->
num_df %>%
mutate(perc = inn/(dial + inn)) %>%
ggplot(aes(year, perc))+
geom_smooth(method = "lm") +
labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth: linear regression", x = "Year of birth", y = "Proportion of forms with initial [n]") +
After plotting the linear regression we see two major problems: 1. It is not good, because it does not cover all the variability (or the big part of it); 2. It predicts values above 1 and less than 0, which is impossible, because a speaker cannot pronounce less than 0 per cent and more then 100 per cent. Therefore, we need to consider our data as the one with binary dependent variable. Only in this case the model will be able to give us realistic results.
The data set analyzed in our research consists of binary variables, where only two outcomes are possible: variable consonant
with the value no
(the absence of initial [n]) and with the value yes
(the presence of initial [n]). Therefore, the calculations were made with the help of Logit. The Logistic Regression Model has advantages for our data. First, predicted values are always between 0 and 1. It means that predicted proportions won’t be above 1 or below 0, which is impossible in our context. Second, in comparison to simple proportions, this method allows to weight the amount of contributions in each case (the number of occurrences for every informant). This is very important in case of data like ours, because some speakers have extremely low number of outcomes. In R, logits can be easily calculated with the function glm()
(i.e. Generalized Linear Model). The hypotheses that we are going to check with the help of logistic regression model are:
and the predictors (independent variables and their interactions).consonant
and the predictors (independent variables and their interactions).First, let us consider a simple model with the numerical variable year
as a predictor. Then we can visulize it and compare with the previuos results.
fit_year <- glm(consonant~year, data = df, family = "binomial")
As a result, the estimate coefficient, that shows dependence on predictor year
, equals to 1.052e-01
, and is positive, which means that the later the informants were born the higher is the probability that they give innovative responses with [n]. The significance code mentioned in the model is ***
, the dependence is considered to be significant, i.e. the coefficient rate cannot be explained by randomness. Let us plot the sigmoid for this logistic regression, together with confidence intervals.
df_ci <-, predict(fit_year, df, type = "response", = TRUE)[1:2])
df_ci %>%
mutate(`P(consonant)` = as.numeric(consonant) - 1) %>%
ggplot(aes(x = year, y = `P(consonant)`))+
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)+
geom_point() +
geom_pointrange(aes(x = year, ymin = fit -, ymax = fit +
labs(title = "Logistic regression: initial [n] ~ year of birth", subtitle = "Separate observations. Confidence intervals", x = "Year of birth", y = "Proportion of forms with initial [n]") +
The plot in shape of S curve predicts the distribution of probability of variable value [n] among speakers of different year of birth. Unfortunately, this plot does not tell us much about how well this sigmoid displays the actual perdormance of each speaker. Moreover, confidence intervals do not provide us with much information. We need to plot the observed probabilities besides the sigmoid with the predicted ones (first, create a data frame with probabilities df_probs
, then join it with the regular data frame and then plot separate points for the observed probabilities).
df %>%
group_by(year, speaker, education) %>%
summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no"))) ->
df <- inner_join(df, df_probs)
## Joining, by = c("speaker", "year", "education")
df %>%
mutate(`P(consonant)` = as.numeric(consonant) - 1) %>%
ggplot(aes(x = year, y = `P(consonant)`))+
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE)+
geom_point(aes(x = year, y = prop_consonant)) +
labs(title = "Logistic regression: initial [n] ~ year of birth", subtitle = "Observed probabilities for each speaker", x = "Year of birth", y = "Proportion of forms with initial [n]") +
After we drew a plot for the numerical predictor, let us see what are the predictions of the model with all possible predictors:
fit_all <- glm(consonant ~ year + education + gender + prep_type + prep_frequency + case + form, data = df, family = "binomial")
The first thing that we can observe is that predictors prep_type
, case
and prep_frequency
are not significant for out model (according to the significance code). But male form
of the pronoun changes the log odds of initial [n] by -0.82600
. Both year
and gender
are statistically significant, as is one term for education
, the log odds of initial [n] (versus its absence) increases by 0.14078
versus female
changes the log odds of initial [n] by -0.82000
of the level high-mid
versus the level high
, changes the log odds of initial [n] by -1.23655
.After applying the Anova test, we also see that the strongest predictor is the year of birth year
: it allows to get rid of the biggest amount of deviance. On the contrary, prep_type
, case
and prep_frequency
are almost unsignificant. This means that the preposition probably has no influence on the absence or presence of the initial [n].
anova(fit_all, test="Chisq")
Therefore, we can remove predictors prep_type
, case
and prep_frequency
from our model. Model fit4
is the model only with significant predictors. Its AIC
is 885.59
fit4 <- glm(consonant ~ year + education + gender + form, data = df, family = "binomial")
As a next step we need to create a table to look and the predicted probabilities of this model for different values of independent variables.
df %>%
count(year, education, gender, form, consonant) %>%
select(-n, -consonant) %>%
unique() ->
fit_prob_df %>%
predict(fit4, newdata = ., type = "response") ->
fit_prob_df %>%
## # A tibble: 86 x 5
## year education gender form prediction
## <int> <fctr> <fctr> <fctr> <dbl>
## 1 1928 high-mid female m 0.01613983
## 2 1922 low-mid female m 0.02176185
## 3 1922 low female m 0.02313566
## 4 1930 low-mid male m 0.02962140
## 5 1925 low female m 0.03483201
## 6 1922 low-mid female pl 0.03693281
## 7 1922 low female pl 0.03922595
## 8 1935 high-mid female m 0.04199200
## 9 1922 low-mid female f 0.04908140
## 10 1928 low-mid female m 0.04911725
## # ... with 76 more rows
fit_prob_df %>%
## # A tibble: 86 x 5
## year education gender form prediction
## <int> <fctr> <fctr> <fctr> <dbl>
## 1 1996 high female f 0.9994919
## 2 1996 high female pl 0.9993163
## 3 1974 high female f 0.9889639
## 4 1974 high female pl 0.9852028
## 5 1975 high male f 0.9787324
## 6 1974 high female m 0.9747619
## 7 1968 high female f 0.9747430
## 8 1975 high male pl 0.9715847
## 9 1968 high female pl 0.9663006
## 10 1975 high male m 0.9520029
## # ... with 76 more rows
As we see, the lowest probability of initial [n] 0.01613983
is for the male pronoun in the speech of female informant 1928 year of birth and high-mid education level. The highest 0.9994919
for the female pronoun in the speech of female informant 1996 year of birth and high education level. This data is hardly interpetable because we mix socolinguitic predictors and observation-characteristic predictor. For this table, let us consider solely sociolinguitic factors (to understand something about the speakers of the dialect).
df %>%
count(year, education, gender, consonant) %>%
select(-n, -consonant) %>%
unique() ->
fit_socio <- glm(consonant ~ year + education + gender, data = df, family = "binomial")
proba_df %>%
predict(fit_socio, newdata = ., type = "response") ->
proba_df %>%
## # A tibble: 30 x 4
## year education gender prediction
## <int> <fctr> <fctr> <dbl>
## 1 1928 high-mid female 0.02796769
## 2 1922 low-mid female 0.03506336
## 3 1922 low female 0.03685432
## 4 1930 low-mid male 0.04137803
## 5 1925 low female 0.05424350
## 6 1935 high-mid female 0.06888340
## 7 1928 low-mid female 0.07547728
## 8 1933 low-mid female 0.13813156
## 9 1951 high-mid male 0.20545696
## 10 1952 high-mid male 0.22835545
## # ... with 20 more rows
proba_df %>%
## # A tibble: 30 x 4
## year education gender prediction
## <int> <fctr> <fctr> <dbl>
## 1 1996 high female 0.9989835
## 2 1974 high female 0.9805907
## 3 1975 high male 0.9589163
## 4 1968 high female 0.9574235
## 5 1963 high female 0.9197085
## 6 1960 high female 0.8842864
## 7 1966 high male 0.8739123
## 8 1954 high female 0.7728024
## 9 1969 high-mid male 0.7457076
## 10 1961 high-mid female 0.7117072
## # ... with 20 more rows
This provides us only with the important information . Now we can say what are the characteristics of the most and the least dialectal informant (not depending on the pronoun): the lowest probability of [n] outcome 0.02796769
is for the the female speaker 1928 year of birth and high-mid education level and the highest 0.9989835
for the the female speaker 1996 year of birth and high education level (the same as in th previous table).
However, before removing the variable case
, let us try to look at its integration with the variable form
because they both constitute the output form of a pronoun (i.e. case and gender). The problem with this model is that it is too complicated.
fit5 <- glm(consonant ~ year + education + gender + case*form, data = df, family = "binomial")
anova(fit4, fit5)
We see that we probably should not fully remove the predictor case
but consider it in the integration with form
. Moreover, we want to check the integration of year
, education
and gender
, because they constitute the group of sociolinguistic factors and may influence each other.
fit6 <- glm(consonant ~ year * education + gender + case*form, data = df, family = "binomial")
fit_final <- glm(consonant ~ year * education * gender + case*form, data = df, family = "binomial")
## Warning: fitted probabilities numerically 0 or 1 occurred
anova(fit5, fit6)
anova(fit6, fit_final)
We can observe that the model with integration of all sociolinguistic factors covers more variability than the simple additive model. The AIC (753.33
) is the lowest for this model in comparison to the other presented fits. This is going to be our final model (fit_final
Even on our simple descriptive scatter plots we could see that there might be some correlation between the proportion of the observations with the initial [n] in the forms of third person pronouns in prepositional constructions and year. We also could suppose that education level and gender may influence this variable. We showed that linear regression model does not fit to the data like this: altough all speakers have their own degree of being dialectal, we must keep considering the data as having binary dependent variable. After trying different possible models we came to the following results. The simplest and, nonetheless, quite strong, model is with the sole predictor year
year of birth (it explains the most number of variability). We also can add other predictors, which, however, complicate the model and each of them is less significant. One more important result is the observation that, apparently, the preposition (its type and frequency) has no impact on the choice of the form.