Readings
A Guide to Analysis in MaxEnt Optimality Theory Chapters 1 and 2
My thoughts
- For some data that is quantitative (“定量的”“数量化的”。也就是说,这些语言学数据是以数量、频率、百分比、统计指标等形式呈现或需要用数值方法来分析的), which means there might not be a rigorous analysis for this type of data, but MaxEnt grammar is suitable for this purpose
- nonstochastic = 非随机 = OT and Classical Harmonic Grammar
My Questions
- Why are probabilities suitable for gradient phenomena? gradient = probabilities?
- MaxEnt 为什么可以 Provide concrete interpretations for linguistic experiments?
- MaxEnt offers both quantitatively precise analysis as well as statistical testing, which permits each constraint in the analysis to be checked for whether it is making a meaningful contribution
- 突然感觉!MaxEnt 是不是可以用来test syntactic binding constraint for reflexive?如果我已经有英语和中文的proportion data的话 (chapter 8 有syntax example)
- The lenition process is confined to function words. When a /p/ appears in a content word, it never alternates. Why does lenition only appear in a function word, but not in a content word?
- Answer: Depends on the differences in content and function word
- 语法地位与韵律结构:功能词通常在韵律上处于弱势位置(非重读、黏附在相邻词上、处在所谓的“clitic group”里),弱势位置更容易发生lenition。
- 信息结构与突出度:内容词承载语义信息,往往带重读,处在“强”位置,音段更稳定;功能词信息量低、常非重读,容易被语流中的同化、弱化影响。
- How do people set the original constraint weights in the classical harmonic grammar?
- What is the perturber constraint? Why will Candidates that violate a perturber constraint will have a probability distribution different from what is seen among non-violating candidates?
- How we distinguish between the categorical phenomena, such as in Iraqi Arabic, which bans triple consonant clusters compared to pertuber constraints, such as in English, the CCC clusters are repaired in surface forms.
- In OT as a nonstochastic theory, there is no way to treat the perturber phenomena, then could harmonic grammar can better handle it as in MaxEnt gives us a way to integrate these phenomena into language-specific analysis?
- Answer: perturber constraint = all the constraints are treated equally in calculation, narrowly, the label perturber is what they are doing, if they follow the same trend, and affect the rate a little, perturber constraint is a name for a particular setup for parallel patterns. It’s the entire descriptive term we are using.
- still confused about the sigmoid part.
- grammatical constraints and inductive biases = what maxent identifies is in phonology
- Answer: maxent as a tool and as a theory, what’s in the grammar, inductive bias informs what kind of constraints that you learn, ex. i prefer constraints are simple, formally, if you say constaints are a way that i present in the grammar, if you want to see there’s an infleunce in the grammar that is phonological or not. Articulatory constraints related to what we think is phonology. A theory inside of CON, compare theories using MaxEnt, the math can’t tell us whehther something is in the grammar or not. Learning distributions, maybe you learn things that are frequent. MaxEnt is a tool for theory testing. See the distribution and use MaxEnt to see how well it fits. Do you think the constriant is reasonable or do you think the model fits well. Learns from the input data and generate constraints and see how they fit each other.

为什么这套说得过去
- 信息论:在满足已知统计的前提下,熵最大保证“最不瞎猜”。
- 统计学:它就是指数族的极大似然解;泛化性体面,解释也直观。
- 工程上:就是一个干净的线性打分 + softmax,快、稳、可视化友好。
Steps for doing MaxEnt
- Calculate the harmony of the candidate
- number of violations x weights
- sum (number of violations x weights)
- higher harmony = lower probability
- convert Harmony -> probability
- negative harmony
- eH = e−H (e = 2.718)
- sum up eH = Z
- probability = eH / Z
Example:
- [mʊntəg poɮən] = Harmony = 3
- [mʊntəg woɮən] = Harmony = 4
- e−3 = 0.05
- e-4 = 0.018
- 0.05 + 0.018 = 0.068
- 0.05 / 0.068 = 0.73
- probability of [mʊntəg poɮən] = 73 %
Chapter 1: Purpose and orientation
- MaxEnt Grammars are a formal apparatus for constraint-based linguistics, and an outgrowth of OT, which works on probabilities (suitable for gradient phenomena)
- free variation = multiple outputs from a single input
- gradient well-formedness (syntactic judgments or phonotactics
- Matchup of native speaker judgments
- to statistical patterns in their languages
- Provide concrete interpretations for linguistic experiments
2 prespectives
- practical = need an analytical tool to help scholars to deal with quantitative linguistic data
- theory = MaxEnt can extract from the theory’s basic principles through reasoning and testing against language-particular phenomena and typology
1.3 Why is it called “MaxEnt”?
- Maximum Entropy “最大熵”在语言学中指的是 Goldwater 和 Johnson 在 2003 提出的OT中附加的数学工具,旨在使其具有概率性
- The OT architecture coupled with the MaxEnt math
Chapter 2: The analysis of variation in outputs
2.1 Analyzing the K1 language in Classical OT

- Many controversial issues for MaxEnt are controversial for OT as well
Classic OT
- Aim: search for a particular candidate output that best satisfies a ranked hierarchy of phonological constraints
- Optimal candidate = unique output of the grammar
- Function GEN = provides a full set of candidates in a finite set of all possible strings
- Function EVAL takes in a phonological input (the candidates of GEN) with the ranked constraint set, and outputs a winner
- Candidate selection
- The ranked constraint C treats candidates 1 and 2 differently, is violated more times by 2 than 2, then 2 cannot be the winner
## Example of OT in K1
- constraints and rankings
- IDENT(sonorant) = output correspondents of an input [alpha sonorant] segment in the content words are also [alpha sonorant]
- positional faithfulness constraints = distinguish content from function words
- faithfulness constraints penalize candidates that change something in the input
- AGREE(continuant) = penalizes segment sequences that differ in a particular feature, penalize *[mw] = adjacent segments should have the same value for [continuant]
- *p = assign a violation to every non-phrase-initial p

OT has been extremely influential in phonological theory, adopted by many who had previously made use of the rule-based framework of SPE (Chomsky and Halle 1968)
- scientifically sensible goal, separation between Markedness and Faithfulness constraints
- Markedness principles that militate against particular surface forms
- Faithfulness constraints that atomize the set of ways in which surface forms can differ from their underlying forms
- OT 把“问题”(标记性)与“解决方式的代价”(忠实性)分开,通过约束排序统一解释:同一个标记性压力,为什么在不同语言/环境里会得到不同的表面修复,有时甚至阻止变化,有时促成变化。这种抽象与可比性是相较“把问题和解决打包在一条规则里”的规则式理论的主要优势。
- Make language-specific phonology analysis more responsible to typology
- when things go well with OT, we found a very detailed language-specific analysis can be reduced to a language-specifc ranking of principles with cross-linguistic support
2.2 K2 and free variation
- The process of /p/ lenition is optional in K2
- it’s okay to have /p/ lenition and it’s okay not to have /p/ lenition
- free variation = no meaning difference arises
- OT’s approach: instead of having ranked constraints in grammar, we let certain constraint pairs be ranked freely
- dotted vertical line = separating the freely ranked constriants (IDENT(sonorant), AGREE(continuant))


2.3 The interpretation of free ranking as probability with partially ranked constraints
Theory of Partially Ranked Constraints
- each group of constraints delimited by solid lines = constraint stratum

- probability assigned to any candidate under the theory of partially ranked constraints = number of rankings for which that candidate is the winner, divided by totoal number of rankings
Probabilitistic grammar
- unlike OT
- grammar with partially ranked constraints
- assigns probability values (often 0) to the members of GEN
- adopted fro the analysis of a variety of systems with free variation
2.4 K3: modeling arbitrary probabilities
- Variants are not equally likely
- percentages in speech and text cover a wide range, and in these cases to be shown as stable and replicable in both corpora and in speakers’ behavior in experiments
- necessary to make use of a model of phonological grammar which can represent these patterns effectively
In K3
- when a function word is preceded by a non-nasal sound, which includes 300 instances of [p] and 110 instances of [w]
- surface [p] and surface [w] are not equally likely
- In order to model the data of K3, we need a probabilistic grammar that can predict the exact lenition rates

2.5 MaxEnt: an introduction
Key difference between MaxEnt and OT
- Instead of using constraint ranking, MaxEnt assigns a numerical value to each constraint (weight) = strength
- solving the conspiracy problem and linking language-specific analyses to phonological typology
2.5.1 Preview: Classical Harmonic Grammar
- Classical Harmonic Grammar
- constraint weights and violations are used to calculate a Harmony score for each candidate = which acts as a kind of penalty
- outcome = least-penalized member of GEN for a given input
- only produces one single winner, just like OT, it is not a probabilistic theory
- the winning candidate = lowest harmoney penalty



2.5.2 Shifting to MaxEnt
- MaxEnt needs one more step in addition to harmonic grammar = convert the Harmony scores to probabilities
- MaxEnt is a probabilistic version of Classical Harmonic Grammar, just as the theory of partially ranked constraints is a probabilistic version of OT.
2 essential properties
- Assign a lower probability to candidates with higher Harmony penalties
- The probability assigned to the full set of candidates for a given input will sum to one = probability distribution
- Calculating probability from harmony values in MaxEnt
- calculate the Harmony of each candidate
- For each harmony value, make it negative, then exponentiate it using Euler’s number e = eH (e = 2.718, MaxEnt also works well with 10)
- For each input, sum up the eH over all the candidates for that input = Z
- for each candidate, divide its eH by Z, this is called its predicted probability

The weights in (17) were chosen by hand. While it is possible to find a good fit to the data by choosing weights by hand for a very small dataset like this one, realistic analyses generally, require the weights to be fit by machine.


2.6 K4: Perturbers as “hidden” phonology in variation
- affected by context = at least one independent factor — we will call it a perturber — that changes the output probabilities in a systematic way
- specifically, lenition is more likely after a vowel than after a consonant
- a preceding vowel is a perturber that increases the probability of lenition
- POSTVOCALICLENITE = Assign a violation for every sequence of the form [+syllabic][−sonorant].
- it lowers the output probability of candidates that violate it, but does not rule out violating candidates entirely
2.6.2 Fitting the weights
steps in Excel

- Column K computes the eH value of each candidate x (i.e. e−H, from (16b)) by first negating, then exponentiating, the result of Column J. As can be seen, the Excel notation for ex is EXP(x). As before, this value can be written once and copied down the column.
- Column L calculates the normalizing factor Z for each input ((16c)), using the SUM() function to total the eH values for all (both) candidates.
- Column M carries out the final step, from (16d); i.e. calculating the probability assigned to each candidate by dividing the eH from column K by the Z value from the appropriate cell in column L.
2.7 K5: Multiple perturbers
- closely based on the real Khalkha data analyzed by Fuller
- we will treat the post-nasal environment not as an absolute blocker of lenition, but merely a downward perturber;

a. IDENT(son)pol Output correspondents of an input [αsonorant] segment in the lexical item pol are also [αsonorant].
b. IDENT(son)pai Output correspondents of an input [αsonorant] segment in the lexical item pai are also [αsonorant].
c. IDENT(son)pol- Output correspondents of an input [αsonorant] segment in the lexical item pol- are also [αsonorant].
d. IDENT(son)pi Output correspondents of an input [αsonorant] segment in the lexical item pi are also [αsonorant].
sigmoid! = the MaxEnt sigmoid is always symmetrical around the location of 50% probability

wug 形曲线是一组彼此平行的 S 型(sigmoid)概率曲线,用来可视化MaxEnt/HG模型里“基线约束”与一个或多个扰动子(perturber)约束共同作用的结果。基线控制总体倾向,扰动子只在特定环境里把整条 S 曲线水平平移,多几个扰动子就出现“条纹 wug”(多条平行 S)

- interaction effect: for lexical items specific, having specific constraints seems like already getting interactions between constraints, depending on the lexical items how other constriants changes. Interaction effect will be conjoining constraints = decive you can put in your grammar, if i have one cosntraint, at and pat has one of the violates, if you violate two together then you got extra penailities, such as interaction effect in the model. More like random effect, but not interactions.
1 1
*coda Onset Harmony score
pa
at * * 2
a * 1
pat * 1
怎么读这玩意
- 一眼看条数:几条就是有几个有统计效力的扰动子。
- 看间距:越开,说明该扰动子权重越大。
- 看斜率:共同斜率由基线约束族的权重规模决定。
- 看中段:扰动子在中段最显著;两端基本白费劲。