Error Categories
TL;DR
Each Edit object contains an err_cat
field, which indicates its error category:
err_cat |
Description |
---|---|
PUNC | Punctuation |
SPELL | Spelling |
GRMR | Grammar |
MIX | Mixed (unused/ignore) |
Details
In some use cases for our API, it's helpful to know what types of errors have been detected.
The error category labels each Edit instance in the API response as either a punctuation (PUNC
),
spelling (SPELL
), or grammar (GRMR
) error. There's also a MIX
category that can be ignored for now. It is being used
experimentally in development and has been included for completeness in case it appears in future rollouts.
Assigning Categories
Because our system is neural network based, the assignment of categories is not the simple process that it is for rule-based systems in which each rule targets a specific type of error. Instead, the system must infer the type of error ad hoc. Because of this, expect these categories to have a generous helping of inaccuracies.
Merging Behavior
Consider the following sentence:
Jack and Jill goes up the hill to fetch some water.
The word goes is incorrect and should be replaced with plural form go.
Now, consider a sentence with lots of errors that are adjacent:
i be da best writer in the world!
Instead of reporting this as 3 errors, we merge adjacent errors of the same type so that i be da
is replaced with I am the
as a single Edit. In the future, we may change this default merging behavior
and/or allow it to be specified on a per request basis.
The impact of the error category on merging behavior is that Edits with different error categories will not be merged. For example, consider this sentence:
Sally has taken her first day off from work this yeer
This sentence has 2 errors at the very end of it: a spelling error (yeer -> year), and a punctuation error (end stop is missing). Because these are different types (SPELL & PUNC), they will not be merged.
Why Does This Matter?
If you are wondering why this matters and if it's all "much ado about nothing", then your use case is probably not affected. But for some, merging behavior is important, so post-processing may be done.
If we go back to this example:
Sally has taken her first day off from work this yeer
Here is the corrected text with parentheses added to show the 2 edits made:
Sally has taken her first day off from work this (year)(.)
For an automated scoring system, such as one that grades GRE essays, it might be helpful to view this as a misspelling
AND a punctuation error. That's because each error type is probably weighted differently in severity.
However, for an email editor that allows for easy correcting of grammar errors, it's not ideal to have
these errors separated (think Gmail). A user doesn't want to click once to change yeer
to year
and then
a second time to insert a period. Besides, that might also make the interface crowded.
Regardless, the moral of this long story is that one's preference on merging behavior depends on the use case. And we want to make you aware of how things behave and why. If you have suggestions on how to make our API better, or a tutorial that you'd like to see, or extensions that would help for your use case, please reach out.