Winograd Schema Challenge
Commonsense Reasoning is keen to promote the Winograd Schema Challenge and Nuance Communications' competition to successfully pass an alternative to the Turing Test.
What is the Winograd Schema Challenge?
Nuance Communications, Inc. is sponsoring a competition to encourage efforts to develop programs that can solve the Winograd Schema Challenge, an alternative to the Turing Test developed by Hector Levesque, winner of the 2013 IJCAI Award for Research Excellence. The test will be organized, administered, and evaluated by CommonsenseReasoning.org (http://www.CommonsenseReasoning.org), which is dedicated to furthering and promoting research in the field of formal commonsense reasoning.
Background: The Turing Test is intended to serve as a test of whether a machine has achieved human-level intelligence. In one of its best-known versions , a person attempts to determine whether he or she is conversing (via text) with a human or a machine. However, it has been criticized as being inadequate. At its core, the Turing Test measures a human’s ability to judge deception: Can a machine fool a human into thinking that it too is human? Chatbots like Eugene Goostman can fool at least some judges into thinking it is human, but that likely reveals more about how easy it is to fool some humans, especially in the course of a short conversation, than the bot’s intelligence. It also suggests that the Turing Test may not be an ideal way to judge a machine’s intelligence.
An alternative is the Winograd Schema Challenge.
Rather than base the test on the sort of short free-form conversation suggested by the Turing Test, the Winograd Schema Challenge (WSC) poses a set of multiple-choice questions that have a particular form. Two examples follow; the second, from which the WSC gets its name, is due to Terry Winograd.
I. The trophy would not fit in the brown suitcase because it was too big
). What was too big
Answer 0: the trophy
Answer 1: the suitcase
II. The town councilors refused to give the demonstrators a permit because they feared
) violence. Who feared
Answer 0: the town councilors
Answer 1: the demonstrators
The answers to the questions (in the above examples, 0 for the sentences if the bolded words are used; 1 for the sentences if the italicized words are used) are expected to be obvious to a layperson.
A human who answers the first questions correctly would likely use his knowledge about the typical size of objects and his ability to do spatial reasoning to solve the first example; he would likely use his knowledge about how political demonstrations unfold and his ability to do interpersonal reasoning to solve the second example. Due to the wide variety of commonsense knowledge and commonsense reasoning that would presumably be used by humans to solve Winograd Schema problems, it was proposed during Commonsense-2013 that the Winograd Schema Challenge could be a promising method for tracking progress in automating commonsense reasoning. The Winograd Schema Challenge received further attention after
Eugene Goostman fooled 30% of judges into thinking it was human in 2014, sparking interest in developing and furthering alternatives to the Turing Test, and was one of several Turing Test alternatives proposed at the AAAI 2015 Workshop
Beyond the Turing Test.
Features of the Challenge:
Winograd Schemas typically share the following features: (Details can be found in Levesque (2011) and Levesque et al. (2012).)
- Two entities or sets of entities, not necessarily people or sentient beings, are mentioned in the sentences by noun phrases.
- A pronoun or possessive adjective is used to reference one of the parties (of the right sort so it can refer to either party).
- The question involves determining the referent of the pronoun.
- There is a special word that is mentioned in the sentence and possibly the question. When replaced with an alternate word, the answer changes although the question still makes sense (e.g., in the above examples, “big” can be changed to “small”; “feared” can be changed to “advocated”.)
Ernest Davis has created a collection of more than 140 sample Winograd Schemas that can be used by participants to test their systems during development, at the WSC Collection. Leora Morgenstern has collected more than 60 sample Pronoun Disambiguation Problems, a more general form of Winograd Schemas that is explained below, and in (Morgenstern, Davis, and Ortiz, AI Magazine 2015), at the PDP Collection. These collections will be augmented over time with examples from previous tests.
Further details are below.
Updated Winograd Schema Challenge Competition Rules
- Deadline: The deadline for registration and submission of executable code is July 1, 2016. The competition itself will be held at IJCAI 2016, July 9-15. 2016, in New York City.
- Input format: For ease of processing, input will be given in XML . See http://www.cs.nyu.edu/faculty/davise/papers/WSCExample.xml for an example. Winograd schemas and Pronoun Disambiguation Problems (see below) will all be expressed in natural language.
- Type of Questions: All questions will appear to be Winograd Schema halves. There will be a sentence or short sequence of sentences with at least one pronoun that has two or more possible referents; the system’s task is to correctly identify the referent of the pronoun. Winograd Schema halves have an alternate, hidden form in which a special word or phrase can be substituted in the sentence, resulting in another referent for the pronoun. Sentences that do not have this alternate, hidden form are known as Pronoun Disambiguation Problems.
- Number of rounds and questions: There will be two rounds in the competition. The first round will consist of at least 60 Pronoun Disambiguation Problems. The second round will consist of at least 60 Winograd Schemas. Only those excelling in the first round will advance to the second round.
- Format for submissions:
- Submissions must be made in executable code form on a disc or other memory device that will run on a personal computer or laptop.
- Submissions should be mailed to (and arrive by July 1, 2016):
1198 E. Arques Ave.
Sunnyvale, CA 94085
- If your program requires access to an Internet search engine during processing please let us know ahead of time so that this can be accommodated. Internet access will be made available through fiber optic or cable modem line. Cellular and wifi access will be blocked. A restricted set of internet sites will be available, including Google. All internet access will be monitored and recorded.
- Each submission must accept WSs and PDPs in the standard form indicated above
- Your software should create a file named TeamName-output.tx, of two separate lines followewhere TeamName is your team’s name, with output formatted so that each problem consists of two separate lines followed by a line, as follows:
An example is shown in http://www.cs.nyu.edu/faculty/davise/WSSampleOutput.txt
- Schema Number:
- Carriage Return
- Time limit for processing a WS: 5 minutes of CPU time for each WS.
- Evaluation criteria: The criteria for the grand prize of $25,000 will be given to the best entry that scores at least 90% or within 3 percentage points of human performance, whichever is higher.
- Registration form: All entries should fill out the registration form which will be available at www.CommonsenseReasoning.org/winograd.html
- Reproducibility: This competition is meant to advance science. A prerequisite for receiving a prize is demonstrating, through sharing code, publishing reproducible algorithms, etc. reproducibility. Further details will be made available.
- Supporting Team Formation: A web page will be linked from this page containing the names of teams that are looking for collaborators as well as software that they might want to share with other teams. Both academic, industry and mixed teams may enter the competition.
- Frequency of competition: TBD
- Announcement of results: The results of each competition will be made public at the www.commonsensereasoning.org and the www.nuance.com sites
- Evaluation committee (To be announced):
- Best of competition prizes: For entries that do not meet the human threshold specified in (7), the following prizes will be given
- $3,000 for the best entry that scores over [TBD]
$1,500 for the second best entry that scores over [TBD ]
- Future tests: The committee is considering possible future theme based tests (e.g., for particular areas of commonsense reasoning) or separate tracks.
- Questions: If you have any questions about the competition, please contact email@example.com or firstname.lastname@example.org