Christopher K.
Riesbeck
The Institute for the Learning Sciences
& Department of Computer Science
Northwestern University
Evanston, IL 60201 USA
+1 847 491 3500
riesbeck@ils.nwu.edu
Wolff Dobson
The Institute for the Learning Sciences
& Department of Computer Science
Northwestern University
Evanston, IL 60201 USA
+1 847 491 3500
wolff@cs.nwu.edu
As the student acts and recommends, the system tracks what is happening. As needed or as requested by the student, the system uses experts captured on video to guide, coach, critique, and give real world examples of similar situations.
A GBS system is not intended to model a teacher or tutor and is not supposed to be like a classroom. A GBS system is supposed to be like learning on the job, only better because experts are available all the time for help and review.
Intelligent critiquing of the student's actions and conclusions is an important element of a GBS. Herein lies the rub. On the one hand, it's important to give the student the same kinds of options that the real-world task would have. On the other hand, the more complex and varied the student's choices, the more complex the critiquing module has to be, and hence the more complex the knowledge engineering needs.
We believe GBS systems for thousands of domains will need to be built and maintained. Therefore, an important goal for us is the development of authoring tools to let content experts create GBS's with complex student activities with modest knowledge engineering costs and little or no programming.
This paper describes a tool, called Indie, for authoring Investigate and Decide GBS's. In particular, we focus on how Indie is used to author knowledge bases, with special emphasis on the knowledge base used to critique evidence-based arguments.
Along the way, students can ask for help and browse a richly-indexed multimedia hypertext network called an ASK system [1]. The central idea in ASK systems is that all information is linked to other information via follow-up questions. For example, if an expert in a video clip mentions how an immunofluorescence test can be used detect immune system problems, one follow-up question might be "How does the immunofluorescence test work?"
Several examples of Investigate and Decide systems that have been built with Indie are:
* Immunology Consultant: Medical school students have to determine what has gone wrong with a patient's immune system by interviewing the patient, running lab tests, and collecting information on how the immune system works and sometimes malfunctions.
* Is It A Rembrandt?: Art history students have to determine whether a painting is actually by Rembrandt or a forgery, by inspecting the style of the painting, its materials, the signature, and so on.
* Volcano Investigator: High school students learning geology run experiments in order to estimate the likelihood of a Mt. St. Helens-like volcano erupting, in order to decide whether a nearby town has to evacuate immediately or not.
* Nutrition Clinician: Medical school students have to determine what nutritional deficiencies a patient has, what the medical implications are, and what needs to be done to remove the deficiencies.
For brevity, in this paper we'll refer to these systems as Immunology, Rembrandt, Volcano, and Nutrition.

The student works with four important interface elements:
* The lab screens, where the student gathers facts about the scenario by interacting with the simulated world. "Lab" is a very broad term here, including screens for running experiments, interviewing patients, reading documents, and so on.
* The notebook, which contains all evidence that has been collected so far. Evidence is added automatically for the student.
* The report screen, where the student constructs an argument for or against one or more of the possible choices, using the evidence gathered in the notebook.
* The ASK browser screens, where the student talks to experts, gets background information about the task and domain, asks follow-up questions about critiques and coaching advice, and so on.
The notebook is usually on-screen all the time. The student switches between the lab, report, and browser screens with a mouse click or two.
Internally, there are three important modules:
* The Simulator produces responses for student actions on the lab screens. These responses include not only interface events such as movies and graphics, but also the pieces of evidence that such actions reveal and that get stored in the student's notebook.
* The Critiquer analyzes the arguments made by the student in the report. Problems found are used to retrieve the relevant responses in the ASK network.
* The ASK system retrieves information from the ASK network and supports browsing through that network. It keeps track of what the student has seen so far so that the same information is not shown twice unless the student asks to review it.
Each module has a knowledge base:
* The domain model holds facts about the particular scenario, e.g., "Mary has sickle cell anemia," and facts and rules about the domain in general, such as "if a patient has sickle cell anemia, the microscope test will show sickle-shaped blood cells."
* The argument models describe what makes good and bad arguments for each possible decision.
* The ASK network links questions to answers (in video or text) and answers to follow-up questions in a large graph. In Indie systems, ASK systems hold any information that should be presented with follow-up questions. This includes not only background reference material, but also critiques.
Indie provides tools for authoring:
* the interface screens
* the domain model
* the argument models
* the ASK network
In this paper, we will focus on how argument models are authored and used for critiquing. More specifically, we'll present the first approach we designed and implemented, its strengths and weaknesses, then the approach we're currently using.
* a claim about the scenario, e.g., "Mary has acute rheumatic fever,"
* a set of presented evidence, consisting of scenario facts (usually test results) supporting the claim, e.g., "Mary has a high fever, and Mary had strep throat recently."
When a student submitted an argument, it was compared against an argument model. Every claim had an argument model, consisting of
* the claim
* one or more proof sets, each a set of scenario facts
* one or more disproof sets, each a set of scenario facts
* a set of relevant fact types (usually types of tests, e.g., "take temperature")
For an argument to be acceptable, the set of presented evidence had to
* be a superset of at least one proof set in the argument model
* not be a superset of any disproof set
* include only facts of the relevant fact types
A typical model would have
* one proof set with several facts that needed to be true for the claim to be true.
* a default set of disproof sets, generated automatically by making a singleton set of the negation of each fact in the proof set.
Disproof sets were intended to give authors some control over how much evidence it took to disprove a claim.
Given an argument, argument model, and the scenario facts currently available to the student in the notebook, the Critiquer looked for
* Contradictions, i.e., evidence used to support a claim when in fact it argues against it, or vice versa.
* Overlooked necessary evidence, in the notebook but not used in the argument.
* Missing necessary evidence, not in the notebook because some tests had not yet been run
* Irrelevancies, i.e., evidence used for or against a claim not in the list of relevant fact types
Our model of critiquing had the following steps:
* The student constructed an argument.
* The Critiquer analyzed the argument and created a set of categorized critiques, such as "Overlooked: High ASO Titer; Irrelevant: Age is young"
* The ASK system retrieved the most relevant responses for each critique.
Our intent was that authors would write responses for the top-level critique categories, such as
* Contradictions exist: "I'm confused. Are you sure all the evidence implies what you say it does?"
* Omissions exist: "Maybe, but it seems like you need a stronger case."
* Irrelevancies exist: "Seems right but I don't see why you mentioned some of the things you did."
* None of the above: "Makes sense. Good job!"
In addition, authors could also write more specific responses for particular problems, such as "You seem to be confused about how the ASO titer test works..." The ASK system would take care of finding the most relevant of the authored responses for each critique.
* Simple: Since, for pedagogical reasons, there are usually only a handful of claims in an Indie GBS and less than a dozen tests necessary to prove any particular claim, the amount of knowledge engineering was fairly small.
* Robust: Using the critique taxonomy to organize responses meant that all student arguments would be handled gracefully, even silly ones.
* Tailorable: Authors could easily add remediation responses for very specific argument errors.
Problems arose almost at once, however, when the authors tried to adapt the Immunology examples to Volcano, Rembrandt, and Nutrition. They felt that the Indie 1.0 model didn't support
* the simple things they wanted to do
* the complicated things they wanted to do
* IF the student says that a high fever is a sign of acute rheumatic fever, THEN play the movie that says fever can be a symptom of many things.
or even, while building a mock-up to demonstrate the interface, a rule like
* IF the student clicks on the "submit report" button, THEN play the movie that says "This isn't enough evidence. You need to collect some real data."
Instead, to make a movie play in response to clicking on submit report, authors had to
* determine the appropriate critique category for the mock-up student argument
* develop an argument model that would generate that critique category
* index the desired movie in the ASK network under that critique category
This is a lot of steps (and thinking) compared to what you need to do to play a movie in a typical interface authoring tool. Of course, such tools provide no support for intelligent critiquing.
There was clearly a serious conflict between our knowledge engineering approach and how goal-based scenario systems were actually being developed. The challenge was making Indie usable for authors without compromising the needs of the final GBS.
1. Single scenario mock-up: In this initial phase, authors develop key segments for one scenario, for design review purposes. They want to specify as simply and directly as possible a (mostly) linear sequence of events, triggered by button clicks. Interface concerns dominate the design and implementation process.
2. Single scenario run-through: In this phase, all the scenes for the scenario are specified, to make sure all functionality is available, feasible and consistent. Enough branches are defined to do some brief usability assessments with test users. The authors need to be able to specify a few conditional responses, especially for remediation, but they want to be able to leave other transitions "hard-wired." Interface work slowly gives way to scenario building.
3. Single scenario completion: In this phase, the authors finish all the branches and the ASK system content for the scenario. The authors need bookkeeping tools to check for consistency, completeness, redundancies, and so on. Some systems stop at this phase. Scenario building and ASK content work dominate.
4. Multiple scenario development: In this phase, the authors specify sequel scenarios. Most of the interface remains the same, as well as some of the artwork, but many of the system responses, especially remedial, have to be significantly changed. Authors need support for replacement, generalization, and reuse of response rules. Scenario building and support content work dominate.
This prototype-based development sequence is very typical with modern interactive systems. A key point for intelligent systems is that
* In the early phases, authors want total control over what happens. The system isn't intended to stand on its own yet. The auctorial attitude is "I want it to do this!"
* In the later phases, authors want to the system to be intelligent, or, more accurately, not stupid [3]. The auctorial attitude is "I want it to be able to handle this new stuff, along with the old stuff."
Our knowledge engineering approach did not support the direct control of system behavior needed in the first two phases. It only supported the robust response handling needed in the latter phases. Though we thought our critiquing model was a "good value" in terms of intelligence gained for work required, it was still more work to make something happen than, say, simply attaching "play movie" to a button.
Not surprisingly, it doesn't matter if an authoring system supports Phases 3 and 4 if authors don't want to use it in Phases 1 and 2. An author wants to say "when I click here, it does this." That's it. Given how rapidly things change in the first phase, doing more work than this is simply not in the author's interest.
The problem with giving our authors exactly what they want is that what works in the first two phases falls apart in the last two phases. In Phase 1, when building a mock-up, it's nice to be able to just attach the command "play the `needs more work' movie" to the button labelled "Submit report." Unfortunately, by the start of Phase 3, which movie to play depends in a non-trivial way on what the report being submitted actually contains. By the end of Phase 3, interface issues are mostly irrelevant to what the authors are trying to specify.
Interface Model
Event button click, rule fired, test
item dragged result generated
onto a list
Action play movie, fill test tube,
update button critique
state argument, set
current topic
Authors create triggers using a form-based editor that lets them select from lists of available actions and events. Triggers for events on interface objects can be edited by simply clicking on the interface object.
In Phase 1, an author can make the "submit report" button play a particular movie by creating a trigger that goes from interface event to interface action. Schematically, it looks like this:

Later, in Phase 2 or 3, when the author wants the movie that gets played to be selected based on some property of the student argument, the author
* changes the trigger on the submit button to call the model action "critique report" rather than "play movie,"
* creates a critiquing rule (as described below) to catch the relevant property,
* creates a new trigger that goes from the rule firing to "play movie."
Schematically, the new triggers look like this:

If, later, the authors want to add follow-up questions to the movie, they
* link the movie into the ASK network
* link the appropriate follow-up questions and answers to the movie in the network
* add a "set topic" model action to the second trigger
Schematically, the second trigger above becomes:

The ASK Browser interface then takes care of presenting the follow-up questions after the movie plays.
In this way, Indie supports migration from hard-wired button responses to full-fledged critiquing.
* more in a student argument than just claim plus evidence,
* more complex analysis of arguments than the proof and disproof set model could provide
In particular, authors wanted students to be able to include in their reports:
* contrary evidence, e.g., in Volcano and Rembrandt, students needed to be able to show that they were aware of test results that didn't fit the claim, e.g., "X is true, because ..., despite the fact that ..."
* categorized evidence, e.g., in Nutrition some evidence is from scenario-dependent test results and some is from scenario-independent background information
* non-evidence, e.g., in Nutrition, the argument for a particular nutritional deficiency is part of a bigger report that also includes medical implications and recommended actions, all of which need critiquing.
Furthermore, authors wanted more control over the argument analysis. In Rembrandt, where evidence about a painting's authorship can be quite fuzzy and subjective, a student's argument has to be based on a preponderance of evidence, not a simple all-or-none logic. In addition, pieces of evidence can interact in complex ways with other points. For example, one of the claims in Rembrandt had the following relationships between its evidence points A, B, C, D, E, F and G:
* Necessary: Any two of the following groups: (A and B), (C and D), E, or F.
* Irrelevant: If (A and B) are present, then E is irrelevant. If (C and D) are present then F is irrelevant.
* Conflicting: G conflicts with B, so if B is mentioned G shouldn't be.
* added support for multiple lists of evidence in arguments, and
* replaced proof and disproof sets with critiquing rules
* at least (or at most) M points from a set of N possible points
* are (or are not) in a BECAUSE list, a DESPITE list, some other labelled list, or the notebook
For example, a rule in Volcano Investigator is:
CLAIM: the volcano will erupt in the next 24 hours,
CHECK IF: BECAUSE does NOT include data from either of two ground deformation tests OR either of the strainmeter results
Since many critiques are based on missing evidence, many rules check for the absence of evidence. Checks for presence of evidence are usually to catch common errors, e.g., "If the students said that a high blood pressure is associated with underweight patients, show them this movie about causes of high BP."
* an evidence point,
* at least M of N conditions being true, or
* at most M of N conditions being true.
In the Volcano example above, the rule says "CHECK IF: BECAUSE does NOT include" and the condition says
* at least 1 of
* at least 1 of 2 ground deformation test results
* at least 1 of 2 strainmeter results
Conditions can simulate various logical connectives:
* OR is "at least 1 of N conditions"
* AND is "at least N of N conditions."
* NOT is "at most 0 of N conditions."
The use of "at least" and "at most" is similar in approach to SNePS [6]. SNePS is more powerful, because it lets you specify at least and at most simultaneously, but this hasn't been needed by our authors. On the other hand, Indie authors do frequently go beyond AND and OR by asking for 2 of 6 possible conditions to be true or 3 of 7 possible points to be present.
In Indie, the rule specifies what evidence lists are being checked and whether the check is for presence or absence. The conditions specify the logic of the check.
Indie has form-based editors for rules and conditions. The rule editor looks like this:

and the condition editor looks like this:

Almost everything is selected from lists, rather than typed in. The only time something is typed is when something new is created, e.g., a new kind of test result. Anything that's created automatically becomes available for later re-use.
* Rules can be marked "once only," which means they fire at most once.
* Rules can be collected into rule sets. Rules in a rule set are checked in order and checking stops when an author-specified number of rules has fired.
* Each scenario has its own rule collection. It's easy to share rule sets across collections.
Rembrandt's authors used once-only rules and rule sets to give different hints on different rounds of critiquing. The first time a student forgot to analyze the signature on the painting, the first rule in a rule set fired and said the report was incomplete. That rule was once-only and the rule set allowed only one rule to fire. Therefore, the second time the student submitted a report with the same mistake, the second (once-only) rule in that set fired and suggested looking at the signature. If this happened a third time, the third rule said the signature was atypical and the student needed to analyze it.
* Adding an interface by which the student can specify a set of claims.
* Allowing two kinds of claims:
deg. The usual kind of claim and argument, e.g., an argument for calcium deficiency
deg. The claim that all problems have been found.
The second kind of claim (usually labelled "I'm done") leads to three possible categories of critique:
* Yes, you're done.
* At least one of your arguments still has problems.
* The arguments you've given are OK, but there's at least one more claim that can be made
They wanted to represent these risks and treatments as points that can be dragged from predefined notebooks into evidence lists associated with an argument. These evidence lists really aren't BECAUSE or DESPITE, or even NOTEBOOKS, exactly. Rather than remove the BECAUSE and DESPITE formalisms from the rules, which saved authoring effort in more traditional critiquers, we added "special conditions" as a final slot on the rule editor which is hidden on a twist-down menu so as not to confuse novice authors.
A special condition is just like a normal condition except it relates directly to some evidence list that may or may not be part of the argument.
* the Critiquer focussed on here and the various rule editors
* an interface editor
* an ASK network editor and browser
* a very lightweight experiment simulator
Indie is implemented in Digitool's Macintosh Common LISP 4.1, and generates stand-alone MCL applications.
In terms of complexities of the Indie systems built so far:
Immun. Volcano Nutriti Remb.
on
Points 120 36 1000 514
Rules 15 30 150 77
ASK 217 150 600 620
nodes
Indie application sizes are dominated by graphics and video. Rembrandt, for example, has around 60MB of pictures (largely uncompressed) and nearly 4 gigabytes of video consisting of nearly 500 clips of experts talking about Rembrandt. Nutrition has 3 gigabytes of video.
All of these systems have at least 15 different screens, ranging from introductions, tests, interviews, ASK zoomers, ASK browsers, report-building, and feedback. Each project took a team of 2 to 3 content analysts about 5 months to complete, with guidance and tool support by two graduate students. Immunology took almost twice as long, largely because it was the first Indie project, and had a programmer whose main role was to work around gaps in the first interface editor.
On average, the Indie team spent less than 3 hours a week communicating with each team, though more at the beginning or end of each project. Most of the interactions after the first week working with the tool were emails with suggestions, bug reports, or questions about what the best way to "Indie-engineer" a critiquer rule or interface interaction.
The Indie tool has also been used by several groups of graduate students, both PhD and masters, in course projects. These projects go through Phase 2, building at least half of a complete scenario, including video and artwork. Volcano Investigator is one of the more successful student projects, built by first-year masters students in an intensive project in 4 months. A similar MS project underway now is Clinical Monitor (drug testing). Two recent class projects were Car Repair and KERMIT (the ecology of polluted ponds).
These many projects have helped us explore the "space" of Investigate and Decide GBS's. Encouragingly, Indie did not need any major change for the most recent student projects. It seems to have reached a stable point where there are enough options to satisfy typical needs and enough concrete examples to show how to use those options.
ACE [7] was an early interesting effort to apply natural language understanding and argument interpretation to the analysis of student explanations of Nuclear Magnetic Resonance (NMR) spectra. The focus was on finding incorrect and incomplete arguments with the pedagogical goal of making the student resolve the problems and re-articulate the argument Conceptually, ACE is similar to the Indie 1.0 Critiquer, in that it matched the student argument against correct arguments. ACE used a great deal of domain knowledge in a narrow area. No attention was given to authoring such knowledge for new domains.
Belvedere [8] is a very recent effort to allow students to collaboratively articulate arguments about scientific issues using a graphical argument tool. Like Indie 2.0, Belvedere uses rules to analyze student arguments in order to suggest areas where the student needs to flesh things out or repair logical problems. Unlike an Indie GBS and ACE, Belvedere has no domain knowledge and doesn't try to understand the propositions in the arguments. On the one hand, this means Belvedere needs little or no knowledge engineering. On the other hand, it can only critique structural problems, such as missing support links or circular reasoning chains.
Indie GBS's in particular, and GBS's in general sit between these two approaches to educational software. GBS's are neither as knowledge-intensive and closed as AI-based systems like ACE (and many other early systems), nor as knowledge-free and open-ended as Belvedere (and many other recent educational systems). The purpose of GBS tools is to make it possible to easily author large numbers of scenarios in many domains with a cost-effective level of knowledge engineering.
The usefulness of a tool is inversely proportional to its intelligence.
Authors don't want smart tools, they want tools that aren't stupid [3]. Stupidity can come from missing knowledge, but it can also come from tools that require knowledge engineering at the wrong time. In some important ways, the current Indie Critiquer is stupider than its predecessor. It's less robust and more prone to allowing logical inconsistencies and gaps. But it seems less stupid because it lets authors do what they want to do. It doesn't require knowledge to be authored until the need for that knowledge is clear.
Indie 2.0 lets authors move at their own pace from interface authoring to model authoring. It is our hope to be able to develop a version of the tool which will support gradual on-demand migration from the current rule model, with gives control but not robustness, to an argument model similar to Indie 1.0.
This work has been supported in part by the Defense Advanced Research Projects Agency, monitored by the Office of Naval Research, under contracts N00014-90-J-4117 and N00014-91-J-4092. The Institute for the Learning Sciences was established in 1989 with the support of Andersen Consulting.
2. Goldberg, A. Information Models, Views, and Controllers. Dr. Dobb's Journal (July 1990), 54-60.
3. Riesbeck, C. What Next? The Future of Case-Based Reasoning in Postmodern AI. In Case-Based Reasoning: Experiences, Lessons, and Future Directions, D. Leake, Ed. AAAI Press/The MIT Press, Menlo Park, CA., 1996, 371-388.
4. Schank, R. Goal-based scenarios: A radical look at education. Journal of the Learning Sciences 3, 4(1994), 429-453.
5. Schank, R., Fano, A., Jona, M., and Bell, B. The design of goal-based scenarios. Journal of the Learning Sciences 3, 4(1994), 305-345.
6. Shapiro, S. The SNePS semantic network processing system. In Associative Networks: The Representation and Use of Knowledge by Computers, N. V. Findler, Ed. Academic Press, New York, 1979, 179-203.
7. Sleeman, D., and Hendley, R. ACE: A system which Analyzes Complex Explanations. In Intelligent Tutoring Systems, D. Sleeman and J. Brown Eds. Academic Press, London, 1982, 99-118.
8. Suthers, D., Weiner, A., Connelly, J. and Paolucci, M. Belvedere: Engaging students in critical discussion of science and public policy issues. In Proceedings AI-Ed 95, the 7th World Conference on Artificial Intelligence in Education (Washington DC, August 16-19, 1995) 266-273.