With the diffusion of digital technologies, problems that have been witnessed in the domain of personal computers since 1980’s (Shackel, B & Richardson, S., 1991) began to be observed in the use of once-humble products (Thimbleby, 1991). Together with this, conventional paradigm of consumer ergonomics was no more sufficient to embrace all the dimensions of user – product relationship.
Relatively complex cognitive processes that were in charge necessitated adoption of methods that traditionally belong to the domain of HCI. In a survey carried out in 1996, including 25 federated societies of IEA, ‘usability of consumer products’ was ranked as the third most important emerging area in ergonomics, leaving ‘human computer interface’ behind (Helander, 1997). Since 1990s, it is no more uncommon to come across with cases that consumer product are evaluated using techniques pertaining to HCI (e.g., Connel, Blandford & Green, 2004; Garmer, Liljegren, Oswalder & Dahlman, 2002; Lauretta & Deffner, 1996).
Being a fundamental technique in HCI, usability testing is one of the most frequently applied techniques in both design and evaluation. As the observation of participant behavior forms the backbone of the technique, it is empirical and somewhat objective in character. Given this, usability testing is one of the most frequently resorted techniques when a systematic approach is required for eliminating evaluator biases as much as possible (Potosnak, 1988).
In the case of consumer products, while applying HCI-specific methods, adherence to conventions valid for HCI in a ‘verbatim’ fashion may cause incompatibilities. Most of the time problems arise because of dissimilar system paradigms of two domains; particularly with respect to how ‘user’ is defined. Although it is possible to anticipate a wide user population within contemporary HCI theories and practice, ‘user’ is traditionally conceptualized as a professional, using a tool for sustaining her/his activity within the work domain. Such a tendency may be observed in the fundamental works in the usability literature (e.g. Nielsen, 1993; Dumas & Redish, 1993). Based on professional activity, users can be defined in terms of their occupations and abilities/skills they are expected to possess. Added to this ‘screening’ effect is the homogenizing effect of personnel selection and training. Therefore, the user profile exhibits a relatively homogenous profile.
Given these, for professional products, it is usually possible to determine the characteristics of target users and ‘choose’ the ones that represent the actual population as participants, with the help of observable attributes such as job experience, education, age etc. In the case of consumer products, working on homogeneous ‘subsets’ is not plausible most of the time, given the fact that such products are usually intended for a larger portion of the population. For example, everybody in the world is a potential user of a cellular phone produced for global markets. Therefore, diversity to be accommodated is quite large and many user characteristics, that vary both quantitatively and qualitatively, should be considered.
1.1. Do we really test the interface?
Causes and consequences of the heterogeneity of user population in the case of consumer products may best be illustrated with a speculative example:
Suppose that during the development process of an innovative cellular phone, the manufacturer wants to see whether users will easily adapt to the innovative interface. Furthermore, the manufacturer wants to compare the performance of this innovative design with its competitors and needs to verify that basic functions can be easily used by all users. Although, usability testing would be the right choice to fulfill those needs, results of the test would not be able to yield unambiguous results.
Firstly, the possibility that variance observed in user performance may be explained by individual differences causes methodological problems, and is hard to neglect especially in the case of consumer products. Some participants may not be able to complete even a single task successfully; interpretation of this result would really be trivial. Was it the interface’s design that caused too much problem for the participants? Was it the participants’ lack of experience with such innovative modes of interaction?
Secondly, when the task is to compare the design with its competitors a methodological problem with ‘experiment design’ arises. Suppose that interface (A) is decided to be compared with three other products (B, C and D). It is evident that a single test where each participant experiences all the interfaces is not possible, since such a test session would take too much time and it would be difficult to isolate and eliminate the effects of positive – negative transfer among interfaces. Therefore, one would look for experiment designs with more than one group. For example, there may be three groups where each competitor is compared with interface A, so that each participant uses only two interfaces instead of 4. In such a design, participants in each group should be comparable with regards to individual differences that may directly influence the test results.
Thirdly, the manufacturer in the example above would never know whether the sample was representative enough to infer that ‘basic functions can be easily used by all users’, regardless of the level of success observed in the tests.
The primary aim of any usability test should be to observe the effect of interface design on user performance, and eliminate all other interfering factors. Individual differences should be regarded as the most important factor to be eliminated or controlled since early studies show that huge variability in performance can be explained by individual differences among users, regardless of design or other factors (Egan, 1988). Experiential factors, among other individual differences, are known to have a significant effect on performance (e.g. Nielsen, 1993; Dumas & Redish, 1993).
Despite the famous phrase reminding participants that what is tested is the interface not their abilities, it is usually the participant’s familiarity with digital interfaces that is being reflected in results.
1.2. When does heterogeneity really cause problems?
Although, the fact that experiential factors have a considerable effect on results indicates that a methodological flaw is present, this is not a criticism brought to the methodology of usability in general. Most of the time usability tests are conducted to uncover major problems and to have a rough idea about the fit between user and the system. It may be assumed that whether a test would be carried out in ‘discount usability situations’ (Nielsen, 1993) or for strict, inferential purposes (Potosnak, 1988) may determine how meticulously should external factors be controlled.
Usability tests are either carried out for summative (Quadrants 1 and 2 in Figure 1) or formative purposes (Quadrants 3 and 4 in Figure 1). In general, aim is to evaluate or measure user performance in the former and to diagnose usability problems and generate results that will give way to design decisions in the latter. Regardless of the nature of research and the motivations behind, representative sampling and heterogeneity of user population are issues to be keen on for obtaining plausible results, unless the only function of observations is to inspire usability experts who rely heavily on their expertise for anticipating usability flaws. However, it should be noted that when a valid inference is to be made with the results of a usability study, control over factors pertaining to sampling that may affect test results becomes even more vital.
Although the main discussions in sampling literature concentrate on the discussions on sufficient sample size to discover the majority of usability problems (see Caulton, 2001 for a review), the probability of experiencing usability problems in a user test seems to be related with experiential factors. Therefore, all types of homogeneity assumptions, regarding age, gender, occupation, experience may prove to be inaccurate. If this is the case, then, even diversity and significance of the problems observed in a discount situation may not be plausible unless the sample is checked for serious biases in terms of expertise levels of the participants involved. With a small sample size even some of the most serious problems may not be encountered by the participants if the sample is heavily skewed in terms of experiential factors.
1.3. Structure of the chapter
In this chapter, two approaches in order to accommodate the effects of individual differences among participants in usability testing of consumer products will be proposed.
In Section 2, after discussing how experiential factors are conventionally handled in usability tests, a working model that suggests a triadic relationship between experience, actual performance, and self-efficacy will be presented. In this regard, how expertise is constructed through a personal history of individual experiences with technological artifacts, and how these acquisitions are reflected upon mechanisms of self perception will be discussed.
In Section 3, an approach based on performance observation will be illustrated by presenting the development process of a prototypical apparatus test, aimed at assessing the expertise.
In Section 4, a complementary approach based on the concept of self-efficacy will be put forward. In this part, a scale developed in order to measure a construct defined as General Interaction Self-efficacy will briefly be presented.
Finally in Section 5, the conclusions drawn and further studies will be discussed.
2. A triadic model of experience, actual performance and self-efficacy
2.1. Conventional approach to experiential factors
Although, representative sampling of participants finds support in usability literature, suggestions about factors to be considered are divergent. Furthermore, it is hard to come across suggestions about how to handle experiential factors.
Nielsen states that “sample should be as representative as possible of the intended users of the system” (1993, p. 175). According to him, in order to achieve this for the systems with large intended populations like consumer products, anyone can be a participant. He suggests that age and gender are among the most critical factors as these may be significant in some cases. He further adds that both novices and experts should be involved in tests. He enumerates experiential factors as computer experience, experience with the particular system, and domain knowledge. Finally, he adds that some “less immediately obvious” factors such as basic abilities are known to play role. Chapanis argues that “human characteristics that are important” (1991, p. 375) are sensory capacities, motor abilities, intellectual capacities, learned cognitive skills, experience, personality, attitudes and motivation. Dumas and Redish (1993) suggest that participants should be chosen directly from target user population when possible. They state that experience and motivation are two important factors to explain differences among people, and propose a similar construct of experience with Nielsen (1993). The experiential factors to be considered are listed as work experience, general computer experience, specific computer experience, experience with the particular product, and experience with similar products (p. 122). The studies reviewed above exhibit a common attitude in the sense that they consider experience as an important factor and how they define it.
In pre-test questionnaires administered before usability tests and some tools developed for measuring computer experience, experience is usually, if not always, defined as quantity, frequency and duration of participation to a task, interaction with a class of applications, a specific application, or computer systems in general (e.g. Bunz, 2004; Kinzie, Delcourt & Powers, 1994; Igbaria, Zinatelli, Cragg & Cavaye, 2001). Such an approach seems to be valuable and has practical appeal for investigating the influences of experiential factors on various measures. Moreover, the fact that such information may readily be extracted by asking frequency-of-use questions before tests has many practical merits. Nevertheless, it is better to treat such information as a nominal variable to distinguish users having some experience and users having none.
The problem of defining experience in above-mentioned terms arises when experience is treated as a ratio variable that confounds performance, or as a substitute for a variable representing the transformations occurred during learning process. For example, let us think of two individuals that have some experience with computers. Assume that individual A has 6 years of computer experience and uses computers 3 times a week; whereas individual B has 12 years of computer experience and uses computers every day. It is not safe to assume that individual B is a higher level of expert than individual B and her/his performance will be better due to this difference since there is no one-to-one relationship between what is experienced and what is retained.
People show great variability even after attending a formal education program to the extent of knowledge and skills they acquired (Ackerman & Humphreys, 1990). This is actually one of the motives behind the study of individual differences. The concept of expertise seems to be a proper starting-point for arriving at a better way of approaching experiential factors. Expertise is defined as “aspects of skill and general (background) knowledge that has been acquired…” (Freudenthal, 2001, p. 23). With such an approach experience is treated as a causal variable rather than a reflective one.
2.2. Definition of General Interaction Expertise
In a usability test, most of the time, if not always, participants’ experience a novel situation. In other words, either a new interface is being tested or participants are asked for completing novel tasks with a familiar interface. It is observed that participants try to grasp designer’s model by navigating within interface and trying to complete the tasks assigned to them. Some participants may predict the model with quite ease before a thorough experience. While others may never form a working model of the system that conforms with the actual model and keep experiencing problems.
Therefore, in essence, in usability tests participants are asked to adapt to a novel interaction situation. It is argued that a test participant’s expertise level acquired by experiencing a diversity of interfaces is one of the most determining factors that affect how s/he copes with this novel situation. Term suggested for this construct is General Interaction Expertise (GIE) (Berkman & Erbuğ, 2005), and may be briefly defined as:
2.3. Triadic model
In this study, the model suggested in Figure 2 will be utilized for comprehending the relationship between what is experienced (experience) and manifestations of what is retained (GIE)— i.e. expressions of permanent cognitive changes, as actual performance and self-efficacy belief.
This triadic model is in line with Bandura’s social learning theory (1986). Before going into detailed discussion of the reciprocal relationships among the components of this model, the concept of self-efficacy should be briefly discussed.
Bandura suggests that as individuals experience a domain they simultaneously grow a self system called self-efficacy, which is a reflection of their actual performances. However, being more than a mere reflection this system also influences cognitive processes and actions.
While discussing what is excluded and what is included to the term ‘self-efficacy’, Bandura asserts that self-efficacy is more than the possession of the required underlying skills for completing a particular task (1986). He maintains that “competent functioning requires both skills and self-beliefs of efficacy to use them effectively” (p.391). Therefore, self-efficacy is proposed as a generative entity that makes it possible to use skills, yielding a desired outcome, within various contexts. In this regard the concept is markedly different from outcome expectancies and can be delineated as an individual’s self-belief in attaining a certain level of performance. Bandura views self-efficacy as a functional mechanism rather than just a self reflection on one’s own capabilities. Stemming from this argument, it is suggested that it partly determines which actions are undertaken and which social milieus are involved with. Therefore, as self-efficacy about a domain starts to grow, through its effects on choice behavior, it starts to determine what is experienced and what is avoided by the individual, partly influencing the course of personal development.
Another effect of self-efficacy beliefs is about breakdown conditions. It is argued that individuals with high self-efficacy beliefs do not easily give up when faced with obstacles and may even expend greater effort as they may tackle the problem as a challenge. Thus, it is asserted that individuals with strong self-efficacy beliefs tend to invest more effort and persist more in sustaining it.
A third effect of having strong self-efficacy beliefs is on the efficiency in converging cognitive resources on accomplishing the task at hand. Individuals with low self-efficacy tend to concentrate more on their limitations and shortcomings when they cannot proceed. Strong self-believers, on the other hand, concentrate on how to solve the problem and put more effort in dealing with ‘external’ problems.
Proceeding from this general conception of self-efficacy and related mechanisms that stem from Bandura’s cognitive theory, it may be proposed that a user with strong self-efficacy regarding interaction may be expected to have a personal history of interaction where positive experiences are dominant, tendency to use digital interfaces more often, exhibit persistent behavior in breakdown situations, and not to exhibit self-blaming behavior in case of an error.
2.4. Construction of GIE
In order to discuss how GIE is constructed, each link between the elements of the triadic model should be examined.
2.4.1. Experience - Actual performance (1)
The suggested relationship between experience and actual performance (see arrow 1 in Figure 3) is tried to be illustrated by exploiting the elaborated taxonomy suggested by Smith (1997).
It may be suggested that as individuals interact with a specific product they acquire a system-specific component of expertise (SS). After experiencing a number of similar systems for carrying out the same task—i.e. listening to music—an application-specific component (AS) of expertise is formed. Therefore, as people use specific systems with similar functionalities they acquire an AS together with individual SS components. Domain-specific knowledge (DS), on the other hand, consists of all the knowledge and skills required for carrying out a specific task. For example, etiquette of unmediated face-to-face communication may be situated within DS of communication.
Coming across a variety of SS, AS, and DS, several schema-based expertise (see Preece, 1994) are acquired, which help individuals to manage known and novel but familiar systems. Even if users face a totally novel application area, their expertise help them to orientate to the new system, provided that prior expertise acquired bear sufficient commonalities with the novel situation.
Therefore, although it was illustrated as if separate areas of AS and DS do not overlap in Figure 3, they actually do in reality. Moreover, the areas of intersection among separate areas of SS are larger than depicted. This taxonomy is further clarified with a concrete example about using a washing machine in provided in Table 1.
2.4.2. Actual performance – experience (2)
The relationship between experience and expertise is suggested to be a reciprocal one (see arrow 2 in Figure 3). It may be argued that as an individual’s expertise is observed to be improved over time, a social image will be formed and probability of coming across with novel interaction situations may eventually increase. For example, if an individual is known to be good at handling novel interaction situations, individuals may start to consult her/him frequently. Thus, if an individual’s observed expertise becomes prominent it may affect what will be experienced by her/him. On the other hand, if an individual is observed to be a poor performer then other individuals will not ask for help or encourage the individual to get involved in novel interaction situations.
2.4.3. Actual performance – self-efficacy (3)
As mentioned earlier, as individuals experience a diversity of interfaces they form a self-efficacy belief (see arrow 3 in Figure 3). This belief may be strong or weak depending on how the outcome of the experience was perceived by the individual. In other words, an individual’s performance in novel interaction situations will be reflected in the form of self-efficacy belief.
2.4.4. Self-efficacy – actual performance (4)
As individuals grow self-efficacy beliefs about interaction, their actual performance with interfaces are influenced through several mechanisms (see arrow 4 in Figure 3). As discussed earlier, people with a strong self-efficacy belief are good at overcoming breakdown situations and converging cognitive resources to problem solving. People with low self-efficacy may tend to get frustrated easier, ask for help or may be prone to quit when confronted with a problem.
2.4.5. Self-efficacy – experience (5)
Individuals with strong self-efficacy beliefs with regards to interaction are expected to extensively use digital interfaces and to frequently get involved in challenging interaction situations. Individuals with a low self-efficacy may choose not to use digital interfaces and try to avoid challenging interaction situations as much as possible.
2.5. Actual performance and self-efficacy as manifestations of GIE
As defined by Cronbach and Meehl (1955), a construct is an attribute postulated to be possessed by individuals and reflected in behavior. It is developed “generally to organize knowledge and direct research in an attempt to describe or explain some aspect of nature” in a scientific inquiry (Peter, 1981, p. 134). It is only possible to make inferences about the attribute by examining its surface manifestations. Therefore, constructs can be observed indirectly.
As depicted in Figure 3, GIE was treated as a construct, which is manifested in actual performance and self-efficacy beliefs. Although it was mentioned that there is a reciprocal relationship between experience and expertise (see 2.4), treating experience as a manifestation of GIE is methodologically inappropriate since ‘what is experienced’ is not a reflection but one of the causes of GIE in the first place.
3. Assessment of actual performance
In this section, the method devised for assessing ‘actual performance’ component of GIE will be explored. The first step in developing a tool for assessment was suggesting a way of recognizing expert behavior. For this purpose, a diversity of cognitive theories was examined and automatic loops of execution – evaluation and problem solving were judged as types of behavior where expertise could be assessed based on performance observation (see Berkman & Erbuğ, 2005). In the following paragraphs the theoretical basis for this measurement strategy was briefly put forward.
3.1. Automated processing
Everyday activities that people carry out are usually composed of automated processes. It is possible to handle such tasks while attending to another one. Such a process of automation is observed in many of the sensory-motor tasks that are practiced frequently. After a sufficient period of experience, even demanding cognitive processes are observed to become automatic (Preece, 1994). From information processing perspective the phenomenon may be explained with the theory of automatic and controlled processing. Automatic processes demand little effort, may be unavailable to consciousness, and maybe identified by their fluency; whereas controlled processes, tap a considerable amount of cognitive resources and are slower than automatic processes (Sternberg, 1999). According to Ackerman (1987), after sufficient practice under consistent task conditions, controlled tasks may become automatic. For consistent tasks, improvements in performance are limited with individual’s sensory-motor capacity or motivation to perform better.
Even it has sprouted from a different school of thought, Activity Theory provides a similar explanation to the process of learning. According to Vygotsky (1978) when people get involved in an activity, they make plans that help them to formulate actions, which are meant to satisfy certain sub-goals. Actions, then, are actualized by a set of operations. After individuals gain certain expertise, actions and even whole activities are carried out as routine operations. However, when conditions vary, a simple operation will be handled as an Activity in itself (see Koschmann, Kuuti & Hickman, 1998 and Bodker, 1991 for a complete model).
Both theories have common points that give clues about ways of recognizing expert behavior:
The extent of expertise gained by practicing a task may be predicted by whether the task is automated, still under conscious control, or both.
After a certain level of automation is attained in a specific task, gains can be transferred to other tasks with similar conditions.
Therefore, sensory-motor fluency observed in an easy task with a familiar interface may be an observable indication of expertise. Individuals with a high level of GIE would have been gained expertise by practicing similar tasks and may be expected to switch to automatic behavior after a concise orientation period.
3.2. Controlled processing
According to Norman (1990), in order our goals to be fulfilled individuals should be able to perceive and evaluate the current state of the world. The evaluation is then followed by a set of actions for changing the state of the world so that goals are accomplished.
The steps of the cycle presented in Figure 4 are run until the goals are accomplished and “the world” is in the desired state. However, whether the flow is smooth or constantly interrupted, whether a single iteration is enough or the cycle is run many times may vary. Cycle may be so internalized by the user that both concretizations of goals and interpretation of the world may be minimally crucial. Taken to the extreme, executions may dominate the cycle, minimizing even the need for feedbacks. A secretary making a fair copy of a hand-written letter without getting feedback from monitor or keyboard is an illustrative example for such an extreme case of execution-dominated behavior. This type of interaction may be characterized by automatic processing as it was discussed previously.
On the other extreme, there may be cases where sequence of actions is not readily available to the individual, or “interpreting the perception” is not possible. This usually occurs when people confront with serious problems, or when they came across with a totally novel interface. In such cases, translation of intention to act to a meaningful sequence of actions and to transform perceptions to evaluations may be problematic. With similar concerns, Sutcliff et al. (2000) propose certain elaborations which transform Norman’s model so that the level of detail is sufficient to discuss breakdown and learning situations.
In Figure 5, certain shortcuts and sub-cycles are suggested to embrace rather extreme cases mentioned above.
Mack and Montaniz (1994) state that such extreme cases are represented by different sets of behaviors. They claim that when users are engaged with “well-understood” (p. 301) tasks, they often exhibit goal-directed behavior and utilize routine cognitive skills. However, when task or interface is a novel one, behavior is dominated by problem-solving type of activity.
As far as the elaborated model suggested by Sutcliff et al. (2000) is concerned, this type of behavior is represented by “error correct” and “explore” loops (see Figure 5). While discussing learning through experiences, Proctor and Dutta (1995) typify this problem solving – learning behavior with cases of learning to operate complex devices without instructions. In a typical usability test this is encouraged to see whether interface provides an intuitive mode of interaction. Therefore, it is possible to state that, in almost every usability test, participants are first confronted with a problem-solving activity, hopefully followed by relatively smooth, uninterrupted task-action cycles.
In an experiment conducted by Shrager and Klar (1986, ctd. in Proctor & Dutta, 1995) on learning a complex device without instructions, it is observed that after an initial orientation phase where participants learn how to change device state, they started to systematically investigate the system by generating hypotheses about ways of attaining task goals. These hypotheses were then tested and the ones that survived helped participants to construct and refine the device model. Therefore, in terms of Mack and Montaniz (1994), systematic investigation phase represents problem-solving activity.
All the studies reviewed above indicate that some sort of problem-solving activity takes place especially when users are involved in novel interaction situations.
3.3. Development of General Interaction Expertise Test (GIE-T)
Based on theories discussed above, it is suggested that GIE may be manifested in two fundamental types of behavior, which are automatic loops of execution – evaluation and controlled problem-solving. In order to assess expertise by observing actual performance on tasks that target these two types of behavior, GIE-T that consists of two prototypic apparatus tests were developed.
The following set of heuristics guided the development process of GIE_XEC test:
Task content should be neutral, so that prior knowledge specific to systems, applications and domains should not alter performance.
Test should not contain tasks that require cognitively complex processes.
Test should not be comprised of tasks that require novel modes of interaction.
Test should be comprised of familiar sub-tasks in order to maximize the effects of experience with digital interfaces on performance.
The task consisted of three simple sub-tasks, assumed to fall into automatic loops of execution and evaluation domain defined previously. In order to eliminate the effects of task content on performance, generic tasks were designed. Thus, the effects of SS, AS, or DS were ruled out. Task difficulty and novelty was tried to be adjusted to a level so that indications of automatic processing and fluency in completing sub-tasks would indicate that participant has a level of expertise. Furthermore, it was expected that participants with a relatively higher degree of expertise would exhibit automatic behavior after a very brief period of adaptation to specific conditions of hard and soft elements of interface.
Before the administration of the test, step-by-step instructions were provided; goals and methods of achieving them were clear. A trial session was provided for participants to familiarize themselves with interface and sub-tasks. After the trial, participants were asked to complete 5 identical trials in order to further increase the ratio of automatic behavior observed during performance. Steps to complete one trial were as follows:
Sub-task 1: Navigate and choose modify (‘değiştir’),
Sub-task 2: Navigate and choose ‘P’,
Sub-task 3: Complete the required modifications and choose confirm (‘onay’)
According to the initial findings, performance in GIE_XEC test may simply be represented by means of elapsed times recorded in 5 successive trials (see Berkman, 2007 for further discussion).
The following set of heuristics was utilized in the design of apparatus test that target problem-solving behavior:
Goals states and current state of the device should be apparent to the participants. Participant’s performance should not be hindered while trying to understand the goal state or compare it with the current state.
Task should not require domain knowledge or a specific ability.
Task should be easy to complete without the interface. If the task would be handled in an unmediated manner, all of the participants should be able to complete it (e.g. with paper and pencil, or verbally).
The task difficulty should be related with how the problem is represented, flexibility in refining the representation, and selection of appropriate methods to control both external and internal processes.
Task should be complex enough to decrease the probability of success by chance.
Completion of the task should not require long procedures. This would ensure that the ratio of time spent on problem solving to time spent on keystrokes is huge and determined by efficiency in problem solving activity to a great extent, rather than execution – evaluation loops.
Among many other alternatives a problem situation was chosen to be developed as an apparatus test.
Participants were asked to form a pattern of shapes so that it reproduces the goal pattern. The interface elements were a display and five push buttons. Three buttons were located so that each of them coupled with a single-digit numerical display. A button labeled with an arrow pointing towards the screen was positioned on the right (redraw button). Another button labeled “done” (“tamam”) was positioned between the goal pattern and display. By pushing that button, participants would be able to indicate that the task was successfully completed (see Figure 7.).
Parameters that could be manipulated were not described to participants. At the beginning of the test, the aim of the test was briefly described, together with some limited instructions about the task.
A typical sequence of actions taken by an expert user for accomplishing the task would be as follows:
Select the slot to be filled with the leftmost button,
Modify the type parameter with the middle button,
Select the appropriate value for the color parameter with the rightmost button,
Press redraw button to see the results,
After the goal state is reached, press the button labeled “done”.
According to the preliminary studies, GIE_PS test should be regarded as a pass-fail item (see Berkman, 2007). Although, elapsed time was considered as a measure that could yield more precise results, it was observed that some participants were not able to solve the problem and quit the task or were asked to quit.
3.4. Validity of GIE-T
In order to evaluate the validity, both tests were administered in usability tests of consumer products with digital interfaces, such as washing machines and dishwashers. Results show that GIE_XEC and GIE_PS scores correlate with effectiveness scores. Participants scoring high on apparatus tests were observed to be more successful in completing usability scenarios. The correlation coefficients yielded with individual and combined scores of apparatus tests were in the range of 0.46-0.76.
In a usability test where a dishwasher with a menu-driven interface was tested with 15 participants, GIE_XEC scores were observed to be highly correlated with the number of tasks completed (0.68). In an experiment, composed of three sub-tests, where 3 washing machines were compared with a primary interface, both GIE_XEC and GIE_PS scores correlated significantly with test results (0.69 and 0.46 respectively). Furthermore, when scores were combined with a linear model, correlation further increased (0.76).
Although there is still much to do in order to develop and investigate the validity of the GIE-T as it is presented in this study, initial results show that systematical assessment of ‘actual performance’ or similar constructs may increase researcher’s control over experiential factors. The following points summarize the outcomes of studies completed so far and what can be done in the future.
Although a small number of sub-tasks seem to be sufficient in recognizing expert behavior during automatic processes, several pass-fail items are necessary in the assessment of problem-solving behavior. Apparatus tests that would be developed in accordance with the afore-mentioned heuristics would yield similar results in terms of assessing GIE and predicting performance in usability tests. Therefore, researchers can tailor their own apparatus tests when cost of development is not a problem. Although quite an ambitious goal, standardized tests may be developed for setting normative standards. This way, results of individual studies would be comparable. It is recommended that apparatus tests should be updated from time to time in order to make sure that they comply with the contemporary culture of interaction as much as possible. Finally, it should be noted that observing participants before the actual usability tests has merits of its own. Observing how participants behave beforehand may provide the following information that may be helpful in moderating tests:
4. Assessment of self-efficacy – General Self-efficacy Scale (GISE-S)
As far as measurement strategy proposed in this study is concerned, assessment of self-efficacy has a twofold character. First, by definition, it is suggested that one of the manifestations of GIE is self-efficacy belief. Thus, assessment of self-efficacy is proposed as a means for measuring GIE and for complementing the performance measurement approach adopted in GIE-T. Second, like many other tools developed for measuring attitudes (Spector, 1992) a paper-based tool would be appropriate for assessing self-efficacy, which will provide the opportunity to assess GIE with a tool that is easier to administer, even in groups and without any equipment or trained administrators.
4.1. Development of GISE-S
In order to identify the essential steps that will form the development procedure, both basic material on fundamentals of scale development (e.g. Crocker & Algina, 1986; Spector, 1992; DeVellis, 1991; Netemeyer, Bearden & Sharma, 2003) and focused discussions on technical and theoretical issues were reviewed. After the comparative examination of the selected procedures, some attributes that are common in all of them were identified. The steps followed in scale development were listed below:
Development of item pool
Initial item tryout
Major data collection
Preliminary reliability and validity studies
The concept of ‘self-efficacy’ is frequently utilized to measure and even predict performance. According to Pajares, “what people know, the skills they possess, or what they have previously accomplished are not always good predictors of subsequent attainments because the beliefs they hold about their capabilities powerfully influence the ways in which they will behave” (1997, 7). In line with this view, researchers developed many scales that targeted ‘computer self-efficacy’ (e.g. Murphy, Coover & Owen, 1989; Compeau & Higgins, 1995). Suggested as ‘more than just a mere reflection of performance’, the concept of ‘self-efficacy’ was considered as a framework for defining the construct that will form the backbone of the scale under development. However, there are some pitfalls to be avoided when defining the specific construct and scale development.
According to Compeau & Higgins (1995), concentrating on individual sub-skills rather than self-efficacy beliefs for accomplishing tasks is a misconception exhibited by some researchers.
While discussing the common errors in assessment, Bong (2006) maintains that self-efficacy should not be confused with other self-referent constructs such as self-esteem and self-concept. Bong maintains that constructs that claim to be a type of self-efficacy should concentrate on one’s confidence in accomplishing a task, and not self-worth or self-perceptions regarding a specific domain.
Another error to be avoided is stated as ignoring the context-specific nature of self-efficacy constructs. Consequently, measurements should not be based on self-assessments done in vacuum and respondents should not be forced to weigh their self-confidence on highly abstracted situations. Finally, Bong (2006) warns that beliefs that match what is to be predicted should be looked for. In other words, it is asserted that the predictive power of self-efficacy is maximized when these beliefs are about tasks that are in relation with the criterial variable.
Bandura (2006) states that perceived capability should be targeted by items “phrased in terms of can do rather than will do” (p.308) so that intentions are not mistaken for self-efficacy perceptions. Another crucial elaboration made by him is the danger of focusing on outcome expectancies.
4.1.1. Construct definition
General Interaction Self-Efficacy (GISE) is specified as individuals’ self-efficacy perceptions as far as learning new devices. Although core definition seems to be too specifically formulated, it will be primarily utilized in relation with usability tests where participants get involved in a novel interaction situation. Therefore, long-term appropriation of digital products, or long-term transformations witnessed in the nature of interaction should be excluded.
General Interaction Self-Efficacy (GISE) is a judgment of capability to establish interaction with a new device and to adapt to novel interaction situations.
In accordance with this definition, GISE has two facets. First of all, GISE is related with learning to use new devices. In this regard, it is the capability to learn how to use a new device under unfavorable conditions, as well as ability to sustain adaptation in the absence of factors that enhance the process. Secondly, it is the ability to reorient, recover interaction and survive in a multitude of breakdown situations.
4.1.2. Development of item pool
Scale development may be regarded as a subtractive process where refinement of a large set of items is done in successive steps. In each step, items that are observed to exhibit certain flaws are eliminated. Therefore, quality of a scale is determined by the quality of items in the initial item pool.
There are no well-established theories in the self-efficacy literature that may be helpful in content sampling during item generation. In order to grasp the users’ perceptions about factors that influence the adaptation processes positively or negatively an empirical study was conducted. Data collection was done with a self-administered questionnaire, titled Learning Electronic Devices Questionnaire (LEDQ), which consists of open-ended questions. Respondents were asked to report favorable and then unfavorable situations for learning electronic devices.
A total of 550 expressions were gathered from 102 respondents. 287 of the expressions were negative whereas 269 were positive. After the elimination of some problematic expressions 425 of them were retained for item generation. In order to have an understanding of the content domain and the distribution of expressions, data was categorically arranged. In Figure 8, the semantic model extracted was provided.
It should be noted that the model depicted in Figure 8 should not be mistaken for a factual model based on empirical findings. The rationale behind constructing such a model was to gain insight about users’ perceptions about adaptation to new interfaces, guiding the item generation and item reduction processes.
Item generation was based on the expressions extracted and a well-established guideline on user interface design (Nielsen, 1994). At the end of the process a total of 242 items were generated. In Table 2 a sample of items were provided - . After giving instructions on how to rate the items and a short exercise, respondents were asked to rate their confidence in learning a new device under circumstances depicted in each item.
4.1.3. Item reduction
Item reduction was done in three successive steps. First, items were rated by five experts and surviving 104 items were tested in a pilot study (N=52) and remaining 92 items were administered to a sample of 442 respondents.
4.1.4. Factor structure
After the final data was gathered, an explorative factor analysis was conducted in order to determine the number of factors and uncover factor structure. In the first iteration, complying with Kaiser Guttman criterion, a 9-factor solution was arrived at. A close inspection of the item groupings indicated that 9-factor solution is quite comprehensible. When items included in these factors were evaluated it was evident that the preliminary semantic structure suggested (see Figure 9) was almost reflected in the factorial structure derived after the factor analysis. However, due to the fact that Factor 9 and 8 were not sufficiently loaded by any items (loadings below 0.50); a 7-factor solution was forced in the second iteration.
In the second iteration, item groupings in the 7-factor solution were still theoretically comprehensible. The remaining factors and sample items were listed in Table 3.
4.1.5. Initial validity studies
During major data collection, some additional data were gathered in order to conduct a preliminary validity analysis. These additional data consisted of age and number of types of electronic devices experienced (NED). Initial findings indicate that there is a positive correlation among NED and a coarse estimate - of GISE-S score (EGISE-S). Age correlated negatively both with NED and EGISE-S score. The highest correlation coefficient was observed between NED and EGISE-S score (see Figure 10). Scale should be finalized and additional data should be collected from a different sample in order to have a reliable analysis. However, it should be noted that initial findings are in line with the hypothesized relationship between experience and self-efficacy (see 2.4 and Figure 10).
Initial results show that there is prospective evidence to show that GIE model proposed here may prove to be useful for measurement purposes. Preliminary analyses indicate that in their fully-fledged forms, GIE-T and GISE-S may be valuable tools for sampling or may be administered when control over experiential factors is necessary. Depending on the nature of research, tools may be administered in combination or individually, or just in reduced forms. GISE-S, being a paper-based tool, has certain advantages over GIE-T such as cost and ease of administration. However, administration of GIE-T provides the opportunity to observe actual performance of participants. A variety of real-life studies where tools are administered in parallel to running usability projects are necessary to weigh cost-effectiveness of both tools.
Measurement of GIE may be helpful for justification of certain assumptions regarding participant profile, as a way of manipulating GIE as an independent variable, or for ascertaining that the effects of GIE on test results were kept to a minimum. Furthermore, if normative standards are determined, the tool may also be used to evaluate usability of interfaces in absolute terms. In other words, it would be possible to identify interfaces that require high levels of GIE and those do not. A final merit of pre-evaluating participants would be to detect the individuals that exhibit intolerable levels of test / performance anxiety before the actual usability test.
Additional studies are necessary for refining GIE-T and GISE-S, increase the prototypic tools in variety and finally justifying that acceptable reliability and validity levels are attained in various research settings.