Why would we want to track physiological data?
You've just finished the alpha version of your game. During the user-testing stage, you would want to be able to collect information about the players' abilities to actually complete the level, as well as get some idea of how engaged they were while they were playing (and whether or not they were actually having any fun).
That first part is fairly straightforward quality assurance. You can capture all the information you need about in-game activity just by recording the game and tracking in-game metrics like the amount of time taken to complete the level and the number and nature of any bugs encountered during those trial runs. It's the latter bit where things get tricky.
Traditionally, if you wanted to get an idea of what people thought about something you made, your options for getting that information were primarily based on asking questions, writing down the answers you received, and praying your testers were competent, honest humans with good memories and decent language skills.
All of these are fantastic methods of gathering information about gameplay and user experience, I'm not going to debate that, but I will say that traditional user-testing methods suffer from two major limitations.
They are not objective. Any time you ask someone a question, the answer you receive will be based entirely upon that person's subjective experience, and as such, cannot be reliably compared to that of any other person.
They are not quantifiable. Yes, it is possible to perform statistical analysis on survey results, and yes, you could totally say "9 out of 10 players said they had fun", but there is no way a talk-aloud trial, for example, could tell you a player was definitively more aroused at point A than point B, nor would direct observation be able to tell you if that one guy who kept turning right for an hour and sending himself in circles was getting more or less agitated over time.
To get this information, we have to turn to biology.
When talking about video game user testing, most of the experiments to date have fallen into two categories.
1. Experiments in which physiological data is collected in addition to traditional data.
In these experiments, physiological data is treated as an additional source of information about subjects' mental states. Ideally, this means that at the end of your experiment, you will be able to say things like "The majority of players showed signs of heightened arousal during gameplay. This is further supported by information collected during post-gameplay interviews. As such, we can confidently assume that this particular level is engaging and fun for users."
2. Experiments in which physiological data is used to shape traditional data-collection.
In contrast to the "Traditional plus physiological data" experiments described above, these "shaping" experiments are a three-phase process in which data is collected, mapped to a timeline or recording of the users' test run, and then used to guide post-testing data collecting by identifying critical moments/events during gameplay. E.g. "Your heart rate increased dramatically at this point. Do you remember what you were thinking at that time?" This particular method has been shown to be extremely effective at identifying significant events, with one experiement finding 63% more issues in the biometic-supplimented trials compared to observation alone. (Pejman et al. 2011)
In either case, physiological data is valuable for user testing in that it can serve not only as an additional source of information about users' psychological states, but also because these same records can be used to guide and modify traditional methods so as to produce data that is both more reliable, and more comprehensive, than traditional methods alone.
So now that we know why you might want to include physiological data in your user-testing experiments, let's get into the how.
The first step in this process is going to be determining what kind of information you want to collect. Different monitoring devices have different strengths and weaknesses. Measuring how tightly someone grips a controller, for example, can be a good way to track how aroused someone is, but it doesn't tell you much about valence. The physical size and shape of the equipment you will be using is also going to vary, as will their potential effects on gameplay and user experience.
The items on this list are techniques which have been used in the user-testing experiments with which I am familiar and which have already been shown to produce reliable/valid data. It is by no means a completely comprehensive list of all possible user-testing methods.
This topic actually got its own post over on my other blog, but to review: Electroencephalography is the recording of electrical signals along the scalp, you can track them with a sensor net, an EEG cap, and now that there's a market for them, headsets and toys like the Nekomimi (below). They've been shown to be pretty useful at identifying critical in-game events, but they can be rather invasive, expensive, and time-consuming. Also, the data collected from them can be fairly difficult to interpret.
Useful for: Monitoring attention/boredom, could be useful for identifying common patterns and behaviors, identifying in-game events which may trigger significant changes in focus
Eye tracking is a good way to get an idea of where your users are looking during play, as well as how fast their eyes are moving. By far the most common method for doing this is by using camera-based eye tracking systems.
The second method, Electrooculography, measures the resting potential of the retina, which changes based on eye orientation. EOG measurements are already used in motion-capture to faithfully track the positions of actors' eyes, and has the added advantage of being fairly non-invasive, as the electrodes used do not interfere with the subjects' field of vision.
On the other hand, the lack of any standardized electrode configuration means it's difficult to compare your results with those of other researchers, the signal itself can be cluttered with blinking-artifacts (which are exactly what they sound like), and the whole setup requires a much higher sampling rate than other methods.
Useful for: Determining where people were looking, how long they were looking at whatever was there, identifying distractions or areas of interest and user's ability to identify important in-game elements visually. Unfortunately gaze is not a functional proxy for "attention" nor would it give you much information (in 3D environments) about how far "out" someone was looking.
EMG records the electrical activity produced by skeletal muscles. Facial EMG in particular can be useful due to the fact that it allows one to track the muscles involved with making facial expressions like smiling or frowning, and as such, can give you a good idea the nature (positive or negative) of subjects' emotional states during play.
Unfortunately, even if your test subjects were totally comfortable with a bunch on sensors on their faces, you'd still have to sacrifice their ability to speak. You're going to get enough hassle from normal recording artifacts, so you don't want to create any more from people talking.
Useful for: Using as a proxy for measuring valence, as it enables you to capture the activity of the muscles involved in making facial expressions.
Galvanic Skin Response/Skin conductance is a measure of the electrical conductance of your skin, and varies based on how moist your skin is at any given point in time. GSR can be used as an indicator of arousal because cause your sweat glands are controlled by the sympathetic nervous system.
Possible issues with this method come from several places. The temperature and humidity in which you are operating can have a significant effect on readings, and make the task of comparing readings from different sessions rather difficult. Internals factors, both biological and psychological, can also lead to depressed readings, or a complete lack of significant variation, depending on the subject.
Useful for: Monitoring arousal/stress, identifying in-game situations (not one-off events) which may increase stress or arousal over time.
There are actually a number of different cardiac responses one can track for the purposes of measuring arousal, including: interbeat intervals, heart rate, heart rate variability, and blood pressure. The obtrusiveness of these methods depends on what's being tracked, and can range anywhere from arm cuffs to video analysis. Metrics like heart rate variability and systolic blood pressure, for example, have been shown to be fairly reliable indicators of "invested effort" and can be used to detect immediate changes in mental workload. The limitations of these methods are dependent upon the technique being used, and as with the other metrics, can show significant differences between individuals.
Useful for: Identifying immediate responses to game events, monitoring arousal, and may be used as a proxy for subjects' mental workload.
While respiration is indeed sensitive to changes in mental workload and emotional states, it's also one of the few physiological responses used in user testing that almost everyone can control, consciously, without much effort or training. As such, while it can be easy to measure, it may not be as useful as the other methods outlined above if you're looking for raw physical responses. Also, as with facial EMGs, your subjects wouldn't be able to talk while you're recording their breathing.
Useful for: Testing games which may contain an actual physical element to gameplay in addition to, or in lieu of, traditional input.
Once you know what you will be collecting, keep the following questions in mind when designing your experiment:
1. Are we making sure it works, looking for general input and information, or are we doing an actual game-design experiment?
User testing is not quality assurance. Quality assurance is, and should always be conducted as a separate activity. It has a specific goal, and the people who do it need to be intimately familiar with the game they're testing, they literally cannot give us the kind of information we're looking for in the other two situations.
If we are looking for feedback, then you want to collect as much information as possible, about as many game elements as you can. Your goal will be to identify anything which might be interesting, troublesome, or worth expanding upon. Ask questions not only about gameplay ("Were the goals clear?") but emotional experiences as well.
If, however, you would like to test something specific, something that you can use as an actual test variable, then you will be performing an actual experiment, and you should act accordingly. Strive for consistency, repeatability, and never change more than one variable at a time.
2. How is the physiological data we collect going to be used?
Next, make a decision about when you are going to use this information. You can collect data for post-test analysis, record data to be played back immediately as a means of guiding post-gameplay interviews, or do both. (...or some shiny new method that you just came up with, in which case I want to hear about it.)
3. How will we attempt to control for the presence of the monitoring equipment? (If at all?)
Certain equipment can distract test subjects and affect their performance during gameplay, while others may have little to no effect whatsoever. Depending on the method being used, you may not feel the need to control for the presence of your monitoring devices, and that's okay.
If you do want to control for such things, however, you can do so by creating three experimental groups: one with no physiological data collection; one with no traditional data collection; and one with both. Alternatively, if you have a whole lot of time and resources at your disposal, you could just try to integrate monitoring devices into the normal input/gameplay hardware itself.
The integration of physiological data into user testing and research can have significant impacts on both the quality and quantity of the data collected. If you are creating a good or service of any sort, it is of the utmost importance that you perform both quality assurance and user-testing as early and as often as possible. Make sure you understand what you are looking for, as well as the pros and cons of the methods you can use to gather that information so that you can frame your questions properly, and design your experiments appropriately.
For more information about implementing physiological monitoring methods in user-testing:
Papers and Academic Articles