Telepresence for formal presentations

De Ensiwiki
Aller à : navigation, rechercher



Quite often people in academia or in companies participate in lectures and formal presentations, with live audience and remote audience at same time. For that, these people use different kind of programs like skype, google hangouts, etc. Although they watch the same presentation, they have different perceptions and understanding of the presentation.

After conducting a survey with a population of thirty people, it was possible to list possible reasons on why the remote spectators have a different perception, compared to the live spectators:

  • Different video feeds between presenter and slides
  • Presenter movements
  • Interactions between presenter and slides, like pointing

Addressed problem

The hypothesis is that the pointing inference of spectators change between watching the lecture live and remote. This loss of inference could lower the understanding of the presentation and, therefore, is a good area to research for improvements.

The main goal of this project is to measure the pointing inference performance of a group of spectators to investigate if the reality immersion is better or worse than video conference, taking into account their performances during a live experiment.

The approach

The experiment basically consists of a group of spectators that will watch a presenter pointing to six elements in a board thirty times in an unknown order. The same group will watch thirty videos of the presenter pointing to the elements but at a different order from the last step. In the last part, the same group will watch the presenter performing the same task, in a different order, but in a 3D reality immersion environment. During each phase, every member from the group has to note in a predefined grid which was the element pointed by the presenter.


  • Presenter: The presenter will know a list of 30 iterations. Every iteration will correspond to a number between 1 and 6, to which he is supposed to point at with his index finger. Every time he points to a number, he also looks at it, turning his head. The presenter will also have a reflective marker on his index finger and a structure of reflective markers (that will be called as rigid body through out this report) attached to his forehead.
  • Moderator: The moderator will be responsible of controlling the video capture device (starting it and stoping it) and also to name, out loud, the number of the current iteration to the spectators. Also, during the reality immersion part of the experiment, the moderator will take notes of the spectator's responses.
  • Spectator: The spectator will watch a person (the presenter) point at 6 numbers in a board 30 times and will have a predefined grid to mark which number he/she thinks the presenter pointed at. After that, the spectator will repeat the experiment, now watching a serie of videos of the presenter pointing at the numbers, but in a random order and a reality immersion application, also in a random order. During the reality immersion phase, the spectator will tell the moderator which number he/she thinks it is the correct one.


The scenario consists of:

  • A white board with six numbers displayed as shown in figure 1
  • Three tracking cameras disposed as shown in figure 2
  • A presenter wearing reflector markers, as shown in figures 3 and 4
  • A video camera recording the whole presentation, as shown in figure 5
  • A group of spectators positioned with the video camera
Figure 1 - Board with elements
Figure 2 - Disposed cameras
Figure 3 - Rigid body markers
Figure 4 - Reflective markers for index finger
Figure 5 - Video camera position

Live presentation

The group of spectators is placed in the end of the room, in front of the board and the presenter. The presenter will have a predefined list of thirty iterations, each one having a number between 1 and 6 to what he is supposed to point. This list is generated randomly before the experiment and is unknown to the spectators.

The spectators will receive a predefined grid of thirty lines and six columns, where they have to note what was the number they think it was pointed during the iteration. There is also a moderator responsible to say to all spectators what is the current iteration.

Movement capture

All the movement capture will be done using the Optitrack system. The optitrack is a set of infrared cameras that capture reflective markers and place them into the space. Every second, the data is broadcasted to a local network.

As previously said, the presenter will be wearing two set of markers on his body:

  • The first set is a fixed structure with three reflective markers, attached to his forehead. Also known as rigid body.
  • The second is a single reflective marker on his index finger.

The rigid body will enable to capture not only the exact position of the presenter but also in which direction he is looking when pointing to an element. The marker on the index finger is used to capture in which direction the presenter is pointing.

Using the C++ open source library NatNetLinux, it's possible to connect to the multicast group and receive the data streaming from the motive tracker system (responsible for interpreting the data from the cameras). Connected to the motive track, it will be possible to generate the following feed:

Frame: 461998
Rigid body: true
loc: ( 0.621, 0.445, 1.156 )
rigidMarker: ( 0.603, 0.468, 1.219 )
rigidMarker: ( 0.651, 0.474, 1.112 )
rigidMarker: ( 0.608, 0.394, 1.139 )
Markers: true
marker: ( 0.343, 0.421, 0.524 )
-- END --

This feed indicates:

  • The number of the frame, for identification purposes
  • The indication if a rigid body was captured
  • If a rigid body was captured, indicates its position and the position for each of its markers
  • The indication if a single marker was captured
  • If a marker was captured, the position of each one of them

Data feed analysis

In order to get the best precision possible, we calibrate the scenario with two measurements. The first measurement is to capture the presenter in his position, wearing the rigid body and holding two reflective markers in front of his eyes. This way it will be possible to calculate the relation between the eyes and the rigid body. Having this relation, it is possible to calculate the direction which the presenter is pointing his head and translate it to the point in between his eyes. Therefore, calculating the direction of his look.

Also, during the calibration, it's collected the position of each element in the board. This information is then coded into the animation.

Video presentation

The video of each one of the iterations in the live presentation are scrambled and presented to each spectator. Each one of them need to indicate what was the pointed number. It's important to indicate that the list of videos is scrambled randomly and the same scrambled order is presented to every spectator.

Immersion scene

“Y_Bot”, a 3D articulated humanoid

For the 3D scene, the main goal was to have a human model that is able to point at different numbers on a whiteboard. We used an existing articulated humanoid (from Unity asset store), which allowed us to easier create the animated scene in Unity. In the images below we can observe the articulated model, as well as its “bone structure” - that is, in this case, represented as an hierarchical structure:

Figure 6 - Humanoid structure

Animation scene

The Y_Bot humanoid came with a predefined (recorded) animation, where the model kept repeating a simple pointing movement - but this was not sufficient for our final goal, so we had to adapt it in order to point exactly to the direction of a specific object. For that, we created an “Animator Controller” asset, which allows us to maintain a set of animations for a specific object, in this case our humanoid. As we can see In the image below, an Animator Controller is represented by a group of different states.

Figure 7 - Animator Controller

In our case, only one state was created (the “Pointing state”), to which we associated the existing “pointing” animation.

Figure 8 - Associating the pointing animation to Animator Controller

  • In the case where more animations are used, more states of this kind must be created. The “Entry” and “Exit” states are created by default.

Inverse kinematics : pointing at a specific object :

In order to make Y_Bot precisely look and point at a certain object, we used what is called inverse kinematics, a technique that helps us to find a way of orienting the model’s joints, so that the final position will match a chosen position in space. Unity offers support for inverse kinematics and some simple configurations were made in order to use this feature:

Figure 9 - Activating the ik parameter in Animator Controller

The next step was to create à corresponding C# script that actually handles the IK , by setting the “pointing” target, as well as the “looking” target . For our test, both the looking and the pointing targets were a number on the whiteboard.

The initial animation represented an humanoid pointing at a fictive object in front of it. Additionally, the inverse kinematics script has a parameter IkWeight, which takes values between 0 and 1. If IkWeight equals 1, the humanoid’s arm points towards the target, and if IkWeight equals 0, the humanoid's arm keeps its position, defined by the animation. Every value between 0 and 1 represents a mix of these two positions.

For a more natural movement of the right arm, we set up a curve that varies the “ikWeight” parameter during the animation (see the script on github [1] ).

Figure 10 - CurveIKWeight : right arm position variation during the animation

Below, you can see the final animation scene, where the humanoid model points at a specific number on the whiteboard. In this image you can also observe the text "Iteration:", which helped the moderator keep track of each pointing iteration, during the testing.

Figure 11 - 3D scene - Pointing at a number

Reality immersion presentation

For the reality immersion presentation, the experiment was conducted with one person at a time. The moderator will pause the animation after each iteration, so that each one of the participants can to indicate the pointed number; the number is noted on the experiment grid, by the moderator.


Our experiment was made on group of 5 persons, in two different days : the first day was dedicated for the live session, and the second one for the video and oculus rift session.

The Figure 9 illustrates the results of the experiment on the first day, during the live presentation. We can observe that each person had a different perception on this pointing session. More than 50% of the given answers, in average, matched the expected ones.

Figure 12 - Results of the live pointing session

If we were to compare the live presentation and the video presentation, we can observe the poor results of the later one, participants only giving 10% answers in average, that matched the real ones answers.

Figure 13 - Live session vs Video session

Better results were obtained when testing with oculus rift. We can clearly observe a better perception of the pointing scene, as presented to the Figure 11 below:

Figure 14 - Video session vs Oculus Rift session
Figure 15 - Final results


The data clearly shows that the proposed task is not easy. Even on live, the spectators had problems to perform it correctly. Video and immersion phases present even worse results, even though the immersion showed equal or better than the video. We believe that the experiment was not sufficient to conclude that the immersion is better than simple video feed, only increase our initial gut feeling.

As improvements, more test subjects should be used, specially more subjects that don't wear glasses. Also, we believe that capturing only the index finger movements and the direction of the look is not enough to translate to a 3D reality, because of the short distances between on position to another. One possible solution would be to also capture the shoulder position and trace a line of projection in the animation. This way the spectator could be sure of which element the presenter is pointing at, without the use of any extra hardware.

One more point to question ourselves is how to cover some drawbacks of wearing a head-mount display. Examples for this case is how to take notes, decide whether or not to answer your phone and so on.