I sometimes work remotely, so I’ve tried out many video conferencing and telepresence “solutions.” The state of the art is pretty sad, and it’s been that way for at least a decade.
In hopes of improving things, I’m going to list here three classic blunders that I’ve seen in most video conferencing software. If you’re developing one of these apps, please fix!
Most video conferencing software is apparently developed by people who only use headphones, because if you use a separate speaker and microphone, you’ll find that echo makes the product unuseable.
This makes the software a non-starter for telepresence. Sure, I can wear a headset on my end, but I can’t require everyone at the other end to wear headphones. And I’d rather not wear headphones myself.
I can’t emphasize how common this blunder is. Even Google, with all of its resources, has fallen for this one: echo cancellation in Google’s WebRTC works fine in Windows but for some reason it is not enabled in Linux. Perhaps it’s working now, I haven’t checked recently—but in any case, it’s also a feature that seems to come and go. You just can’t count on it.
Frankly I think this is a feature that should not be the responsibility of an application, it should be handled by the operating system. In Linux, PulseAudio now supports echo cancellation. For boring reasons I’m not using that, so I’ve been forced to a hardware solution.
Multiple network interfaces
Many computers have multiple network interfaces. For example, you may have both wireless and ethernet connections, or you might be running a VPN over a standard connection.
SIP, the Session Initiation Protocol used by most video conferencing software, is horrible and its design in particular encourages screw-ups with multiple network interfaces. When a SIP client initiates a connection to another client, it must tell the other client how to reach it. To do this it has to pick the right interface. Most clients fail to pick correctly. The symptom is a call that starts with a bit of audio/video but then fails, or fails to start at all. Debugging this generally requires packet sniffing.
The real blunder here is in the design of SIP. Essentially SIP is trying to do routing, which would be better left to the networking stack. That said, clients could certainly do a better job here.
If you have a client that makes this blunder and you can’t modify the source code, the only solution is to eliminate all network interfaces but one. Unplug your ethernet or disable your wifi. If you have to use a VPN, then you’ll have to use a virtual machine configured with the VPN as its only interface.
Firewalls and NAT traversal
Everyone knows that firewalls and NAT screw up video conferencing.
SIP and other video conferencing standards blundered by trying to bake in their solution. Again, this is a problem that is not specific to video conferencing at all, it affects many applications, and it should have been handled by a completely separate mechanism in the OS or networking stack. Instead, it has become the responsibility of every SIP client, and they all handle it a bit differently, and badly.
One particularly infuriating problem is that some situations, but not all, require help from a third-party server. A server is completely unnecessary if you are using a VPN, for example; the VPN is already doing any firewall/NAT traversal, and it provides point-to-point connections. (And encryption!) However, a lot of SIP software requires you to run a server even if you don’t need it.
A related issue is that SIP requires several connections between clients, and every connection requires firewall/NAT traversal separately. That’s multiple chances to get things wrong and it makes things unreliable.
We would have all been better off if SIP assumed that it could make direct peer connections, and left routing and network traversal as problems to be solved by someone else.
All SIP clients should at the least be able to make peer-to-peer calls without doing firewall/NAT traversal and without requiring a server.