Lip-to-Speech Communication Device

Technical Implementation:

The project uses a distributed system where the PCB handles data capture and a laptop performs the heavy computational tasks.

Hardware and Power Design:
- Power Source: The device is powered by two LiPo batteries in series, providing 7.4V.
- Voltage Regulation: A clever two-stage power system is used. First, an efficient buck converter steps the voltage down to 5V. Then, a Low-Dropout (LDO) regulator provides a very stable 3.3V output to power the sensitive main processor (ESP32-S3) and the camera. This design combines the high efficiency of a buck converter with the clean, stable power of an LDO, which is critical for the processor’s performance.
- Core Components: The PCB is built around an ESP32-S3 microcontroller and an OV5640 camera module.
Data Processing and Workflow:
1. The OV5640 camera captures images of the user’s lips.
2. The ESP32-S3 processor takes these images and uses its built-in Wi-Fi to stream the data to a connected laptop.
3. All the intensive AI processing happens on the laptop:
  - An open-source lip-to-text model analyzes the video stream and generates text.
  - An open-source text-to-speech model converts that text into an audio signal.
4. The final audio signal is sent back to the ESP32-S3.
5. The ESP32-S3 outputs this as a vibration signal through a connected sound module.