Software Engineer

Crafting an AI clone to handle my interviews

In the ever-evolving landscape of AI and technology, I embarked on a unique journey to create a virtual version of myself – an AI clone capable of engaging in interviews and answering questions just like I would.

Let’s start by having a peek at the end result and then I will explain how it was built.

Answering the prompts with a Large Language Model

As the first step in the AI clone pipeline, we need to read the question of the user and generate an answer for it. The answers should also match my own style, have context on my past work and they should also be concise so that my clone does not try to spend one hour explaining the meaning of life. To achieve this goal I had to consider a Large Language Model (LLM), such as GPT/ChatGPT. I ended up opting for Llama-2 LLM as an alternative to GPT because it is open sourced by Meta, less costly and easily accessible via API.

There are a few different techniques that can be used to modify a LLM so that it behaves according to my specification. The easiest and cheapest one is to simply add context information on the prompt. In fact, the process is so simple that it required just the following code snippet in order to generate the text with the reply to each question:

import Replicate from 'replicate';
await"a16z-infra/llama-2-13b-chat:2a7f981751ec7fdf87b5b91ad4db53683a98082e9ff7bfd12c8cd5ea85980a52", {
            input: {
                max_new_tokens: 500,
                system_prompt: `You are acting as Gustavo Silva, a software engineer who is being asked questions about himself and his professional work.
Here is your resume:
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
Prefer direct replies, most of the times one sentence is enough.
Do not include any introductory text, go straight to the answer. Remember to keep your answer short.`,
                temperature: 0.2,

Explaining each parameter:

  • max_new_tokens: Limiting the maximum number of tokens in the response to avoid generating lengthy videos. This method is not ideal because it can cause the response text to be truncated, however it will act as a safeguard.

  • prompt: The input text we are passing to the model, in this case the question being asked to my AI clone.

  • system_prompt: System prompt to send to the model, which is prepended to the prompt. Here I customized the behavior of the bot by including the following information, in order:

    1. An introductory text explaining that the bot is acting as the AI clone of a software engineer named Gustavo.

    2. A text dump of my resume.

    3. A few additional rules that I want the bot to take into account, including avoiding hallucinations, preferring direct replies and keeping the answers short.

  • temperature: Adjusts randomness of outputs. I have decided to lower it from the default value of 0.75 so that it provides more factual and less creative answers.

Bringing Words to Life: Text-to-Speech Service

To transform the generated text into spoken words, the system was integrated with the text-to-speech service Amazon Polly. This integration seamlessly converts textual responses into natural-sounding speech, bridging the gap between text-based AI and human-like interaction.

From my own experience (plus publicly available reviews) this is not the best text-to-speech service available and you will get more natural results with a tool such as ElevenLabs. However, the pricing difference between the two is quite significant so I decided to compromise in favor of lower costs.

The Visual Element: Wav2Lip and Template Videos

Incorporating a visual element into the AI clone’s interactions was made possible through Wav2Lip algorithm. Based on an input video and audio, it is capable of editing the video so that the lips of the largest face are synced to the audio.

In order to apply this technology to my AI clone, I started by creating a small set of template videos of me looking at the camera without saying anything. These videos lack lip movement but feature head movement and hand gestures.

Template video:

Wav2Lip breathed life into these videos. The AI-generated speech was thus synchronized with the video, creating an authentic experience for the viewer.

Video synced to speech using Wav2Lip:

Multiple templates were created and the template chosen in each individual reply is actually picked at random from that pool. This adds some variety to the results.

Wrapping up

Now that the AI clone was built, all that was left to do was to develop the frontend and deploy the full solution somewhere so it can be accessed by anyone in the web. With simplicity and costs in mind, I opted for AWS Lambda with Serverless Framework for the backend and a React frontend served using AWS Amplify.

Last but not least, and in fact the most time consuming part of the whole process, I implemented some anti-bot and anti-spam measures to ensure fair use and that the operational costs don’t blow up.

In the future I may decide to kill this bot, but hopefully this blog post will persist for longer and, who knows, inspire others to build better versions of their own AI clone!