Tell me about any recent project-
you worked on using Paithan
- Random Recruiter on call
My online friends made fun of me because I didn't have a mic.
So I made my own mic using my Game Console.
That mic evolved into a node for my voice assistant.
I was unable to afford a microphone, so I made use of my PS Vita as a microphone and ended up using Dear ImGui to make a custom client app for my PS Vita to connect with my computer that is running my custom voice assistant using an Ollama model and lots of algorithms, rules and fuzzy logic.
Initially I made a single node version of this using my laptop and its inbuilt mic. You can check the below video.
Table of contents →
hacked/jailbroken and can run custom code. Therefore, I decided to use it..vpk file which works like an Android .apk file in design. They both are fancy zip files with rules for where the files should be copied to, etc.
Hmmm...
Those Hexadecimal numbers?
Seems like I can transfer it to my PC via WiFi to make a DIY wireless Mic to make calls.
SCE_NET_SOCK_DGRAM to setup a UDP network. I chose UDP because I first wanted to check if PSV's 2.4 GHz 802.11n WiFi (it does not match true 802.11n speeds though, I only get a max of around 2 megabytes/sec)is good enough for modern standards. Surprisingly, it worked very well → given my router is literally above my computer monitor.IP Address. Initially I had hardcoded the IP Address of my computer, but since I also use my laptop sometimes, it was not ideal to hardcode. So I used file operations to read the IP Address from a file instead: sceNetInetPton(SCE_NET_AF_INET, SERVER_IP, &server_audio.sin_addr)server_audio.sin_port = sceNetHtons(SERVER_PORT)#define NET_PARAM_MEM_SIZE (1*1024*1024) → This is how much space is allotted for the networking stack of the entire program. 1*1024*1024 here means 1 MB → networking stack RAM.
I used port 2012 because I had this feeling of impending doom when I was writing the code for this.
Therefore -> 2012.
(。﹏。) don't ask...
mic_socket at port 2012.
sceNetSendto(sock, cmd, strlen(cmd), 0,(SceNetSockaddr*)server_addr, sizeof(*server_addr)). Now that the WiFi initialization is done, this is the function used to do the real WiFi communication.
やった
Math time
(ミ^ᆽ^ミ)
SCE_AUDIO_IN_PARAM_FORMAT_S16_MONOint grain = 256,short is 2 bytes,「16,000 samples/sec ÷ 256 samples/sec」= 62.5 packets/sec「62.5 packets/sec x 512 bytes」 = 32,000 bytes/seclibopus as that would make the required speed of 256 kbps go down to 32 kbps, 8x smaller size.Only Walkie.
No Walkie-Talkie yet.
Until I implement 2 way communication.
(。╯︵╰。)
AUDIO_FORMAT = 's16le' (signed 16-bit little Endian) and other values from earlier along with other values like IP Address, Socket port number, number of channels, etc.
It's called alter-ego because this project was heavily inspired by Chihiro from Danganronpa.
I then made a few rules like "open browser": lambda: (speak("Opening browser"), send2vita("Opening browser"), notify("Opening Browser"), subprocess.Popen(["zen-browser"]))
"open firefox" → If user says "open firefox" then: do somethingspeak("Opening browser") → It says, "opening browser" in a human-like voicesend2vita("Opening browser") → Currently a placeholder function for a future upgradenotify("Opening Browser") → Displays a notification on the desktopsubprocess.Popen(["zen-browser"]) → Runs the actual command using subprocess
Here are some more example commands which demonstrate the assistant being able to do system level tasks like changing volume, brightness, keyboard backlight, taking a screenshot, opening YouTube, etc.
But the assistant was not perfect when I tried to use it. Since vosk converts human speech to English, a slight difference in accent could make it misunderstand the command and not do anything at all, as it's not defined in it's command dictionary. Therefore, I implemented a Fuzzy Logic selection method using RapidFuzz to make sure it can narrow down what the user has said to something in the command dictionary.
Fuzzy Logic means checking how similar 2 strings are, instead of a strict binary output like both are same or both are not the same. It gives a score on how similar the 2 strings are.
Simple, effective,
elegant
ヽ(´▽`)/
fuzz.partial_ratio(command, text) → partial_ratio() means it checks partially. For example, I could say something like "open browser please" and it would still execute "open browser" command.fuzz.token_sort_ratio(command, text) → token_sort_ratio() means it checks for words without checking their order. For example, I could also say "browser open" and it would still execute "open browser" command.fuzz.partial_sort as from my experiments, 0.6:0.4 was the sweet spot.if score > 60: return best_match → If matching score is above 60, accept it and run the command. 60 because PSV microphone is not very good and 60 was the sweet spot where it was working well.
A voice assistant needs to respond by sound, not by us pressing a button
This was done because once I was yelling at myself "shut up" and it fuzzy matched it to "shutdown" and shutdown my PC...
selection where I can select some text and ask the assistant to explain about the selected text. For this, I used Ollama. The specific model I used is Qwen 2.5 0.5b. It's a tiny and very capable model for this use case.
A voice assistant needs a personality. These are the things it says when it detects a wake word.
It says one of these if fuzzy match score is below 60.
it's basically my test code which i use to test graphical capabilities of various hardware. yes, it runs everywhere, like doom.