Sounds like a jawharp recording (possibly layered with a pulse wav) being processed through a formant filter and/or multiple bandpass filters.
If you go with the bandpass filter route, you’ll need at least 3 to give it that “vocal quality”.
You’ll want to set up a system where they are modulated by hand (modwheel?), not lfo, but have them overlap and ‘pass each-other’ as you increase/decrease the frequency. You may even consider trying 1 hp filter, 1 bp filter, and 1 lp filter for this too.
You’ll need to tweak around with the resonance peaks of each filter until it sounds just right.
If you go with a formant filter…well….It’s pretty much the above, but it’s all taken care of for you. 
Hope that gets you headed in the right direction.