Alireza Heshmati

Lightweight Speaker Verification Model This project focuses on verifying whether two audio files belong to the same speaker. Unlike heavy speech feature extractors like WavLM, I designed a lightweight model based on ECAPA-TDNN, which generates fixed-length representations for audio files of varying durations using statistical pooling. The model was trained for a speaker identification task on a dataset of 1,000 Persian speakers using supervised learning. The dataset was highly imbalanced, so I conducted data analysis and preprocessing to mitigate its negative effects on training. Finally, I detached the classifier layer and retained the trained model as a feature extractor. Using this feature extractor, cosine similarity as the scoring function, and an Equal Error Rate (EER) threshold for decision-making, I developed a speaker verification system with a model size of only 13 MB, which can be further reduced by using Float16 quantization.

Persian Automatic Speech Recognition (ASR) This project is a large-scale initiative aimed at developing a high-quality offline ASR system and a fast real-time ASR for the Persian language, optimized for operation in extremely noisy environments while accommodating a wide range of accents. Given the complexity of the task, I thoroughly explored different ASR architectures, including Whisper (an encoder-decoder model based on AED), Wave2Vec (a CTC-based model), and FastConformer (a hybrid model utilizing both RNN-T and CTC). After extensive evaluation, I selected Whisper for offline transcription due to its strong performance in handling long-form audio with complex linguistic structures. For real-time applications, I chose FastConformer because of its efficiency, lower latency, and suitability for streaming scenarios. To enhance the robustness of these models, our team fine-tuned them on a carefully curated Persian dataset designed to simulate real-world conditions. This dataset includes a wide variety of noise types, distortions, audio codecs, recording devices, reverberation effects, and diverse Persian accents. The goal was to ensure that the ASR system performs well across different environments, from clean studio recordings to highly challenging acoustic conditions. By leveraging state-of-the-art architectures and targeted fine-tuning, this project aims to push the boundaries of Persian ASR, making it both highly accurate and practical for real-world applications.

Designing Pixel-wise and Group-wise Attacks to Deep Neural Networks This project focuses on designing pixel-wise and group-wise attacks to identify key pixels and features in images from the view of Deep Neural Networks (DNNs). These methods generate both imperceptible perturbations for non-robust DNNs and counterfactual explanations for robust DNNs. We use the general concept of sparsity, overlapping sparsity, to introduce general regularization which covers all modes of sparsity.

Few-Shot Keyword Spotting (FSKWS) This project involved recognizing a specific keyword with very limited training data. This approach allows us to change the target keyword with as few as five samples or fewer. We prepared a rich dataset in the Persian language for pretraining using a prototypical network. Additionally, we implemented streaming capabilities for real-time keyword spoting.

Designing Fast Gradual Sparse Attacks to Deep Neural Networks In this paper, we propose a new algorithm to design fast, sparse attacks to DNNs using proximal-based optimization methods. This algorithm uses ℓ1 norm, ℓ0 norm, and the Smoothly Clipped Absolute Deviation (SCAD) function for sparsity regularization. In addition, it starts with a dense perturbation and gradually makes it sparse using the penalty method.

Speech Enhancement (SE) The goal was to prepare a punctually efficient SE module before the ASR system to enhance noisy input speech so that the resulting ASR embedded outputs become close to the corosponding clean ones. The noise was mostly environmental (both natural and artificial), also reverberation was considered a little among the input speech.

Voice Activity Detection (VAD) This project was about detection of speech and non-speech for each frame of audio. This module is needed to reduce the time complexity and errors of Automatic Speech Recognition (ASR) systems. In our project, We designed VAD with deep learning modules such as CNN, RNN and FNN in such a way that there are fewer parameters and execution time on CPU, and more accuracy compared to recent models.

Tokens Position Detection in Speech This project was about detecting position of tokens in audio according to the encoder of an Automatic Speech Recognition (ASR) system. For this, We found correspondence between the input and the output of the encoder that has a CTC (Connectionist Temporal Classification) layer. indeed, the blank tokens were removed and each relative tokens were merged as their positions.

Designing Low Coherent Measurement Matrix This is my first paper in IEEE SPL. It is about designing low coherent measurement matrix with controlled spectral norm via an efficient approximation of ty ℓ∞ norm. Compressed Sensing (CS) is targeted at reconstructing a signal from a small set of measurements, if the signal is sparse in some domains. In this respect, a low coherent measurement matrix plays an important role. In this letter, an efficient approximation of ℓ∞ norm based on the soft maximum was introduced to design a low coherent measurement matrix with a controllable spectral norm. The proposed approximation, called Logarithm of Sum of Exponential Absolute values (LSEAp), is convex (similar to ℓ∞ norm) and almost smooth. Acordingly, we designed a low coherent measurement matrix with a small spectral norm via minimization of the ℓ∞ norm of the Gram matrix. The resulting problem was not convex but our simulations show that the LSEAp leads to an improved design of the measurement matrix, as compared to current methods.

Attack to Deep Learning Networks This, as my master's degree project, evaluates the robustness of DNNs against a designed attack using pixel-wise or group-wise (structured) perturbations on images (CIFAR-10 and ImageNet). The key challenges include controlling the sparsity of perturbed units and the perturbations intensity. The proposed method implements sparsity and imperceptibility criteria using the Smoothed ℓ0 function and an approximation of ℓ∞ norm, respectively. In this project, the proposed sparse adversarial attacks were developed such that the element-wise perturbations can be converted into either pixel-wise or group perturbations.

Keyword Spotting This project was about recognizing the desired word (keywor spotting) in audio. The network of the project is light, which increased the speed of the application. This network was inspired by residual blocks, and used 1D and 2D audio features to increase the network accuracy and the execution speed. In this project, Google voice command v1 dataset was used for network training and evaluation.

Object Detection and Depth Estimation In this project, I used YOLO network for object detecting and FastDepth that was interduced by facebook as a depth estimator. FastDepth is an encoder-and-decoder model that use mobilenet v2 as a light encoder, and skip connections between encoder layers and decoder layers to reduce the decoder layers. For this project, I used NYU Depth Dataset V2.

Pose Estimation using Convolutional Neural Network In this project, I had to determine the location of the head, torso, and the joints of the hands and feet by giving the image to a convolutional neural network (CNN). Accordingly, I used a version of the LSP dataset and a simplified network in this paper that consists of two parts (Initial stage and Stages). Finally, I managed to learn this network.

Recovery of an Image with IMAT and OMP Methods In this project, IMAT (Iterative Method with Adaptive Threshold) and OMP (Orthogonal Matching Pursuit) were used for recovering of an image from its random samples, uniform and non-uniform samples using their sparse domains.

PPM Demodulation Using Non-uniform Sampling and Inverse System Approach: Pulse Position Modulation (PPM) signals are generated at the time of intersection between modulating signals and saw-tooth with constant amplitude and time period. In the PPM signal, information is not at the amplitude, indeed, the distance between the two PPM pulses has information. Two approaches were defined for PPM demodulation. First approach is non-uniform sampling such as:

Wiley/Marvasti
Time-varying
Zero-Order-Hold
Linear interpolation
Voronoi
Adaptive Weight Method(ADPW).

And second one is PPM inverse system using an iterative and a Chebyshev Acceleration (CA) methods. In this project I used those methods for PPM demodulation.

Compensating Distortion of Interpolation of 1D Signals Interpolation methods such as Sample-and-hold (S&H), linear, 1-spline and c-spline have distortion. In this project, a modular method and an iterative method were used to compensate for distortion of common interpolators. according to the modular method, the interpolation function is multiplied by 1 +2 [cos(2𝑇𝜋𝑡)+cos(4𝑇𝜋𝑡)+⋯ +cos(2𝑁𝜋)] and then got low-pass filter. The interpolation of a discrete signal is modeled as the output of a linear time-invariant system when the inputs are the discrete samples. Results of simulation:

The iterative method achieved the SNR of the interpolation by about 35 dB
the modular method achieved the SNR about 15 dB.

Mask Detection Using Neural Networks This was my bachelor's project and the first project that I used Deep Neural Networks, which was about detecting people with masks. For this, I used a practicable Convolutional Neural Network (CNN) and I prepared my own dataset to identify the mask in the frames of a video. Issues were to track the position of a person with a mask in the frames of videos and crop its head that these were addresed with detection of eyes in the frames.

Alireza Heshmati

Research Assistant as a Signal Processing Engineer at Electronics Research Institute (ERI), Sharif University of Technology, Interested in Speech and Image Processing, Trustworthy AI, Robustness of Deep learning Networks, NLP, Compressed Sensing 🧑‍💻💻✖️➖