Research: My research areas are related to computer vision (CV), deep learning (DL), and machine learning (ML).
Some of the active areas of research include:
Transfer learning (Domain adaptation, Incremental learning).
Fairness and bias-free learning.
Uncertainty in deep learning (Bayesian models).
Data generative models (GANs, VAEs, and Diffusion models).
Medical image processing.
3D reconstructions, augmented reality (AR), and virtual reality (VR).
We are seeking motivated researchers who are passionate about computer vision, machine learning, and deep learning. If you are interested, please send us your CV and highlight your areas of interest and experience. We have openings for long internships and PhDs in this groups.
For more details please visit the Visual Data Computing Group (VisDom).
In machine learning applications, gradual data ingress is common, especially in audio processing where incremental learning is vital for real-time analytics. Few-shot class-incremental learning addresses challenges arising from limited incoming data. Existing methods often integrate additional trainable components or rely on a fixed embedding extractor post-training on base sessions to mitigate concerns related to catastrophic forgetting and the dangers of model overfitting. However, using cross-entropy loss alone during base session training is suboptimal for audio data. To address this, we propose incorporating supervised contrastive learning to refine the representation space, enhancing discriminative power and leading to better generalization since it facilitates seamless integration of incremental classes, upon arrival. Experimental results on NSynth and LibriSpeech datasets with 100 classes, as well as ESC dataset with 50 and 10 classes, demonstrate state-of-the-art performance.
@inproceedings{singh_intespeech24,
Author = {Singh Riyansha,
and Nema, Parinita
and Kurmi, Vinod },
Title = {Towards Robust Few-shot
Class Incremental Learning in Audio
Classification using Contrastive Representation},
Booktitle = {Interspeech},
Year = {2024}
}
Regression models are of fundamental importance in explicitly explaining the response variable in terms of covariates. However, point predictions of these models limit them from many real world applications. Heteroscedasticity is common in most real-world scenarios and is hard to model due to its randomness. The Gaussian process generally captures epistemic (model) uncertainty but fails to capture heteroscedastic aleatoric uncertainty. The framework of HetGP inherently captures both epistemic and aleatoric by placing independent GP’s priors on both mean function and error term. We propose the posthoc HetGP on the residuals of the trained deterministic neural network to obtain both epistemic and aleatoric uncertainty. The advantage of posthoc HetGP on residuals is that it can be extended to any type of model, since the model is assumed to be black-box that gives point predictions. We demonstrate our approach through simulation studies and UCI regression datasets.
@inproceedings{udbhav,
Author = {Udbhav Mallanna,Dalavai
and Dwivedi, Rajeev R
and Thakur, Rini S
and Kurmi, Vinod },
Title = {Quantifying Uncertainty
in Neural Networks through Residuals},
Booktitle = {CIKM},
Year = {2024}
}
Deep neural networks trained on biased data often inadvertently learn unintended inference rules, particularly when labels are strongly correlated with biased features. Existing bias mitigation methods typically involve either a) predefining bias types and enforcing them as prior knowledge or b) reweighting training samples to emphasize biasconflicting samples over bias-aligned samples. However, both strategies address bias indirectly in the feature or sample space, with no control over learned weights, making it difficult to control the bias propagation across different layers. Based on this observation, we introduce a novel approach to address bias directly in the model’s parameter space, preventing its propagation across layers. Our method involves training two models: a bias model for biased features and a debias model for unbiased details, guided by the bias model. We enforce dissimilarity in the debias model’s later layers and similarity in its initial layers with the bias model, ensuring it learns unbiased low-level features without adopting biased high-level abstractions. By incorporating this explicit constraint during training, our approach shows enhanced classification accuracy and debiasing effectiveness across various synthetic and real-world datasets of different sizes. Moreover, the proposed method demonstrates robustness across different bias types and percentages of biased samples in the training data
@inproceedings{rajeevbmvc24_abs,
Author = {Dwivedi, Rajeev R
and Kumari, Priyadarshini
and Kurmi, Vinod },
Title = {CosFairNet:A Parameter-Space based
Approach for Bias Free Learning},
Booktitle = {BMVC},
Year = {2024}
}
Recently, there have been significant advancements in the 3D face reconstruction field, largely driven by monocular image-based deep learning methods. However, these methods still face challenges in reliable deployments due to their sensitivity to facial occlusions and inability to maintain identity consistency across different occlusions within the same facial image. To address these issues, we propose two frameworks: Distillation Assisted Mono Image Occlusion Robustification (DAMIOR) and Duplicate Images Assisted Multi Occlusions Robustification (DIAMOR). The DAMIOR framework leverages the knowledge from the Occlusion Frail Trainer (OFT) network to enhance robustness against facial occlusions. Our proposed method overcomes the sensitivity to occlusions and improves reconstruction accuracy. To tackle the issue of identity inconsistency, the DIAMOR framework utilizes the estimates from DAMIOR to mitigate inconsistencies in geometry and texture, collectively known as identity, of the reconstructed 3D faces. We evaluate the performance of DAMIOR on two variations of the CelebA test dataset: empirical occlusions and irrational occlusions. Furthermore, we analyze the performance of the proposed DIAMOR framework using the irrational occlusion-based variant of the CelebA test dataset. Our methods outperform state-of-the-art approaches by a significant margin. For example, DAMIOR reduces the 3D vertex-based shape error by 41.1% and the texture error by 21.8% for empirical occlusions. Besides, for facial data with irrational occlusions, DIAMOR achieves a substantial decrease in shape error by 42.5% and texture error by 30.5%. These results demonstrate the effectiveness of our proposed methods.
@inproceedings{hitika_ivc3,
Author = {Tiwari, Hitika
and Kurmi, Vinod K
and Subramanian,
Venkatesh K and
Chen, Yong Sheng },
Title = {Distilling Knowledge
for Occlusion Robust Monocular
3D Face Reconstruction},
Booktitle = {InterSpeech},
Year = {2022}
}
Generalized Keyword Spotting using ASR embeddings
Kirandevraj R, Vinod K Kurmi, Vinay P Namboodiri, C V Jawahar Conference of the International Speech Communication Association (Interspeech) 2022, Incheon Korea.
Keyword Spotting (KWS) detects a set of pre-defined spoken keywords. Building a KWS system for an arbitrary set re- quires massive training datasets. We propose to use the text transcripts from an Automatic Speech Recognition (ASR) sys- tem alongside triplets for KWS training. The intermediate rep- resentation from the ASR system trained on a speech corpus is used as acoustic word embeddings for keywords. Triplet loss is added to the Connectionist Temporal Classification (CTC) loss in the ASR while training. This method achieves an Average Precision (AP) of 0.843 over 344 words unseen by the model trained on the TIMIT dataset. In contrast, the Multi-View re- current method that learns jointly on the text and acoustic em- beddings achieves only 0.218 for out-of-vocabulary words. This method is also applied to low-resource languages such as Tamil by converting Tamil characters to English using transliteration. This is a very challenging novel task for which we provide a dataset of transcripts for the keywords. Despite our model not generalizing well, we achieve a benchmark AP of 0.321 on over 38 words unseen by the model on the MSWC Tamil keyword set. The model also produces an accuracy of 96.2% for classifi- cation tasks on the Google Speech Commands dataset.
@inproceedings{kiran_inter22,
Author = {R,Kiran.
and Kurmi, Vinod K
and Namboodiri, Vinay P
and Jawhar, CV},
Title = {Generalized Keyword
Spotting using ASR embeddings},
Booktitle = {InterSpeech},
Year = {2022}
}
Gradient Based Activations for Accurate Bias-Free Learning Vinod K Kurmi*, Rishabh Sharma*, Yash Vardhan Sharma*, Vinay P Namboodiri (*equal contributions) Proceedings of the AAAI Conference on Artificial Intelligence, (AAAI), Vancouver BC, Canada, 2022.
Bias mitigation in machine learning models is imperative, yet challenging. While several approaches have been proposed, one view towards mitigating bias is through adversarial learning. A discriminator is used to identify the bias attributes such as gender, age or race in question. This discriminator is used adversarially to ensure that it cannot distinguish the bias attributes. The main drawback in such a model is that it directly introduces a trade-off with accuracy as the features that the discriminator deems to be sensitive for discrimination of bias could be correlated with classification. In this work we solve the problem. We show that a biased discriminator can actually be used to improve this bias-accuracy tradeoff. Specifically, this is achieved by using a feature masking approach using the discriminator's gradients. We ensure that the features favoured for the bias discrimination are de-emphasized and the unbiased features are enhanced during classification. We show that this simple approach works well to reduce bias as well as improve accuracy significantly. We evaluate the proposed model on standard benchmarks. We improve the accuracy of the adversarial methods while maintaining or even improving the unbiasness and also outperform several other recent methods.
@inproceedings{kurmi_aaai22,
Author = {Kurmi,Vinod K.
and Sharma, Rishabh and
Sharma, Yash Vardhan and
Namboodiri, Vinay P},
Title = {Gradient Based
Activations for Accurate
Bias-Free Learning},
Booktitle = {AAAI},
Year = {2022}
}
3D face reconstruction from a monocular face image is a mathematically ill-posed problem. Recently, we observed a surge of interest in deep learning-based approaches to address the issue. These methods possess extreme sensitivity towards occlusions. Thus, in this paper, we present a novel context-learning-based distillation approach to tackle the occlusions in the face images. Our training pipeline focuses on distilling the knowledge from a pre-trained occlusion-sensitive deep network. The proposed model learns the context of the target occluded face image. Hence our approach uses a weak model (unsuitable for occluded face images) to train a highly robust network towards partially and fully-occluded face images. We obtain a landmark accuracy of $0.77$ against $5.84$ of recent state-of-the-art-method for real-life challenging facial occlusions. Also, we propose a novel end-to-end training pipeline to reconstruct 3D faces from multiple variations of the target image per identity to emphasize the significance of visible facial features during learning. For this purpose, we leverage a novel composite multi-occlusion loss function. Our multi-occlusion per identity model shows a dip in the landmark error by a large margin of $6.67$ in comparison to a recent state-of-the-art method. We deploy the occluded variations of the CelebA validation dataset and AFLW2000-3D face dataset: naturally-occluded and artificially occluded, for the comparisons. We comprehensively compare our results with the other approaches concerning the accuracy of the reconstructed 3D face mesh for occluded face images..
@inproceedings{tiwati_wacv22,
Author = {Tiwari, H.
and Kurmi,Vinod Kumar
and Venkatesh, KS and
Chen,Yong-Sheng},
Title = {Occlusion Resistant Network
for 3D Face Reconstruction},
Booktitle = {WACV},
Year = {2022}
}
Generating natural questions from an image is a semantic task that requires using
vision and language modalities to learn multimodal representations. Images can have
multiple visual and language cues such as places, captions, and tags. In this paper,
we propose a principled deep Bayesian learning framework that combines these cues
to produce natural questions. We observe that with the addition of more cues and
by minimizing uncertainty in the among cues, the Bayesian network becomes more
confident. We propose a Minimizing Uncertainty of Mixture of Cues (MUMC), that
minimizes uncertainty present in a mixture of cues experts for generating probabilistic
questions. This is a Bayesian framework and the results show a remarkable similarity to
natural questions as validated by a human study. Ablation studies of our model indicate
that a subset of cues is inferior at this task and hence the principled fusion of cues is
preferred. Further, we observe that the proposed approach substantially improves over
state-of-the-art benchmarks on the quantitative metrics (BLEU-n, METEOR, ROUGE,
and CIDEr).
@inproceedings{patro_ivc_2,
Author = {Patro, Badri Narayana
and Kumar, Sandeep and Kurmi,
Vinod Kumar and Namboodiri, Vinay},
Title = {MUMC: Minimizing Uncertainty
of Mixture of Cues},
Booktitle = {Image and Vision
Computing,(IMAVIS)},
Year = {2021}
}
Adaptation of a classifier to new domains is one of the challenging problems in machine learning. This has been addressed using many deep and non-deep learning based methods. Among the methodologies used, that of adversarial learning is widely applied to solve many deep learning problems along with domain adaptation. These methods are based on a discriminator that ensures source and target distributions are close. However, here we suggest that rather than using a point estimate obtaining by a single discriminator, it would be useful if a distribution based on ensembles of discriminators could be used to bridge this gap. This could be achieved using multiple classifiers or using traditional ensemble methods.} In contrast, we suggest that a Monte Carlo dropout based ensemble discriminator could suffice to obtain the distribution based discriminator. Specifically, we propose a curriculum based dropout discriminator that gradually increases the variance of the sample based distribution and the corresponding reverse gradients are used to align the source and target feature representations.
@inproceedings{kurmi2021_idda_j,
Author = {Kurmi, Vinod Kumar and K S ,
Venaktesh and Namboodiri, Vinay},
Title = {Exploring Dropout
Discriminator for Domain Adaptation},
Booktitle = {Neurocomputing},
Year = {2021}
}
Understanding unsupervised domain adaptation has been an important task that has been well explored. However, the wide variety of methods have not analyzed the role of a classifier's performance in detail. In this paper, we thoroughly examine the role of a classifier in terms of matching source and target distributions. We specifically investigate the classifier ability by matching a) the distribution of features, b) probabilistic uncertainty for samples and c) certainty activation mappings. Our analysis suggests that using these three distributions does result in a consistently improved performance on all the datasets. Our work thus extends present knowledge on the role of the various distributions obtained from the classifier towards solving unsupervised domain adaptation.
@inproceedings{shanu,
Author = {Kumar, Shanu and
Kurmi, Vinod Kumar and Singh,
Praphul and Namboodiri, Vinay P},
Title = {Mitigating Uncertainty
of Classifier for Unsupervised Domain Adaptation},
Booktitle = {Preprint- arXiv},
Year = {2021}
}
Generative adversarial networks (GANs) are very popular to generate realistic images, but they often suffer from the training instability issues and the phenomenon of mode loss. In order to attain greater diversity in GAN synthesized data, it is critical to solving the problem of mode loss. Our work explores probabilistic approaches to GAN modelling that could allow us to tackle these issues. We present Prb-GANs, a new variation that uses dropout to create a distribution over the network parameters with the posterior learnt using variational inference. We describe theoretically and validate experimentally using simple and complex datasets the benefits of such an approach. We look into further improvements using the concept of uncertainty measures. Through a set of further modifications to the loss functions for each network of the GAN, we are able to get results that show the improvement of GAN performance. Our methods are extremely simple and require very little modification to existing GAN architecture.
@inproceedings{george2021prb,
title={Prb-GAN: A Probabilistic
Framework for GAN Modelling},
author={George, Blessen and
Kurmi, Vinod K and Namboodiri, Vinay P},
journal={arXiv preprint arXiv:2107.05241},
year={2021}
}
A fingerprint region of interest (roi) segmentation algorithm is designed to separate the foreground fingerprint from the background noise. All the learning based state-of-the-art fingerprint roi segmentation algorithms proposed in the literature are benchmarked on scenarios when both training and testing databases consist of fingerprint images acquired from the same sensors. However, when testing is conducted on a different sensor, the segmentation performance obtained is often unsatisfactory. As a result, every time a new fingerprint sensor is used for testing, the fingerprint roi segmentation model needs to be re-trained with the fingerprint image acquired from the new sensor and its corresponding manually marked ROI. Manually marking fingerprint ROI is expensive because firstly, it is time consuming and more importantly, requires domain expertise. In order to save the human effort in generating annotations required by state-of-the-art, we propose a fingerprint roi segmentation model which aligns the features of fingerprint images derived from the unseen sensor such that they are similar to the ones obtained from the fingerprints whose ground truth roi masks are available for training. Specifically, we propose a recurrent adversarial learning based feature alignment network that helps the fingerprint roi segmentation model to learn sensor-invariant features. Consequently, sensor-invariant features learnt by the proposed roi segmentation model help it to achieve improved segmentation performance on fingerprints acquired from the new sensor. Experiments on publicly available FVC databases demonstrate the efficacy of the proposed work.
@inproceedings{joshi021_1,
Author={Joshi, Indu and Kothari, Riya
and Utkarsh, Ayush and Kurmi, Vinod K
and Dantcheva,Antitza and Roy,
Sumantra Dutta and Kalra, Prem Kumar},
Title = {Sensor-invariant Fingerprint
ROI SegmentationUsing Recurrent
Adversarial Learning},
Booktitle = {IJCNN},
Year = {2021}
}
The effectiveness of fingerprint-based authentication systems on good quality fingerprints is established long back. However, the performance of standard fingerprint matching systems on noisy and poor quality fingerprints is far from satisfactory. Towards this, we propose a data uncertainty-based framework which enables the state-of-the-art fingerprint preprocessing models to quantify noise present in the input image and identify fingerprint regions with background noise and poor ridge clarity. Quantification of noise helps the model two folds: firstly, it makes the objective function adaptive to the noise in a particular input fingerprint and consequently, helps to achieve robust performance on noisy and distorted fingerprint regions. Secondly, it provides a noise variance map which indicates noisy pixels in the input fingerprint image. The predicted noise variance map enables the end-users to understand erroneous predictions due to noise present in the input image. Extensive experimental evaluation on 13 publicly available fingerprint databases, across different architectural choices and two fingerprint processing tasks demonstrate effectiveness of the proposed framework.
@inproceedings{joshi021_2,
Author={Joshi, Indu and Kothari, Riya
and Utkarsh, Ayush and Kurmi, Vinod K
and Dantcheva,Antitza and Roy,
Sumantra Dutta and Kalra, Prem Kumar},
Title = {Data Uncertainty Guided
Noise-awarePreprocessing Of Fingerprints},
Booktitle = {IJCNN},
Year = {2021}
}
In this paper, we consider the problem of domain adaptation for multi-class classification, where we are provided a labeled set of examples in a source dataset and target dataset with no supervision. We tackle the mode collapse problem in adapting the classifier across domains. In this setting, we propose an adversarial learning-based approach using an informative discriminator. Our observation relies on the analysis that shows if the discriminator has access to all the information available, including the class structure present in the source dataset, then it can guide the transformation of features of the target set of classes to a more structured adapted space. Further, by training the informative discriminator using the more robust source samples, we are able to obtain better domain invariant features. Using this formulation, we achieve state-of-the-art results for the standard evaluation on benchmark datasets. We also provide detailed analysis, which shows that using all the labeled information results in an improved domain adaptation.
@inproceedings{kurmi2021_idda_j,
Author = {Kurmi, Vinod Kumar and K S ,
Venaktesh and Namboodiri, Vinay},
Title = {Informative Discriminator
for Domain Adaptation},
Booktitle = {Image and Vision Computing,(IMAVIS)},
Year = {2021}
}
Collaborative Learning to Generate Audio-Video Jointly Vinod K Kurmi, Vipul Bajaj, Badri N. Patro, Venkatesh K Subramanian, Vinay P. Namboodiri, Preethi Jyothi IEEE International Conference on Acoustics, Speech, and Signal Processing.(ICASSP), 2021.
There have been a number of techniques that have demonstrated the generation of multimedia data for one modality at a time using GANs, such as the ability to generate images, videos, and audio. However, so far, the task of multi-modal generation of data, specifically for audio and videos both, has not been sufficiently well-explored. Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities. The proposed method uses multiple discriminators to ensure that the audio, video, and the joint output are also indistinguishable from real-world samples. We present a dataset for this task and show that we are able to generate realistic samples. This method is validated using various standard metrics such as Inception Score, Frechet Inception Distance (FID) and through human evaluation.
@inproceedings{kurmi2021_avg,
Author = {Kurmi, Vinod Kumar and Bajaj, Vipul and Patro, Badri N and K S ,
Venaktesh and Namboodiri, Vinay and Jyothi, Preethi},
Title = {Collaborative Learning to Generate Audio-Video Jointly},
Booktitle = {ICASSP},
Year = {2021}
}
Unsupervised Domain adaptation methods solve the adaptation problem for an unlabeled target set, assuming that the source dataset is available with all labels. However, the availability of actual source samples is not always possible in practical cases. It could be due to memory constraints, privacy concerns, and challenges in sharing data. This practical scenario creates a bottleneck in the domain adaptation problem. This paper addresses this challenging scenario by proposing a domain adaptation technique that does not need any source data. Instead of the source data, we are only provided with a classifier that is trained on the source data. Our proposed approach is based on a generative framework, where the trained classifier is used for generating samples from the source classes. We learn the joint distribution of data by using the energy-based modeling of the trained classifier. At the same time, a new classifier is also adapted for the target domain. We perform various ablation analysis under different experimental setups and demonstrate that the proposed approach achieves better results than the baseline models in this extremely novel scenario.
@inproceedings{kurmi2021_sfda,
Author = {Kurmi, Vinod Kumar and K S ,
Venaktesh and Namboodiri, Vinay},
Title = {Domain Impression: A Source Data Free
Domain Adaptation Method},
Booktitle = {WACV},
Year = {2021}
}
One of the major limitations of deep learning models is that they face catastrophic forgetting in an incremental learning scenario. There have been several approaches proposed to tackle the problem of incremental learning. Most of these methods are based on knowledge distillation and do not adequately utilize the information provided by older task models, such as uncertainty estimation in predictions. The predictive uncertainty provides the distributional information can be applied to mitigate catastrophic forgetting in a deep learning framework. In the proposed work, we consider a Bayesian formulation to obtain the data and model uncertainties. We also incorporate self-attention framework to address the incremental learning problem. We define distillation losses in terms of aleatoric uncertainty and self-attention. In the proposed work, we investigate different ablation analyses on these losses. Furthermore, we are able to obtain better results in terms of accuracy on standard benchmarks.
@inproceedings{kurmi2021_incre,
Author = {Kurmi, Vinod Kumar and Patro, Badri Narayana
and K S , Venaktesh and Namboodiri, Vinay},
Title = {Do not Forget to Attend to Uncertainty while
Mitigating Catastrophic Forgetting},
Booktitle = {WACV},
Year = {2021}
}
A fingerprint Region of Interest (ROI) segmentation module is one of the most crucial components in the fingerprint pre-processing pipeline. It separates the foreground fingerprint and background region due to which feature extraction and matching is restricted to ROI instead of entire fingerprint image. However, state-of-the-art segmentation algorithms act like a black box and do not indicate model confidence. In this direction, we propose an explainable fingerprint ROI segmentation model which indicates the pixels on which the model is uncertain. Towards this, we benchmark four state-of-the-art models for semantic segmentation on fingerprint ROI segmentation. Furthermore, we demonstrate the effectiveness of model uncertainty as an attention mechanism to improve the segmentation performance of the best performing model. Experiments on publicly available Fingerprint Verification Challenge (FVC) databases showcase the effectiveness of the proposed model.
@inproceedings{joshi021_w,
Author={Joshi, Indu and Kothari, Riya
and Utkarsh, Ayush and Kurmi, Vinod K
and Dantcheva,Antitza and Roy,
Sumantra Dutta and Kalra, Prem Kumar},
Title = {Explainable Fingerprint ROI Segmentation
Using Monte Carlo Dropout},
Booktitle = {IEEE Winter Conference on Applications
of Computer Vision Workshops (WACVW)},
Year = {2021}
}
In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.
@inproceedings{patro2020Bayesian,
title={Revisiting Paraphrase Question Generator
using Pairwise Discriminator},
author={Patro, Badri Narayana and Chauhan, Dev
and Kurmi, Vinod Kumar and Namboodiri, Vinay},
booktitle={Neurocomputing},
year = {2020}
}
Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr).
@inproceedings{patro2020Bayesian,
title={Deep Bayesian Network for Visual Question Generation},
author={Patro, Badri Narayana and Kumar, Sandeep
and Kurmi, Vinod Kumar and Namboodiri, Vinay},
booktitle={IEEE Winter Conference of Applications
on Computer Vision (WACV)},
year = {2020}
}
Domain adaptation is essential to enable wide usage of deep learning based networkstrained using large labeled datasets. Adversarial learning based techniques have showntheir utility towards solving this problem using a discriminator that ensures source andtarget distributions are close. However, here we suggest that rather than using a pointestimate, it would be useful if a distribution based discriminator could be used to bridgethis gap. This could be achieved using multiple classifiers or using traditional ensemblemethods. In contrast, we suggest that a Monte Carlo dropout based ensemble discrim-inator could suffice to obtain the distribution based discriminator. Specifically, we pro-pose a curriculum based dropout discriminator that gradually increases the variance ofthe sample based distribution and the corresponding reverse gradients are used to alignthe source and target feature representations. The detailed results and thorough ablationanalysis show that our model outperforms state-of-the-art results.
@article{kurmi2019curriculum,
title={Curriculum based Dropout Discriminator for Domain Adaptation},
author={Kurmi, Vinod Kumar and Bajaj, Vipul
and Subramanian, Venkatesh K and Namboodiri, Vinay P},
journal={BMVC},
year={2019} }
In this paper, we tackle a problem of Domain Adaptation. In a domain adaptation setting, there is provided a labeled set of examples in a source dataset with multiple classes being present and a target dataset that has no supervision. In this setting, we propose an adversarial discriminator based approach. While the approach based on adversarial discriminator has been previously proposed; in this paper, we present an informed adversarial discriminator. Our observation relies on the analysis that shows that if the discriminator has access to all the information available including the class structure present in the source dataset, then it can guide the transformation of features of the target set of classes to a more structured adapted space. Using this formulation, we obtain the state-of-the-art results for the standard evaluation on benchmark datasets. We further provide detailed analysis which shows that using all the labeled information results in an improved domain adaptation.
@InProceedings{kurmi2019looking,
author = {Kurmi, Vinod Kumar and
Namboodiri, Vinay P},
title = {Looking back at Labels:
A Class based Domain Adaptation
Technique},
booktitle = {International Joint
Conference on Neural Networks (IJCNN) },
month = {July},
year = {2019}
}
n this paper, we aim to solve for unsupervised domain adaptation of classifiers where we have access to label information for the source domain while these are not available for a target domain. While various methods have been proposed for solving these including adversarial discriminator based methods, most approaches have focused on the entire image based domain adaptation. In an image, there would be regions that can be adapted better, for instance, the foreground object may be similar in nature. To obtain such regions, we propose methods that consider the probabilistic certainty estimate of various regions and specific focus on these during classification for adaptation. We observe that just by incorporating the probabilistic certainty of the discriminator while training the classifier, we are able to obtain state of the art results on various datasets as compared against all the recent methods. We provide a thorough empirical analysis of the method by providing ablation analysis, statistical significance test, and visualization of the attention maps and t-SNE embeddings. These evaluations convincingly demonstrate the effectiveness of the proposed approach.
@InProceedings{Kurmi_2019_CVPR,
author = {Kumar Kurmi, Vinod and Kumar, Shanu
and Namboodiri, Vinay P.},
title = {Attending to Discriminative Certainty
for Domain Adaptation},
booktitle = {IEEE Computer Society Conference
on Computer Vision and Pattern Recognition(CVPR),},
month = {June},
year = {2019}
}
Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr).
@inproceedings{patro2018multimodal,
title={Multimodal Differential Network
for Visual Question Generation},
author={Patro, Badri Narayana and Kumar,
Sandeep and Kurmi, Vinod Kumar and Namboodiri, Vinay},
booktitle={Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing},
pages={4002--4012},
year={2018}
}
In this paper, we propose a method for obtaining sentence-level embeddings. While the problem of securing word-level embeddings is very well studied, we propose a novel method for obtaining sentence-level embeddings. This is obtained by a simple method in the context of solving the paraphrase generation task. If we use a sequential encoder-decoder model for generating paraphrase, we would like the generated paraphrase to be semantically close to the original sentence. One way to ensure this is by adding constraints for true paraphrase embeddings to be close and unrelated paraphrase candidate sentence embeddings to be far. This is ensured by using a sequential pair-wise discriminator that shares weights with the encoder that is trained with a suitable loss function. Our loss function penalizes paraphrase sentence embedding distances from being too large. This loss is used in combination with a sequential encoder-decoder network. We also validated our method by evaluating the obtained embeddings for a sentiment analysis task. The proposed method results in semantic embeddings and outperforms the state-of-the-art on the paraphrase generation and sentiment analysis task on standard datasets. These results are also shown to be statistically significant.
@inproceedings{patro2018learning,
title={Learning Semantic Sentence Embeddings
using Sequential Pair-wise Discriminator},
author={Patro, Badri Narayana and Kurmi,
Vinod Kumar and Kumar, Sandeep and
Namboodiri, Vinay},
booktitle={Proceedings of the 27th
International Conference on Computational
Linguistics},
pages={2715--2729},
year={2018}
}
Robust hand gesture recognition from 3D data Vinod K Kurmi, Garima Jain, Venkatesh K Subramanian International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision(WSCG), 2015
In this paper, we use the output of a 3D sensor (ex. Kinect from Microsoft) to capture depth images of humans making a set of predefined hand gestures in various body poses. Conventional approaches using Kinect data have been constrained by the limitation of the human detector middleware that requires close conformity to a standard near erect, legs apart, hands apart pose for the subject. Our approach also permits clutter and possible motion in the scene background, and to a limited extent, in the foreground as well. We make an important point in this work to emphasize that the recognition performance is considerably improved by a choice of hand gestures that accommodate the sensor’s specific limitations. These sensor limitations include low resolution in x and y as well as z. Hand gestures have been chosen (designed) for easy detection by seeking to detect a fingers apart, fingertip constellation with minimum computation. without, however compromising on issues of utility or ergonomy. It is shown that these gestures can be recognised in real time irrespective of visible band illumination levels, background motion, foreground clutter, user body pose, gesturing speeds and user distance. The last is of course limited by the sensor’s own range limitations. Our main contributions are the selection and design of gestures suitable for limited range, limited resolution 3D sensors and the novel method of depth slicing used to extract hand features from the background. This obviates the need for preliminary human detection and enables easy detection and highly reliable and fast (30 fps) gesture classification.
@article{kurmi2015robust,
title={Robust hand gesture recognition
from 3D data},
author={Kurmi, Vinod K and Jain, Garima
and Venkatesh, KS},
year={2015},
publisher={V{\'a}clav Skala-UNION Agency}
}