Experienced Data Scientist with a demonstrated history of working in the higher education industry. Skilled in Statistical Data Analysis, Big Data Programming, Machine Learning, and Apache Spark.
Strong research professional with a Master's degree focused in Computer Science from Simon Fraser University.
Kamal HighSchool GPA: 19.3/20 (4/4)
Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with true labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. true labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model’s true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget
Emerging location-based systems and data analysis frameworks requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as Kd-tree, R*-tree, and Ball-tree. In this paper, we focus on Ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a Ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose Ball*-tree, an improved Ball-tree that is more efficient for spatial queries. Ball*-tree enjoys a modified space partitioning algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using Ball*-tree, which performs better than both KNN and range search for such queries. Results show that Ball*-tree performs 39%-57% faster than the original Ball-tree algorithm.
Performed a Monte Carlo study for the Update Rate of webpages using Poisson distribution and maximum-likelihood estimation (MLE).
Applied exploratory Data Analysis (EDA) and visualization in matplotlib along with implementation of the k-means algorithm on a real dataset using Spark(PySpark) on Databricks Cloud Computing framework.
A recommender system designed for offering movies based on user ranking built by Kd-tree and later by Ball-tree data structure as an optimization on Python.