Education

Teaching Experience

Activities

  • November 2014
    Conference Committee, PyCon 2014
    image
    PyCon is the largest annual gathering for the community using and developing the open-source Python programming language.
  • April 2013
    Elected as the Chief Director of Computer Engineering Department Student Association (CESA)
    image
    CESA is the student committee concerned with directing the department’s extra-curriculum activities.
  • April 2013
    Elected as a member of Central Scientific Student Association (CSSA)
    image
    CSSA is the scientific student organization of Iran University of Science and Technology aims to facilitate students scientific activities and educational process.

Research Interests


  • Data Cleaning, Big Data Analytics
  • Database and Query Processing
  • Parallel and Distributed Computing
  • Machine Learning, Pattern Recognition
  • Computational Geometry

Research Experience


  • Research Assistant in the SFU Data Science Laboratory (Sep 2016 - Sep 2018)
    • Explored a variation of the typical active learning setting where a learning algorithm is able to interactively query the user to obtain the desired label for training data.
    • Developed an intelligent system that would reduce the cost of manually obtaining labels with the estimation error up to 3× smaller than the existing methods resulted in publishing a paper titled Cleaning Crowdsourced Labels Using Oracles For Supervised Learning in VLDB 2019 conference.

  • Research Assistant in the IUST Data Mining laboratory (2014 - March 2016)
    A joint work with Mr.Ali Hadian, under supervision of Dr.Behrouz Minae; Started as an extensive research on various spatial data structures. Our intention is to propose a taxonomy of already existing methods for indexing geometrical spaces along with a survey on the literature.

Work Experience

  • image

    Data Scientist at Traction on Demand

    Mar 2018 - Sep 2018

    • Built data pipeline to extract and map different sources of information related to customers and employees resulted in 800% speedup in data collection.
    • Designed machine learning models for customer lifetime value (LTV) analysis, Sales lead scoring, customer churn prediction, project assignment and etc, using Pandas, Scikit, Tensorflow, Keras in order to increase profitability and throughput.
    https://tractionondemand.com/blog/are-you-ready-for-ai/
  • image

    Database Manager of www.zirend.com

    2013 - 2016

    • Managed data warehousing for an online outsourcing marketplace which allows employers to post projects for freelancers. Created by Django(Python MVC web framework), Zirend lets anyone to post works to get done by thousands of active skilled users.
  • image

    Founder & Software Manager of Elexir

    Apr 2015 - Jul 2015

    • Used Neo4j for handling user relations and MongoDB for document-based objects in a social network mobile app, resulted in cutting the access time in half compared to traditional RDBMS.
  • image

    Database Manager Emersun Industries Co

    Apr 2014 - Jul 2014

    • Refined the pipeline for data transmission between two Microsoft SQL Server databases in one of the biggest manufacturer of home appliances in middle-east led to reducing the human effort.

Filter by type:

Sort by year:

Cleaning Crowdsourced Labels Using Oracles For Supervised Learning

Mohamad Dolatshah, Mathew Teoh, Jiannan Wang
Conference Paper Proceedings of the 2019 International Conference on Very Large Data Bases (VLDB 2019)

Abstract

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with true labels. In this paper, we consider how to apply oracle-based label cleaning to reduce the gap. We propose TARS, a label-cleaning advisor that can provide two pieces of valuable advice for data scientists when they need to train or test a model using noisy labels. Firstly, in the model testing stage, given a test dataset with noisy labels, and a classification model, TARS can use the test data to estimate how well the model will perform w.r.t. true labels. Secondly, in the model training stage, given a training dataset with noisy labels, and a classification algorithm, TARS can determine which label should be sent to an oracle to clean such that the model can be improved the most. For the first advice, we propose an effective estimation technique, and study how to compute confidence intervals to bound its estimation error. For the second advice, we propose a novel cleaning strategy along with two optimization techniques, and illustrate that it is superior to the existing cleaning strategies. We evaluate TARS on both simulated and real-world datasets. The results show that (1) TARS can use noisy test data to accurately estimate a model’s true performance for various evaluation metrics; and (2) TARS can improve the model accuracy by a larger margin than the existing cleaning strategies, for the same cleaning budget

Ball*-tree: Efficient spatial indexing for constrained nearest-neighbor search

Mohamad Dolatshah, Ali Hadian, Behrouz Minaei-Bidgoli
Conference Paper 3rd International Conference on Mathematical Sciences & Computer Engineering (ICMSCE 2016)

Abstract

Emerging location-based systems and data analysis frameworks requires efficient management of spatial data for approximate and exact search. Exact similarity search can be done using space partitioning data structures, such as Kd-tree, R*-tree, and Ball-tree. In this paper, we focus on Ball-tree, an efficient search tree that is specific for spatial queries which use euclidean distance. Each node of a Ball-tree defines a ball, i.e. a hypersphere that contains a subset of the points to be searched. In this paper, we propose Ball*-tree, an improved Ball-tree that is more efficient for spatial queries. Ball*-tree enjoys a modified space partitioning algorithm that considers the distribution of the data points in order to find an efficient splitting hyperplane. Also, we propose a new algorithm for KNN queries with restricted range using Ball*-tree, which performs better than both KNN and range search for such queries. Results show that Ball*-tree performs 39%-57% faster than the original Ball-tree algorithm.

Pattern Recognition in the 2016 Presidential Election

Project
Leveraged natural language processing (NLP) and pattern recognition in top candidates tweets along with statistical test of the null hypothesis in Pandas.

Re-indexing Model for Webpage Crawler

Project

Abstract

Performed a Monte Carlo study for the Update Rate of webpages using Poisson distribution and maximum-likelihood estimation (MLE).

Spark Image Clustering

Project

Abstract

Applied exploratory Data Analysis (EDA) and visualization in matplotlib along with implementation of the k-means algorithm on a real dataset using Spark(PySpark) on Databricks Cloud Computing framework.

Recommender System

Project

Abstract

A recommender system designed for offering movies based on user ranking built by Kd-tree and later by Ball-tree data structure as an optimization on Python.

Analyzing NASA Server Logs

Project

Abstract

Examined the NASA web server logs using HDFS clusters in order to analyze the Pearson correlation between features in SparkSQL.

Customer Lifetime Value (LTV) Analysis

Project

Abstract

Calculated LTV values along with training supervised learning models to predict important customers based on behavioral features using KaplanMeier estimator resulted in increasing profitability by 77%.

Churn Rate Prediction

Project

Abstract

Conducted Data Wrangling on business order information in order to predict if a sales lead is going to be converted, by training a LSTM(Long shortterm memory) neural network model based on Weibull distribution.

Customer Activity Forecast

Project

Abstract

Predicted if a customer is going to be inactive for a window of time by feeding CRM cycle history to a Recurrent Neural Network(RNN) by 96% accuracy.

Deep Feature Selection and Classification

Project

Abstract

Used ensemble feature engineering and cross validation in Matlab for a Deep Belief Network(DBN) resulted in %71 accuracy

Skills

Programming Languages

C C++ Python CUDA R SQL Java

Data Science

Hadoop Hive SparkSQL Cassandra TensorFlow Pandas Amazon Web services

Tools

Tableau Trifacta Keras Microsoft Azure IBM Watson Jupyter WEKA MATLAB

Database Technologies

MySQL Microsoft SQL Server SQLite MongoDB OrientDB Neo4j

Web Technologies

PHP Django HTML CSS JavaScript jQuery AJAX

Operating Systems

Linux Ubuntu Server CentOS Windows

Typesettings

Latex XeTex

Hobbies


  • Sports
    Playing Soccer, Table tennis, Chess.

  • Camping and mountain climbing.

References

Ali Hadian

Research Assistant

Email: ali.hadian@gmail.com
webpage

Lara Gilchrist

Product Director, Traction on Demand

Email: info@tractionondemand.com
webpage

Jiannan Wang

Assistant Professor

Email: jnwang@sfu.ca
webpage

Behrouz Minaei-Bidgoli

Associate Professor, (BSc & MSc Supervisor)

Email: b_minaei@iust.ac.ir
webpage