Intro

Kenneth L. Lay (Chairman), Jeffrey K. Skilling (CEO), Andrew S. Fastow (CFO)

Enron was one of the americas largest energy cooperations which collapsed due to fraudulent activities. Enron's collapse impacted lives of thousands of people. The collapse impacted Wall Street and it's shares dropped from $90.75 in December 2001 to $0.67 by January 2002.

Kenneth Lay was charged with security, wire frauds and making false and misleading statements. Jeffrey Skilling was charged of hiding the financial losses of the trading buisness and other operations of the company. Andrew Fastow planned to make the company appear to be in great shape, despite the fact that it was loosing money.

This project is mainly focused on Enron company datasets. In this project we are attempting to classify POI(Person of Interest) from two different sets of data. One dataset contains insider pays to all the Enron executives. The second dataset contains emails text sent through their company accounts.

Through email data set we attempt to improve the accuracy of our prediction.

The Email datasets can be downloaed from here and the Insider Payroll can be download form here (final_project_dataset.pkl) .

Data

Insider Pays Data Preprocessing

The two datasets (Insider payroll and Emails) are in pkl format so we had to convert them in to csv and text format in order to be able to use them for our data analysis. Pandas and cPickle libraries is used for converting the pickle files in to csv and text respectively.

Statistics

Some statistical analysis is done over different features of the data and the results are depicted in the following table. High standard deviation shows possibility of outliers in our data set. It could also mean that few people have really high income/stock when compared to others. In order to be sure we have to perform further analysis which is done using visualization below.

Visulaziation

The numerical distribution for different features of the data is visualized using Tableau. After analysing these images our confidence on possibility of existance of outliers increases.



In order to find the outliers, we implemented codes in MATLAB and analyzed the data related to the suspicious features such as bonus, exercised stock options, loan advances, other, restricted stock, restricted stock deferred. Our code finds the top five values related to each one of these features as well as looking in to if they belong to a POI or not. The results related to our analysis for the bonus feature is demonstrated in the following table:

Based on these result, we concluded that TOTAL is an outlier therefore, we deleted this row from our data set.

Moreover, the results related to the top five values for the rest of the suspicious features is demonstrated in the following table:

Based on these results we concluded that all the values are normal except for the Restricted Stock Deferred, as in all the other features the largest value belongs to a POI and we do not attempt to delete the values related to the POIs. However, the top value related to the Restricted Stock Deferred feature seems suspicious.

Moreover, we performed further analysis on the raw data to find out if the Stock Deferred feature value is really an outlier. Based the information provided in here (enron61702insiderpay.pdf). we know that there are some correlation between some features in a way that payments are combined in to two different groups. One is the total payments which is derives from adding the values of salary, bonus, director fees, deferral payments, deferred income, loan advances, long term incentive, expenses, other, salary and the other one is total stock value which is the combination of restricted stock, exercised stock options, restricted stock deferred.

Combining the values of features related to total payment and total stock, and comparing the result with the actual values of total payment and total stock, we noticed some differences. The results indicated that difference is coming from the 'BHATNAGAR SANJAY' and 'BELFER ROBERT'. Therefore, we corrected the values related to these names for all the relevant features.

Finally, we tested the modified data set on both groups of payment and analysed that no more differences exist in the results. Therefore, the modified data set was plotted again in order to visualize the manipulated data set for different features.

The modified dataset is visualized again using Tableau. The results showed no more outliers.

Correlation analysis of the features is performed in order to find about the independence of features to each other as well as dependence on the POI.

We found that the total salary and total stock options are highly correlated and we removed them from the data set.

The MATLAB code for removing outliers and data cleansing on Insider payroll data can be found here (RemovingOutliers.m and correlation.m).

Email Dataset Preprocessing

Spark Context unable to read pkl file not saved using RDD.saveAsPickleFile. Data converted from pkl to text before processing by spark.

Sample Pkl text

Sample from text file

The entire text file can be downloaded from here.
The code to convert pkl file to text can be found here. (convertPklToText.py)

Analysis

1. Loading Data

After conversion to text, Email data set is copied into HDFS and loaded into the dataframe with following structure:

# load data into RDD
wordsRdd = sc.textFile(inputPath).repartition(sc.defaultParallelism * 20)
dataRdd = wordsRdd.map(lambda line: splitLines(line))

# define schema
schema = StructType([
StructField('person', StringType(), False),
StructField('email', StringType(), False),
StructField('poi', StringType(), False),
StructField('sentence', StringType(), False) ])

dataFrameEnron = sqlContext.createDataFrame(dataRdd, schema)
Chat Conversation End

2. Tokenization:

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A tokenizer that converts the input string to lowercase and then splits it by white spaces.

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(dataFrameEnron)

3. Remove Stop Words:

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. Here we remove stop words i.e. words which do not contain important significance.

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
removedwordsData = remover.transform(wordsData)

4. TF-IDF:

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus.

# perform TF using hashing. Maps a sequence of terms to their term frequencies using the hashing trick.
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=100)
featurizedData = hashingTF.transform(removedwordsData)

# perform IDF
# ignore terms which occur in less than a minimum number of documents.
# In such cases, the IDF for these terms is set to 0.
idf = IDF(minDocFreq=2, inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData) idfDataFrame = idfModel.transform(featurizedData)


fij = frequency of term ti in document dj

TF.IDF score wij = TFij X IDFi
ni = number of docs that mention term i
N = total number of docs

5. Classification Using Decision Trees:

model = DecisionTree.trainClassifier(trainData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', maxDepth=5, maxBins=32)

6. Experiments

Gini: Accuracy 83.9%
Enthropy: Accuracy 83.8%
MaxDepth 7: Accuracy 83.1%

Comparison

The Accuracy of classifying the payroll data was 80% which is less comparing to the email data set 83.9%.
The code for classifying insider payroll and emails can be found here. (decisionTree.py & bagofwords.py)

Summary

Based on our analysis on both the data sets, here are our findings:
1. Outliers can show big deviaiton in our data so analysing and removing them should be the first step before the analysis
2. Finding correlation between features helped us identifying that there are some dependencies among different data features and helped us to decide about the type of classifier to apply.
3. Even though email content does not seem that important on the first thought, but it actually showed better results when compared to the Insider payroll. So we should not ignore any kind of data in our analysis as one which may seem less important can bring better results.

Links to our codes and data can be found as following:
1. https://github.com/skhaksho/Enron-Project
2. http://www.sfu.ca/~skhaksho/Data/

Marking the project

Getting the data: Acquiring/gathering/downloading: (2 points)
ETL: Extract-Transform-Load work and cleaning the data set: (3 points)
Problem: Work on defining problem itself and motivation for the analysis: (1 point)
Algorithmic work: Work on the algorithms needed to work with the data, including integrating data mining and machine learning techniques: (4 points)
Bigness/parallelization: Efficiency of the analysis on a cluster, and scalability to larger data sets: (2 points)
UI: User interface to the results, possibly including web or data exploration frontends: (3 points)
Visualization: Visualization of analysis results: (3 points)
Technologies: New technologies learned as part of doing the project: (2 points)

Elements

Text

This is bold and this is strong. This is italic and this is emphasized. This is superscript text and this is subscript text. This is underlined and this is code: for (;;) { ... }. Finally, this is a link.


Heading Level 2

Heading Level 3

Heading Level 4

Heading Level 5
Heading Level 6

Blockquote

Fringilla nisl. Donec accumsan interdum nisi, quis tincidunt felis sagittis eget tempus euismod. Vestibulum ante ipsum primis in faucibus vestibulum. Blandit adipiscing eu felis iaculis volutpat ac adipiscing accumsan faucibus. Vestibulum ante ipsum primis in faucibus lorem ipsum dolor sit amet nullam adipiscing eu felis.

Preformatted

i = 0;

while (!deck.isInOrder()) {
    print 'Iteration ' + i;
    deck.shuffle();
    i++;
}

print 'It took ' + i + ' iterations to sort the deck.';

Lists

Unordered

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Alternate

  • Dolor pulvinar etiam.
  • Sagittis adipiscing.
  • Felis enim feugiat.

Ordered

  1. Dolor pulvinar etiam.
  2. Etiam vel felis viverra.
  3. Felis enim feugiat.
  4. Dolor pulvinar etiam.
  5. Etiam vel felis lorem.
  6. Felis enim et feugiat.

Icons

Actions

Table

Default

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Alternate

Name Description Price
Item One Ante turpis integer aliquet porttitor. 29.99
Item Two Vis ac commodo adipiscing arcu aliquet. 19.99
Item Three Morbi faucibus arcu accumsan lorem. 29.99
Item Four Vitae integer tempus condimentum. 19.99
Item Five Ante turpis integer aliquet porttitor. 29.99
100.00

Buttons

  • Disabled
  • Disabled

Form