It seems like so much of our online activities are tied to our identity. All manner of online shopping and social media require you to hand over pieces of your identity so that you can enjoy their product or service. At the crux of this particular quest is this thought:

What kinds of conversations are possible with social media that is completely anonymous?

I recently started doing some data science projects at Confesh, an anonymous social media platform that makes a promise never to track you… no username, email, or ip address. One of the interesting things we’re exploring is classifying user sentiment, ie. what do people think/feel about a confession? Is most of it spam, trolling, and bigotry, or – maybe counter intuitively – can there be honest, substantive, or at least some kind of civil conversation?

The thing about sentiment analysis is that a sentiment classifier (i.e. “this post has a positive/negative sentiment”) only performs well if have access to a lot of labeled data. Luckily, Confesh also has a mechanism for reporting spam. These reports are a potential source of labels because users can provide free text to state the reason for reporting the confession or comment.

One other limitation of sentiment analysis is that it can typically only detect patterns in simple binary outcomes, like “this review is positive or negative”. I can talk about this more in a future post, but generally going for the simplest model is the most expedient thing to do when building these kinds of data pipelines. Luckily, the subset of the Confesh dataset that we’re going to take a look at might be able to provide us with everything that we need to create a rudimentary ‘offensiveness’ detector.

In data science speak, I’d say we’re dealing with semi-structured data (which we often are). In this post, we’re going to reshape and recast our dataset into a structure that can help answer some interesting questions.

I always like to have a working hypothesis to guide my explorations, so here it goes:

There are statistical patterns in the word composition of confessions such that we can predict whether a confession is offensive or not offensive with some degree of accuracy using a simple classifier algorithm.

I won’t really be able to test this hypothesis in this post, but I think it’s a good enough motivation to get us started!

The Toolbox

As with any craft, we need some tools… in our case, those would be Python and a bunch of nice open source libraries!

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import re

from os import path
from scipy.ndimage import imread
from nltk.util import ngrams
from collections import Counter
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from IPython.display import display, HTML

# importing plotly-related modules
import cufflinks as cf
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.tools import FigureFactory as FF

plt.style.use('ggplot')
%matplotlib inline

The Data

What do you get if you give a bunch of liberal arts college students an anonymous platform?

This dataset is a small subset of the confessions, comments, and reports from the Mount Holyoke Confesh.

We can read in the dataset into memory to take a closer look. Think of this as our chopping block. We’re going to take four seperate csv (comma-separated value) files and splice them together.

# Reading in data
holyr_df = pd.read_csv('../tmp/clean/holyokecon_confessional_reports.csv')
holys_df = pd.read_csv('../tmp/clean/holyokecon_confessional_secrets.csv')
holyraw_df = pd.read_csv('../tmp/raw/holyokecon_confessional_secrets.csv')
holyrawr_df = pd.read_csv('../tmp/raw/holyokecon_confessional_reports.csv')

# defining some global variables
SECRET_COL = 'clean_tokens_secret'
REPORT_COL = 'clean_tokens_report'

# merge the clean secrets, clean reports, raw reports, and raw secrets data
holysr_df = holys_df.merge(holyr_df, left_on='id', right_on="secret_id",
                          how='left', suffixes=('_secret', '_report'))
holysr_df = holysr_df.merge(holyraw_df[['id', 'create_date', 'confession']],
                            left_on='id_secret', right_on='id', how='left')
holysr_df = holysr_df.merge(holyrawr_df[['id', 'reason']],
                            left_on='id_report', right_on='id', how='left')
holysr_df.rename(columns={'reason': 'report_reason'}, inplace=True)

#preprocess: remove rows with null clean_tokens_secret value
holysr_df = holysr_df[holysr_df[SECRET_COL].notnull()]
holysr_df[['id_secret', 'confession', 'clean_tokens_secret', ]].head()

	id_secret	confession	clean_tokens_secret
0	14040	goddamn insomnia.	goddamn insomnia
1	13994	GO TO SLEEP. KEEP YOUR SECRETS TO YOURSELF.	sleep keep secret
2	10971	we are accidents waiting to happen	accident waiting happen
3	12515	Is this site ruining your life?	site ruining life
4	9854	I just do it for kicks, and I don't believe an...	kick dont believe

The confession column is the original raw text, and the clean_tokens_secret is the result of some preprocessing that I did. For this initial preprocessing step, I did the following:

removed punctuation
removed special characters like / or ~,
removed numbers
lowercased all letters
removed stopwords (i.e. common words like ‘the’, ‘and’, ‘a’ that are typically structural and convey little to the ‘aboutness’ of a piece of text).

Censoring Problematic Words

Not surprisingly, we need to do more preprocessing…

Ultimately, we want to process our data so that we can create some interesting things with them, like visualizations and machine learning models.

After seeing the unfiltered version of the results that you are about to see, I decided that processing select words (namely the n-word) was appropriate. While it’s important to let the data speak for itself, I didn’t feel comfortable presenting these results without exercising some editorial judgement.

Warning: there is some offensive language in this text analysis.

# Removing the offensive word by matching it to a pattern.
def preprocess_pattern(text, replace="n_word", p=r'nigger|niger|niggar|nigar'):
    return " ".join([replace if re.search(p, t) else t
                     for t in text.lower().split()])

# these are the columns we want to process
text_columns = [
    "confession",
    "clean_tokens_secret",
    "clean_tokens_report",
    "report_reason"
]

# apply the preprocess_pattern function to
# each column that contains text
for c in text_columns:
    holysr_df[c] = holysr_df[c].apply(
        lambda x: x if isinstance(x, float) and np.isnan(x) else preprocess_pattern(x)
    )

holysr_df[[SECRET_COL]][holysr_df[SECRET_COL].str.contains("n_word")].head()

	clean_tokens_secret
1955	define making fool many people think dont want...
4235	n_word n_word n_word discus
4236	n_word n_word n_word discus
4237	n_word n_word n_word discus
4238	n_word n_word n_word discus

Let’s find a pattern… not!

This next little code block is meant to sift through all the secrets for a specific pattern and return only those posts that contain a match. In this initial analysis, I want to be able to analyze all the confessions, so we’ll leave pattern at '', which means all secrets will be matched.

# detecting secrets containing a specific word
pattern = r''
selector = holysr_df[SECRET_COL].str.contains(pattern)
match_df = holysr_df[selector]

# Drop duplicate secrets
match_secrets = match_df.drop_duplicates('clean_tokens_secret')

# Match secrets that were not reported
match_not_reported = match_secrets[match_secrets['id_report'].isnull()]

# Match secrets that were reported
match_reported = match_secrets[match_secrets['id_report'].notnull()]

# Select report text
report_text = match_df[match_df[REPORT_COL].notnull()]

Makin’ a Word Cloud, ‘cause we can…

With all its limitations, Word Clouds are still fun :) It’s great for giving you a broad impression of the word composition of text, which is exactly what we want to do right now.

Below we create a word cloud in the shape of the Confesh logo, all purple n’ stuff, ‘cause purple is pretty.

word_cloud_options = {
    'width': 1000,
    'height': 1000,
    'background_color': "white",
    'max_words': 300,
    'stopwords': STOPWORDS,
    'random_state': 42
}

def create_word_cloud(text_iterable, image_color_fp=None,
                      title='', **kwargs):
    confesh_coloring = imread(image_color_fp)

    # generating the word cloud plot
    kwargs.update({'mask': confesh_coloring})
    wc = WordCloud(**kwargs)
    text = " ".join(text_iterable)
    wc.generate(text)

    # prettifying the plot
    image_colors = ImageColorGenerator(confesh_coloring)
    plt.figure(figsize=(12,12))
    plt.title(title)
    plt.imshow(wc.recolor(color_func=image_colors))
    plt.axis("off")
    plt.show();

logo_fp = '../assets/logo-purple.png'
create_word_cloud(match_secrets[SECRET_COL].astype(str),
                  logo_fp, **word_cloud_options);

png

As you can see, the n-word is one of the most frequently used words in the dataset, along with a smattering of expletives and some other pretty mundane verbiage. I don’t know about you, but when I first saw this word cloud I thought to myself: “wow, this platform enables racism and bigotry because anonymity”.

Pardon the grammitically incorrect thought, but actually I think I may have been jumping to a conclusion there. Isn’t the entire internet a platform for trolling, bigotry, and racism? It occured to me that the quality of content on a social media platform is heavily influenced by the moderation system of that platform.

Like Facebook and Twitter, Confesh has a moderation system that communities can use to report confessions and comments. We will end this post by answering a final question:

If we group confessions by those that were reported by the community and those that were not, how would the above word frequency distribution change?

Counting Ngrams: An Introduction to Text-mining

It’s great to count individual words and all, but what we lose by doing that is context.

What words appeared together in sequence?

A simple way to address this problem is by computing ngrams. An ngram is the n sequence of words that appear in succession for any given piece of text. So a unigram would be a single word, a bigram would be a sequence of two words, like so:

Unigram (1-gram): ‘the’
Bigram (2-gram): ‘the cat’
Trigram (3-gram): ‘the cat sits
…

Doing this allows us to at least capture the most frequent sequences of words.

# Defining functions to compute ngram frequency
def word_counter(text, n=1, length_thres=50):
    t = text.split()
    t = [tk for tk in t if len(tk) < length_thres]
    for i in range(n):
        t_ngrams = [" ".join(b) for b in list(ngrams(t, i + 1))]
        t.extend(t_ngrams)
    return Counter(t)

def word_aggregater(corpus_list, n=1):
    c = Counter()
    for doc in corpus_list:
        c.update(word_counter(doc, n=n))
    return c

def count_token_frequency(token_series, filter_thres, **kwargs):
    freq_df =  pd.DataFrame(word_aggregater(token_series, **kwargs).items())
    freq_df.rename(columns={0: 'word', 1: 'frequency'}, inplace=True)
    freq_df = freq_df[freq_df['frequency'] > filter_thres] \
        .sort_values('frequency', ascending=False)
    freq_df['ngrams'] = freq_df['word'].apply(lambda x: len(x.split()))
    return freq_df.reset_index(drop=True)

# create frequency count dataframes
secrets_corpus = count_token_frequency(match_secrets['clean_tokens_secret'], 0, n=3)
secrets_not_reported_corpus = count_token_frequency(match_not_reported['clean_tokens_secret'], 0, n=3)
secrets_reported_corpus = count_token_frequency(match_reported['clean_tokens_secret'], 0, n=3)
report_text_corpus = count_token_frequency(report_text['clean_tokens_secret'], 0, n=3)

# merge frequencies for all secrets, reported, and not reported
merge_cols = ['word', 'frequency']
all_corpus = secrets_corpus.merge(secrets_not_reported_corpus[merge_cols], on="word",
                                  how="left", suffixes=("_all", "_not_reported"))
all_corpus = all_corpus.merge(secrets_reported_corpus[merge_cols], on="word", how="left")
all_corpus = all_corpus.rename(columns={'frequency': 'frequency_reported'})
all_corpus.head()

	word	frequency_all	ngrams	frequency_not_reported	frequency_reported
0	n_word	59120	1	148	58972
1	n_word n_word	52896	2	6	52890
2	n_word n_word n_word	52874	3	2	52872
3	n_word n_word n_word n_word n_word n_word	52858	6	1	52857
4	like	29268	1	27518	1750

Just looking at the first 5 rows in the ngram frequency table, we can pose an interesting hypothesis:

The same word repeated many times in sequence is an indicator of spam.

I think the relationship between offensiveness and spam is an interesting topic, but I think that’s for another post. For now, we need to do a…

Sanity Check!

As a data scientist, it’s important to do sanity checks often. Because our data as it is now is so different from how it was in the beginning, it’s important to check and double-check if the transformations we are actually performing are in fact the transformations that we intend.

Below, we do a quick test to make sure that for each row, the sum of frequency_not_reported and frequency_reported should equal frequency_all. This is because the frequency_not_reported and frequency_reported categories are mutually exhaustive and mutually exclusive.

secret_sum = all_corpus[['frequency_not_reported', 'frequency_reported']].sum(axis=1)
not_equal = all_corpus[~(secret_sum == all_corpus['frequency_all'])]
print "We should expect this to be zero!: %d" % not_equal.shape[0]

We should expect this to be zero!: 0

Sanity check passed!

Creating an Interactive Visualization

Wouldn’t it be nice to compare confessions that contain the most frequent words in the corpus? How about if you can break it down by whether a confession was reported or not?

To do this, we need to enrich our ngram frequency data with some more text data. Below, we filter the ngrams table to include only the top 20 unigrams, bigrams, and trigrams for a total of 60 (1,2,3)-grams.

Then, we search through the cleaned confession text to find confessions that contain our top 60 (1,2,3)-ngrams. We filter those search results by selecting the top 5 confessions that have the most comments.

# creating custom annotations for the plot
# when you hover over a specific bar on the plot,
# you should be able to see the top 4 posts
# containing that word, sorted by number of comments

def format_text_annotation(text_list, n=35):
    text_list = [t.decode('utf-8').encode('ascii', 'ignore') for t in text_list]
    text_list = [" ".join(t.split()) for t in text_list]
    text_list = "<br>".join([t if len(t) < n else t[:n] + "..." for t in text_list])
    return text_list

def token_top_secrets(token, comment_col='comments', n=5):
    secrets = holysr_df[holysr_df['id_report'].isnull()].copy()
    top_secrets = secrets[secrets[SECRET_COL].str.contains(token)] \
        .sort_values(comment_col, ascending=False)['confession']
    top_secrets = top_secrets.drop_duplicates().tolist()
    if len(top_secrets) < n:
        n = len(top_secrets)
    return format_text_annotation(top_secrets[:n])

def token_reports_text(token, comment_col='comments', n=5):
    top_reports = report_text[report_text[SECRET_COL].str.contains(token)] \
        .sort_values(comment_col, ascending=False)['confession']
    top_reports = top_reports.drop_duplicates().tolist()
    if len(top_reports) < n:
        n = len(top_reports)
    return format_text_annotation(top_reports[:n])

# filter all_corpus to pick top n tokens for each ngram
n = 20
all_corpus = pd.concat([
    all_corpus[all_corpus['ngrams'] == 1].sort_values('frequency_all', ascending=False)[:n],
    all_corpus[all_corpus['ngrams'] == 2].sort_values('frequency_all', ascending=False)[:n],
    all_corpus[all_corpus['ngrams'] == 3].sort_values('frequency_all', ascending=False)[:n]
])
all_corpus['top_secrets'] = all_corpus['word'].apply(token_top_secrets)
all_corpus['top_reports'] = all_corpus['word'].apply(token_reports_text)
all_corpus[['word', 'top_secrets', 'top_reports']].head()

	word	top_secrets	top_reports
0	n_word	to the two bitches who didn't make ...<br>i'm ...	you who are shaming am for having a...<br>n_wo...
4	like	hi you guys. i'm a recent-ish alum....<br>i re...	let's give this a go: rate my body!...<br>some...
5	dont	bringing back an oldie. paste whate...<br>hi y...	the mhc confessional needs to be bl...<br>so h...
6	get	can smith & holyoke together count ...<br>deba...	the mhc confessional needs to be bl...<br>okay...
7	want	i'm a guy. ask me whatever you want...<br>deba...	"fellow classmates, hope all is wel...<br>so, ...

def create_bar_trace(dataframe, graph_obj, x_col, y_col, text_col, **go_kwargs):
    return graph_obj(
        y=dataframe[x_col],
        x=dataframe[y_col],
        text=dataframe[text_col],
        **go_kwargs)

def create_word_freq_subplot(dataframe, ngrams=1, colorlist=[]):
    dataframe = dataframe[dataframe['ngrams'] == ngrams].copy()
    dataframe.sort_values('frequency_all', inplace=True, ascending=False)
    dataframe.fillna(0, inplace=True)

    if ngrams == 1:
        gram_text = "Unigrams"
    if ngrams == 2:
        gram_text = "Bigrams"
    if ngrams == 3:
        gram_text = "Trigrams"

    trace1 = create_bar_trace(dataframe, go.Bar, 'frequency_not_reported', 'word',
                              'top_secrets',name='<b>%s Not Reported</b>' % gram_text,
                              marker={'color': colorlist[0]})
    trace2 = create_bar_trace(dataframe, go.Bar,'frequency_reported', 'word',
                              'top_reports', name='<b>%s Reported</b>' % gram_text,
                              marker={'color': colorlist[1]})

    data = [trace1, trace2]
    return data

def add_subplot_fig(fig, row, col, traces):
    for t in traces:
        fig.append_trace(t, row, col)
    return fig

subplot1 = create_word_freq_subplot(all_corpus, ngrams=1, colorlist=['#bc94d3', '#8551a3'])
subplot2 = create_word_freq_subplot(all_corpus, ngrams=2, colorlist=['#82ddbc', '#459b7c'])
subplot3 = create_word_freq_subplot(all_corpus, ngrams=3, colorlist=['#f2d37d', '#c9a654'])

fig = tools.make_subplots(rows=3, cols=1,
                          subplot_titles=('Unigrams', 'Bigrams', 'Trigrams'),
                          vertical_spacing = 0.14);

add_subplot_fig(fig, 1, 1, subplot1)
add_subplot_fig(fig, 2, 1, subplot2)
add_subplot_fig(fig, 3, 1, subplot3)

title = 'Frequency of Words/Phrases in Mount Holyoke Confessions'
xaxis_domain = fig['layout']['xaxis1']['domain']
fig['layout'].update(
    {
        'title': title,
        'titlefont': {'size': 16},
        'height': 1200,
        'width': 750,
        'barmode': 'stack',
        'margin': {'l': 100, 'r': 100, 'b': 155, 't': 100, 'pad': 10},
        'xaxis1': {
            'tickangle': -45
        },
        'xaxis2': {
            'tickangle': -45
        },
        'xaxis3': {
            'tickangle': -45
        },
        'legend': {
            'traceorder': 'normal'
        }
    }
)
url = py.iplot(fig, filename="confesh-exploration")

This is the format of your plot grid:
[ (1,1) x1,y1 ]
[ (2,1) x2,y2 ]
[ (3,1) x3,y3 ]

Compare and Contrast

What do confessions look like when you remove reported posts?

An emerging question from this exploration is this:

“How does the Mount Holyoke community feel about the use of the n-word?”.

It’ll take a little bit more data smithery to get at this question in a deeper way, but for now, you can explore the distribution of unigrams, bigrams, and trigrams in the interactive frequency plots below. Click on the legend items to hide/show a particular category, and see what you get!

HTML("{::nomarkdown}" + url.embed_code + "{:/nomarkdown}")

Takeaways

The n-word is being used a lot in this community forum
Preliminary analysis suggests that the use of the n-word is mostly as spam.
The Mount Holyoke community is moderating the hell out of posts that contain the n-word.

	report_reason
2	wrong thread
4	troll.
5	type a reason here...
8	error
11	spam
15	double post
16	doesn't make sense since i deleted my double post
18	name
19	its demeaning.
20	it attacks a person

Quest for an Offensiveness Detector Part 1