How to get your Facebook messages into R

2016-10-29
7 min read

It is a truism that we live in the information age, yet on a day to day basis we engage remarkably little with insights on the personal information we create. Sure, Netflix shows you films you want to see, Amazon offers books you want to buy and Facebook shows you pictures of cats with boobs or whatever it is you tend to click on, but explicit purpose of that is to get your money. What about using all that data to gain insights on who you are, who your friends are, what you tend to talk about? Introspection has a long history, but Wundt wasn’t particularly data-driven.

A few interesting facts are shown by WolframAlpha & myPersonality but they are limited by what Facebook decides it is acceptable for you to share with other apps. For instance, you cannot share your Facebook message info (probably for the best) which is the most interesting bit. I decided to dig deeper and look into my Facebook data.

How (technical stuff)

To do this I first downloaded my Facebook data (this link shows you how). It comes in an html format, and I started to extract relevant information with rvest when I noticed a python script on github which handily converts it into JSON format. After a bit of fiddling (different time formats and timezones can confuse the script) I got it into JSON and using rjson into R.

#read and fix JSON file
library(rjson)

msgR<-fromJSON(file = "C:\\Users\\user1\\Desktop\\Life\\RProjects\\facebookMsg\\FB-Message-Parser\\messages.json",)
msg<-msgR[[1]]
rm(msgR)

threads<-sapply(msg,function(x) paste(x[[1]], sep ="", collapse = "-"))

thread<-vector()
sender<-vector()
date<-vector()
message<-vector()

for (i in 1:length(msg)){
  thread.length<-length(msg[[i]][[2]])
  for (a in 1:thread.length){
          thread[length(thread)+1]  <- threads[i]
         sender[length(sender)+1]   <- msg[[i]][[2]][[a]][[3]]
         date[length(date)+1]       <- msg[[i]][[2]][[a]][[2]]
         message[length(message)+1] <- msg[[i]][[2]][[a]][[1]]
  }
cat(paste0("thread ",i," involving ", threads[i]," has been completed ,
           ...its length was ", thread.length, "\n"))
}

df<-data.frame(thread,sender,date,message)
rm(message,msg,sender,thread,thread.length,threads, date)

df$date<-as.POSIXct(strptime(df$date,format = "%A, %B %e, %Y at %I:%M%p"))

df$thread<-as.character(df$thread)
df$sender<-as.character(df$sender)
df$message<-as.character(df$message)

Thus the data is ready for nice little plots about frequencies and such. However, one of my goals was to analyse the text, and to do that I needed to categorise threads into different languages. This was an issue because I regularly chat in four languages. I thought I would have to go through the astounding effort of categorising by hand, when a quick google search showed me the textcat package, which will do it all for me.

#first step, try text categorisation
install.packages("textcat")
library(textcat)
library(dplyr)
library(ggplot2)

#testing
df$language<-textcat(df$message)

#exploring. Unfortunately it didn't work perfectly, certain messages were classified wrong (I don't speak Basque)
select(df,message,language) %>% filter(language == "basque")%>%
  select(message)

#So what I did was to take the language most spoken in the thread and classify it as that
lang.m<-select(df,thread,language) %>% group_by(thread,language) %>% summarise(count = n()) %>%
  top_n(1)

df<-merge(df,lang.m,all.x = T,by = "thread")

#I removed the few remaining wrongly classified threads
df<-df %>% filter(language == "hungarian" |
              language == "english" |
              language == "spanish" |
              language == "german")

In addition, I thought I would classify the messages whether they were sent by a male or a female. Luckily, R provided me with a package for this too, but it wasn’t super effective.

Results (interesting stuff?)

After a bit of cleaning and struggling with encodings the data was ready for analysis. The first thing that became clear was the sheer volume of the messages sent and received. To put this into perspective I downloaded a few books from Project Gutenberg(using Gutenbergr) and compared them to my own Facebook opus.

Total characters written and read by me in Facebook messages compared to total characters in well known literature.

What would the world come to if we spent all that time we spend on Facebook messenger writing novels? Of course, it is senseless to compare Plato’s Republic with my teenage lol’s, but its increadible how much stuff I’ve written. Given that I sent very few of these to myself, it seems to confirm that social media is indeed social. Just out of interest, I had a look to see how my vocabulary is different to Tolstoy’s.

What distinguishes my personal correspondance from War & Peace. Word size is indicating Chi-square overrepresentation.

What distinguishes my personal correspondance from War & Peace. Word size is indicating Chi-square overrepresentation.

When do I send these messages? I admittedly what I first wanted to see if there are individuals whom I message almost exclusively at certain times of the night. But then I realised the fact I’ve travelled a lot has confused Facebook’s understanding of what time different message’s were sent, moreover it would be imprudent to share that information in a blogpost.

So instead I looked at when I send messages during the the time I was at Cambridge. I’m least likely to be messaging at around 4,5am and most likely to be messaging after breakfast and before dinner. Funnily enough, there are more messages sent at 3am than at 2am when I am on holiday. These are the famous “where are you” text messages.

An average day in my life in terms of Facebook messages sent and received between 2011 and 2014.

An average day in my life in terms of Facebook messages sent and received between 2011 and 2014.

What time of the day messages are sent and received is only one aspect of analysing the flow of messages over time. With the rise of Whatsapp and other messaging apps Facebook messanger might be slowly becoming useless. After all, who uses MSN Messenger nowdays? Looking back at how my use of Facebook messaging since I started using it seems discredit this.

Messages sent from my Facebook inception till the end of 2015.

Messages sent from my Facebook inception till the end of 2015.

While the peaks are for the most part related to my love affairs, the minimums are a little harder to explain with the notable exception of August 2015, where the Chinese firewall stopped me from sending too many messages. My month long stay in Cuba left a similar dent in this year’s results.

Most polyglots will be familiar with the suffering that messaging in multiple languages entails. There is a big difference between “año” and “ano”. If you don’t believe me, I encourage you to search for images of “ano” on Google. An question for me would be, what languages do I actually use to message others?

Messages sent from my Facebook inception till the end of 2015.

Messages sent from my Facebook inception till the end of 2015.

Based on this I predominantly write in Hungarian, followed by English, and a minimum amount of Spanish and German. I recently started reading more in Spanish, perhaps its time to get my Goethe out and read some German, lest I completely forget the language.

Anther question of interest to me was whether I use different words to talk to girls than I use to talk to boys. It turns out I do (slightly). Here is what I am more likely to say to boys, with word size indicating Chi-square overrepresentation.

Most distinguishing words sent to females compared to males.

Most frequently sent words to females and to males.

So the stereotypes are true. I swear more when talking to other men, I talk about *"sex"*,*"money"*, *"models"*, *"cars"* and *"free"* *"shit"*. Funnily enough, the most overrepresented words are related to the amateur chess I play through Facebook chat(*"fbchess"*,*"Sam"*,*"Oliver"*,*"move"*,*"black"*,*"white"*) which I only play with other guys. I could have also computed wordclouds for words I use MORE with one gender than the other. I also did this, but they were too embarrassing to share on the internet.

Before you judge me for being a knuckle-dragging male stereotype, I encourage you to read this paper. What I say to girls is not nearly as interesting, partly because I the gender classifying algorithm was based on names and it wasn’t very effective and partly because the girls I message the most I talk to in languages other than English. If I make the effort to correcty classify more conversations the results would be potentially more interesting.

In conclusion, while writing this I learned quite a bit about myself. Given that most of the things I looked at were social, I decided not to share most of the most interesting insights because they involve other people. Importantly, this type of introspection is based on hard data, not just what I think about myself, which is of course heavily biased.