Saturday, September 20, 2014

Mining the Social Web, by Mathew Russell, O'Reilly Media

"Mining the social web" is a book about how to access social data from the most popular social services today by using the services' public APIs, and analyzing the retrieved data to gain insights about it.

The book uses the Python programming language to access and manipulate the data, and provides code snippets of common tasks within the book, as well as full iPython notebooks on Github. The book is written as documentation for the freely available iPython notebooks, with the documentation providing context and background for the code, as well as describing the algorithms used to mine the social data.

The author tries to be as concise as possible, although he did not succeed in the first chapter, where the first three section were verbose, and relatively unnecessary,  describing what twitter is and why people use it as a microblogging platform. With that out of the way, the writing style improves as the book progresses, and is a mixture of code examples and step by step explanations.

The author follows the same formula throughout the book: for each of the popular social services examined, he starts with an overview of the API to access the required data and how to configure the requisite authorization tokens to access it. He then proceeds to explain how to make requests, followed by a brief description of the important APIs and data sets returned. The author then presents a couple of algorithms to mine the data, and extract valuable statistics from it, describing the algorithms without assuming prior knowledge on the reader's end. Finally the author presents a cool visualization of the insights, using either Python libraries and packages, or Google Earth APIs. The formula is quite useful, and provides the book with consistency across chapters, which can be read independently and out of order.

The author starts with Twitter. He explains the structure of Twitter API, how it uses OAuth, and how to connect with it using the python "twitter" library. The chapter progresses with example python notebooks that show how to retrieve trending topics, user timelines, search results, and manipulate the tweet contents and tweet locations to gather interesting statistics about them. The writing style is expository, showing the notebooks piecemeal and explaining them well.

In the next chapter, the author focuses on Facebook and the social graph API. The chapter starts with an exposition of the entities available through Facebook (timeline, likes, locations, etc), how to grant access tokens to each of these entities, and introduces the Facebook query language FQL. The author provides ample examples that analyzes social graph connections, Facebook pages (Pepsi vs Coke examples), statistics on friends likes, and Friend graph cliques using PrettyTables, Histograms, and graph plots.

The author then tackles LinkedIn in a similar manner, but starts introducing the more interesting data mining techniques with a brief introduction to data clustering clustering algorithms. The author talks about normalizing the data, using NLTK for language processing, and describes and uses a couple of clustering algorithms such as greedy and hierarchical clustering, and k-means to cluster LinkedIn connections. The chapter ends with cool visualization of where the talent is using Google Earth.

The author then proceeds to Google+, and describes an information retrieval example to cluster documents, using it to introduce concepts such as TF-IDF, document similarity, and analyzing language bigrams.

The next chapters are about understanding blog posts, with a brief interlude on how to crawl and scrape the web, and how to summarize documents, which comes in handy if you have no time to read the full content of web pages and you'd like to figure out the gist of the document.

The following chapter tackles mining user emails, including high level statistics on who connects to who, and how frequently do they send emails to each other. The author uses the Enron data as an example, and introduces a toolkit to do the same with Gmail accounts

The last couple of chapters deal with Github project analytics, and micro formats and RDF. The book ends with a cookbook of recipes that list the problem to be solved, offers a solution, and discusses the salient points of the solution to drive a point home. Most of the cookbook recipes are for twitter, with a couple of cases for Facebook.

Overall I recommend the book. It is decently written, and contains a wealth of introductory material on how to access content from the popular social websites, and a cornucopia of algorithms that can be used to analyze the data.

No comments :

Post a Comment