Skip to main content

Mining the Social Web, by Mathew Russell, O'Reilly Media

"Mining the social web" is a book about how to access social data from the most popular social services today by using the services' public APIs, and analyzing the retrieved data to gain insights about it.

The book uses the Python programming language to access and manipulate the data, and provides code snippets of common tasks within the book, as well as full iPython notebooks on Github. The book is written as documentation for the freely available iPython notebooks, with the documentation providing context and background for the code, as well as describing the algorithms used to mine the social data.

The author tries to be as concise as possible, although he did not succeed in the first chapter, where the first three section were verbose, and relatively unnecessary,  describing what twitter is and why people use it as a microblogging platform. With that out of the way, the writing style improves as the book progresses, and is a mixture of code examples and step by step explanations.

The author follows the same formula throughout the book: for each of the popular social services examined, he starts with an overview of the API to access the required data and how to configure the requisite authorization tokens to access it. He then proceeds to explain how to make requests, followed by a brief description of the important APIs and data sets returned. The author then presents a couple of algorithms to mine the data, and extract valuable statistics from it, describing the algorithms without assuming prior knowledge on the reader's end. Finally the author presents a cool visualization of the insights, using either Python libraries and packages, or Google Earth APIs. The formula is quite useful, and provides the book with consistency across chapters, which can be read independently and out of order.

The author starts with Twitter. He explains the structure of Twitter API, how it uses OAuth, and how to connect with it using the python "twitter" library. The chapter progresses with example python notebooks that show how to retrieve trending topics, user timelines, search results, and manipulate the tweet contents and tweet locations to gather interesting statistics about them. The writing style is expository, showing the notebooks piecemeal and explaining them well.

In the next chapter, the author focuses on Facebook and the social graph API. The chapter starts with an exposition of the entities available through Facebook (timeline, likes, locations, etc), how to grant access tokens to each of these entities, and introduces the Facebook query language FQL. The author provides ample examples that analyzes social graph connections, Facebook pages (Pepsi vs Coke examples), statistics on friends likes, and Friend graph cliques using PrettyTables, Histograms, and graph plots.

The author then tackles LinkedIn in a similar manner, but starts introducing the more interesting data mining techniques with a brief introduction to data clustering clustering algorithms. The author talks about normalizing the data, using NLTK for language processing, and describes and uses a couple of clustering algorithms such as greedy and hierarchical clustering, and k-means to cluster LinkedIn connections. The chapter ends with cool visualization of where the talent is using Google Earth.

The author then proceeds to Google+, and describes an information retrieval example to cluster documents, using it to introduce concepts such as TF-IDF, document similarity, and analyzing language bigrams.

The next chapters are about understanding blog posts, with a brief interlude on how to crawl and scrape the web, and how to summarize documents, which comes in handy if you have no time to read the full content of web pages and you'd like to figure out the gist of the document.

The following chapter tackles mining user emails, including high level statistics on who connects to who, and how frequently do they send emails to each other. The author uses the Enron data as an example, and introduces a toolkit to do the same with Gmail accounts

The last couple of chapters deal with Github project analytics, and micro formats and RDF. The book ends with a cookbook of recipes that list the problem to be solved, offers a solution, and discusses the salient points of the solution to drive a point home. Most of the cookbook recipes are for twitter, with a couple of cases for Facebook.

Overall I recommend the book. It is decently written, and contains a wealth of introductory material on how to access content from the popular social websites, and a cornucopia of algorithms that can be used to analyze the data.


Comments

Popular posts from this blog

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The books

Why good customer service matters?

I am not an Apple fan, but I do like their computers, and recommend them to colleagues and friends for a variety of reasons. They are well designed, and in addition to an excellent user interface, they run a flavor of Unix--which makes the life of computer programmers a lot easier. But most importantly, Apple's customer support is impeccable, that despite all the hardware issues I experienced in the past, I still recommend Apple computers. Let me explain why. A year and a half ago, I bought a Mac Book Pro for work. At the time it was the first generation unibody laptop, that had an i7 processor, lots of memory, and lots of disk space. Alas, like first generation models everywhere, it also had a lot of hardware problems. The most annoying of which was the screen randomly turning dark, with the hard drive spinning out of control. The only way to get out of this state was by forcing a reboot by holding down the power button, and losing everything I have been working on. At first

New ASUS RT-AX88U router

  I have been using Asus routers for many years, and have been pretty happy with them. The web interface is superb, and the firmware upgrades are timely and easy to apply, and over the last couple of years have introduced newer features that kept my old router relevant and functional.   After many years of service, my older router finally gave way, and started dropping Wifi connections randomly, especially when under heavy load. The connection drop happens whenever the kids have a Zoom meeting, or my wife and I are on work calls. Turning the laptop/iPad Wifi off and on again did not help, and we usually had to reboot the router to be able to connect again. Out of curiosity I looked at the CPU/memory stats of the router under heavy load, and could not see any issues. Even when all of us were in video calls, the CPU/memory did not rise about 50%. I could not see anything abnormal in the logs either. Online I saw that a lot of people had similar problems after upgrading to the latest rout