Skip to main content

Mining the Social Web, by Mathew Russell, O'Reilly Media

"Mining the social web" is a book about how to access social data from the most popular social services today by using the services' public APIs, and analyzing the retrieved data to gain insights about it.

The book uses the Python programming language to access and manipulate the data, and provides code snippets of common tasks within the book, as well as full iPython notebooks on Github. The book is written as documentation for the freely available iPython notebooks, with the documentation providing context and background for the code, as well as describing the algorithms used to mine the social data.

The author tries to be as concise as possible, although he did not succeed in the first chapter, where the first three section were verbose, and relatively unnecessary,  describing what twitter is and why people use it as a microblogging platform. With that out of the way, the writing style improves as the book progresses, and is a mixture of code examples and step by step explanations.

The author follows the same formula throughout the book: for each of the popular social services examined, he starts with an overview of the API to access the required data and how to configure the requisite authorization tokens to access it. He then proceeds to explain how to make requests, followed by a brief description of the important APIs and data sets returned. The author then presents a couple of algorithms to mine the data, and extract valuable statistics from it, describing the algorithms without assuming prior knowledge on the reader's end. Finally the author presents a cool visualization of the insights, using either Python libraries and packages, or Google Earth APIs. The formula is quite useful, and provides the book with consistency across chapters, which can be read independently and out of order.

The author starts with Twitter. He explains the structure of Twitter API, how it uses OAuth, and how to connect with it using the python "twitter" library. The chapter progresses with example python notebooks that show how to retrieve trending topics, user timelines, search results, and manipulate the tweet contents and tweet locations to gather interesting statistics about them. The writing style is expository, showing the notebooks piecemeal and explaining them well.

In the next chapter, the author focuses on Facebook and the social graph API. The chapter starts with an exposition of the entities available through Facebook (timeline, likes, locations, etc), how to grant access tokens to each of these entities, and introduces the Facebook query language FQL. The author provides ample examples that analyzes social graph connections, Facebook pages (Pepsi vs Coke examples), statistics on friends likes, and Friend graph cliques using PrettyTables, Histograms, and graph plots.

The author then tackles LinkedIn in a similar manner, but starts introducing the more interesting data mining techniques with a brief introduction to data clustering clustering algorithms. The author talks about normalizing the data, using NLTK for language processing, and describes and uses a couple of clustering algorithms such as greedy and hierarchical clustering, and k-means to cluster LinkedIn connections. The chapter ends with cool visualization of where the talent is using Google Earth.

The author then proceeds to Google+, and describes an information retrieval example to cluster documents, using it to introduce concepts such as TF-IDF, document similarity, and analyzing language bigrams.

The next chapters are about understanding blog posts, with a brief interlude on how to crawl and scrape the web, and how to summarize documents, which comes in handy if you have no time to read the full content of web pages and you'd like to figure out the gist of the document.

The following chapter tackles mining user emails, including high level statistics on who connects to who, and how frequently do they send emails to each other. The author uses the Enron data as an example, and introduces a toolkit to do the same with Gmail accounts

The last couple of chapters deal with Github project analytics, and micro formats and RDF. The book ends with a cookbook of recipes that list the problem to be solved, offers a solution, and discusses the salient points of the solution to drive a point home. Most of the cookbook recipes are for twitter, with a couple of cases for Facebook.

Overall I recommend the book. It is decently written, and contains a wealth of introductory material on how to access content from the popular social websites, and a cornucopia of algorithms that can be used to analyze the data.


Comments

Popular posts from this blog

Kindle Paperwhite

I have always been allergic to buying specialized electronic devices that do only one thing, such as the Kindle, the iPod, and fitness trackers. Why buy these when technology evolves so fast that a multi-purpose device such as the phone or a smart watch can eventually do the same thing, but with the convenience of updates that fix bugs and add functionality? So, I was shocked when this weekend I made an impulse buy and got the newest Kindle Paperwhite—a special purpose device for reading eBooks. I was walking past the Amazon store in the mall and saw that the newest Kindle Paperwhites were marked down by $40 for the holidays. The device looked good in the display, so I went in to look at it closely. The Paperwhite is small and light, with a 6” screen that is backlit and waterproof.   The text was crisp and readable, and in the ambient light, it felt like I am reading a printed book. I was sold and bought it on the spot. At home I have struggled to put it down. The bo...

A paper a day keeps the doctor away: NoDB

In most database systems, the user defines the shape of the data that is stored and queried using concepts such as entities and relations. The database system takes care of translating that shape into physical storage, and managing its lifecycle. Most of the systems store data in the form of tuples, either in row format, or broken down into columns and stored in columnar format. The system also stores metadata associated with the data, that helps with speedy retrieval and processing. Defining the shape of the data a priori, and transforming it from the raw or ingestion format to the storage format is a cost that database systems incur to make queries faster. What if we can have fast queries without incurring that initial cost? In the paper " NoDB: Efficient Query Execution on Raw Data Files ", the authors examine that question, and advocate a system (NoDB) that answers it. The authors start with the motivation for such a system. With the recent explosion of data...

A paper a day keeps the dr away: Dapper a Large-Scale Distributed Systems Tracing Infrastructure

Modern Internet scale applications are a challenge to monitor and diagnose. The applications are usually comprised of complex distributed systems that are built by multiple teams, sometimes using different languages and technologies. When one component fails or misbehaves, it becomes a nightmare to figure out what went wrong and where. Monitoring and tracing systems aim to make that problem a bit more tractable, and Dapper, a system by Google for large scale distributed systems tracing is one such system. The paper starts by setting the context for Dapper through the use of a real service: "universal search". In universal search, the user types in a query that gets federated to multiple search backends such as web search, image search, local search, video search, news search, as well as advertising systems to display ads. The results are then combined and presented back to the user. Thousands of machines could be involved in returning that result, and any poor p...