Sunday, August 25, 2013

Hungry for Change -- the documentary

Documentaries about food and health are almost always controversial, especially if they include a view on dieting and weight loss. They end up espousing a myopic view of how to lose weight and be healthy, and lose the viewer who is a long time skeptic after seeing a lot of fad diets come and go.

The documentary "Hungry for Change" was a pleasant change. It focused on giving out balanced information on how the US as a nation has a high percentage of obese individuals, despite all the health awareness, and diets in circulations. The premise of the documentary is that we have strayed from eating the foods our bodies evolved with through million of years to consuming highly processed foods, and if we change that by eating less processed foods, we'd have a better health.

The documentary starts with some eye-opening highlights:
  • Supermarket foods are engineered to have a long shelf life, to be appealing and addictive, and be not fulfilling so that we buy more
  • The protocol for making mice fat to study the effects of obesity on health is to feed mice MSG, which is present in about 80% of the supermarket foods
  • Airplane pilots refrain from drinking diet sodas, as it interferes with their focus
  • Phosphoric acid in sodas can cause loss of bone density
  • Total blueberry pomegranate cereal does not contain blueberry nor pomegranate. It contains propylene glycol which adds the blue bits. Interestingly propylene glycol is used to winterize cars, and to clean up the colon before a colonoscopy
  • 68% of the population is fat
  • Most of that is due to high fructose corn syrup consumption.  In the 1970s,  Japanese scientists developed a process to separate fructose from corn at a low cost, and its use skyrocketed afterwards
  • In 1900, the average person consumed 15g of fructose a day. Today that number jumped to 70-80g with some people consuming as high as 120-150g a day, a 10x increase
  • On average, people consume about 150 lbs of sugar every year
  • Sugar is the main culprit in making us fat
  • Consuming sugars produces beta endorphins, which should be considered a drug
The documentary then proposes consuming less processed foods to get rid of all the toxins the body accumulated over the years. It argues against initial exercise, since the fat cells in the body balance the toxins which are fat soluble in general. When we burn the fat without letting the body get rid of the toxins, they become unbalanced, and make us sick.

The documentary lists some of the good detox foods: anything with chlorophyll such as greens, and gelatinous foods such as aloe vera, chia seeds. Although they don't sound very appealing, the theory is that as the gelatinous foods move through the intestines they suck and  clean up the toxins. The show also recommends parsley and cilantro, as the later binds with heavy metals and removes them from the body. Who would have thought?

The show finally recommends getting more sleep to decrease weight gain because of stress. Sounds like small changes for a bigger benefit, and who knows, maybe we can all enjoy more chia seeds.

Thursday, August 15, 2013

Enterprise Data Workflows with Cascading, by Paco Nathan, O'Reilly Media

For people interested in developing Hadoop analytic applications there is a plethora of options. The options range from writing low-level, hand-tuned Java map-reduce code, to using a higher level language to manipulate the data such as Pig and Hive. There are pros and cons for each option. For the first, the code becomes complex for anything other than the canonical word-count example, and for the latter, to do anything meaningful, you almost always end up augmenting the higher level language with user-defined functions written in a different language to regain power and flexibility, causing maintenance nightmares. A happy medium in between is to use one of the data-flow libraries for Hadoop, of which Cascading is one.

Since Cascading has been around for some time, the online documentation is relatively mature, and includes a gentle introduction to the library, with example source code, and a well written user's guide. However this does not obviate the need for a book that describes the library and walks the reader gently through its usage and subtleties. "Enterprise Data Workflows with Cascading" is such a book.

The book starts with a simple example of copying a file on Hadoop, and introduces the concepts of taps for data sources, and data sinks, as well as data pipes that connect them. It then graduates to the canonical word count example, using it as a vehicle to explain flows, and the operations that can be performed on them through the use of functions and aggregation functions.

Next comes more complex tasks that require joins. The book starts with HashJoins, and then progresses to LeftJoins and distributed joins. The book then uses a meaty example of a text analytics pipeline to calculate term frequencies/inverse document frequency for a text corpus (TF-IDF), and uses that as a vehicle to walk through splits, merges, and more complex joins.

By then, the reader has become familiar and comfortable with Cascading, and the author walks him through the benefits of developing applications in a data-flow language instead of the other options available for Hadoop developers. Some of these benefits are the ability to test the code before deployment, and the author walks through an example of a TDD pipeline.  Others include using a consistent pattern language to describe the workflows, and having a single deployable JAR that can be used in dev/test/production environments.

Toward the end the author lists other language bindings for Cascading, such as Scalding (Scala), and Cascalog (Clojure). The later chapters contain good references for further reading on TDD/Scala/Clojure. The book closes with an open-data use-case.

Throughout the book, the author provides ample links to the source code, and code gists on github, as well as alternate implementations in different languages.

I liked the style of the book: it is a gentle introduction to Cascading, interspersed with some good advice on doing TDD for enterprise applications, the use of a pattern language for describing data-flows, and an introduction to other language bindings for Cascading.