The Last Post

This really is the last post. I’m shutting down the Tumblr blog, switching to a new platform, and focusing my writing to topics related to my sector. I will also continue the Cooking with Data posts through a redesign of my personal web site.

Thanks for reading. This blog was a great outlet and discussion point while I was working in my Ph.D. program, but it’s time to move on.

Considering False Discovery Rate in Financial Data Mining

When searching for investment opportunities using data mining and high throughput science, most people are concerned with losses that will occur from making a wrong pick rather than opportunity lost from missing a marginal pick. Smart data scientists therefore use some variant of the False Discovery Rate (FDR) test to ignore finding investment opportunities that are truly trivial and instead find the important few.

Python Unicode Encoding Bugs

It’s a tricky business to work with applications that pass around Unicode strings as ASCII byte strings. Small bugs can lead to lost data or data that is processed differently by different programs. Here’s an example.

One program outputs the following Unicode string representation of a Tweet:

C:\\Documents and Settings\\u30e6\u30fc\u30b6

If you process this string in Python, it will not be interpreted correctly. You can follow along with the Python interpreter (there is just a space between “and Settings”, not a new line):

»> s=u’C:\\Documents and Settings\\u30e6\u30fc\u30b6’»> s.encode(‘ascii’,errors=’replace’)

'C:\\Documents and Settings\u30e6??'

»> len(s.encode(‘ascii’,errors=’replace’))

33

When the Python Unicode string is encoded into ASCII, you can see the bug easily. The \u30e6 is not converted to a Unicode character by the Python library because it is parsing the string from left to right. When moving from left to right, it sees the double \ characters and interprets them differently than what they really are.

The moral of the story is: don’t pass data this way. Just store it in proper Unicode format and pass it around with the file’s encoding and endian order.

Model updating with terabytes of data in real-time

Readers of my Facebook page may remember that I’ve been writing about model updating high frequency trading data on an Intilop at sub 100-ns speeds. One of the tricks to doing this …

Does the Bakshy Study Demonstrate An End to the Echo Chamber?

Bakshy (Facebook’s Data Team) studied who influences whom on Facebook. One of the findings is that users propagated novel information from weak ties. In interviews, Facebook’s PR people are claiming that this indicates an end to the online echo chamber. This is highly unlikely.

Hybrid Operating Systems

It is entertaining how Google has copied Microsoft’s strategies.

alexainslie
:

“My goal is for web apps to become compelling enough to force OS creators to hybridize their platforms. In other words, I’d like to see rightward movement in both the app and OS spectrums.” - Boris Smus

Michigan's Big Data Success

The health care savings and increased tax compliance are only two of the benefits from incorporating data from multiple databases into behavioral models.


This is a graph of the frequency of Tweets by character length (1 to 140 characters) from the 500 million Tweet sample used by Scott Golder and Vladimir Barash (sample extracted from Twitter in late 2009) in their academic papers. View this graph in comparison to a Twitter employee’s set of graphs that describe the typical length of a Tweet. Current Tweets have a number of changes (different text processing in the twitter-text API, wrapping of URLs, different application distributions, etc.) Note that the early peak seen in the current Twitter graphs is much less pronounced. Another note is that this graph shifts a little depending on how the Tweets are processed. More on that some other time.Thanks to Scott Golder and Vladimir Barash for letting me use this data. 

This is a graph of the frequency of Tweets by character length (1 to 140 characters) from the 500 million Tweet sample used by Scott Golder and Vladimir Barash (sample extracted from Twitter in late 2009) in their academic papers. View this graph in comparison to a Twitter employee’s set of graphs that describe the typical length of a TweetCurrent Tweets have a number of changes (different text processing in the twitter-text API, wrapping of URLs, different application distributions, etc.) Note that the early peak seen in the current Twitter graphs is much less pronounced. Another note is that this graph shifts a little depending on how the Tweets are processed. More on that some other time.

Thanks to 
Scott Golder and Vladimir Barash for letting me use this data. 

Business Intelligence versus Data Science

Business Intelligence (BI) professionals are beginning to connect with me on LinkedIn, Twitter, and Facebook. As I read their commentary (such as this), I realize how the market has split to support complex, real-time prediction tasks.

When I think about working on real-time prediction tasks, I’m worried about throughput. The easy part of the job is fitting a model. The difficult part of the job is implementing the model in a real-time prediction system. I’ve been working with a few friends on a toy example for explanatory purposes. The toy is an iPhone application that recognizes a wine label (from its picture) and then integrates with Cellartracker to allow insertion/deletion from Cellartracker's inventory.

The Fat Trap Article

It’s worth bookmarking this article on current research results in weight loss for future review. Our 2011 CHI research paper on the limits of persuasive and machine learning weight loss technology sees about 30 downloads a week from my personal site. It’s no wonder that it’s popular.

Data Scientists Disliked in Corporate America?

A recent post by mathbabe about the desirable traits of data scientists has met with a lot of backlash in the community. In the comments section of her blog, on my Facebook page, and in my Twitter DMs, I’m seeing notes from data scientists that tell me that corporate America doesn’t like data when it disagrees with corporate desires.

I’ve worked in corporate America and academia, and one thing is inevitably true: every decision is political. Many of the people that work with data seem to believe that their analysis is objective and this should shield them from politics. They need an injection of reality. First, because all data is situated and subjective. Second, because the use of that data is always situated. These are attributes of political problems, and scientists should not be surprised that when there is a fight over scarce resources there will be politics.

Let me add another list item to the list of attributes that mathbabe finds useful for data scientists: a good data scientist is politically aware. They know which stakeholders have an interest and their motivations. And, instead of tanking programs, they do everything they can to find a way to make them successful instead.

Why I like Stanford’s Online Machine Learning Course for Training

The University of Washington is making a big push for additional funding to support UW Computer Science and Engineering. Some people have suggested doubling the funding to increase the number of graduates per year by less than a thousand.

I’m skeptical, but mostly because I think integrating courses like Stanford’s online machine learning courses into someone’s job will result in better outcomes for students. Stanford’s course is much lighter than a typical undergraduate or graduate course on machine learning, but it may be more useful. It establishes a common set of code, cases, and vocabulary around machine learning and it is unobtrusive enough for a programmer working at a large company or a startup to do each week and still have a full time job. The “assignments” become integrating the lessons learned into the company’s business.

If we coupled this type of learning with a certification program, it would be more valuable for students and industry than taking a student out of the work force and asking them to spend a lot of time commuting (or living) in an undergraduate or graduate program. The learning could be more integrated with their job. And the loss of income/cost of education would be especially tasty for young people that have become over-burdened with debt from the modern American university system.

This won’t work for everyone, but for most tech workers, it should work much better than the current system. 

Managing Bias - Variance Tradeoff in Machine Learning

Update: Several industry professionals have suggested that aspects of this note are incorrect. However, the academic literature generally supports every one of these “cliff’s notes”. The most common experienced industry veteran complaint is a riff on “more data solves every learning problem”. While true in the abstract, in practicality, there are problems where even large amounts of data are too noisy if the features are not designed well. Designing features well is beyond the scope of this Cliff Note for students that are beginning to learn to implement machine learning in real settings.