By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
SmartData Collective
  • Analytics
    AnalyticsShow More
    data analytics in sports industry
    Here’s How Data Analytics In Sports Is Changing The Game
    6 Min Read
    data analytics on nursing career
    Advances in Data Analytics Are Rapidly Transforming Nursing
    8 Min Read
    data analytics reveals the benefits of MBA
    Data Analytics Technology Proves Benefits of an MBA
    9 Min Read
    data-driven image seo
    Data Analytics Helps Marketers Substantially Boost Image SEO
    8 Min Read
    construction analytics
    5 Benefits of Analytics to Manage Commercial Construction
    5 Min Read
  • Big Data
  • BI
  • Exclusive
  • IT
  • Marketing
  • Software
Search
© 2008-23 SmartData Collective. All Rights Reserved.
Reading: A Lesson about Optimization
Share
Notification Show More
Latest News
big data mac performance
Data-Driven Tips to Optimize the Speed of Macs
News
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
Artificial Intelligence
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics
data analytics reveals the benefits of MBA
Data Analytics Technology Proves Benefits of an MBA
Analytics
Aa
SmartData Collective
Aa
Search
  • About
  • Help
  • Privacy
Follow US
© 2008-23 SmartData Collective. All Rights Reserved.
SmartData Collective > Uncategorized > A Lesson about Optimization
Uncategorized

A Lesson about Optimization

Editor SDC
Last updated: 2009/04/22 at 11:26 AM
Editor SDC
7 Min Read
SHARE

In my last post, I wrote about the productivity achievable with Python, telling the story of creating the SPSSINC TURF extension command and dialog box.  Well, when the cat’s away, the mice will play.  This post is about scalability and optimizing the TURF algorithm,

The TURF algorithm is computationally explosive.  It has to compute a number […]

In my last post, I wrote about the productivity achievable with Python, telling the story of creating the SPSSINC TURF extension command and dialog box.  Well, when the cat’s away, the mice will play.  This post is about scalability and optimizing the TURF algorithm,

More Read

big data improves

3 Ways Big Data Improves Leadership Within Companies

IT Is Not Analytics. Here’s Why.
Romney Invokes Analytics in Rebuke of Trump
WEF Davos 2016: Top 100 CEO bloggers
In Memoriam: Robin Fray Carey

The TURF algorithm is computationally explosive.  It has to compute a number of set unions that grows very rapidly as the problem size grows and harvest the best.  Apart from the number of cases, which affects the time required to compute a set union and the amount of memory required, the size is determined mainly by the number of variables, N, and the maximum number of variables that can be combined. e.g., best three variables, depth.

If we hold the depth constant at 10, i.e., find the best combination of up to 10 variables, the number of unions required grows like this as we increase the number of variables.

N     unions
3     4
6     57
12   4070
24   4,540,361
48   8,682,997,422

Looking in another dimension, fixing the number of variables at 24 and varying the depth, the union count grows like this.

depth  unions
3         1330
6         190,026
12       9,740,661
24       16,777,191

48 variables and a depth of 24 would require 156,861,290,196,829 unions!

This clearly can get out of hand with what seem to be reasonable problems!  I added code to precalculate and report the number of unions required and syntax to set a limit on problem size to make the user better informed, but that is not enough.

In calculating the set unions, I was careful to make the code pretty efficient.  That works well.  But I found that as the number of sets got into the millions, the algorithm stalled and eventually exhausted the machine memory and failed.

Some experimentation showed that the set calculations completed in a reasonable amount of time, but finding the best combinations was very slow.  A little optimization of that part of the Python code sped it up by a factor of 4.  But I could see that what was killing the code was my strategy of first accumulating the reach count for all the sets and then picking out the best ones.  Even though each reach statistic saved added only a small object to the list, the number of such objects was coming to dominate memory and time usage.

Handling the result list had initially seemed to me to be a trivial part of the process, but it clearly is not as the size grows.  So I needed to change the code to only keep reach counts for combinations that have a chance at making the final list.  Each new count needs to be checked against the counts already computed.  Then the code should discard that count if it was dominated by the others, and it should replace a count if it is better than the worst already in the list.

Doing this efficiently requires an entirely different data structure for keeping the list of counts.  A heap is a data structure that has two useful properties for this problem.  First, the smallest element is always at the head of the list, and, second, elements can be added or deleted quickly while maintaining the heap property.

Python provides a heap data structure in the heapq module in the standard library.  For this problem, I actually needed to keep a list of heaps, one for each different number of combinations, but I could use the heapq functions for each one.  One other problem is that the heap items are a single number, and I needed to keep something a little more complicated (the count and a list of the variables that produced it).  Because Python does not use strong types, I could easily create an object that acted like an integer for comparison purposes but held all the information I needed.  A Python motto is, “if it walks like a duck and talks like a duck, it’s a duck”.

With these changes, the result management parts of the algorithm now run in a constant and small amount of memory and the harvesting of the best combinations is very fast.  A test problem that previously died after consuming 1.5GB of memory now runs in 30MB – and finishes.  Of course, constructing and counting all those set unions can still take a long time, but that’s the nature of the problem.  I am not going to try that problem requiring 156 trillion sets.

There are several points to this story.

  • You may find a need to optimize in places you didn’t expect: testing and measuring is important.
  • Producing an initial version and putting it out for real world use can quickly flush out the places that need work.  Being able to respond quickly and outside annual product release cycles as we can on Developer Central is a great help.  It can change how one builds software.
  • Python provides a rich set of technology that can be tapped without the need to invent or reinvent the basics.

You can download the SPSSINC TURF module from the Downloads section of SPSS Developer Central, www.spss.com/devcentral.  It requires at least SPSS Statistics Version 17 and the Python plug-in.  And it’s free.

Editor SDC April 22, 2009
Share this Article
Facebook Twitter Pinterest LinkedIn
Share

Follow us on Facebook

Latest News

big data mac performance
Data-Driven Tips to Optimize the Speed of Macs
News
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
3 Ways AI Has Helped Marketers and Creative Professionals Streamline Workflows
Artificial Intelligence
data analytics in sports industry
Here’s How Data Analytics In Sports Is Changing The Game
Big Data
data analytics on nursing career
Advances in Data Analytics Are Rapidly Transforming Nursing
Analytics

Stay Connected

1.2k Followers Like
33.7k Followers Follow
222 Followers Pin

You Might also Like

big data improves
Big DataJobsKnowledge ManagementUncategorized

3 Ways Big Data Improves Leadership Within Companies

6 Min Read
Image
Uncategorized

IT Is Not Analytics. Here’s Why.

7 Min Read

Romney Invokes Analytics in Rebuke of Trump

4 Min Read

WEF Davos 2016: Top 100 CEO bloggers

14 Min Read

SmartData Collective is one of the largest & trusted community covering technical content about Big Data, BI, Cloud, Analytics, Artificial Intelligence, IoT & more.

ai in ecommerce
Artificial Intelligence for eCommerce: A Closer Look
Artificial Intelligence
giveaway chatbots
How To Get An Award Winning Giveaway Bot
Big Data Chatbots Exclusive

Quick Link

  • About
  • Contact
  • Privacy
Follow US

© 2008-23 SmartData Collective. All Rights Reserved.

Removed from reading list

Undo
Go to mobile version
Welcome Back!

Sign in to your account

Lost your password?