www.anaconda.com Open in urlscan Pro
2606:4700::6810:8303  Public Scan

Submitted URL: http://go2.anaconda.com/Mzg3LVhOVy02ODgAAAGHzdpDkZ544oZp8WUWjKPp3VmEB1Z7VHdqbFWfVqaAeHmp8vU1sz6i_D3ZfdhQ4-OuWbRTdRo=
Effective URL: https://www.anaconda.com/blog/spine-tingling-data-science-tales-from-beyond-the-desk?utm_campaign=cybersecurity&utm_mediu...
Submission: On November 01 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

WE'RE SORRY THIS SITE WORKS BEST WITH JAVASCRIPT ENABLED BROWSERS.
FORMS AND OTHER ACTIONS MAY NOT WORK PROPERLY.

Menu
Products
Anaconda Distribution
Open-source repository & toolkit
Practitioner Tools
Cloud notebooks & Python training
Anaconda Professional
Commercial-grade distribution
Anaconda Business
Cloud repository governance
Anaconda Server
On-prem repository governance
Enterprise DS Platform
OSS code development platform
Professional Services
Data experts at your service
Pricing
Solutions
Use Cases Industries
Resources
Blog Anaconda Nucleus Join Our Online Community Open Source Technologies for
Data Science Library Videos, Datasheets and Whitepapers Events Meetups, Webinars
and Conferences Support Center Get Support and Documentation on Anaconda Nucleus
Podcast Listen to Numerically Speaking: The Anaconda Podcast
Partners
Embedded Technology Resellers Services
Blog
Company
About Us Leadership Press Contact Us Careers Dividend Program FAQ Customer
Reference Program
Contact Sales

NEWS


SPINE-TINGLING DATA SCIENCE TALES FROM BEYOND THE DESK

OCT 31, 2022

By Team Anaconda

Data represents one of the most important competitive advantages in modern
business, and the increasing reliance on data throughout decision-making
processes has elevated the role of data scientists and the IT teams that support
them. But like all departments, data science teams still encounter their share
of headaches, frustrations, and, well, horror stories.

Light a candle and, if you dare, read up on these spine-tingling data science
tales from beyond the desk!

When You Have to Find the Needle In A Haystack

Sean Law, Principal Data Scientist at Charles Schwab, ran into this
“needle-in-a-haystack” problem recently in a 23-million-row JSON file that kept
returning as invalid. Having to parse through all that data to find the few
small errors breaking the system sounds like a nightmare of a time, but is just
another day for data scientists.



This sort of issue is something that most if not all data scientists are
familiar with, including Anaconda Senior Data Scientist Vicky Kwan who
encountered a similar problem when looking into an infrastructure bug. After
searching for the missing “ }” in 300 lines of CloudFormation, an
infrastructure-as-code (IaC) service for modeling, provisioning, and managing
AWS and third-party resources, Vicky eventually fixed the bug and kept
Anaconda’s servers running.

Format errors and data conversion nightmares aren’t unique to data scientists;
technologists in all departments, including IT, frequently wrestle with such
challenges. Parsing through data is laborious and exhausting, but addressing
these kinds of problems is an important part of turning data into insights and
moving businesses forward.

We asked around internally for some tips and tricks for dealing with data
conversions and formatting. Here are a few of our team’s suggestions:


 * IT teams frequently help convert data to different structures, like JSON to
   CSV. Check out the pandas library in Python to simplify this process.

 * Use existing parsers rather than writing your own. And while you’re at it,
   use a parser instead of a regular expression. The road of “I’ll just use a
   regex to extract an email from this HTML doc” holds nothing but tears.

 * Never forget to hand-scan a sample of your data after any conversion. There
   are lots of obvious issues that you'll catch just by scrolling through some
   of the records yourself, and it’s easy to assume “no errors raised” means
   “everything converted as I expected.”

 * If it’s a conversion you need to run over and over, an automated “quick
   consistency check” for basic things like tracking records, times of events,
   or values is amazingly helpful.

When the Model Is a Little Too Accurate

If your model is returning with nearly 100% accuracy, your first instinct should
be to look for where you went wrong because chances are a mistake was made
somewhere. And if that model flies under the radar and makes it to production,
something is going to break. This problem could have serious consequences for
the business and is an important element to manage and monitor.

According to a recent Reddit thread, multiple users had encountered this
scenario, and they shared their spine-chilling tales:


> I was reviewing a Data Scientist's predictive model. They had used a neural
> net, probably because that's what they knew. Their model was extremely
> accurate. Next to perfect. Which for me was a red flag. So I dug deeper. They
> were using time series data that had a lot of missing data as time went on.
> Because they needed a matrix, they had to do something with the missing data
> points. So they imputed all missing data with a value of 0. A significant
> portion of the data was now just 0. After I pointed out the problem, the model
> turned out to be useless. -Reddit user

> A coworker confidently declared his model had a 98% accuracy. In the code
> review I found he wasn't properly splitting his train and test sets. He was
> training on nearly his entire test set….I wasn't supposed to be reviewing the
> code and it nearly made it into production. -Reddit user

> I just commented this on another post: a phd of economics pushed me away for
> asking if he validated his model. I noticed he didn’t split his dataset at all
> and, I quote, made “economical assessments”….His model was deployed and was
> blatantly incorrect: it would predict a housing quality score, and estimated
> residential neighborhoods with high income as “poor quality”. -Reddit user

Code is code, whether it’s meant for pure software development, managing
infrastructure (IAC), or your data science models. The application of
tried-and-true software development techniques to data science might be novel,
but you can and should apply those hard-won lessons from other areas of
technology to data science. For example, pair programming during model
development can help find and eliminate mistakes and increase your velocity.

No one makes a model fail on purpose, and getting a second—or third—set of eyes
on your work is the best way to ensure improvement. Requiring code reviews prior
to publishing can also strengthen your team’s output.

Finally, documentation must be considered at every step of the process. Not only
will this keep your code transparent and allow others to spot things you might
have missed, but proper documentation is also a critical part of IT governance
and will make your auditors happy.

When You Realize Your Mistake Is on Full Display to Stakeholders

Fixing bugs and reviewing models can be enormously time consuming when you don’t
know exactly where you went wrong. But sometimes you spot it right away—only, in
front of coworkers or stakeholders.

Dan Killam, Environmental Scientist at San Francisco Estuary Institute, recently
shared a scary data science story on Twitter about a truly hair-raising
happening: presenting to a critical audience with mislabeled objects:




And in a Medium post, Vincent Vanhoucke, Distinguished Scientist at Google,
talks about one of his first projects as an intern and how one failure lead to
the lesson of a lifetime:


> Picture this other horror story: You’re an intern, and you’re asked to build a
> “yes” versus “no” speech classifier. You have audio files: yes1.wav, no1.wav,
> yes2.wav, no2.wav, yes3.wav, and so on. You build your classifier and obtain
> great results. The moment you are about to present your work, you discover
> that the only thing your model is actually doing is reading the words “yes” or
> “no” in the filenames of your audio files to determine the answer, and not
> listening to the audio samples at all. So you cower in shame, cry a lot, and
> find the nearest exit. -My Data Science Horror Story, Vincent Vanhoucke

We all make mistakes, which is good news for those of us who are just beginning
our careers. Because those who came before us, those we look up to as luminaries
in our fields—they make mistakes too. It’s expected. Simply ask your co-workers,
and everyone will have an example of a technical demo that went sideways, a
presentation to leadership that didn’t hit the mark, or a production change that
caused a global outage.

Some would argue that if you aren’t making mistakes, you likely aren’t pushing
yourself to achieve your peak performance. So accept your stumbles with
self-compassion, have a sense of humor about them in the moment, and, most
importantly, learn from them. Sometimes mistakes are the best teachers.

Don’t Let Scary Stories Hold You Back

There’s a lot of overlap between the challenges of data science teams and IT
teams, particularly when it comes to ensuring organizations are prepared to meet
future goals. As the lifeblood of the proper provisioning and configuration of
IT resources, digital asset stability, and security, data can feel like a
business area that’s high stakes and, sometimes, downright spooky.

There are moments over the course of any career that are scary, embarrassing, or
stressful, but such moments are necessary for learning and growth. With some
effort we can come out the other side and look back at past “horror stories” as
moments that challenged us to become better. With resilience, a supportive
community, and great tools, we’ll surely find some king-size candy bars along
the way.

Happy Halloween from Anaconda!


NEWS

Anaconda at PyCon Ghana: Supporting Global Python and Data Science Education and
Community
Read More


MAKER BLOG SERIES

Going Back-End-Less With PyScript
Read More


NEWS

How Anaconda Is Advocating for Data Science in K-12 Education
Read More

BY DATA SCIENTISTS, FOR DATA SCIENTISTS



PRODUCTS

Anaconda DistributionAnaconda ProfessionalAnaconda BusinessAnaconda
ServerEnterprise DS PlatformProfessional ServicesPricing

SOLUTIONS

Use Cases Industries

RESOURCES

Blog Open Source Library Events

COMPANY

About Us Leadership Press Contact Us Careers Anaconda Dividend Partners Anaconda
FAQ Customer Reference Program
© 2022 Anaconda Inc. All rights reserved.

Service Status Legal Privacy Policy

This website uses cookies to ensure you get the best experience on our website.
Privacy Policy
Accept


HELLO! LET'S GET STARTED!

Request an Anaconda demo Download Anaconda installers