medium.datadriveninvestor.com Open in urlscan Pro
162.159.152.4  Public Scan

Submitted URL: https://medium.datadriveninvestor.com/mastering-chatgpt-in-python-a53814e834b0
Effective URL: https://medium.datadriveninvestor.com/mastering-chatgpt-in-python-a53814e834b0?gi=01f23ab2e398
Submission: On January 25 via manual from GB — Scanned from GB

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign In

Write


Sign up

Sign In


Published in

DataDrivenInvestor

You have 2 free member-only stories left this month.

Sign up for Medium and get an extra one



Sadrach Pierre, Ph.D.
Follow

Jan 12

·
9 min read
·

Member-only

·

Listen



Save








MASTERING THE GPT-3 API IN PYTHON


EXPLORING DATA SCIENCE USE CASES WITH THE GPT-3 API


Image by Miguel Á. Padriñán on Pexels

GPT-3 is a language machine-learning model that was released by Open AI late
last year. It has gained widespread media attention for its ability to write
essays, songs, poetry, and even code! The tool is free to use and simply
requires an email to sign up.

GPT-3 is a type of machine-learning model called a transformer. Specifically, it
is an extension of the Generative Pre-training Transformer, hence “GPT.” The
transformer architecture uses self-attention and reinforcement learning to model
conversational text. Generally, it works by processing text one word at a time
and it uses previous words to predict the next word in a sequence.

GPT-3 has a large list of use cases across all fields including the sciences,
arts, and technology. It can be used to answer basic questions about science and
math. It can even answer, with decent accuracy, questions related to
graduate-level math and science concepts. Even crazier, I asked some questions
related to my Ph.D. research in chemical physics and it was able to provide
decent explanations for some graduate-level concepts. It does have its
limitations, though. When I asked GPT-3 about the more novel research methods in
chemical physics it was unable to provide a sensible answer. Because of this,
GPT-3 should be used with caution when being used as a search engine for
education and research. GPT-3 has no fact-checking capabilities. As the
fact-checking capabilities improve I can imagine GPT-3 being very useful for
graduate students and maybe even research one day.

In addition to my personal experience, I’ve seen many other cool applications of
the tool. For example, a developer used GPT-3 to orchestrate cloud services for
completing complex tasks. Other users have generated working python and SQL,
among other languages, scripts with GPT-3. In the arts, users have asked GPT-3
to write essays comparing modern and contemporary art. The potential
applications of GPT-3 are plentiful across basically any domain.

GPT-3 performs very well at answering basic questions about topics for which
there is a great deal of accurate content on the web. For example, it can do a
decent job explaining photosynthesis. It won’t be able to do as good of a job
answering cutting-edge research questions around photosynthesis. For example, it
won’t be able to describe the mechanistic process of photosynthesis and the
quantum particles involved. While it can give decent high-level responses but
isn’t likely to give insight into the technical details of most research
questions. Similarly, GPT-3 can write some simple working code but as the
complexity of the task increases the more error-prone and often wrong the
generated code. It also can’t generate political opinions, ethical values,
investment advice, accurate news reporting, and other types of content that are
typically human-generated.

While GPT-3 has its limitations, its wide range of applicability is impressive.
I thought it would be fun to come up with some interesting data science and
machine learning prompts to see if they can supplement parts of the data science
workflow.

To start we will generate some data science-related text from some simple
prompts. Once we have a bit more familiarity with the tool, we will see if we
can ask questions that can help with some come data science tasks.
There are several interesting data science and machine learning questions that
we can ask GPT-3. For example, can we use GPT-3 to source publicly available
data sets? The training data for GPT-3 contains data up until 2021, so
presumably, it should know public datasets within the past 1–2 years. Another
interesting application is question framing. Can GPT-3 help users frame good
machine-learning research questions? While it struggles to give specific
technical answers maybe it can do a good job of framing good unanswered research
questions.

Another cool application is using GPT-3 to decide which ML model to use for a
particular application. This would be nice since, for well-established
techniques for which there is copious online literature, it should be able to do
a good job of helping a user select a model as well as explain why the selected
model will work best. Finally, we can try to use GPT-3 to write some python code
for some data science tasks. For example, can we will see if we can use it to
write code that generates synthetic data for particular use cases.

NOTE: The results of GPT-3 API are not deterministic. Because of this, the
results you get may slightly differ from the output shown here. Further, since
GPT-3 does not have a fact-checking mechanism, I advise you double check any
factual results you plan to use for work, school, or personal projects.

For this work, I will be writing code in Deepnote, which is a collaborative data
science notebook that makes running reproducible experiments very easy.


SETTING UP GPT-3

To start, let’s navigate to Deepnote and create a new project (you can sign-up
for free if you don’t already have an account).

Let’s create a project called ‘GPT3’ and a notebook within this project called
‘GPT3_ds’.

Next let’s install OpenAI using pip in the first notebook cell:


Embedding created by author


SOME SIMPLE DATA SCIENCE & MACHINE LEARNING PROMPTS

Asking about python libraries

Let’s start by store our key in the api_key attribute of our openAI object:

import openai
openai.api_key = "your-key"

Now let’s ask about some well known python libraries. If we ask “what is the
Pandas library?” GPT-3 will give a response:


Embedding created by author

We can even ask more specific questions such as “what are some common Pandas use
cases?”. It gives reasonable answers listing data wrangling, data visualization,
data aggregation, and time series analysis:


Embedding created by author

Let’s ask “what are the most common deep learning libraries?”:


Embedding created by author

We can even ask something like “what is a deep neural network?”:


Embedding created by author

Using GPT-3 for fact searching should be done with caution since there are no
fact-checking mechanisms in place. For the most part, if you are looking to get
a surface-level understanding of a well-known concept (that has many online
blogs & forums) in data science and machine learning, GPT-3 should work fairly
well.


FEATURE ENGINEERING

Another interesting use case for GPT-3 is using it as an aid to guide data
science work. For example, can GPT-3 give some ideas on feature transformations
that can improve model performance? Let’s ask GPT-3:


Embedding created by author

We see that it gives some good feature transformation suggestions as well as
explanations of each.

Let’s see if we can take this even further. Let’s ask it to write some python
code where it performs standardization:


Embedding created by author

Let’s copy and paste this into a new cell and run it:


Embedding created by author

Next, let’s do the same for normalization:


Embedding created by author

Let’s run this code:


Embedding created by author


MODEL SELECTION

Another cool use case model selection. Let’s see if GPT-3 can guide us on our
select machine-learning models. Let’s ask “How do I select a time series model?”


Embedding created by author

The answers are reasonable listing, checks for stationarity, seasonality,
trends, and model performance evaluation. Let’s ask a more specific question
like “When to use ARIMA vs Prophet?”:


Embedding created by author

The full response:

> ARIMA and FB Prophet are both used for time series forecasting. ARIMA is a
> linear model and is better for predicting long-term trends with stationary
> data. FB Prophet is a non-linear model and is better for predicting short-term
> trends with non-stationary data. So, you would use ARIMA when you are mainly
> concerned with predicting long-term trends, and you would use FB Prophet when
> you are mainly concerned with predicting short-term trends. -GPT-3


GENERATING SYNTHETIC DATA

Can we use GPT-3 to generate synthetic data? Let’s see! We will look at
well-known examples of healthcare readmission data and transaction data.

Healthcare readmission

Let’s ask GPT-3 to “Write example python code that generates synthetic
healthcare readmission data stored in a data frame”:


Embedding created by author

Let’s run this code:


Embedding created by author

Let’s see if we can ask it to build a classification model on this synthetic
data that predict readmission and evaluates performance. The prompt will be:

> “Write example python code that generates synthetic healthcare readmission
> data stored in a dataframe. From this write code that builds a catboost model
> that predicts readmission outcomes. Also write code to calculate and print
> performance.”


Embedding created by author

And let’s copy and run this code:


Embedding created by author

Transaction data

Let’s ask GPT-3 to “Write example python code that generates synthetic
transaction data stored in a dataframe”:


Embedding created by author

Let’s copy this code and run it in a new cell:


Embedding created by author

Right now we have item ID, customer, and Purchase amount. Let’t see if we can
get more specific. Let’s ask for age, gender, and location as well:


Embedding created by author

And let’s run the code:


Embedding created by author

Pretty impressive!


PROMPTS FOR PUBLIC DATA SETS

Another application is querying GPT-3 for public data sets. Let’s ask GPT-3 to
list some public data sets:


Embedding created by author

Let’s see if we can find public data licensed under Apache 2.0. Let’s also ask
for links to the source:


Embedding created by author

While not all of these links are correct, it does a decent job of finding source
links. The Google Ngrams, common crawl, and NASA data are pretty good. If it
doesn’t provide the exact location of the data, in most cases it provides a link
to a webpage where the data can be found.

Let’s ask for descriptions as well. Note that while results may overlap, they
are slightly different with each run. To my knowledge, results are not always
reproducible:


Embedding created by author

The results are:

 1. OpenStreetMap: OpenStreetMap is a free, editable map of the world, created
    and maintained by volunteers and available for use under an open license. It
    contains millions of data points, including roads, buildings, and points of
    interest. Source: https://www.openstreetmap.org/
 2. Google Books Ngrams: Google Books Ngrams is a dataset of over 5 million
    books from Google Books, spanning from 1500 to 2008. It contains word counts
    for each year, allowing researchers to track the usage of words over time.
    Source: https://aws.amazon.com/datasets/google-books-ngrams/
 3. Wikipedia: Wikipedia is a free, open-source encyclopedia with millions of
    articles in hundreds of languages. It is available for use under the
    Creative Commons Attribution-ShareAlike license. Source:
    https://www.wikipedia.org/
 4. Common Crawl: Common Crawl is a large-scale web crawl that collects data
    from over 5 billion webpages. It is available for use under the Apache 2.0
    license. Source: https://commoncrawl.org/
 5. Open Images Dataset: The Open Images Dataset is a collection of 9 million
    images annotated with labels spanning over 6000 categories. It is available
    for use under the Apache 2.0 license. Source:
    https://storage.googleapis.com/openimages/web/index.html

-GPT-3

Open streetmaps, Google Ngrams, and the open image data set links are
reasonable.


PROMPTS FOR MACHINE LEARNING QUESTION FRAMING

For our final example, we will see if GPT-3 can help us frame machine-learning
problems. We will consider use-case formulation as well as research question
framing.

Use Case Prompts

While GPT-3 contains data only up until 2021, it still may be able to help us
ideate ML use cases that are relevant today. Let’s ask “What are some emerging
machine learning use cases in social media?”:


Embedding created by author

1. Automated Content Curation: Automatically curating content from social media
platforms to create personalized content feeds for users.

2. Sentiment Analysis: Analyzing user sentiment from social media posts to gain
insights into customer opinions and preferences.

3. Social Media Monitoring: Using machine learning algorithms to monitor social
media conversations and detect potential issues or trends.

4. Social Media Advertising: Leveraging machine learning to optimize social
media advertising campaigns and target the right audience.

5. Social Media Recommendations: Using machine learning to recommend content to
users based on their interests and preferences.

Let’s also see the response for healthcare:


Embedding created by author


RESEARCH QUESTION PROMPTS

Now let’s see if it can frame some research questions. Let’s ask “What are some
good research questions on using deep learning for image detection?”


Embedding created by author

“What are some good research questions related to NLP transformer models?


Embedding created by author

The code in this post is available on GitHub.


CONCLUSIONS

In this post, we discussed how to use GPT-3 for a variety of data science and
machine learning tasks. First, we gave the API some simple prompts related to
python libraries, feature engineering, and model selection. Next, we walked
through how to use the GPT-3 API to generate synthetic healthcare and synthetic
transaction data. We were also able to generate a query to build a readmission
model on top of the synthetic healthcare data. We then discuss how we can query
the API for information on publicly available data sets. Finally, we queried the
API for emerging machine learning use cases and research questions. I hope you
enjoyed this post. Feel free to download the code and see if you can incorporate
GPT-3 into your data science workflow.



Subscribe to DDIntel Here.

Visit our website here: https://www.datadriveninvestor.com

Join our network here: https://datadriveninvestor.com/collaborate

Artificial Intelligence
Machine Learning
Python
Chatgpt
Data Science


221

221

1




221



1


ENJOY THE READ? REWARD THE WRITER.BETA

Your tip will go to Sadrach Pierre, Ph.D. through a third-party platform of
their choice, letting them know you appreciate their story.

Give a tip


GET AN EMAIL WHENEVER SADRACH PIERRE, PH.D. PUBLISHES.

By signing up, you will create a Medium account if you don’t already have one.
Review our Privacy Policy for more information about our privacy practices.

Subscribe


MORE FROM DATADRIVENINVESTOR

Follow

empowerment through data, knowledge, and expertise. subscribe to DDIntel at
https://ddintel.datadriveninvestor.com

Tica Darius

·Jan 12

Member-only


THE SLEEPER INVESTMENT THAT WILL MAKE YOU RICH IN 2023

It’s insane that nobody is talking about this — The investing world took a large
hit with this recession. Everybody was chasing tech stocks, dividend stocks,
unicorn startups, or wild crypto projects. But once it all imploded, investor
sentiment plummeted and none of it works anymore. Luckily, this is an
opportunity for new investors to explore other ideas than…

Investing

5 min read



Investing

5 min read




--------------------------------------------------------------------------------

Share your ideas with millions of readers.

Write on Medium

--------------------------------------------------------------------------------

Dr. Gleb Tsipursky

·Jan 12

Member-only


FTX FIASCO CAUSED BY SBF’S DOUBLE-OR-NOTHING PHILOSOPHY

“Let’s say there’s a game: 51 percent, you double the Earth out somewhere else;
49 percent, it all disappears. Would you play that game? And would you keep on
playing that, double or nothing?” The vast majority of us would not take the
risk of playing that game even once…

Cognitive Bias

8 min read



Cognitive Bias

8 min read




--------------------------------------------------------------------------------

Dr. Gleb Tsipursky

·Jan 12

Member-only


NEW STUDY SHOWS SMES ARE LEADING THE HYBRID WORK REVOLUTION TO WIN THE TALENT
WARS

As the world continues to navigate the challenges posed by the ongoing pandemic,
small and medium enterprises (SMEs) have been faced with a particularly
difficult economic environment. A new study by Vistage, published in January
2023, reveals that despite these challenges, 60% of SME CEOs are planning to
increase headcount…

Cognitive Bias

6 min read



Cognitive Bias

6 min read




--------------------------------------------------------------------------------

Rijul Singh Malik

·Jan 12


DATA SCIENCE FOR THE MASSES

A blog geared towards those who are interested in learning data science. — Data
Science for the Masses is a blog that covers a range of topics related to data
science. From beginner tips to more advanced concepts, this blog has something
for everyone who is interested in learning more about this field. Check it out
today and start your data science journey! What is Data Science? …

Data Science

5 min read



Data Science

5 min read




--------------------------------------------------------------------------------

Desiree Peralta

·Jan 12

Member-only


DO YOURSELF A FAVOR AND DON’T BUY A HOUSE IN 2023

Advice from a real estate lover who got rich in this market. — In 2018, I bought
my first house. It was a quick and inexperienced decision I made without even
having studied the market well, but the results were surprisingly positive: The
house is worth double today, and I haven’t sold it yet. Since then, I have
bought three more houses. They…

Finance

5 min read



Finance

5 min read




--------------------------------------------------------------------------------

Read more from DataDrivenInvestor


RECOMMENDED FROM MEDIUM

Nikita Tank

in

Dev Tutorials

BEST CUSTOMER SERVICE TOOLS TO REPLY CONSUMERS WITH IN SECONDS



Women in AI - Switzerland

in

WomeninAI

PSYCHOLOGY OF ARTIFICIAL INTELLIGENCE



Lambert Chu

in

Point API

FROM QUILL TO KEYBOARD — WHAT’S NEXT?



Bokyeong Woo

in

Lunit Team Blog

CASE OF THE MONTH | WHERE IS BREAST CANCER? 🔎



Epilog AI

EPILOG AI — DECEMBER 2020 UPDATE



Ahmed Bader

EDGE COMPUTING FOR PALM TREES



J Peng

in

The HumAIn Blog

BRIEF INTRODUCTION OF REINFORCEMENT LEARNING 2.5



AI Can Really Write

THE FUTURE OF INVESTING: NEW AI METHODS



AboutHelpTermsPrivacy

--------------------------------------------------------------------------------


GET THE MEDIUM APP




SADRACH PIERRE, PH.D.

3.5K Followers

Writer for Built In & Towards Data Science. Cornell University Ph. D. in
Chemical Physics.

Follow



MORE FROM MEDIUM

The PyCoach

in

Artificial Corner

3 CHATGPT EXTENSIONS TO AUTOMATE YOUR LIFE



Josep Ferrer

in

Geek Culture

5 CHATGPT FEATURES TO BOOST YOUR DAILY WORK



Jim Clyde Monge

in

Generative AI

CHATGPT PRO IS COMING — HERE’S HOW YOU CAN JOIN THE WAITLIST



Mark Schaefer

I JUST EARNED MY FIRST REVENUE FROM CHATGPT AND OTHER OBSERVATIONS



Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech

To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.