blog.gopenai.com Open in urlscan Pro
162.159.153.4  Public Scan

Submitted URL: https://blog.gopenai.com/computer-vision-and-nlp-on-ocr-text-extraction-supervised-ml-d9b7c8de15a2?source=email-d780b06a5...
Effective URL: https://blog.gopenai.com/computer-vision-and-nlp-on-ocr-text-extraction-supervised-ml-d9b7c8de15a2?gi=8c4007c6d7ab&source...
Submission: On October 31 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

Open in app

Sign up

Sign in

Write


Sign up

Sign in




COMPUTER VISION AND NLP ON OCR TEXT EXTRACTION (SUPERVISED ML)

Erasmo Soares

·

Follow

Published in

GoPenAI

·
13 min read
·
May 29, 2024

34



Listen

Share

A few days ago, I began exploring Machine Learning and its practical
applications. During this journey, a conversation with friends brought up the
question of how to extract text from images, specifically to read and extract
important parts of a pay stub. This got me thinking; while I didn’t have the
answer immediately, I realized that this challenge could be effectively
addressed using Artificial Intelligence, particularly Machine Learning.

Working on this project helped me gain practical insights into Machine Learning
and how to implement and train a supervised model. In this article, I will
briefly describe the steps leading to the conclusion.

This article provides a general overview and is not intended as a step-by-step
guide. Its purpose is to offer a broad perspective on the process, highlighting
key concepts and considerations without delving into detailed code instructions.

Let’s dive into the proof of concept.


Photo by Jakob Søby on Unsplash


THE PROJECT: NLP PSR

The NLP Pay Slip Reader is a proof of concept (POC) designed to develop an
automated system for extracting key information from slips (pay stubs). The web
app allows users to upload their payment slips, automatically detect and extract
relevant entities, and visualize the data. The project utilizes various
techniques, including optical character recognition (OCR), named entity
recognition (NER), and image processing to achieve its objectives.



The solution utilizes Computer Vision to scan the slips, identify the text’s
location, and extract it from the image. In Natural Language Processing, the
solution extracts entities from the text, performs necessary text cleaning, and
parses the information.


DATA PREPARATION

The initial step in the process is to extract text from images using Optical
Character Recognition (OCR). This technology enables us to convert text within
images into a machine-readable format, facilitating subsequent data processing
and analysis.

To accomplish this, we can utilize PyTesseract.

PyTesseract is a popular Optical Character Recognition (OCR) tool for Python,
based on Google’s Tesseract-OCR engine. It allows developers to extract text
from images, making it possible to convert scanned documents, images, or
screenshots containing text into machine-readable text data.

import numpy as np
import pandas as pd
import cv2
import PIL
import pytesseract

# Loading using cv2

img_cv = cv2.imread('./images/Slip.png') 

### Extract Text from Image from cv2

text_cv = pytesseract.image_to_string(img_cv)

PyTesseract operates through a system of hierarchies and levels to detect data
related to the page, blocks, paragraphs, lines, and words. By examining the
results in `text_cv`, we can find the texts extracted from the images.

Once the text is extracted from the images, we perform basic cleaning procedures
to prepare the data for entity extraction. This may involve removing unwanted
characters, correcting errors, or standardizing formats. Following the cleaning
process, we organize the extracted text by saving all the words or tokens in a
CSV (Comma-Separated Values) file. Each word or token is linked to the
corresponding filename, allowing us to maintain the connection between the
extracted text and its source slip.

import numpy as np
import pandas as pd
import pytesseract
import cv2
import os

from glob import glob
from tqdm import tqdm
import warnings

warnings.filterwarnings('ignore')

imPaths = glob('./images/*.png')

allSlips = pd.DataFrame(columns=['id','text'])    

for imgPath in tqdm(imPaths,desc='Slips'):

    imgPath = imPaths[0]
    #print(imgPath)

    #extract filename
    _, filename = os.path.split(imgPath)

    #extract data and text
    image = cv2.imread(imgPath)
    data = pytesseract.image_to_data(image)

    #dataframe
    dataList = list(map(lambda x: x.split('\t'),data.split('\n')))
    df = pd.DataFrame(dataList[1:],columns=dataList[0])
    df.dropna(inplace=True)


    useFulData = df#.query('conf >= 30')

    #dataframe
    slip = pd.DataFrame()
    slip['text'] = useFulData['text']
    slip['id'] = filename

    #Concatenation
    allSlips =pd.concat((allSlips,slip))

allSlips.to_csv('slips.csv',index=False)

The above code should generate a long DataFrame that will be exported as a CSV
like:




MANUAL LABELLING WITH BIO TAGGING


Photo by Beazy on Unsplash

After the text extraction, the subsequent step involves labeling, which is a
process commonly used in supervised machine learning. This step entails
identifying and marking the relevant entities within the text, such as names,
dates, organizations, etc. Labeling plays a critical role in training machine
learning models to accurately recognize and categorize these entities.

BIO stands for Beginning, Inside, and Outside (of a text segment). In a system
that recognizes entity boundaries, only three labels are used: B, I, and O. The
labeling technique is a crucial part of constructing our ground truth, which
refers to the accurate and reliable set of data or information used as a
benchmark to validate and train models. In this manual process, the information
in the DataFrame will be marked with B, I, or O. This is a costly but essential
step that will determine the success of the machine learning model. The more
diverse information in the DataFrame, such as texts read from different slips
with varying formats and positions, the better the results.

The labeling process is extensive. In the example below, I have the following
texts in the CSV:



In the context of the process, `slip.png` corresponds to the name of the image
file, and subsequently, each portion of extracted text. Considering that the
company name is a crucial part of the entity I want to build, I need to locate
all the texts in the CSV that represent the organization’s name and tag them
accordingly. Here, the TAG `ORG` is utilized. `O` is used for texts that do not
represent anything relevant and can be discarded. `B` indicates the beginning of
a piece of information; in this example, “Cofomo” is the start of the
organization’s name, while “Developpement” is in the middle or at the end. In
this case, the tags `I` or `I-ORG` are used, so [B-ORG, I-ORG, I-ORG]
corresponds to [Cofomo Development Inc.].


TEXT CLEANING AND PREPROCESS

After labeling, the extracted text undergoes a cleaning process. This involves
removing any unnecessary characters, correcting errors, and standardizing
formats to ensure the text is consistent and usable. Clean text is essential for
achieving high-quality results in subsequent analytical steps.

During the cleaning stage, it’s necessary to convert the data into a specific
format for training the NER model. For this purpose, Spacy can be used.

spaCy is a free, open-source library for advanced Natural Language
Processing(NLP) in Python. spaCy is designed specifically for production use and
helps you build applications that process and “understand” large volumes of
text. It can be used to build information extraction or natural language
understanding systems, or to pre-process text for deep learning.

The format expected by Spacy consists a dictionary containing the information of
the tags that were manually mapped in the previous stage of the POC.
Additionally, it includes the start and end positions of each tag.



The code below demonstrates the conversion of the CSV data into the format
illustrated above: start, end, and tag. In this instance, the file
‘slips-tag.txt’ contains all the slips that have been tagged.

import numpy as np
import pandas as pd
import string
import re

# Data Preparation
with open('slips-tag.txt',mode='r',encoding='utf8',errors='ignore') as f:
    text = f.read()

data = list(map(lambda x:x.split('\t'),text.split('\n')))

# Load Data and convert into Pandas DataFrame
df = pd.DataFrame(data[1:],columns=data[0])

df.head()

# See whitespaces
# string.whitespace
# string.punctuation

whitespace = string.whitespace
punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~"
tableWhitespace = str.maketrans('','',whitespace)
tablePunctuation = str.maketrans('','',punctuation)
def cleanText(txt):
    text = str(txt)
    text = text.lower()
    removewhitespace = text.translate(tableWhitespace)
    removepunctuation = removewhitespace.translate(tablePunctuation)
    
    return str(removepunctuation)

df['text'] = df['text'].apply(cleanText)

dataClean = df.query("text != '' ")
dataClean.dropna(inplace=True)
dataClean.head(10)

# Convert Data into spacy format
group = dataClean.groupby(by='id')

# group.groups.keys()
grouparray = group.get_group('Slip.png')[['text','tag']].values
content = ''
annotations = {'entities':[]}
start = 0
end = 0
for text, label in grouparray:
    text = str(text)
    stringLenght = len(text) + 1

    start = end
    end = start + stringLenght

    if label != 'O':
        annot = (start,end -1 ,label)
        annotations['entities'].append(annot)

    content = content + text + ' '

slips = group.groups.keys()

allSlips = []
for slip in slips:
    slipData = []
    grouparray = group.get_group(slip)[['text','tag']].values
    content = ''
    annotations = {'entities':[]}
    start = 0
    end = 0
    for text, label in grouparray:
        text = str(text)
        stringLength = len(text) + 1

        start = end
        end = start + stringLength

        if label != 'O':
            annot = (start,end-1,label)
            annotations['entities'].append(annot)

        content = content + text + ' '
        
        
  slipData = (content,annotations)
  allSlips.append(slipData)

# SpacyData
allSlips


SPLIT THE DATA INTO TRAINING AND TESTING SET


Photo by Jo Coenen - Studio Dries 2.6 on Unsplash

Here, the data is randomized and divided into a training and testing set.
Splitting data into these sets is a fundamental practice in machine learning for
several reasons. By training the model on one subset of the data (the training
set) and testing it on another subset (the testing set), you can assess how well
the model generalizes to new, unseen examples.

For this POC, the training and testing data will be stored in a pickle file. The
pickle module implements binary protocols for serializing and de-serializing a
Python object structure.

# Spliting Data into Training and Testing Set
import random

random.shuffle(allSlips)
len(allSlips)

TrainData = allSlips[:240]
TestData = allSlips[240:]

# Save data
import pickle

pickle.dump(TrainData,open('./data/TrainData.pickle',mode='wb'))
pickle.dump(TestData,open('./data/TestData.pickle',mode='wb'))


TRAINING THE NAMED ENTITY RECOGNITION MODEL

To train the NER model, you can utilize spaCy and load it with the pickle file
generated during the preprocessing stage. The spaCy documentation offers a
thorough, step-by-step guide on configuring the tool, including setup for
specific scenarios and installation instructions. In the project’s GitHub
repository, I have versioned the previously trained model within the output
folder, which you can check at the end of this article. SpaCy also provides
pre-trained models that are ready for use in your project. Detailed instructions
can be found in the “Training Models” section of the spaCy documentation.


TRAINING PIPELINES & MODELS · SPACY USAGE DOCUMENTATION


TRAIN AND UPDATE COMPONENTS ON YOUR OWN DATA AND INTEGRATE CUSTOM MODELS

spacy.io



One of the first steps is to download the base_config.cfg file, which contains
the basic configuration for spaCy. For this POC, the NER component and the
English language have been selected, as the slips are in English.



Next, copy the file to your project’s root directory and execute the following
command to generate the “complete” configuration file, which in this case is
named config.cfg:




PREPARE THE DATA FOR TRAINING


Photo by Meghan Holmes on Unsplash

Training data for NLP projects can arrive in various formats. The spaCy
documentation includes a section that explains the model configuration and
training process. In essence, you must furnish spaCy with the training and
testing data, which, in our scenario, are the two pickle files generated during
the preprocessing stage. Subsequently, these pickle files will be converted to
the spaCy format.

import spacy
from spacy.tokens import DocBin
import pickle

nlp = spacy.blank("en")

# Load Data
training_data = pickle.load(open('./data/TrainData.pickle','rb'))
testing_data = pickle.load(open('./data/TestData.pickle','rb'))


# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations['entities']:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./data/train.spacy")


# the DocBin will store the example documents
db_test = DocBin()
for text, annotations in testing_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations['entities']:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db_test.add(doc)
db_test.to_disk("./data/test.spacy")


The model is trained using the following command:

python -m spacy train .\config.cfg --output .\output --paths.train .\data\train.spacy --paths.dev .\data\test.spacy

After training, two folders are created: “model-best,” which contains the most
accurate model, and “model-last,” which contains the model from the latest
training iteration.


PREDICTIONS


Photo by Anne Nygård on Unsplash

The final step involves training a Named Entity Recognition (NER) model. Using
the cleaned and labeled text, the NER model learns to identify and classify
various entities within the text. This training process involves feeding the
model a large dataset and iteratively improving its accuracy through supervised
learning techniques. The trained NER model can then be used to automatically
recognize and extract entities from new text data.

Consider a scenario where one of the pay stub contains the following texts. The
confidential information has been blurred for privacy; please disregard it:



Given that the entity “Slip” includes gains and deductions as parameters, the
model, after analyzing the extracted text from the image, should accurately
detect the values of each information and populate them in our entity, as
demonstrated in the example below:



In this phase, there are four steps to follow to obtain predictions from the NER
model:

1. Load the image.
2. Extract data using Pytesseract.
3. Convert data into content.
4. Obtain predictions.

import numpy as np
import pandas as pd
import cv2
import pytesseract
from glob import glob
import spacy
import re
import string
import warnings

def cleanText(txt):
    whitespace = string.whitespace
    punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~"
    tableWhitespace = str.maketrans('','',whitespace)
    tablePunctuation = str.maketrans('','',punctuation)
    text = str(txt)
    text = text.lower()
    removewhitespace = text.translate(tableWhitespace)
    removepunctuation = removewhitespace.translate(tablePunctuation)
    
    return str(removepunctuation)


warnings.filterwarnings('ignore')

### Load NER model
model_ner = spacy.load('./output/model-best/')

# Load Image
image = cv2.imread('./Selected/Slip.png')


# extract data using Pytesseract 
tessData = pytesseract.image_to_data(image)

# convert into dataframe
tessList = list(map(lambda x:x.split('\t'), tessData.split('\n')))
df = pd.DataFrame(tessList[1:],columns=tessList[0])
df.dropna(inplace=True) # drop missing values
df['text'] = df['text'].apply(cleanText)

# convet data into content
df_clean = df.query('text != "" ')
content = " ".join([w for w in df_clean['text']])
print(content)

# get prediction from NER model
doc = model_ner(content)

In spaCy, a “doc” (short for “document”) refers to a container for accessing
linguistic annotations and a sequence of tokens. The Doc object holds the
processed text along with its annotations.

The code snippet below initializes a server where it is possible to view the
prediction:

from spacy import displacy

displacy.serve(doc,style='ent',port=5001)

Here is the result. The confidential information has been blurred in the example
for privacy, but we have extracted all the text from a new pay slip. When a tag
mapped during the labeling process is identified, it is displayed accordingly.
For example, following the text “compagnie,” which represents the company’s
name, it was automatically mapped to the tag B-ORG. The same process was applied
to all other relevant information in the text, such as names, salaries, gains,
deductions, etc.



It is possible to work with the “Doc” variable, which contains the annotation
information and token sequences returned by the model. The code below converts
the `Doc` object into JSON to transform it into a DataFrame containing all the
necessary information to create bounding boxes on the image at the positions
where the information was found.

Next, an entity dictionary is created, along with a parsing function to process
each piece of text found.

#!/usr/bin/env python
# coding: utf-8

import numpy as np
import pandas as pd
import cv2
import pytesseract
from glob import glob
import spacy
import re
import string
import warnings
warnings.filterwarnings('ignore')

### Load NER model
model_ner = spacy.load('./output/model-best/')


def cleanText(txt):
    whitespace = string.whitespace
    punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~"
    tableWhitespace = str.maketrans('','',whitespace)
    tablePunctuation = str.maketrans('','',punctuation)
    text = str(txt)
   
    removewhitespace = text.translate(tableWhitespace)
    removepunctuation = removewhitespace.translate(tablePunctuation)
    
    return str(removepunctuation)

# group the label
class groupgen():
    def __init__(self):
        self.id = 0
        self.text = ''
        
    def getgroup(self,text):
        if self.text == text:
            return self.id
        else:
            self.id +=1
            self.text = text
            return self.id


def parser(text,label):
    if label in ('NAME'):
        text = text.lower()
        text = re.sub(r'[^a-z ]','',text)
        text = text.title()
        
    elif label in ('ORG','ROLE'):
        text = text.lower()
        text = re.sub(r'[^a-z0-9 ]','',text)
        text = text.title()
        
    elif label in ('DATE'):
        text = text.lower()
        text = re.sub(r'[^0-9/]','',text)
        text = text.title()        
        
    elif label in ('BASE','HOURS','QTD','GAINS','DEDUCTIONS','NETTE'):
        text = text.lower()
        text = re.sub(r'[^0-9.,]','',text)
        text = text.title()
        
    return text

grp_gen = groupgen()

def getPredictions(image):
    try:
        # extract data using Pytesseract 
        tessData = pytesseract.image_to_data(image)
        
        # convert into dataframe
        tessList = list(map(lambda x:x.split('\t'), tessData.split('\n')))
        df = pd.DataFrame(tessList[1:],columns=tessList[0])
        df.dropna(inplace=True) # drop missing values
        df['text'] = df['text'].apply(cleanText)

        # convet data into content
        df_clean = df.query('text != "" ')
        content = " ".join([w for w in df_clean['text']])
    
        
        # get prediction from NER model (doc file)
        doc = model_ner(content)

        # converting doc in json
        docjson = doc.to_json()
        doc_text = docjson['text']

        # creating tokens
        datafram_tokens = pd.DataFrame(docjson['tokens'])
        datafram_tokens['token'] = datafram_tokens[['start','end']].apply(
            lambda x:doc_text[x[0]:x[1]] , axis = 1)

        right_table = pd.DataFrame(docjson['ents'])[['start','label']]
        datafram_tokens = pd.merge(datafram_tokens,right_table,how='left',on='start')
        datafram_tokens.fillna('O',inplace=True)

        # join lable to df_clean dataframe
        df_clean['end'] = df_clean['text'].apply(lambda x: len(x)+1).cumsum() - 1 
        df_clean['start'] = df_clean[['text','end']].apply(lambda x: x[1] - len(x[0]),axis=1)

        # inner join with start 
        dataframe_info = pd.merge(df_clean,datafram_tokens[['start','token','label']],how='inner',on='start')

        # Bounding Box
        bb_df = dataframe_info.query("label != 'O' ")

        bb_df['label'] = bb_df['label'].apply(lambda x: x[2:])
        bb_df['group'] = bb_df['label'].apply(grp_gen.getgroup)

        # right and bottom of bounding box
        bb_df[['left','top','width','height']] = bb_df[['left','top','width','height']].astype(int)
        bb_df['right'] = bb_df['left'] + bb_df['width']
        bb_df['bottom'] = bb_df['top'] + bb_df['height']

        # tagging: groupby group
        col_group = ['left','top','right','bottom','label','token','group']
        group_tag_img = bb_df[col_group].groupby(by='group')
        img_tagging = group_tag_img.agg({

            'left':min,
            'right':max,
            'top':min,
            'bottom':max,
            'label':np.unique,
            'token':lambda x: " ".join(x)

        })

        img_bb = image.copy()
        for l,r,t,b,label,token in img_tagging.values:
            cv2.rectangle(img_bb,(l,t),(r,b),(0,255,0),2)

            cv2.putText(img_bb,str(label),(l,t),cv2.FONT_HERSHEY_PLAIN,1,(255,0,255),2)


        # Entities
        info_array = dataframe_info[['token','label']].values
        entities = dict(NAME=[],ORG=[],DATE=[],ROLE=[],BASE=[],HOURS=[],QTD=[],GAINS=[],DEDUCTIONS=[],NETTE=[])
        previous = 'O'

        for token, label in info_array:
            bio_tag = label[0]
            label_tag = label[2:]

            # step -1 parse the token
            text = parser(token,label_tag)

            if bio_tag in ('B','I'):

                if previous != label_tag:
                    entities[label_tag].append(text)

                else:
                    if bio_tag == "B":
                        entities[label_tag].append(text)

                    else:
                        if label_tag in ("NAME",'ORG','ROLE'):
                            entities[label_tag][-1] = entities[label_tag][-1] + " " + text

                        else:
                            entities[label_tag][-1] = entities[label_tag][-1] + text



            previous = label_tag
        
        return img_bb, entities
    
    except Exception as e:
        print(f"An error occurred, make sure the image contours are correct: {str(e)}")
        return None, None


CONCLUSION


Photo by Kelly Sikkema on Unsplash

There are numerous applications for Machine Learning in character detection and
recognition and NLP. Another potential proof of concept could entail employing
the same technique by using a mobile phone and automatically translating the
extracted text into a designated language.

This demonstrates the versatility and practicality of Machine Learning in
various contexts.

Hope you like it!

The POC for the NLP Scanner is available on my GitHub.


REFERENCES

 1. Vision view & Data Science Anywhere, Intelligently Extract Text & Data from
    Documents
 2. Li, J., Lu, Q., & Zhang, B. (2019). An efficient business card recognition
    system based on OCR and NER. In 2019 International Conference on Robotics,
    Automation and Artificial Intelligence (RAAI) (pp. 334–338). IEEE.
 3. Sharma, S., & Sharma, A. (2020). Business Card Recognition using
    Convolutional Neural Networks. In 2020 5th International Conference on
    Computing, Communication and Security (ICCCS) (pp. 1–5). IEEE.
 4. Spacy — Industrial-strength Natural Language Processing in Python. (n.d.).
    Retrieved from https://spacy.io/
 5. PyTesseract: Python-tesseract — OCR tool for Python. (n.d.). Retrieved from
    https://pypi.org/project/pytesseract/
 6. OpenCV: Open Source Computer Vision Library. (n.d.). Retrieved from
    https://opencv.org/
 7. Flask: A Python Microframework. (n.d.). Retrieved from
    https://flask.palletsprojects.com/




SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD.


FREE



Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.


Sign up for free


MEMBERSHIP



Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app


Try for 5 $/month
Machine Learning
AI
Ocr Tesseract
NLP
Ner


34

34



Follow



WRITTEN BY ERASMO SOARES

22 Followers
·Writer for

GoPenAI

With 12+ years in software engineering, my current passion lies in cloud
computing and AI, constantly exploring new ideas and pushing boundaries.

Follow




MORE FROM ERASMO SOARES AND GOPENAI

Erasmo Soares


CREATING A DATA MANAGER FOR UNITY 3D


SO YOU THINK ABOUT CREATING A GAME, WHATEVER IT IS, ONE THING THAT YOU CAN
DEFINITELY END UP BUMPING INTO IS HOW YOU WILL ACCESS ALL THE…

Jul 18, 2017
60



kirouane Ayoub

in

GoPenAI


FINE-TUNING EMBEDDINGS FOR SPECIFIC DOMAINS: A COMPREHENSIVE GUIDE


IMAGINE YOU’RE BUILDING A QUESTION ANSWERING SYSTEM FOR A MEDICAL DOMAIN. YOU
WANT TO ENSURE IT CAN ACCURATELY RETRIEVE RELEVANT MEDICAL…

Sep 30
556
2



Paras Madan

in

GoPenAI


BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE


TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS
ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG…

Jun 6
801
6



Erasmo Soares

in

GoPenAI


MACHINE LEARNING IN ACTION AND AZURE AUTOMATED MACHINE LEARNING


IN MY LAST ARTICLE, I DESCRIBED HOW TO EXTRACT TEXT FROM IMAGES USING COMPUTER
VISION, OCR, AND ENTITY RECOGNITION. IN THIS ARTICLE, I WILL…

Jul 9
14


See all from Erasmo Soares
See all from GoPenAI



RECOMMENDED FROM MEDIUM

Coursesteach


NATURAL LANGUAGE PROCESSING (PART 56)-CREATING AND USING N-GRAM LANGUAGE MODELS
FOR TEXT…


📚CHAPTER8 :AUTOCOMPLETE AND LANGUAGE MODELS

4d ago
2



Anoop Maurya

in

Python in Plain English


WHY PYMUPDF4LLM IS THE BEST TOOL FOR EXTRACTING DATA FROM PDFS (EVEN IF YOU
DIDN’T KNOW YOU NEEDED…


STUCK BEHIND A PAYWALL? READ FOR FREE!


Oct 18
1.1K
10




LISTS


NATURAL LANGUAGE PROCESSING

1782 stories·1391 saves


THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND

12 stories·492 saves


PREDICTIVE MODELING W/ PYTHON

20 stories·1628 saves


PRACTICAL GUIDES TO MACHINE LEARNING

10 stories·1986 saves


Abhinav Kimothi



in

Towards AI


A TAXONOMY OF RETRIEVAL AUGMENTED GENERATION


POWERING THE RISE OF CONTEXTUAL AI —OVER 200 TERMS INCLUDING COMPONENTS,
PIPELINES, OPS STACK, TECHNOLOGIES & MORE


Oct 21
427
8



Ignacio de Gregorio


APPLE SPEAKS THE TRUTH ABOUT AI. IT’S NOT GOOD.


ARE WE BEING LIED TO?


Oct 23
4.1K
125



Cezary Gesikowski



in

Generative AI


HOW TO BECOME A GENERATIVE AI POWER USER — PART 1


CHATGPT USER JOURNEY FROM NEWBIE TO PRO WITH USER CASE EXAMPLES AND STEP-BY-STEP
EXPERIMENTS


Oct 23
241
4



Rohit Patel

in

Towards Data Science


UNDERSTANDING LLMS FROM SCRATCH USING MIDDLE SCHOOL MATH


IN THIS ARTICLE, WE TALK ABOUT HOW LLMS WORK, FROM SCRATCH — ASSUMING ONLY THAT
YOU KNOW HOW TO ADD AND MULTIPLY TWO NUMBERS. THE ARTICLE…

Oct 19
1.96K
21


See more recommendations

Help

Status

About

Careers

Press

Blog

Privacy

Terms

Text to speech

Teams


To make Medium work, we log user data. By using Medium, you agree to our Privacy
Policy, including cookie policy.