blog.gopenai.com
Open in
urlscan Pro
162.159.153.4
Public Scan
Submitted URL: https://blog.gopenai.com/computer-vision-and-nlp-on-ocr-text-extraction-supervised-ml-d9b7c8de15a2?source=email-d780b06a5...
Effective URL: https://blog.gopenai.com/computer-vision-and-nlp-on-ocr-text-extraction-supervised-ml-d9b7c8de15a2?gi=8c4007c6d7ab&source...
Submission: On October 31 via api from US — Scanned from DE
Effective URL: https://blog.gopenai.com/computer-vision-and-nlp-on-ocr-text-extraction-supervised-ml-d9b7c8de15a2?gi=8c4007c6d7ab&source...
Submission: On October 31 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
Open in app Sign up Sign in Write Sign up Sign in COMPUTER VISION AND NLP ON OCR TEXT EXTRACTION (SUPERVISED ML) Erasmo Soares · Follow Published in GoPenAI · 13 min read · May 29, 2024 34 Listen Share A few days ago, I began exploring Machine Learning and its practical applications. During this journey, a conversation with friends brought up the question of how to extract text from images, specifically to read and extract important parts of a pay stub. This got me thinking; while I didn’t have the answer immediately, I realized that this challenge could be effectively addressed using Artificial Intelligence, particularly Machine Learning. Working on this project helped me gain practical insights into Machine Learning and how to implement and train a supervised model. In this article, I will briefly describe the steps leading to the conclusion. This article provides a general overview and is not intended as a step-by-step guide. Its purpose is to offer a broad perspective on the process, highlighting key concepts and considerations without delving into detailed code instructions. Let’s dive into the proof of concept. Photo by Jakob Søby on Unsplash THE PROJECT: NLP PSR The NLP Pay Slip Reader is a proof of concept (POC) designed to develop an automated system for extracting key information from slips (pay stubs). The web app allows users to upload their payment slips, automatically detect and extract relevant entities, and visualize the data. The project utilizes various techniques, including optical character recognition (OCR), named entity recognition (NER), and image processing to achieve its objectives. The solution utilizes Computer Vision to scan the slips, identify the text’s location, and extract it from the image. In Natural Language Processing, the solution extracts entities from the text, performs necessary text cleaning, and parses the information. DATA PREPARATION The initial step in the process is to extract text from images using Optical Character Recognition (OCR). This technology enables us to convert text within images into a machine-readable format, facilitating subsequent data processing and analysis. To accomplish this, we can utilize PyTesseract. PyTesseract is a popular Optical Character Recognition (OCR) tool for Python, based on Google’s Tesseract-OCR engine. It allows developers to extract text from images, making it possible to convert scanned documents, images, or screenshots containing text into machine-readable text data. import numpy as np import pandas as pd import cv2 import PIL import pytesseract # Loading using cv2 img_cv = cv2.imread('./images/Slip.png') ### Extract Text from Image from cv2 text_cv = pytesseract.image_to_string(img_cv) PyTesseract operates through a system of hierarchies and levels to detect data related to the page, blocks, paragraphs, lines, and words. By examining the results in `text_cv`, we can find the texts extracted from the images. Once the text is extracted from the images, we perform basic cleaning procedures to prepare the data for entity extraction. This may involve removing unwanted characters, correcting errors, or standardizing formats. Following the cleaning process, we organize the extracted text by saving all the words or tokens in a CSV (Comma-Separated Values) file. Each word or token is linked to the corresponding filename, allowing us to maintain the connection between the extracted text and its source slip. import numpy as np import pandas as pd import pytesseract import cv2 import os from glob import glob from tqdm import tqdm import warnings warnings.filterwarnings('ignore') imPaths = glob('./images/*.png') allSlips = pd.DataFrame(columns=['id','text']) for imgPath in tqdm(imPaths,desc='Slips'): imgPath = imPaths[0] #print(imgPath) #extract filename _, filename = os.path.split(imgPath) #extract data and text image = cv2.imread(imgPath) data = pytesseract.image_to_data(image) #dataframe dataList = list(map(lambda x: x.split('\t'),data.split('\n'))) df = pd.DataFrame(dataList[1:],columns=dataList[0]) df.dropna(inplace=True) useFulData = df#.query('conf >= 30') #dataframe slip = pd.DataFrame() slip['text'] = useFulData['text'] slip['id'] = filename #Concatenation allSlips =pd.concat((allSlips,slip)) allSlips.to_csv('slips.csv',index=False) The above code should generate a long DataFrame that will be exported as a CSV like: MANUAL LABELLING WITH BIO TAGGING Photo by Beazy on Unsplash After the text extraction, the subsequent step involves labeling, which is a process commonly used in supervised machine learning. This step entails identifying and marking the relevant entities within the text, such as names, dates, organizations, etc. Labeling plays a critical role in training machine learning models to accurately recognize and categorize these entities. BIO stands for Beginning, Inside, and Outside (of a text segment). In a system that recognizes entity boundaries, only three labels are used: B, I, and O. The labeling technique is a crucial part of constructing our ground truth, which refers to the accurate and reliable set of data or information used as a benchmark to validate and train models. In this manual process, the information in the DataFrame will be marked with B, I, or O. This is a costly but essential step that will determine the success of the machine learning model. The more diverse information in the DataFrame, such as texts read from different slips with varying formats and positions, the better the results. The labeling process is extensive. In the example below, I have the following texts in the CSV: In the context of the process, `slip.png` corresponds to the name of the image file, and subsequently, each portion of extracted text. Considering that the company name is a crucial part of the entity I want to build, I need to locate all the texts in the CSV that represent the organization’s name and tag them accordingly. Here, the TAG `ORG` is utilized. `O` is used for texts that do not represent anything relevant and can be discarded. `B` indicates the beginning of a piece of information; in this example, “Cofomo” is the start of the organization’s name, while “Developpement” is in the middle or at the end. In this case, the tags `I` or `I-ORG` are used, so [B-ORG, I-ORG, I-ORG] corresponds to [Cofomo Development Inc.]. TEXT CLEANING AND PREPROCESS After labeling, the extracted text undergoes a cleaning process. This involves removing any unnecessary characters, correcting errors, and standardizing formats to ensure the text is consistent and usable. Clean text is essential for achieving high-quality results in subsequent analytical steps. During the cleaning stage, it’s necessary to convert the data into a specific format for training the NER model. For this purpose, Spacy can be used. spaCy is a free, open-source library for advanced Natural Language Processing(NLP) in Python. spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. The format expected by Spacy consists a dictionary containing the information of the tags that were manually mapped in the previous stage of the POC. Additionally, it includes the start and end positions of each tag. The code below demonstrates the conversion of the CSV data into the format illustrated above: start, end, and tag. In this instance, the file ‘slips-tag.txt’ contains all the slips that have been tagged. import numpy as np import pandas as pd import string import re # Data Preparation with open('slips-tag.txt',mode='r',encoding='utf8',errors='ignore') as f: text = f.read() data = list(map(lambda x:x.split('\t'),text.split('\n'))) # Load Data and convert into Pandas DataFrame df = pd.DataFrame(data[1:],columns=data[0]) df.head() # See whitespaces # string.whitespace # string.punctuation whitespace = string.whitespace punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~" tableWhitespace = str.maketrans('','',whitespace) tablePunctuation = str.maketrans('','',punctuation) def cleanText(txt): text = str(txt) text = text.lower() removewhitespace = text.translate(tableWhitespace) removepunctuation = removewhitespace.translate(tablePunctuation) return str(removepunctuation) df['text'] = df['text'].apply(cleanText) dataClean = df.query("text != '' ") dataClean.dropna(inplace=True) dataClean.head(10) # Convert Data into spacy format group = dataClean.groupby(by='id') # group.groups.keys() grouparray = group.get_group('Slip.png')[['text','tag']].values content = '' annotations = {'entities':[]} start = 0 end = 0 for text, label in grouparray: text = str(text) stringLenght = len(text) + 1 start = end end = start + stringLenght if label != 'O': annot = (start,end -1 ,label) annotations['entities'].append(annot) content = content + text + ' ' slips = group.groups.keys() allSlips = [] for slip in slips: slipData = [] grouparray = group.get_group(slip)[['text','tag']].values content = '' annotations = {'entities':[]} start = 0 end = 0 for text, label in grouparray: text = str(text) stringLength = len(text) + 1 start = end end = start + stringLength if label != 'O': annot = (start,end-1,label) annotations['entities'].append(annot) content = content + text + ' ' slipData = (content,annotations) allSlips.append(slipData) # SpacyData allSlips SPLIT THE DATA INTO TRAINING AND TESTING SET Photo by Jo Coenen - Studio Dries 2.6 on Unsplash Here, the data is randomized and divided into a training and testing set. Splitting data into these sets is a fundamental practice in machine learning for several reasons. By training the model on one subset of the data (the training set) and testing it on another subset (the testing set), you can assess how well the model generalizes to new, unseen examples. For this POC, the training and testing data will be stored in a pickle file. The pickle module implements binary protocols for serializing and de-serializing a Python object structure. # Spliting Data into Training and Testing Set import random random.shuffle(allSlips) len(allSlips) TrainData = allSlips[:240] TestData = allSlips[240:] # Save data import pickle pickle.dump(TrainData,open('./data/TrainData.pickle',mode='wb')) pickle.dump(TestData,open('./data/TestData.pickle',mode='wb')) TRAINING THE NAMED ENTITY RECOGNITION MODEL To train the NER model, you can utilize spaCy and load it with the pickle file generated during the preprocessing stage. The spaCy documentation offers a thorough, step-by-step guide on configuring the tool, including setup for specific scenarios and installation instructions. In the project’s GitHub repository, I have versioned the previously trained model within the output folder, which you can check at the end of this article. SpaCy also provides pre-trained models that are ready for use in your project. Detailed instructions can be found in the “Training Models” section of the spaCy documentation. TRAINING PIPELINES & MODELS · SPACY USAGE DOCUMENTATION TRAIN AND UPDATE COMPONENTS ON YOUR OWN DATA AND INTEGRATE CUSTOM MODELS spacy.io One of the first steps is to download the base_config.cfg file, which contains the basic configuration for spaCy. For this POC, the NER component and the English language have been selected, as the slips are in English. Next, copy the file to your project’s root directory and execute the following command to generate the “complete” configuration file, which in this case is named config.cfg: PREPARE THE DATA FOR TRAINING Photo by Meghan Holmes on Unsplash Training data for NLP projects can arrive in various formats. The spaCy documentation includes a section that explains the model configuration and training process. In essence, you must furnish spaCy with the training and testing data, which, in our scenario, are the two pickle files generated during the preprocessing stage. Subsequently, these pickle files will be converted to the spaCy format. import spacy from spacy.tokens import DocBin import pickle nlp = spacy.blank("en") # Load Data training_data = pickle.load(open('./data/TrainData.pickle','rb')) testing_data = pickle.load(open('./data/TestData.pickle','rb')) # the DocBin will store the example documents db = DocBin() for text, annotations in training_data: doc = nlp(text) ents = [] for start, end, label in annotations['entities']: span = doc.char_span(start, end, label=label) ents.append(span) doc.ents = ents db.add(doc) db.to_disk("./data/train.spacy") # the DocBin will store the example documents db_test = DocBin() for text, annotations in testing_data: doc = nlp(text) ents = [] for start, end, label in annotations['entities']: span = doc.char_span(start, end, label=label) ents.append(span) doc.ents = ents db_test.add(doc) db_test.to_disk("./data/test.spacy") The model is trained using the following command: python -m spacy train .\config.cfg --output .\output --paths.train .\data\train.spacy --paths.dev .\data\test.spacy After training, two folders are created: “model-best,” which contains the most accurate model, and “model-last,” which contains the model from the latest training iteration. PREDICTIONS Photo by Anne Nygård on Unsplash The final step involves training a Named Entity Recognition (NER) model. Using the cleaned and labeled text, the NER model learns to identify and classify various entities within the text. This training process involves feeding the model a large dataset and iteratively improving its accuracy through supervised learning techniques. The trained NER model can then be used to automatically recognize and extract entities from new text data. Consider a scenario where one of the pay stub contains the following texts. The confidential information has been blurred for privacy; please disregard it: Given that the entity “Slip” includes gains and deductions as parameters, the model, after analyzing the extracted text from the image, should accurately detect the values of each information and populate them in our entity, as demonstrated in the example below: In this phase, there are four steps to follow to obtain predictions from the NER model: 1. Load the image. 2. Extract data using Pytesseract. 3. Convert data into content. 4. Obtain predictions. import numpy as np import pandas as pd import cv2 import pytesseract from glob import glob import spacy import re import string import warnings def cleanText(txt): whitespace = string.whitespace punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~" tableWhitespace = str.maketrans('','',whitespace) tablePunctuation = str.maketrans('','',punctuation) text = str(txt) text = text.lower() removewhitespace = text.translate(tableWhitespace) removepunctuation = removewhitespace.translate(tablePunctuation) return str(removepunctuation) warnings.filterwarnings('ignore') ### Load NER model model_ner = spacy.load('./output/model-best/') # Load Image image = cv2.imread('./Selected/Slip.png') # extract data using Pytesseract tessData = pytesseract.image_to_data(image) # convert into dataframe tessList = list(map(lambda x:x.split('\t'), tessData.split('\n'))) df = pd.DataFrame(tessList[1:],columns=tessList[0]) df.dropna(inplace=True) # drop missing values df['text'] = df['text'].apply(cleanText) # convet data into content df_clean = df.query('text != "" ') content = " ".join([w for w in df_clean['text']]) print(content) # get prediction from NER model doc = model_ner(content) In spaCy, a “doc” (short for “document”) refers to a container for accessing linguistic annotations and a sequence of tokens. The Doc object holds the processed text along with its annotations. The code snippet below initializes a server where it is possible to view the prediction: from spacy import displacy displacy.serve(doc,style='ent',port=5001) Here is the result. The confidential information has been blurred in the example for privacy, but we have extracted all the text from a new pay slip. When a tag mapped during the labeling process is identified, it is displayed accordingly. For example, following the text “compagnie,” which represents the company’s name, it was automatically mapped to the tag B-ORG. The same process was applied to all other relevant information in the text, such as names, salaries, gains, deductions, etc. It is possible to work with the “Doc” variable, which contains the annotation information and token sequences returned by the model. The code below converts the `Doc` object into JSON to transform it into a DataFrame containing all the necessary information to create bounding boxes on the image at the positions where the information was found. Next, an entity dictionary is created, along with a parsing function to process each piece of text found. #!/usr/bin/env python # coding: utf-8 import numpy as np import pandas as pd import cv2 import pytesseract from glob import glob import spacy import re import string import warnings warnings.filterwarnings('ignore') ### Load NER model model_ner = spacy.load('./output/model-best/') def cleanText(txt): whitespace = string.whitespace punctuation = "!#$%&\'()*+:;<=>?[\\]^`{|}~" tableWhitespace = str.maketrans('','',whitespace) tablePunctuation = str.maketrans('','',punctuation) text = str(txt) removewhitespace = text.translate(tableWhitespace) removepunctuation = removewhitespace.translate(tablePunctuation) return str(removepunctuation) # group the label class groupgen(): def __init__(self): self.id = 0 self.text = '' def getgroup(self,text): if self.text == text: return self.id else: self.id +=1 self.text = text return self.id def parser(text,label): if label in ('NAME'): text = text.lower() text = re.sub(r'[^a-z ]','',text) text = text.title() elif label in ('ORG','ROLE'): text = text.lower() text = re.sub(r'[^a-z0-9 ]','',text) text = text.title() elif label in ('DATE'): text = text.lower() text = re.sub(r'[^0-9/]','',text) text = text.title() elif label in ('BASE','HOURS','QTD','GAINS','DEDUCTIONS','NETTE'): text = text.lower() text = re.sub(r'[^0-9.,]','',text) text = text.title() return text grp_gen = groupgen() def getPredictions(image): try: # extract data using Pytesseract tessData = pytesseract.image_to_data(image) # convert into dataframe tessList = list(map(lambda x:x.split('\t'), tessData.split('\n'))) df = pd.DataFrame(tessList[1:],columns=tessList[0]) df.dropna(inplace=True) # drop missing values df['text'] = df['text'].apply(cleanText) # convet data into content df_clean = df.query('text != "" ') content = " ".join([w for w in df_clean['text']]) # get prediction from NER model (doc file) doc = model_ner(content) # converting doc in json docjson = doc.to_json() doc_text = docjson['text'] # creating tokens datafram_tokens = pd.DataFrame(docjson['tokens']) datafram_tokens['token'] = datafram_tokens[['start','end']].apply( lambda x:doc_text[x[0]:x[1]] , axis = 1) right_table = pd.DataFrame(docjson['ents'])[['start','label']] datafram_tokens = pd.merge(datafram_tokens,right_table,how='left',on='start') datafram_tokens.fillna('O',inplace=True) # join lable to df_clean dataframe df_clean['end'] = df_clean['text'].apply(lambda x: len(x)+1).cumsum() - 1 df_clean['start'] = df_clean[['text','end']].apply(lambda x: x[1] - len(x[0]),axis=1) # inner join with start dataframe_info = pd.merge(df_clean,datafram_tokens[['start','token','label']],how='inner',on='start') # Bounding Box bb_df = dataframe_info.query("label != 'O' ") bb_df['label'] = bb_df['label'].apply(lambda x: x[2:]) bb_df['group'] = bb_df['label'].apply(grp_gen.getgroup) # right and bottom of bounding box bb_df[['left','top','width','height']] = bb_df[['left','top','width','height']].astype(int) bb_df['right'] = bb_df['left'] + bb_df['width'] bb_df['bottom'] = bb_df['top'] + bb_df['height'] # tagging: groupby group col_group = ['left','top','right','bottom','label','token','group'] group_tag_img = bb_df[col_group].groupby(by='group') img_tagging = group_tag_img.agg({ 'left':min, 'right':max, 'top':min, 'bottom':max, 'label':np.unique, 'token':lambda x: " ".join(x) }) img_bb = image.copy() for l,r,t,b,label,token in img_tagging.values: cv2.rectangle(img_bb,(l,t),(r,b),(0,255,0),2) cv2.putText(img_bb,str(label),(l,t),cv2.FONT_HERSHEY_PLAIN,1,(255,0,255),2) # Entities info_array = dataframe_info[['token','label']].values entities = dict(NAME=[],ORG=[],DATE=[],ROLE=[],BASE=[],HOURS=[],QTD=[],GAINS=[],DEDUCTIONS=[],NETTE=[]) previous = 'O' for token, label in info_array: bio_tag = label[0] label_tag = label[2:] # step -1 parse the token text = parser(token,label_tag) if bio_tag in ('B','I'): if previous != label_tag: entities[label_tag].append(text) else: if bio_tag == "B": entities[label_tag].append(text) else: if label_tag in ("NAME",'ORG','ROLE'): entities[label_tag][-1] = entities[label_tag][-1] + " " + text else: entities[label_tag][-1] = entities[label_tag][-1] + text previous = label_tag return img_bb, entities except Exception as e: print(f"An error occurred, make sure the image contours are correct: {str(e)}") return None, None CONCLUSION Photo by Kelly Sikkema on Unsplash There are numerous applications for Machine Learning in character detection and recognition and NLP. Another potential proof of concept could entail employing the same technique by using a mobile phone and automatically translating the extracted text into a designated language. This demonstrates the versatility and practicality of Machine Learning in various contexts. Hope you like it! The POC for the NLP Scanner is available on my GitHub. REFERENCES 1. Vision view & Data Science Anywhere, Intelligently Extract Text & Data from Documents 2. Li, J., Lu, Q., & Zhang, B. (2019). An efficient business card recognition system based on OCR and NER. In 2019 International Conference on Robotics, Automation and Artificial Intelligence (RAAI) (pp. 334–338). IEEE. 3. Sharma, S., & Sharma, A. (2020). Business Card Recognition using Convolutional Neural Networks. In 2020 5th International Conference on Computing, Communication and Security (ICCCS) (pp. 1–5). IEEE. 4. Spacy — Industrial-strength Natural Language Processing in Python. (n.d.). Retrieved from https://spacy.io/ 5. PyTesseract: Python-tesseract — OCR tool for Python. (n.d.). Retrieved from https://pypi.org/project/pytesseract/ 6. OpenCV: Open Source Computer Vision Library. (n.d.). Retrieved from https://opencv.org/ 7. Flask: A Python Microframework. (n.d.). Retrieved from https://flask.palletsprojects.com/ SIGN UP TO DISCOVER HUMAN STORIES THAT DEEPEN YOUR UNDERSTANDING OF THE WORLD. FREE Distraction-free reading. No ads. Organize your knowledge with lists and highlights. Tell your story. Find your audience. Sign up for free MEMBERSHIP Read member-only stories Support writers you read most Earn money for your writing Listen to audio narrations Read offline with the Medium app Try for 5 $/month Machine Learning AI Ocr Tesseract NLP Ner 34 34 Follow WRITTEN BY ERASMO SOARES 22 Followers ·Writer for GoPenAI With 12+ years in software engineering, my current passion lies in cloud computing and AI, constantly exploring new ideas and pushing boundaries. Follow MORE FROM ERASMO SOARES AND GOPENAI Erasmo Soares CREATING A DATA MANAGER FOR UNITY 3D SO YOU THINK ABOUT CREATING A GAME, WHATEVER IT IS, ONE THING THAT YOU CAN DEFINITELY END UP BUMPING INTO IS HOW YOU WILL ACCESS ALL THE… Jul 18, 2017 60 kirouane Ayoub in GoPenAI FINE-TUNING EMBEDDINGS FOR SPECIFIC DOMAINS: A COMPREHENSIVE GUIDE IMAGINE YOU’RE BUILDING A QUESTION ANSWERING SYSTEM FOR A MEDICAL DOMAIN. YOU WANT TO ENSURE IT CAN ACCURATELY RETRIEVE RELEVANT MEDICAL… Sep 30 556 2 Paras Madan in GoPenAI BUILDING A MULTI PDF RAG CHATBOT: LANGCHAIN, STREAMLIT WITH CODE TALKING TO BIG PDF’S IS COOL. YOU CAN CHAT WITH YOUR NOTES, BOOKS AND DOCUMENTS ETC. THIS BLOG POST WILL HELP YOU BUILD A MULTI RAG… Jun 6 801 6 Erasmo Soares in GoPenAI MACHINE LEARNING IN ACTION AND AZURE AUTOMATED MACHINE LEARNING IN MY LAST ARTICLE, I DESCRIBED HOW TO EXTRACT TEXT FROM IMAGES USING COMPUTER VISION, OCR, AND ENTITY RECOGNITION. IN THIS ARTICLE, I WILL… Jul 9 14 See all from Erasmo Soares See all from GoPenAI RECOMMENDED FROM MEDIUM Coursesteach NATURAL LANGUAGE PROCESSING (PART 56)-CREATING AND USING N-GRAM LANGUAGE MODELS FOR TEXT… 📚CHAPTER8 :AUTOCOMPLETE AND LANGUAGE MODELS 4d ago 2 Anoop Maurya in Python in Plain English WHY PYMUPDF4LLM IS THE BEST TOOL FOR EXTRACTING DATA FROM PDFS (EVEN IF YOU DIDN’T KNOW YOU NEEDED… STUCK BEHIND A PAYWALL? READ FOR FREE! Oct 18 1.1K 10 LISTS NATURAL LANGUAGE PROCESSING 1782 stories·1391 saves THE NEW CHATBOTS: CHATGPT, BARD, AND BEYOND 12 stories·492 saves PREDICTIVE MODELING W/ PYTHON 20 stories·1628 saves PRACTICAL GUIDES TO MACHINE LEARNING 10 stories·1986 saves Abhinav Kimothi in Towards AI A TAXONOMY OF RETRIEVAL AUGMENTED GENERATION POWERING THE RISE OF CONTEXTUAL AI —OVER 200 TERMS INCLUDING COMPONENTS, PIPELINES, OPS STACK, TECHNOLOGIES & MORE Oct 21 427 8 Ignacio de Gregorio APPLE SPEAKS THE TRUTH ABOUT AI. IT’S NOT GOOD. ARE WE BEING LIED TO? Oct 23 4.1K 125 Cezary Gesikowski in Generative AI HOW TO BECOME A GENERATIVE AI POWER USER — PART 1 CHATGPT USER JOURNEY FROM NEWBIE TO PRO WITH USER CASE EXAMPLES AND STEP-BY-STEP EXPERIMENTS Oct 23 241 4 Rohit Patel in Towards Data Science UNDERSTANDING LLMS FROM SCRATCH USING MIDDLE SCHOOL MATH IN THIS ARTICLE, WE TALK ABOUT HOW LLMS WORK, FROM SCRATCH — ASSUMING ONLY THAT YOU KNOW HOW TO ADD AND MULTIPLY TWO NUMBERS. THE ARTICLE… Oct 19 1.96K 21 See more recommendations Help Status About Careers Press Blog Privacy Terms Text to speech Teams To make Medium work, we log user data. By using Medium, you agree to our Privacy Policy, including cookie policy.