www.nlpsummit.org Open in urlscan Pro
51.158.129.126  Public Scan

Submitted URL: https://t.sidekickopen22.com/Ctc/2H+23284/ccpq104/JlF2-6qcW8wLKSR6lZ3kDW3KSkxB2xHGSvW4jyN9J2QvLtTW7xW1SL5-74bkW2m4ktJ2BZFp9W5...
Effective URL: https://www.nlpsummit.org/lessons-learned-de-identifying-700-million-patient-notes-with-spark-nlp/
Submission: On September 11 via api from SG — Scanned from SG

Form analysis 0 forms found in the DOM

Text Content

 * Program
 * NLP Training
 * Watch Past Summits
   * Healthcare NLP 2023
   * NLP Summit 2022
   * Healthcare NLP 2022
   * NLP Summit 2021
   * Healthcare NLP 2021
   * NLP Summit 2020
 * Register now

Select Page
 * Program
 * NLP Training
 * Watch Past Summits
   * Healthcare NLP 2023
   * NLP Summit 2022
   * Healthcare NLP 2022
   * NLP Summit 2021
   * Healthcare NLP 2021
   * NLP Summit 2020
 * Register now


LESSONS LEARNED DE-IDENTIFYING 700 MILLION PATIENT NOTES WITH SPARK NLP



Providence St. Joseph Health’s (PSJH) unstructured data de-identification
methodology relies on pre-trained BiLSTM-CNN-Char NER models provided by John
Snow Labs.

The PSJH Data science department evaluated John Snow Labs models based on
accuracy and speed. The accuracy is evaluated by randomly selecting 1000 patient
notes, de-identifying the notes by using the John Snow Labs de-identification
model, and using human experts to validate each of the de-identified notes.
There are a total of 34,701 sentences and the total number of leaked PHI events
is 281.

Therefore, the PHI leaks into at least 0.81% sentences. The speed of the John
Snow Labs de-identification model is evaluated by measuring the time to run 100K
and 500K patient notes (expected daily load ranges from 100K-500K) using a
moderate size cluster. The cluster used for this test has 15 workers, each with
112 GB memory, 1 GPU, 5DBU.

It took 43.76 minutes to de-identify 100K patient notes and 2.46 hours to
de-identify 500K patient notes. In conclusion, the John Snow Labs
de-identification model performs quite well as far as the speed is concerned.

The John Snow Labs de-identification model is reasonably accurate, and
consistent with advertised performance accuracy.

About the speaker


Vivek Tomer 

Principal Data Scientist at Providence St. Joseph Health

Vivek Tomer is a Principal Data Scientist at Providence St. Joseph Health (PSJH)
Healthcare Intelligence department where he is responsible for creating and
leading strategic enterprise Data Science projects.

Prior to PSJH, Mr. Tomer was Vice President, Model Development at Umpqua Bank
where he led the development of the bank’s first loan-level credit risk and
customer analytics models.

Mr. Tomer has two master’s degrees from the University of Illinois at
Urbana-Champaign, one in Theoretical Statistics and the other in Quantitative
Finance, and has over a decade of experience in solving complex business
problems using statistical models.


WHEN

Sessions: October 5 – 7
Trainings: October 4, 12 – 15


CONTACT

nlpsummit@johnsnowlabs.com

 * Follow
 * Follow



Presented by


 * Code of Conduct
 * Privacy Policy