www.together.ai
Open in
urlscan Pro
34.249.200.254
Public Scan
Submitted URL: http://together.ai/
Effective URL: https://www.together.ai/
Submission: On May 02 via api from US — Scanned from DE
Effective URL: https://www.together.ai/
Submission: On May 02 via api from US — Scanned from DE
Form analysis
2 forms found in the DOMName: email-form-2 — GET
<form id="email-form-2" name="email-form-2" data-name="Email Form 2" method="get" fs-cmsfilter-element="filters" data-wf-page-id="6589b69bcbb8c584607c03ff" data-wf-element-id="c46924a9-b883-a3d1-bd45-136dd5e25b44" aria-label="Email Form 2">
<div class="filters-wrapper">
<div class="filters-menu"><label checked="" fs-cmsfilter-element="clear" class="tab fs-cmsfilter_active w-radio"><input type="radio" data-name="filter" id="all" name="filter" class="w-form-formradioinput hide w-radio-input" value="all"><span
class="text-size-navigation w-form-label" for="all">All</span></label></div>
<div class="w-dyn-list">
<div fs-cmsstatic-element="list" role="list" class="filters-menu w-dyn-items">
<div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-0-0" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Chat</span></label></div>
<div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-1-1" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Language</span></label></div>
<div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-2-2" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Embeddings</span></label></div>
<div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-3-3" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Image</span></label></div>
<div role="listitem" class="filters-item w-dyn-item"><label class="tab w-radio"><input type="radio" data-name="filter" id="radio-2-4-4" name="filter" class="w-form-formradioinput hide w-radio-input" value="radio"><span
fs-cmsfilter-field="type" class="text-size-navigation w-form-label" for="radio-2">Code</span></label></div>
</div>
</div>
</div>
</form>
Name: email-form — GET
<form id="email-form" name="email-form" data-name="Email Form" method="get" class="footer_form-inner" data-wf-page-id="6589b69bcbb8c584607c03ff" data-wf-element-id="d9d8a00a-8aed-ff21-807d-bc6c899f4abb" aria-label="Email Form"><input
class="footer_form-field w-input" maxlength="256" name="email-2" data-name="Email 2" placeholder="your@email.com" type="email" id="email-2" required=""><input type="submit" data-wait="" class="footer_form-submit w-button" value=""></form>
Text Content
This website uses cookies to anonymously analyze website traffic using Google Analytics. AcceptDecline Together AI launches Meta Llama 3 with over 350 tokens per second Products Together InferenceTogether Fine-tuningTogether Custom ModelsTogether GPU Clusters Solutions What we offerCustomer storiesWhy open-sourceIndustries & use cases ResearchBlog About ValuesCareersTeam PricingContactDocs Get Started TOGETHER.WE BUILD The fastest cloud platform for building and running generative AI. Start building nowDocs * 01 TOGETHER INFERENCE The fastest inference stack available — just an API call away. * 02 TOGETHER FINE-TUNING Train your own generative AI model with your private data. * 03 TOGETHER GPU CLUSTERS Dedicated training clusters ready-to-go with the Together Training stack. * 04 TOGETHER CUSTOM MODELS Build state-of-the-art models from scratch with ease. THE FASTEST CLOUD FOR GEN AI. BUILT ON LEADING AI RESEARCH. INNOVATIONS Our research team is behind breakthrough AI models, datasets, and optimizations. See all research COCKTAIL SGD With Cocktail SGD, we’ve addressed a key hindrance to training generative AI models in a distributed environment: networking overhead. Cocktail SGD is a set of optimizations that reduces network overhead by up to 117x. Read more FLASHATTENTION 2 FlashAttention is now broadly used by all Transformer models. It accelerates training and fine-tuning of LLMs by up to 9x and achieves an unprecedented 72% model FLOPs utilization for training on NVIDIA A100s. Read more SUB-QUADRATIC MODEL ARCHITECTURES In close collaboration with Hazy Research, we’re working on the next core architecture for generative AI models that will provide even faster performance with longer context. Our research published in this area includes Striped Hyena, Monarch Mixer, and FlashConv. Read more REDPAJAMA Our RedPajama project enables leading generative AI models to be available as fully open-source. The RedPajama models have been downloaded millions of times, and the RedPajama dataset has been used to create over 500 leading models. Read more 100+ Open MODELS All All Chat Language Embeddings Image Code Thank you! Your submission has been received! Oops! Something went wrong while submitting the form. Try now together.ai Chat Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. LLAMA-3 VIEW MORE Chat Arctic is a dense-MoE Hybrid transformer architecture pre-trained from scratch by the Snowflake AI Research Team. ARCTIC-INSTRUCT VIEW MORE DBRX Instruct is a mixture-of-experts (MoE) large language model trained from scratch by Databricks. DBRX Instruct specializes in few-turn interactions. DBRX-INSTRUCT VIEW MORE Chat Llama 2-chat leverages publicly available instruction datasets and over 1 million human annotations. Available in three sizes: 7B, 13B and 70B parameters LLAMA-2 CHAT VIEW MORE Chat The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. MIXTRAL INSTRUCT VIEW MORE Chat A hybrid architecture composed of multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks, different from traditional decoder-only Transformers STRIPED HYENA NOUS VIEW MORE Language The Mixtral-8x22B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. MIXTRAL-8X22B VIEW MORE Chat Trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. DEEPSEEK-67B VIEW MORE Chat instruct fine-tuned version of Mistral-7B-v0.1 MISTRAL INSTRUCT VIEW MORE Language The Yi series models are large language models trained from scratch by developers at 01.AI 01-AI YI VIEW MORE Embeddings An universal English sentence embedding model by WhereIsAI. Its embedding dimension is 1024, it takes up to 512 context length. UAE-LARGE V1 VIEW MORE Image A text-to-image generative AI model that excels at creating 1024x1024 images. STABLE DIFFUSION XL 1.0 VIEW MORE Embeddings An 80M checkpoint of M2-BERT, pretrained with sequence length 32768, and it has been fine-tuned for long-context retrieval. M2-BERT 80M 32K RETRIEVAL VIEW MORE Chat Code Llama is a family of large language models for code based on Llama 2 providing infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. CODE LLAMA INSTRUCT VIEW MORE Language Designed for few-shot prompts, fine-tuned over the RedPajama-INCITE-Base-7B-v1 base model. REDPAJAMA-INCITE INSTRUCT VIEW MORE Chat Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT. VICUNA V1.5 16K VIEW MORE Chat First Nous collection of dataset and models made by fine-tuning mostly on data created by Nous in-house NOUS CAPYBARA VIEW MORE Language NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities. NEXUS RAVEN VIEW MORE Chat Nous-Hermes-Llama2-13b is a state-of-the-art language model fine-tuned on over 300,000 instructions. NOUS HERMES LLAMA-2 VIEW MORE Chat This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities WIZARD LM VIEW MORE Code This model empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. WIZARDCODER PYTHON V1.0 VIEW MORE Embeddings An 80M checkpoint of M2-BERT, pretrained with sequence length 8192, and it has been fine-tuned for long-context retrieval. M2-BERT 80M 8K RETRIEVAL VIEW MORE Embeddings An 80M checkpoint of M2-BERT, pretrained with sequence length 2048, and it has been fine-tuned for long-context retrieval. M2-BERT 80M 2K RETRIEVAL VIEW MORE Embeddings BAAI general embedding - large, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. BGE-LARGE-EN V1.5 VIEW MORE Chat Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations, instruction tuned by Together LLAMA-2-7B-32K-INSTRUCT VIEW MORE Embeddings BAAI general embedding - base, english - model v1.5. FlagEmbedding can map any text to a low-dimensional dense vector which can be used for tasks like retrieval, classification, clustering, or semantic search. And it also can be used in vector databases for LLMs. BGE-BASE-EN V1.5 VIEW MORE Language 7.3B parameter model that outperforms Llama 2 13B on all benchmarks, approaches CodeLlama 7B performance on code, Uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at smaller cost MISTRAL VIEW MORE Code Code Llama is a family of large language models for code based on Llama 2 providing infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. CODE LLAMA PYTHON VIEW MORE Image Latent text-to-image diffusion model capable of generating photo-realistic images given any text input. STABLE DIFFUSION 2.1 VIEW MORE Chat Vicuna is a chat assistant trained by fine-tuning Llama 2 on user-shared conversations collected from ShareGPT. VICUNA V1.5 VIEW MORE Language Phind-CodeLlama-34B-v1 trained on additional 1.5B tokens high-quality programming-related data proficient in Python, C/C++, TypeScript, Java, and more. PHIND CODE LLAMA V2 VIEW MORE Image Fine-tune version of Stable Diffusion focused on photorealism. REALISTIC VISION 3.0 VIEW MORE Language Language model trained on 2 trillion tokens with double the context length of Llama 1. Available in three sizes: 7B, 13B and 70B parameters LLAMA-2 VIEW MORE Code Code Llama is a family of large language models for code based on Llama 2 providing infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. CODE LLAMA VIEW MORE Language Extending LLaMA-2 to 32K context, built with Meta's Position Interpolation and Together AI's data recipe and system optimizations. LLAMA-2-32K VIEW MORE Chat This model is a 75/25 merge of Chronos (13B) and Nous Hermes (13B) models resulting in having a great ability to produce evocative storywriting and follow a narrative. CHRONOS HERMES VIEW MORE Chat An instruction fine-tuned LLaMA-2 (70B) model by merging Platypus2 (70B) by garage-bAInd and LLaMA-2 Instruct v2 (70B) by upstage. PLATYPUS2 INSTRUCT VIEW MORE Language This model achieves a substantial and comprehensive improvement on coding, mathematical reasoning and open-domain conversation capacities. WIZARDLM V1.0 (70B) VIEW MORE Chat MythoLogic-L2 and Huginn merge using a highly experimental tensor type merge technique. The main difference with MythoMix is that I allowed more of Huginn to intermingle with the single tensors located at the front and end of a model MYTHOMAX-L2 VIEW MORE Chat 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B-Chat is a large-model-based AI assistant, which is trained with alignment techniques. QWEN-CHAT VIEW MORE Language 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Aibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. QWEN VIEW MORE Chat Chat model fine-tuned using data from Dolly 2.0 and Open Assistant over the RedPajama-INCITE-Base-7B-v1 base model. REDPAJAMA-INCITE CHAT VIEW MORE Language Base model that aims to replicate the LLaMA recipe as closely as possible (blog post). REDPAJAMA-INCITE VIEW MORE Embeddings Pretrained model on English language using a masked language modeling (MLM) objective. The embedding dimension is 768, and the number of model parameters is 110M. This model is uncased: it does not make a difference between english and English. BERT VIEW MORE Embeddings This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 500K (query, answer) pairs from the MS MARCO dataset. Its embedding dimension is 768 with 512 max context length. SENTENCE-BERT VIEW MORE Language This model can be used to moderate other chatbot models. Built using GPT-JT model fine-tuned on Ontocord.ai's OIG-moderation dataset v0.1. GPT-JT-MODERATION VIEW MORE Language Fork of GPT-J instruction tuned to excel at few-shot prompts (blog post). GPT-JT VIEW MORE Chat Chat model fine-tuned from EleutherAI’s GPT-NeoX with over 40 million instructions on carbon reduced compute. GPT-NEOXT-CHAT-BASE VIEW MORE Language An auto-regressive language model, based on the transformer architecture. The model comes in different sizes: 7B, 13B, 33B and 65B parameters. LLAMA VIEW MORE Chat Falcon-40B-Instruct is a causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize. FALCON INSTRUCT VIEW MORE Language Falcon-40B is a causal decoder-only model built by TII and trained on 1,000B tokens of RefinedWeb enhanced with curated corpora. FALCON VIEW MORE Image An open source Stable Diffusion model fine tuned model on Midjourney images. OPENJOURNEY V4 VIEW MORE Chat Chat model based on EleutherAI’s Pythia-7B model, and is fine-tuned with data focusing on dialog-style interactions. PYTHIA-CHAT-BASE VIEW MORE Image Dreambooth model trained on a diverse set of analog photographs to provide an analog film effect. ANALOG DIFFUSION VIEW MORE TOGETHER INFERENCE The fastest inference on the market1. Start with an easy-to-use API and immediately benefit from the fastest backend optimizations anywhere. SPEED RELATIVE TO TGI, VLLM OR OTHER INFERENCE SERVICES 3x faster2 LLAMA-2 70B 117 TOKENS / SEC3 COST RELATIVE TO GPT-3.5-TURBO 6x lower cost4 Input Summarize the following article: "Transformers are the workhorse architecture behind modern foundation models. Our team at Together AI has released optimizations like FlashAttention-2 to help further scale their capabilities. However, the computational requirements for Transformers increase quadratically with sequence length, which limits the extent of knowledge these models can capture from available datasets. We are interested in new, sub-quadratic approaches to increase scale and eventually produce models that learn from data more comprehensively.Today, in partnership with Hazy Research, we’re releasing a first look at Monarch Mixer—an exploration into an alternative, more efficient model architecture.Monarch Mixer (M2) is sub-quadratic in both sequence length and model dimension, enabling it to scale more efficiently while still matching quality. Our first target for M2 is BERT, the most popular model in use today for language tasks. M2-BERT is 25% more parameter-efficient than BERT and trains up to 3x faster.This is just the beginning of research into Monarch Mixer models, and we are excited to share the results. You can access the first set of checkpoints at a standard sequence length today, and expect long-sequence BERT models to be coming soon! " Output The article discusses the limitations of Transformers, which are the workhorse architecture behind modern foundation models. The authors introduce Monarch Mixer, a new exploration into an alternative, more efficient model architecture that is sub-quadratic in both sequence length and model dimension. The article highlights the benefits of Monarch Mixer, including its ability to scale more efficiently and match the quality of Transformers. The authors also provide a first look at Monarch Mixer-BERT, which is 25% more parameter-efficient than BERT and trains up to 3x faster. They encourage readers to access the first set of checkpoints and expect long-sequence BERT models to be coming soon. Input You are an expert programmer that writes simple, concise code and explanations. Write a python function that accepts a long input string, extracts words in between `<span>` and `<\span>` and returns a list. Output Here is a Python function that accepts a long input string, extracts words in between <span> and <\span> and returns a list: import re def extract_words(input_string): pattern = r"<span>(.*?)<\/span> matches = re.findall(pattern, input_string) return matches This function uses the re module to perform a regular expression search on the input string. The regular expression pattern <span>(.*?)<\\/span> matches any text that is enclosed in <span> and <\\span> tags. The findall function returns a list of all the matches found in the input string.Here is an example of how you can use this function: input_string = "<span>Hello</span><span>World</span>" words = extract_words(input_string) print(words) # Output: ['Hello', 'World'] This code will extract the words "Hello" and "World" from the input string and return them in a list. Input <s>[INST] <<SYS>>Your task is to extract top 10 keywords from a provided text.<</SYS>>\n\nScaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2× speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).[/INST] Output 1. Scaling 2. Transformers 3. Longer sequence lengths 4. Language modeling 5. High-resolution image understanding 6. Code generation 7. Audio generation 8. Video generation 9. Attention layer 10. Memory saving Input <s>[INST] <<SYS>>You are a helpful, polite customer service agent for Awesome Cheeses Inc named Sally. If the customer mentions anything about food, tell them you have a 20% off deal on gouda cheese. <</SYS>>\n\nhi who am i speaking to?[/INST]Hello! this is Sally from Awesome Cheeses. How is your day going?</s><s>[INST]im hungry for something cheesy[/INST] Output Well, you've come to the right place! I'm happy to help you with that. We're running a special promotion right now - you can get 20% off all gouda cheese products. Would you like me to tell you a bit more about our gouda selection? We have a variety of delicious flavors to choose from, such as Smoked Gouda, Goat Gouda, and Garlic Gouda. Which one sounds the most appealing to you? Input Label the sentences as either "positive", "negative", "mixed", or "neutral": Sentence: I can say that there isn't anything I would change. Sentence: I'm not sure about this. Sentence: I liked some parts but I didn't like other parts. Sentence: I think the background image could have been better. Sentence: I really like it. Output Sentence: I can say that there isn't anything I would change. Label: positive Sentence: I'm not sure about this. Label: neutral Sentence: I liked some parts but I didn't like other parts. Label: mixed Sentence: I think the background image could have been better. Label: negative Sentence: I really like it. Label: positive * 1. Document Summarization Turn lengthy documents into concise summaries. * 2. Code generation Automate the creation of repetitive or template-based code. * 3. ENTITY EXTRACTION Identify and extract specific data points from unstructured text. * 4. CHAT Engage users in real-time conversations to assist or inform. * 5. Sentiment analysis Determine the emotional tone or subjective opinion expressed in a piece of text. Together Inference TOGETHER FINE-TUNING Fine-tune leading open-source models with your data to achieve greater accuracy for your tasks. together files upload acme_corp_customer_support.jsonl { "filename" : "acme_corp_customer_support.json", "id": "file-aab9997e-bca8-4b7e-a720-e820e682a10a", "object": "file" } together finetune create --training-file file-aab9997-bca8-4b7e-a720-e820e682a10a --model together compute/RedPajama-INCITE-7B-Chat together finetune create --training-file $FILE_ID --model $MODEL_NAME --wandb-api-key $WANDB_API_KEY --n-epochs 10 --n-checkpoints 5 --batch-size 8 --learning-rate 0.0003 { "training_file": "file-aab9997-bca8-4b7e-a720-e820e682a10a", "model_output_name": "username/togethercomputer/llama-2-13b-chat", "model_output_path": "s3://together/finetune/63e2b89da6382c4d75d5ef22/username/togethercomputer/llama-2-13b-chat", "Suffix": "Llama-2-13b 1", "model": "togethercomputer/llama-2-13b-chat", "n_epochs": 4, "batch_size": 128, "learning_rate": 1e-06, "checkpoint_steps": 2, "created_at": 1687982945, "updated_at": 1687982945, "status": "pending", "id": "ft-5bf8990b-841d-4d63-a8a3-5248d73e045f", "epochs_completed": 3, "events": [ { "object": "fine-tune-event", "created_at": 1687982945, "message": "Fine tune request created", "type": "JOB_PENDING", } ], "queue_depth": 0, "wandb_project_name": "Llama-2-13b Fine-tuned 1" } * Start simple Fine-tune your model with just two commands. * go deep Control hyperparameters like learning rate, batch size, and epochs to optimize model quality. Together Fine-tuning NEED A CLUSTER OF H100S ON INFINIBAND TODAY? Reserve your cluster today TOGETHER GPU CLUSTERS Get your own private GPU cluster for large training and fine-tuning today. Use our purpose built training clusters with H100 and A100 GPUs connected over fast Infiniband networks. Your cluster comes optimized for distributed training with the high performance Together Training stack out of the box. You focus on your model, and we’ll ensure everything runs smoothly. * 01 HIGH END CLUSTERS WITHOUT LONG-TERM COMMITS We offer flexible terms – even with our highest quality hardware. You can commit to just a month or reserve capacity for up to 5 years. * 02 CLUSTERS FROM 16 GPUS TO 2048 GPUS A100 PCIe Clusters Node Specifications: - 8x A100 / 80GB / PCIe - 200Gb node-to-node non-blocking Ethernet - 120 vCPU Intel Xeon (Ice Lake) - 960GB RAM - 7.68 TB NVMe storage A100 SXM Clusters Node Specifications: - 8x NVIDIA A100 80GB SXM - 4120 vCPU Intel Xeon (Sapphire Rapids) - 960 GB RAM - 8 x 960GB NVMe storage - 200 Gbps Ethernet or 3200 Gbps Infiniband configs available H100 Clusters Node Specifications: - 8x Nvidia H100 / 80GB / SXM5 - 3.2 Tbps Infiniband network - 2x AMD EPYC 9474F 18 Cores 96 Threads 3.6GHz CPUs - 1.5TB ECC DDR5 Memory - 8x 3.84TB NVMe SSDs * 03 SNAPPY SETUP. BLAZING FAST TRAINING. We value your time. Clusters are pre-configured for high-speed distributed training, using Slurm and the Together Custom Models stack to get you up and running at lightspeed. Together GPU Clusters TOGETHER CUSTOM MODELS We built the open-source RedPajama models. Now, we’re helping you build models from scratch. Together Custom Models optimizes model builds end-to-end from data collection and preparation to deployment. And in the end, you own the model. Together Custom Models TRAINING 9x faster5 NETWORK C0MMUNICATION 117x lower6 RELATIVE TO AWS 4x lower cost7 WE BELIEVE THE FUTURE OF AI IS OPEN SOURCE 01 TRANSPARENCY Inspect how models are trained and what data is used to increase accuracy and minimize risks 02 NO LOCK-IN You own the model you fine-tune, not your cloud provider. Change providers for whatever reason, including price changes. 03 PRIVACY Maintain complete data privacy by storing data locally or in our secure cloud BUILD, DEPLOY, AND SCALE. ALL IN A SINGLE PLATFORM. * 01 BUILD Whether prompt engineering, fine-tuning, or training, we are ready to meet your business demands. * 02 DEPLOY Easily integrate your new model into your production application using the Together Inference API. * 03 SCALE With the fastest performance available and elastic scaling, Together AI is built to scale with your needs as you grow. CUSTOMER STORIES See how we support leading teams around the world. Our customers are creating innovative generative AI applications, faster. PIKA CREATES THE NEXT GEN TEXT-TO-VIDEO MODELS ON TOGETHER GPU CLUSTERS NEXUSFLOW USES TOGETHER GPU CLUSTERS TO BUILD CYBERSECURITY MODELS ARCEE BUILDS DOMAIN ADAPTIVE LANGUAGE MODELS WITH TOGETHER CUSTOM MODELS Start building yours here → 1. Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here. 2. Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference, TGI, vLLM, Anyscale, Perplexity, and Open AI. Mosaic ML comparison based on published numbers in Mosaic ML blog. Detailed results and methodology published here. 3. Testing conducted by Together AI in November 2023 using Llama-2-70B running on Together Inference. Detailed results and methodology published here. 4. Based on published pricing November 8th, 2023, comparing Open AI GPT-3.5-Turbo to Llama-2-13B on Together Inference using Serverless Endpoints. Assumes equal number of input and output tokens. 5. Compared to a standard attention implementation in PyTorch, FlashAttention-2 can be up to 9x faster. Source. 6. Testing methodology and results published in this research paper. 7. Based on published pricing November 8th, 2023, comparing AWS Capacity Blocks and AWS p5.48xlarge instances to Together GPU Clusters configured with an equal number of H100 SXM5 GPUs on our 3200 Gbps Infiniband networking configuration. SUBSCRIBE TO NEWSLETTER Thank you! Your submission has been received! Oops! Something went wrong while submitting the form. * Products * Solutions * Research * Blog * About * Pricing * Contact * * * © 2023 San Francisco, CA 94114 * Privacy policy * Terms of service