Today I am announcing OpenOrca, an open-source dataset and series of instruct-tuned language models.
As I read Orca: Progressive Learning from Complex Explanation Traces of GPT-4 by Mukherjee et. al. of Microsoft, I had to consider the implications for Open Source AI.
This was pretty awesome stuff. But, I realized that while Microsoft would probably release their LLaMA-13b based model (as of the time of this writing they still haven’t) I concluded that they might not release the dataset.
Therefore, I resolved to replicate their efforts, download the data myself, and train the model myself, so that OpenOrca can be released on other sizes of LLaMA as well as other foundational models such as Falcon, OpenLLaMA, RedPajama, MPT, RWKV.
This was a nontrivial undertaking. With the help of an all-star team of open-source AI/ML engineers, we have completed the OpenOrca dataset.
Our dataset consists of:
We followed the submix and system prompt distribution outlined in the Orca paper. With a few exceptions. We included all 75k of CoT in the FLAN-1m dataset rather than sampling that. Also, we found that many items were duplicated so we removed duplicates, resulting in 3.5m instructs in the ChatGPT dataset.
We are presently performing full weights fine-tuning of OpenOrca on the foundation of LLaMA-13b, so that our performance can be compared with Microsoft’s model when it releases.
We expect to release OpenOrca-LLaMA-13b in mid-July 2023. At that time we will publish our evaluation findings and the dataset.
We are currently seeking GPU compute sponsors for training OpenOrca on the following platforms:
From the Orca paper and our experiments, we roughly estimate the compute costs as follows:
|Model Size||Compute Estimate|
We will share our appreciation for sponsorship in this space, as well as the model cards.
Our current sponsors:
Please reach out to me if you are interested in providing compute sponsorship for any specific targets of OpenOrca.
I would like to thank the motley crew of Open Source AI/ML engineers who have worked beside me in this endeavor. Including:
Wing “Caseus” Lian and NanoBit of OpenAccess AI Collective
AutoMeta, Entropi, AtlasUnified, and neverendingtoast of Alignment Lab AI
Tom “TheBloke” Jobbins for quantizing and amplifying
All the other people in the Open Source AI community who have taught me and helped me along the way.