Professional Documents
Culture Documents
RWKV: Reinventing RNNs For The Transformer Era - Cropped
RWKV: Reinventing RNNs For The Transformer Era - Cropped
Notably, Quasi-Recurrent neural network where the multi-headness and scaling factor √1 is
dk
(QRNN) (Bradbury et al., 2017) uses both con- omitted for convenience. The core QK >
multipli-
volutional layers and recurrent pooling functions cation is an ensemble of pairwise attention scores
across timesteps and channels. While QRNN between each token in a sequence, which can be
utilizes convolutional filters with fixed sizes, decomposed as vector operations:
RWKV employs a time-mixing module as an PT q> ki
attention mechanism with time-decaying factors. e t vi
Attn(Q, K, V )t = Pi=1 . (8)
T
eqt> ki
Different from the element-wise pooling in QRNN, i=1
RWKV includes a parametrized channel-mixing In AFT (Zhai et al., 2021), this is alternately
module (see the green blocks in Fig.1c) that is formulated as
parallelizable. Pt
+ ewt,i +ki vi
Attn (W, K, V )t = Pi=1t wt,i +ki
, (9)
i=1 e
3 Background where {wt,i } ∈ RT ×T is the learned pair-wise po-
sition biases, and each wt,i is a scalar.
Here we briefly review the fundamentals of RNNs Inspired by AFT, we let each wt,i in RWKV be
and Transformers. a channel-wise time decay vector multiplied by the
relative position, traced backwards from current
time as it decays:
3.1 Recurrent Neural Networks (RNNs)
wt,i = −(t − i)w, (10)
Popular RNN architectures such as LSTM (Hochre- where w ∈ (R≥0 )d , with d the number of chan-
iter and Schmidhuber, 1997) and GRU (Chung nels. We require w to be non-negative to ensure
et al., 2014) are characterized by the following for- that ewt,i ≤ 1 and the per-channel weights decay
mulation (shown for LSTM, others can be reasoned backwards in time.
Linear Convolution Time-mixing
Elementwise
RNN Cell/Linear Channel-mixing
Pooling
Linear Convolution Time-mixing
Elementwise
RNN Cell/Linear Channel-mixing
Pooling
Figure 1: Computation structure of the RWKV in comparison to QRNN and RNN (Vanilla, LSTM, GRU, etc)
architectures. Color codes: orange indicates time-mixing, convolutions or matrix multiplications, and the contin-
uous block indicates that these computations can proceed simultaneously; blue signifies parameterless functions
that operate concurrently along the channel or feature dimension (element-wise). Green indicates channel-mixing.
5 Evaluations
In this section, we focus on evaluating to answer
the following questions:
Figure 6: Cumulative time during text generation for
• RQ1: Is RWKV competitive against different LLMs.
quadratic transformer architectures with equal
number of parameters and training tokens?
• RQ2: When increasing the number of param- Additionally, we carried out comparative studies
eters, does RWKV remain competitive against on RWKV-4 and ChatGPT / GPT-4, see Appendix
quadratic transformer architectures? J. They revealed that RWKV-4 is very sensitive to
• RQ3: Does increasing parameters of RWKV prompt engineering. When the prompts were ad-
yield better language modeling loss, when justed from the ones used for GPT to more suitable
RWKV models are trained for context lengths for RWKV, the F1-measure performance increased
that most open-sourced quadratic transform- even from 44.2% to 74.8%.
ers cannot efficiently process?
7 Future Work
Addressing RQ1 and RQ2, from Fig. 4, we There are several promising directions for future
can see that RWKV is very competitive on six work on the RWKV architecture:
benchmarks (Winogrande, PIQA, ARC-C, ARC-E,
LAMBADA, and SciQ) against major open source • Increasing model expressivity with enhanced
quadratic complexity transformer models: Pythia time-decay formulations and exploring initial
(Biderman et al., 2023), OPT (Zhang et al., 2022) model states while maintaining efficiency.
(a) Winogrande (b) PIQA (c) ARC-Challenge
Figure 4: Zero-Shot Performance: The horizontal axis is a number of parameters and the vertical axis is accuracy.
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav Denis Paperno, Germán Kruszewski, Angeliki Lazari-
Artzi. 2018. Simple recurrent units for highly par- dou, Ngoc Quan Pham, Raffaella Bernardi, San-
allelizable recurrence. In Proceedings of the 2018 dro Pezzelle, Marco Baroni, Gemma Boleda, and
Conference on Empirical Methods in Natural Lan- Raquel Fernandez. 2016. The LAMBADA dataset:
guage Processing, pages 4470–4481, Brussels, Bel- Word prediction requiring a broad discourse context.
gium. Association for Computational Linguistics. In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. Long Papers), pages 1525–1534, Berlin, Germany.
2021. Pay attention to mlps. Association for Computational Linguistics.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Ilya O. Tolstikhin, Neil Houlsby, Alexander
2012. On the difficulty of training recurrent neural Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas
networks. In International Conference on Machine Unterthiner, Jessica Yung, Andreas Steiner, Daniel
Learning. Keysers, Jakob Uszkoreit, Mario Lucic, and
Alexey Dosovitskiy. 2021. Mlp-mixer: An all-mlp
Adam Paszke, Sam Gross, Francisco Massa, Adam architecture for vision. CoRR, abs/2105.01601.
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Antiga, Alban Desmaison, Andreas Köpf, Edward Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Baptiste Rozière, Naman Goyal, Eric Hambro,
Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Jun- Faisal Azhar, Aurelien Rodriguez, Armand Joulin,
jie Bai, and Soumith Chintala. 2019. Pytorch: An Edouard Grave, and Guillaume Lample. 2023.
imperative style, high-performance deep learning li- Llama: Open and efficient foundation language mod-
brary. els.
Michael Poli, Stefano Massaroli, Eric Nguyen,
Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Aäron van den Oord, Sander Dieleman, Heiga Zen,
Bengio, Stefano Ermon, and Christopher Ré. 2023. Karen Simonyan, Oriol Vinyals, Alex Graves,
Hyena hierarchy: Towards larger convolutional lan- Nal Kalchbrenner, Andrew W. Senior, and Koray
guage models. arXiv preprint arXiv:2302.10866. Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. ArXiv, abs/1609.03499.
Ofir Press, Noah A. Smith, and Mike Lewis. 2022.
Train short, test long: Attention with linear biases en- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ables input length extrapolation. In The Tenth Inter- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
national Conference on Learning Representations, Kaiser, and Illia Polosukhin. 2017. Attention is all
ICLR 2022, Virtual Event, April 25-29, 2022. you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Ilan Price, Jordan Gifford-Moore, Jory Flemming, Saul
Musker, Maayan Roichman, Guillaume Sylvain,
David Vilares and Carlos Gómez-Rodríguez. 2019.
Nithum Thain, Lucas Dixon, and Jeffrey Sorensen.
Head-qa: A healthcare dataset for complex reason-
2020. Six attributes of unhealthy conversations.
ing. In ACL.
In Proceedings of the Fourth Workshop on Online
Abuse and Harms, pages 114–124, Online. Associa-
tion for Computational Linguistics. Alex Wang, Yada Pruksachatkun, Nikita Nangia,
Amanpreet Singh, Julian Michael, Felix Hill, Omer
Markus N. Rabe and Charles Staats. 2022. Self- Levy, and Samuel Bowman. 2019. Superglue: A
attention does not need o(n2 ) memory. stickier benchmark for general-purpose language un-
derstanding systems. In Advances in Neural Infor-
Melissa Roemmele, Cosmin Adrian Bejan, , and An- mation Processing Systems, volume 32. Curran As-
drew S. Gordon. 2018. Choice of plausible alterna- sociates, Inc.
tives: An evaluation of commonsense causal reason-
ing. In AAAI. Alex Wang, Amanpreet Singh, Julian Michael, Fe-
Teven Le Scao, Angela Fan, Christopher Akiki, El- lix Hill, Omer Levy, and Samuel Bowman. 2018.
lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro- GLUE: A multi-task benchmark and analysis plat-
man Castagné, Alexandra Sasha Luccioni, François form for natural language understanding. In Pro-
Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b- ceedings of the 2018 EMNLP Workshop Black-
parameter open-access multilingual language model. boxNLP: Analyzing and Interpreting Neural Net-
arXiv preprint arXiv:2211.05100. works for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm
Exposed! http://www.kaggle.com/rmsharks4/ Sinong Wang, Belinda Z. Li, Madian Khabsa, Han
sarcasmania-dataset. [Online; accessed 02- Fang, and Hao Ma. 2020. Linformer: Self-attention
February-2023]. with linear complexity.
David R. So, Wojciech Manke, Hanxiao Liu, Zihang Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong
Dai, Noam Shazeer, and Quoc V. Le. 2021. Primer: Long, Chengqi Zhang, and S Yu Philip. 2020. A
Searching for efficient transformers for language comprehensive survey on graph neural networks.
modeling. CoRR, abs/2109.08668. IEEE transactions on neural networks and learning
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, systems, 32(1):4–24.
Zhe Zhao, and Che Zheng. 2020. Synthesizer: Re-
thinking self-attention in transformer models. Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017.
Ex machina: Personal attacks seen at scale. In Pro-
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald ceedings of the 26th International Conference on
Metzler. 2022. Efficient transformers: A survey. World Wide Web, WWW 2017, Perth, Australia, April
ACM Computing Surveys, 55(6):1–28. 3-7, 2017, pages 1391–1399. ACM.
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Xiangru Tang Manuscript (sections 2 and 3;
Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng contributions to abstract; revision and proofread-
Yan. 2022. Metaformer is actually what you need
ing). Contributions to Appendix K.
for vision.
Matteo Grella Manuscript (sections 4.5, 4.6, 8;
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
Dubey, Joshua Ainslie, Chris Alberti, Santiago On- contributions to sections 1, 7 and 9; proofreading
tanon, Philip Pham, Anirudh Ravula, Qifan Wang, and revision). Contributions to Appendix B.
Li Yang, et al. 2020. Big bird: Transformers for
longer sequences. Advances in Neural Information Ferdinand Mom Manuscript (contributions to
Processing Systems, 33. section 1, 2, 4.3, 4.6; proofreading and revision).
Contributions to Appendix B.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
Farhadi, and Yejin Choi. 2019. Hellaswag: Can a Atsushi Saito Manuscript (sections 3 and 5; con-
machine really finish your sentence? In ACL.
tributions to section 2). Figures 1a , 1b, 1c. Contri-
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali butions to Appendix H
Farhadi, and Yejin Choi. 2020. Winogrande: An
adversarial winograd schema challenge at scale. In Krishna Sri Ipsit Mantri Figure 4
ACL.
Rui-Jie Zhu Tables 1 and 5. Experiments for
Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen table 5.
Huang, Hanlin Goh, Ruixiang Zhang, and Josh
Susskind. 2021. An attention free transformer. Peng Zhou Contributions to Table 5.
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Qihang Zhao Manuscript (proofreading and re-
Gao, Kevin Duh, and Benjamin Van Durme. 2018. vision). Contributions to Table 5.
Record: Bridging the gap between human and ma-
chine commonsense reading comprehension. In Xuzheng He Manuscript (contributions to sec-
arXiv:1810.12885.
tion 3; proofreading and revision). Contributions
Susan Zhang, Stephen Roller, Naman Goyal, Mikel to Figures 1, 7. Appendix G. Contributions to ap-
Artetxe, Moya Chen, Shuohui Chen, Christopher De- pendix F.
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2022. Opt: Open pre-trained transformer language Hayden Lau Manuscript (contributions to sec-
models. arXiv preprint arXiv:2205.01068. tion 1; proofreading and revision). Contributions
to Appendix K.
A Author Contributions
Michael Chung Manuscript (contributions to
Bo Peng Original RWKV idea, original code, section 4.6; proofreading and revision).
performance optimizations, original experiments,
and trained RWKV models from 0.1B to 14B. Haowen Hou Figure 8. Appendix E
Eric Alcaide Manuscript (initial draft sections 1, Jiaming Kong Manuscript (revision and proof-
2; sections 4, 7 and 8; revision and proofreading; reading). Appendix F.
final version ). Figures (2, 3, 4, 7). Experiments
Johan S. Wind RWKV performance optimiza-
section 6. Appendices D, I. Contributions to Ap-
tions (CUDA), Contributions to Appendix C.
pendix K.
Jian Zhu Manuscript (section 2; proofreading
Quentin Anthony Led writing the paper.
and revision). Figures 3 and 5.
Manuscript (initial draft sections 1, 2, 3; revision
and proofreading; final version). Huanqi Cao Manuscript (contributions to 4.2
and 4.3; proofreading and revision). Experiments
Zhenyuan Zhang Manuscript (revision and
for Appendix G.
proofreading) Figure 3. Experiments Appendix
G. Contributions to Appendices B and K. Samuel Arcadinho Contributions to Figures 6,
10, and 11. Contributions to Appendix I.
Kranthi Kiran GV Manuscript (sections 2 and
5; contributions to section 3; revision and proof- Xin Cheng Manuscript (proofreading and revi-
reading). Tables 3 and 4. Appendix C. sion). Contributions to Appendix K, H.
Alon Albalak Manuscript (abstract and sections that
1, 9; proofreading and revision).
a1 = e−w a0 + ek0 v0 = ek0 v0 , (23)
Jan Kocon Manuscript (sections 1; proofreading
b1 = e−w b0 + ek0 = ek0 , (24)
and revision). Contributions to Appendix J.
and we set a01 = v0 , b01 = 1, p0 = k0 , where pt−1
Przemysław Kazienko Manuscript (section 6;
stores the shared exponents of at and bt . Now the
proofreading and revision). Contributions Ap-
above recursion can be converted into a numerical
pendix J.
safe version, for each time step t > 1:
Ruichong Zhang Manuscript (proofreading and
revision); Contributions to Figure 5 and Appendix q := max(pt−1 , u + kt ), (25)
K. a∗t = ept−1 −q a0t−1 + eu+kt −q vt , (26)
Stanisław Woźniak Appendix J. b∗t = ept−1 −q b0t−1 + eu+kt −q , (27)
a∗t
Bartłomiej Koptyra Contributions to Appendix wkvt = . (28)
b∗t
J.
The update to a0t , b0t and their shared exponent are
B Time-Mixing Block as an RNN Cell also carried out in similar fashion:
As stated in 4.3, the RWKV time-mixing block can q := max(pt−1 − w, kt ), (29)
be formulated as an RNN, as the W KV computa-
tion can be written in such a recursive form: a0t = ept−1 −w−q a0t−1 + ekt −q vt , (30)
b0t = ept−1 −w−q b0t−1 + ekt −q , (31)
a0 , b0 = 0, (19) pt = q. (32)
at−1 + eu+kt vt
wkvt = , (20) C Parameter and FLOP Count for the
bt−1 + eu+kt
RWKV Models
at = e−w at−1 + ekt vt , (21)
bt = e−w bt−1 + ekt . (22) The following section provides an overview of the
different RWKV model architectures along with
The dataflow of the RNN-like time-mixing is their respective parameter and FLOP counts in Ta-
shown in Fig. 7, where the hidden states h is the ble 2.
numerator-denominator tuple (a, b). Name Layers Model Dimension Parameters FLOPs per token
169 M 12 768 1.693 × 108 2.613 × 108
430 M 24 1024 4.304 × 108 7.573 × 108
1.5 B 24 2048 1.515 × 109 2.823 × 109
3B 32 2560 2.985 × 109 5.710 × 109
1 2 7B 32 4096 7.393 × 109 1.437 × 1010
14 B 40 5120 1.415 × 1010 2.778 × 1010
3
Time Decay
PT Layer 3
Kte vt S(vt ) Layer 4
wkvT = Pt=1
T
= E(vt ) = , (33) Layer 5
e
t=1 Kt
S(1) 0.4 Layer 6
Layer 7
Layer 8
where 0.2 Layer 9
Layer 10
Layer 11
∂(vt )i 0.0 Layer 12
vt = Wv xt , = (xt )j , 0 100 200 300 400 500 600 700 800
∂(Wv )i,j Channel
Information propagation path 1
The
Log-probability of "Paris"
∂(Kte )i E 2
Kte =e Wk xt +wT,t
, = (xt )j (Kte )i , iff
∂(Wk )i,j el 3
Tower 4
is
located 5
and S(·) and E(·) are shorthand for denoting sums in
the 6
and averages over weights Kte . city 7
of
The loss function at position T can be written as 1 6 11 16 21
Layer
∗
Table 3: Zero-Shot Performance of the model on Common Sense Reasoning Tasks. Interpolation of Pythia and
GPT-Neo models
∗
Table 4: Zero-Shot Performance of various models on different tasks. Interpolation of Pythia and GPT-Neo
models
Method L d T Train bpc Test bpc Time Complexity Space Complexity
2
Transformer 12 512 1024 0.977 1.137 O(T d) O(T 2 + T d)
Transformer 24 256 1024 1.039 1.130 O(T 2 d) O(T 2 + T d)
Reformer 12 512 1024 1.040 1.195 O(T log T d) O(T log T + T d)
Synthesizer 12 512 1024 0.994 1.298 O(T 2 d) O(T 2 + T d)
Linear Transformer 12 512 1024 0.981 1.207 O(T d2 ) O(T d + d2 )
Performer 12 512 1024 1.002 1.199 O(T d2 log d) O(T d log d + d2 log d)
AFT-simple 12 512 1024 0.854 1.180 O(T d) O(T d)
RWKV-RNN 6 512 1024 0.720 - O(Td) O(d)
Table 5: Enwik8 results, measured in bits per character (bpc): the lower the better. Baseline comparisons are made
with Reformer (Kitaev et al., 2020), Synthesizer (Tay et al., 2020) (the best performing dense version), Linear
Transformer (Katharopoulos et al., 2020), Performer (Choromanski et al., 2020). L, d, and T denote the number
of blocks (network depth), dimension of features, and sequence length, respectively. Both Linear Transformer and
Performer are implemented with customized CUDA kernels (github.com/idiap/fast-transformers), and all other
models are implemented in native Pytorch.
In this part, we present a few instances of outputs produced by RWKV model using a Chat interface56 .
RWKV-4-Raven-14B
Tell me about ravens.
Ravens are large, black birds with a distinctive white head and neck. They are found in most parts
of the world, including North America, Europe, Asia, and Australia. Ravens are known for their
intelligence and problem-solving abilities. They are also considered to be symbols of death and
transformation in many cultures. Ravens are often associated with the afterlife or death because
they have been known to eat carrion or even other birds. In some cultures, ravens are also believed
to be messengers of the gods or guardians of treasure.
RWKV-4-Raven-14B
Explain the concept of machine learning to a 10-year-old.
Machine learning is like a super-smart computer that can learn on its own. It can recognize patterns
and make decisions based on what it has learned. Just like how you might remember something
from the past, the computer can remember what it’s learned and use it to make predictions or
decisions in the future. It’s like having a smart friend who always knows what to do!
RWKV-4-Raven-14B
Write a short poem describing nature’s beauty in springtime.
5
https://github.com/BlinkDL/ChatRWKV
6
https://huggingface.co/spaces/BlinkDL/ChatRWKV-gradio
RWKV-4-Raven-14B
Below is the numerical stable formula for updating the state variable in RWKV:
Here is the PyTorch code to update the state variable in RWKV using the given formula:
import t o r c h
def u p d a t e _ s t a t e ( a , b , u , v , p , q ) :
a _ t = t o r c h . exp ( p − q ) * a + t o r c h . exp ( u + k − q ) * v
b _ t = t o r c h . exp ( p − q ) * b + t o r c h . exp ( u + k − q )
return a_t , b_t
a , b = t o r c h . randn (2 , 3)
u = t o r c h . randn ( 2 )
v = t o r c h . randn ( 2 )
wkv = t o r c h . r a n d n ( 2 )
a _ t , b _ t = u p d a t e _ s t a t e ( a , b , u , v , wkv , t o r c h . r a n d n ( 2 ) )
print ( " a_t : " , a_t )
print ( " b_t : " , b_t )
This code defines a function update_state that takes in the state variables a and b, the control
variables u and v, and the parameters wkv and torch.randn. It then uses the PyTorch torch.exp
function to compute the exponential of the control variables and the log of the normalization factor.
Finally, it uses these values to update the state.
RWKV-4-Raven-14B
Explain the following metaphor: Life is like cats.
The metaphor "Life is like cats" suggests that life can be unpredictable and full of unexpected
twists and turns, just like a cat’s behavior. It implies that we cannot fully control or predict the
outcome of our lives, just as we cannot control the movements of a cat. The metaphor also implies
that life is complex and multifaceted, with many different variables and factors at play. Just as a
cat can have multiple personalities or moods, life can have many twists and turns that are difficult
to predict or understand. Overall, the metaphor suggests that life is complex and unpredictable,
much like a cat’s behavior.
RWKV-4-Raven-14B
P∞ −i
Test the convergence of i=0 2 .
The given series is a geometric series with the first term a0 = 2 and common ratio r = 12 . We can
find the sum of this series using the formula for a geometric series:
∞ 1
X a0 1 2
ai = = 2 1 = 1 = =2
1−r 1− 2 1− 2
1
i=0
RWKV-4-Raven-14B
I would like to cook some Chinese food at home. Do you have any suggestions on the meal I
could choose?
Yes, of course! If you are not very experienced in cooking Chinese food, you could start with
making noodles or rice, then make a dish using one of those. I also think making stir-fry with
some meat is an excellent idea.
I want to make something more special. What about red braised pork belly?
Yes, red braised pork belly is also very popular and tasty.
You will need a few ingredients such as pork belly, star anise, sugar, ginger, scallions and wine.
Put all of these in a pot, cook on medium heat and simmer until the meat is tender. After that,
remove the meat from the pot and reserve the sauce. Slice the meat and add some ginger slices
into it, then boil until it’s almost done.
What do I need to be careful with the most during the cooking process?
To cook pork belly properly, it is very important to monitor the heat of the stove and be patient
with the cooking process. Overcooking it can cause the pork to be tough, dry and difficult to chew.
You can add a few ingredients like soy sauce or black beans to give the pork more flavor.
If the temperature of the water is too high, it can damage the collagen and connective tissues,
resulting in tough pork belly. It is essential to control the temperature carefully, especially
when braising the pork. Additionally, keep the water simmering but not boiling to maintain a
low simmer, as high heat will dry out the pork belly and result in tough and unpalatable meat.
Lastly, monitor the heat during cooking and make sure the heat is medium, as this is what is most
appropriate for slow cooking pork belly.
You’re welcome! It was a pleasure chatting with you. Let me know if you need any further
assistance in the future.
RWKV-4-Raven-14B
If all apples are fruits and all fruits are sweet, are all apples sweet? Please explain your
answer in detail.
Yes, all apples are fruits, and all fruits are sweet. Therefore, it is safe to assume that all apples are
sweet.
If birds can fly. Penguins are birds. Can penguins fly? Please explain your answer in detail.
No, penguins cannot fly. Penguins are birds, but they are not capable of flight like other birds.
They have wings and can glide for short distances, but they cannot take off or fly long distances
like other birds.
The day before two days after the day before tomorrow is Saturday. What day is it today?
Please explain your answer in detail.
If five cats can catch five mice in five minutes, how long will it take one cat to catch one
mouse? Please explain your answer in detail.
If five cats can catch five mice in five minutes, it means that each cat can catch one mouse in one
minute. To find out how long it will take one cat to catch one mouse, we can use the concept of
proportionality.
Let’s assume that it takes x minutes for a cat to catch a mouse. Then, the number of mice caught
by a cat in one minute is equal to the number of cats multiplied by the number of mice caught by
each cat: x = 5 * 1 = 5
So, it will take 5 minutes for one cat to catch one mouse.
RWKV-4-Raven-14B
Summarize the following paragraph.
The paper presents a novel architecture, Receptance Weighted Key Value (RWKV), that combines
the efficient parallelizable training of Transformers with the efficient inference of RNNs. The
model is able to scale to tens of billions of parameters and exhibits linear computational complexity
during training and inference, making it a promising alternative to Transformers for sequence
processing tasks.
The researchers have developed a new architecture that combines the efficient parallelizable
training of Transformers with the efficient inference of RNNs, which allows for linear scaling in
memory and computational requirements.